# Exercise Six: Structured Data with ugly soup and Pandas

This example will scrape information from the github page for the user "pret." pret primarily works on the Pokemon game recreations from scratch that I thought would be interesting to work on. Turns out, github has a variety of little quirks and issues for the Beautiful Soup scraper to work with, which led to me wanting to rip my hair out. At that point, I was far too deep to try anything else, so you'll see some gross looking code that I tried to clean up at the end. 

## Stages 1-3 (Ambiguous)

Below you'll see the first three steps of the project. I thought that the way you presented the data in the final post was much cleaner than what I had been doing previously, so I decided to work with that as well.

The only problematic section here regards the "information" project_dict. Previously, it was to be divided between "Language," "Stargazers," and "Hooks," but try as I might these information pieces refused to separate from one another per the way github displays them. Dividing each section further would return a variety of errors that people online didn't want to solve, and seemed directly related to issues with reading data and text in the Beautiful Soup scraper.

In [122]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

In [123]:
url = 'https://github.com/pret'

user_agent = {'User-agent': 'Mozilla/5.0'}
response = response = requests.get(url, headers = user_agent)
article = BeautifulSoup(response.text, 'html.parser')

project_dict = {'projecttitle':[], 'projectd':[], 'information':[]}

In [124]:
for project in soup.find_all('div', class_='pinned-item-list-item-content'):
    project_dict['projecttitle'].append(project.find().find('a').text)
    project_dict['projectd'].append(project.find(class_="pinned-item-desc color-text-secondary text-small d-block mt-2 mb-3").text)
    project_dict['information'].append(project.find(class_="mb-0 f6 color-text-secondary").text)
                                    
print(project_dict['projecttitle'])
print(project_dict['projectd'])
print(project_dict['information'])

['\npokered\n', '\npokecrystal\n', '\npokeemerald\n', '\npokeruby\n', '\npokeyellow\n', '\npokegold\n']
['\n        Disassembly of Pokémon Red/Blue\n      ', '\n        Disassembly of Pokémon Crystal\n      ', '\n        Decompilation of Pokémon Emerald\n      ', '\n        Decompilation of Pokémon Ruby/Sapphire\n      ', '\n        Disassembly of Pokemon Yellow\n      ', '\n        Disassembly of Pokémon Gold/Silver\n      ']
['\n\n\nAssembly\n\n\n\n\n\n            2.9k\n          \n\n\n\n\n            529\n          \n', '\n\n\nAssembly\n\n\n\n\n\n            1.5k\n          \n\n\n\n\n            471\n          \n', '\n\n\nC\n\n\n\n\n\n            920\n          \n\n\n\n\n            683\n          \n', '\n\n\nC\n\n\n\n\n\n            519\n          \n\n\n\n\n            150\n          \n', '\n\n\nAssembly\n\n\n\n\n\n            420\n          \n\n\n\n\n            134\n          \n', '\n\n\nAssembly\n\n\n\n\n\n            251\n          \n\n\n\n\n            61\n          \n']


## Step 4: trying to make the data look nice at all

The data itself was horrendously ugly due to the way Beautiful Soup attempted to read github, so I tried to clean it the best I could with some string cleaning functions that I discovered as I worked and researched for this project. I was ultimately unable to make them look good or become separated, leading to the strange looking data presentation here. I have included some of the steps I took.

In [125]:
information = project_dict['information']
newinfo1 = []

for element in information:
    newinfo1.append(element.replace("\n", ""))

print(newinfo1)

infofinal = []
for element in newinfo1:
    infofinal.append(element.replace(" ", ""))
    
print(infofinal)

['Assembly            2.9k                      529          ', 'Assembly            1.5k                      471          ', 'C            920                      683          ', 'C            519                      150          ', 'Assembly            420                      134          ', 'Assembly            251                      61          ']
['Assembly2.9k529', 'Assembly1.5k471', 'C920683', 'C519150', 'Assembly420134', 'Assembly25161']


In [126]:
projects = pd.DataFrame(project_dict)
pd.set_option("display.max_colwidth", None)


projects

Unnamed: 0,projecttitle,projectd,information
0,\npokered\n,\n Disassembly of Pokémon Red/Blue\n,\n\n\nAssembly\n\n\n\n\n\n 2.9k\n \n\n\n\n\n 529\n \n
1,\npokecrystal\n,\n Disassembly of Pokémon Crystal\n,\n\n\nAssembly\n\n\n\n\n\n 1.5k\n \n\n\n\n\n 471\n \n
2,\npokeemerald\n,\n Decompilation of Pokémon Emerald\n,\n\n\nC\n\n\n\n\n\n 920\n \n\n\n\n\n 683\n \n
3,\npokeruby\n,\n Decompilation of Pokémon Ruby/Sapphire\n,\n\n\nC\n\n\n\n\n\n 519\n \n\n\n\n\n 150\n \n
4,\npokeyellow\n,\n Disassembly of Pokemon Yellow\n,\n\n\nAssembly\n\n\n\n\n\n 420\n \n\n\n\n\n 134\n \n
5,\npokegold\n,\n Disassembly of Pokémon Gold/Silver\n,\n\n\nAssembly\n\n\n\n\n\n 251\n \n\n\n\n\n 61\n \n


## Step 5: Presenting the Data

I ended up defaulting to the manual way I learned to present data in Python using numpy and matplotlib, and wrote the variables out for use by the plotter. In the future, I will be interested to use a scraper that automatically translates the data into a .csv file like we got to work on previously.

In [127]:
import matplotlib.pyplot as plt
import numpy as np

game = ('pokered', 'pokecrystal', 'pokeemerald', 'pokeruby', 'pokeyellow', 'pokegold')
height = [2900, 1500, 920, 519, 420, 25]
y_pos = np.arange(len(game))

plt.barh(y_pos, height)

plt.yticks(y_pos, game)

plt.show()

<Figure size 640x480 with 1 Axes>