## Intermediate Lab:  Creating A Dataset Using the GitHub API

In this lab you'll create a dataset containing all the meta information about your classmates github repos, using only the csv file with everyone's url.  

The process will be done in these 4 general steps:

 - load in the csv file
 - clean the individual lines of each file to get them ready to use
 - connect to the GitHub api to obtain information about everyone's repos
 - re-structure this information to turn it into a dataframe.
 
It'll be a great way to work through the first step of many data science problems: creating a workable dataset out of unorganized, messy data.  Let's get started!

### Step 1:  Load in the csv file with everyone's github repo

In [1]:
import requests
import pandas as pd

In [3]:
# IMPORTANT -- MAKE SURE THE FILE PATH LISTED MATCHES WHAT'S ON YOUR COMPUTER
with open('datrepos.csv') as f:
    data = f.readlines()

In [4]:
# notice that data is a list filled with strings that contain info about each line
data

['Name,Github URL\n',
 'Jonathan,https://github.com/JonathanBechtel\n',
 'Rezwana,https://github.com/rezsharmeen\n',
 'Marnie,https://github.com/marnierl\n',
 'Kristina,https://github.com/hayniek\n',
 'Harley,https://github.com/harleyhoffmann\n',
 'Uma,https://github.com/umap1230\n',
 'Andrew Cal,https://github.com/AndrewCal2013\n',
 'Zarina ,https://github.com/zarinajm7\n',
 'Lisa,https://github.com/lisastaal\n',
 'Emma ,https://github.com/ewynn5\n',
 'Jacob,https://github.com/jdonahue94\n',
 'Tina,https://github.com/tinagads\n',
 'Danielle,https://github.com/dlemi\n',
 ',\n',
 ',\n',
 'Jake H,https://github.com/jhoernsch\n',
 'Avinash,https://github.com/avirathore2\n',
 ',\n',
 ',https://github.com/bebenono/sample-ga-repo\n',
 'Krithi N,https://github.com/nkrithi\n']

### Step 2: Remove the `\n` from each item in the list, as well as the item that contains the header info.

In [5]:
# this loops through each item in the list, starting at position 1, and replace the \n character with nothing
cleaned_data = [repo.replace('\n', "") for repo in data[1:]]

In [6]:
# we can confirm now that these marks are no longer there
cleaned_data

['Jonathan,https://github.com/JonathanBechtel',
 'Rezwana,https://github.com/rezsharmeen',
 'Marnie,https://github.com/marnierl',
 'Kristina,https://github.com/hayniek',
 'Harley,https://github.com/harleyhoffmann',
 'Uma,https://github.com/umap1230',
 'Andrew Cal,https://github.com/AndrewCal2013',
 'Zarina ,https://github.com/zarinajm7',
 'Lisa,https://github.com/lisastaal',
 'Emma ,https://github.com/ewynn5',
 'Jacob,https://github.com/jdonahue94',
 'Tina,https://github.com/tinagads',
 'Danielle,https://github.com/dlemi',
 ',',
 ',',
 'Jake H,https://github.com/jhoernsch',
 'Avinash,https://github.com/avirathore2',
 ',',
 ',https://github.com/bebenono/sample-ga-repo',
 'Krithi N,https://github.com/nkrithi']

### Step 3:  Separate the username in each string from everything else

In [7]:
# we do the same thing, except we take the LAST item from the list returned by split()
usernames = [url.split('/')[-1] for url in cleaned_data]

NameError: name 'repo_urls' is not defined

### Step 4: Obtain the repo data for every single github username

In [None]:
# this part of the url will never change
base_url = 'https://api.github.com'

In [None]:
# this goes through every username, and inserts it into the api url, and then passes that into requests.get().json()
# to obtain a list of repos for every single user
repo_lists = [requests.get(f"{base_url}/users/{username}/repos").json() for username in usernames]

### Step 5: Create a 'flat' list that contains every unique repo for every single user

Answer with list comprehension:

In [None]:
# this is a nested for-loop using a list comprehension that returns each item inside the inner list
repos = [repo for user in repo_lists for repo in user]

Nested loops with comprehensions can be difficult to interpret sometimes, so if a regular for-loop is easier to digest, this is a different way of writing the same thing:

In [None]:
repos = []

for user in repo_lists:
    for repo in user:
        repos.append(repo)

### Step 6:  Get information about the name, owner, url, and date of every single repo.

In [None]:
# this creates a list of all the values for the name key
repo_names = [repo['name'] for repo in repos]
# ditto for the login key -- notice it's accessed inside the owner key
owners     = [repo['owner']['login'] for repo in repos]
# next two work the same way
urls       = [repo['html_url'] for repo in repos]
dates      = [repo['created_at'] for repo in repos]

### Step 7:  Create a dictionary with the data created from step 7

In [None]:
data_dict = {
    'Owner': owners,
    'Name': repo_names,
    'URL': urls,
    'Date': dates
}

### Step 8:  Pass your dictionary into the `pd.dataframe()` method to get your final dataset  

In [None]:
import pandas as pd

# this will take your dictionary and turn it into a dataframe
df = pd.DataFrame(data_dict)

In [None]:
# look how pretty it is :)
df.head()