## Intermediate Lab:  Creating A Dataset Using the GitHub API

In this lab you'll create a dataset containing all the meta information about your classmates github repos, using only the csv file with everyone's url.  

The process will be done in these 4 general steps:

 - load in the csv file
 - clean the individual lines of each file to get them ready to use
 - connect to the GitHub api to obtain information about everyone's repos
 - re-structure this information to turn it into a dataframe.
 
It'll be a great way to work through the first step of many data science problems: creating a workable dataset out of unorganized, messy data.  Let's get started!

### Step 1:  Load in the csv file with everyone's github repo

In [1]:
import requests
import pandas as pd

In [2]:
url = r"C:\Users\Jonat\Downloads\Github Repos DAT-10-19 - Sheet1.csv"

In [3]:
with open(url) as f:
    data = f.readlines()

In [4]:
data

['Name,Github URL\n',
 'ChloÃ©,https://github.com/chloemd\n',
 ',\n',
 'Gary,https://github.com/Gmarin10\n',
 'Cameron ,https://github.com/clefevre01\n',
 'Oore,https://github.com/ladipoore\n',
 'Jaryd Thornton,https://github.com/jcolethornton\n',
 'Peter,https://github.com/Lothdyn/my-1019-repo\n',
 'Alvaro ,https://github.com/alvarog01/mydat1019\n',
 ',\n',
 'Amanda Chernishkin,https://github.com/amandachernishkin\n',
 'John Mayer,https://github.com/mayerjp01\n',
 'Nidhi Mahambre,https://github.com/nidhim03']

In [8]:
urls = [url for url in data if 'https' in url]

In [11]:
urls

['ChloÃ©,https://github.com/chloemd\n',
 'Gary,https://github.com/Gmarin10\n',
 'Cameron ,https://github.com/clefevre01\n',
 'Oore,https://github.com/ladipoore\n',
 'Jaryd Thornton,https://github.com/jcolethornton\n',
 'Peter,https://github.com/Lothdyn/my-1019-repo\n',
 'Alvaro ,https://github.com/alvarog01/mydat1019\n',
 'Amanda Chernishkin,https://github.com/amandachernishkin\n',
 'John Mayer,https://github.com/mayerjp01\n',
 'Nidhi Mahambre,https://github.com/nidhim03']

In [14]:
[len(url.split('/')) for url in urls]

[4, 4, 4, 4, 4, 5, 5, 4, 4, 4]

In [15]:
for url in urls:
    if len(url.split('/')) == 4:
        print(url.split('/')[-1])
    else:
        print(url.split('/')[-2])

chloemd

Gmarin10

clefevre01

ladipoore

jcolethornton

Lothdyn
alvarog01
amandachernishkin

mayerjp01

nidhim03


In [17]:
usernames = []

for url in urls:
    if len(url.split('/')) == 4:
        usernames.append(url.split('/')[-1])
    else:
        usernames.append(url.split('/')[-2])

In [20]:
usernames[0].replace('\n', '')

'chloemd'

In [21]:
usernames = [user.replace('\n', '') for user in usernames]

In [22]:
usernames

['chloemd',
 'Gmarin10',
 'clefevre01',
 'ladipoore',
 'jcolethornton',
 'Lothdyn',
 'alvarog01',
 'amandachernishkin',
 'mayerjp01',
 'nidhim03']

In [24]:
api_url = "https://api.github.com/users/chloemd/repos"
requests.get(api_url).json()

[{'id': 305542140,
  'node_id': 'MDEwOlJlcG9zaXRvcnkzMDU1NDIxNDA=',
  'name': 'DAT-1019-Chloe',
  'full_name': 'chloemd/DAT-1019-Chloe',
  'private': False,
  'owner': {'login': 'chloemd',
   'id': 73141231,
   'node_id': 'MDQ6VXNlcjczMTQxMjMx',
   'avatar_url': 'https://avatars1.githubusercontent.com/u/73141231?v=4',
   'gravatar_id': '',
   'url': 'https://api.github.com/users/chloemd',
   'html_url': 'https://github.com/chloemd',
   'followers_url': 'https://api.github.com/users/chloemd/followers',
   'following_url': 'https://api.github.com/users/chloemd/following{/other_user}',
   'gists_url': 'https://api.github.com/users/chloemd/gists{/gist_id}',
   'starred_url': 'https://api.github.com/users/chloemd/starred{/owner}{/repo}',
   'subscriptions_url': 'https://api.github.com/users/chloemd/subscriptions',
   'organizations_url': 'https://api.github.com/users/chloemd/orgs',
   'repos_url': 'https://api.github.com/users/chloemd/repos',
   'events_url': 'https://api.github.com/users/c

In [25]:
user_repos = [requests.get(f"https://api.github.com/users/{user}/repos").json()
              for user in usernames]

In [26]:
user_repos

[[{'id': 305542140,
   'node_id': 'MDEwOlJlcG9zaXRvcnkzMDU1NDIxNDA=',
   'name': 'DAT-1019-Chloe',
   'full_name': 'chloemd/DAT-1019-Chloe',
   'private': False,
   'owner': {'login': 'chloemd',
    'id': 73141231,
    'node_id': 'MDQ6VXNlcjczMTQxMjMx',
    'avatar_url': 'https://avatars1.githubusercontent.com/u/73141231?v=4',
    'gravatar_id': '',
    'url': 'https://api.github.com/users/chloemd',
    'html_url': 'https://github.com/chloemd',
    'followers_url': 'https://api.github.com/users/chloemd/followers',
    'following_url': 'https://api.github.com/users/chloemd/following{/other_user}',
    'gists_url': 'https://api.github.com/users/chloemd/gists{/gist_id}',
    'starred_url': 'https://api.github.com/users/chloemd/starred{/owner}{/repo}',
    'subscriptions_url': 'https://api.github.com/users/chloemd/subscriptions',
    'organizations_url': 'https://api.github.com/users/chloemd/orgs',
    'repos_url': 'https://api.github.com/users/chloemd/repos',
    'events_url': 'https://ap

In [4]:
# notice that data is a list filled with strings that contain info about each line
data

['Name,Repo\n',
 'Jonathan Bechtel,https://github.com/JonathanBechtel\n',
 'Aoife Duna,https://github.com/aoifeduna\n',
 'Erik Lindernoren,https://github.com/eriklindernoren']

### Step 2: Remove the `\n` from each item in the list, as well as the item that contains the header info.

In [5]:
# this loops through each item in the list, starting at position 1, and replace the \n character with nothing
cleaned_data = [repo.replace('\n', "") for repo in data[1:]]

In [6]:
# we can confirm now that these marks are no longer there
cleaned_data

['Jonathan Bechtel,https://github.com/JonathanBechtel',
 'Aoife Duna,https://github.com/aoifeduna',
 'Erik Lindernoren,https://github.com/eriklindernoren']

### Step 3:  Separate the username in each string from everything else

In [10]:
# we do the same thing, except we take the LAST item from the list returned by split()
usernames = [url.split('/')[-1] for url in repo_urls]

### Step 4: Obtain the repo data for every single github username

In [11]:
# this part of the url will never change
base_url = 'https://api.github.com'

In [12]:
# this goes through every username, and inserts it into the api url, and then passes that into requests.get().json()
# to obtain a list of repos for every single user
repo_lists = [requests.get(f"{base_url}/users/{username}/repos").json() for username in usernames]

### Step 5: Create a 'flat' list that contains every unique repo for every single user

Answer with list comprehension:

In [13]:
# this is a nested for-loop using a list comprehension that returns each item inside the inner list
repos = [repo for user in repo_lists for repo in user]

Nested loops with comprehensions can be difficult to interpret sometimes, so if a regular for-loop is easier to digest, this is a different way of writing the same thing:

In [14]:
repos = []

for user in repo_lists:
    for repo in user:
        repos.append(repo)

### Step 6:  Get information about the name, owner, url, and date of every single repo.

In [16]:
# this creates a list of all the values for the name key
repo_names = [repo['name'] for repo in repos]
# ditto for the login key -- notice it's accessed inside the owner key
owners     = [repo['owner']['login'] for repo in repos]
# next two work the same way
urls       = [repo['html_url'] for repo in repos]
dates      = [repo['created_at'] for repo in repos]

### Step 7:  Create a dictionary with the data created from step 7

In [17]:
data_dict = {
    'Owner': owners,
    'Name': repo_names,
    'URL': urls,
    'Date': dates
}

### Step 8:  Pass your dictionary into the `pd.dataframe()` method to get your final dataset  

In [18]:
import pandas as pd

# this will take your dictionary and turn it into a dataframe
df = pd.DataFrame(data_dict)

In [19]:
# look how pretty it is :)
df.head()

Unnamed: 0,Owner,Name,URL,Date
0,JonathanBechtel,cdc-dashboard,https://github.com/JonathanBechtel/cdc-dashboard,2016-11-02T14:39:37Z
1,JonathanBechtel,DAT-01-21,https://github.com/JonathanBechtel/DAT-01-21,2020-01-21T12:57:43Z
2,JonathanBechtel,DAT-06-24,https://github.com/JonathanBechtel/DAT-06-24,2019-06-26T15:12:49Z
3,JonathanBechtel,DAT-10-14,https://github.com/JonathanBechtel/DAT-10-14,2019-10-14T16:13:47Z
4,JonathanBechtel,data,https://github.com/JonathanBechtel/data,2019-01-14T22:09:06Z
