## Intermediate Lab:  Creating A Dataset Using the GitHub API

In this lab you'll create a dataset containing all the meta information about your classmates github repos, using only the csv file with everyone's url.  

The process will be done in these 4 general steps:

 - load in the csv file
 - clean the individual lines of each file to get them ready to use
 - connect to the GitHub api to obtain information about everyone's repos
 - re-structure this information to turn it into a dataframe.
 
It'll be a great way to work through the first step of many data science problems: creating a workable dataset out of unorganized, messy data.  Let's get started!

### Step 1:  Load in the csv file with everyone's github repo

In [1]:
import requests
import pandas as pd

In [2]:
url = r"C:\Users\Jonat\Downloads\Github Repos DAT-10-19 - Sheet1.csv"

In [3]:
with open(url) as f:
    data = f.readlines()

In [4]:
data

['Name,Github URL\n',
 'ChloÃ©,https://github.com/chloemd\n',
 ',\n',
 'Gary,https://github.com/Gmarin10\n',
 'Cameron ,https://github.com/clefevre01\n',
 'Oore,https://github.com/ladipoore\n',
 'Jaryd Thornton,https://github.com/jcolethornton\n',
 'Peter,https://github.com/Lothdyn/my-1019-repo\n',
 'Alvaro ,https://github.com/alvarog01/mydat1019\n',
 ',\n',
 'Amanda Chernishkin,https://github.com/amandachernishkin\n',
 'John Mayer,https://github.com/mayerjp01\n',
 'Nidhi Mahambre,https://github.com/nidhim03']

In [4]:
urls = [url for url in data if 'https' in url]

In [6]:
urls[0].split('/')

['ChloÃ©,https:', '', 'github.com', 'chloemd\n']

In [7]:
[len(url.split('/')) for url in urls]

[4, 4, 4, 4, 4, 5, 5, 4, 4, 4]

In [8]:
for url in urls:
    if len(url.split('/')) == 4:
        print(url.split('/')[-1])
    else:
        print(url.split('/')[-2])

chloemd

Gmarin10

clefevre01

ladipoore

jcolethornton

Lothdyn
alvarog01
amandachernishkin

mayerjp01

nidhim03


In [9]:
usernames = []

for url in urls:
    if len(url.split('/')) == 4:
        usernames.append(url.split('/')[-1])
    else:
        usernames.append(url.split('/')[-2])

In [20]:
usernames[0].replace('\n', '')

'chloemd'

In [10]:
usernames

['chloemd\n',
 'Gmarin10\n',
 'clefevre01\n',
 'ladipoore\n',
 'jcolethornton\n',
 'Lothdyn',
 'alvarog01',
 'amandachernishkin\n',
 'mayerjp01\n',
 'nidhim03']

In [11]:
usernames = [user.replace('\n', '') for user in usernames]

In [12]:
usernames

['chloemd',
 'Gmarin10',
 'clefevre01',
 'ladipoore',
 'jcolethornton',
 'Lothdyn',
 'alvarog01',
 'amandachernishkin',
 'mayerjp01',
 'nidhim03']

In [13]:
api_url = "https://api.github.com/users/chloemd/repos"
requests.get(api_url).json()

[{'id': 305542140,
  'node_id': 'MDEwOlJlcG9zaXRvcnkzMDU1NDIxNDA=',
  'name': 'DAT-1019-Chloe',
  'full_name': 'chloemd/DAT-1019-Chloe',
  'private': False,
  'owner': {'login': 'chloemd',
   'id': 73141231,
   'node_id': 'MDQ6VXNlcjczMTQxMjMx',
   'avatar_url': 'https://avatars1.githubusercontent.com/u/73141231?v=4',
   'gravatar_id': '',
   'url': 'https://api.github.com/users/chloemd',
   'html_url': 'https://github.com/chloemd',
   'followers_url': 'https://api.github.com/users/chloemd/followers',
   'following_url': 'https://api.github.com/users/chloemd/following{/other_user}',
   'gists_url': 'https://api.github.com/users/chloemd/gists{/gist_id}',
   'starred_url': 'https://api.github.com/users/chloemd/starred{/owner}{/repo}',
   'subscriptions_url': 'https://api.github.com/users/chloemd/subscriptions',
   'organizations_url': 'https://api.github.com/users/chloemd/orgs',
   'repos_url': 'https://api.github.com/users/chloemd/repos',
   'events_url': 'https://api.github.com/users/c

In [14]:
user_repos = [requests.get(f"https://api.github.com/users/{user}/repos").json()
              for user in usernames]

In [19]:
user_repos[3]

[{'id': 305542154,
  'node_id': 'MDEwOlJlcG9zaXRvcnkzMDU1NDIxNTQ=',
  'name': 'DAT-1019',
  'full_name': 'ladipoore/DAT-1019',
  'private': False,
  'owner': {'login': 'ladipoore',
   'id': 22856225,
   'node_id': 'MDQ6VXNlcjIyODU2MjI1',
   'avatar_url': 'https://avatars0.githubusercontent.com/u/22856225?v=4',
   'gravatar_id': '',
   'url': 'https://api.github.com/users/ladipoore',
   'html_url': 'https://github.com/ladipoore',
   'followers_url': 'https://api.github.com/users/ladipoore/followers',
   'following_url': 'https://api.github.com/users/ladipoore/following{/other_user}',
   'gists_url': 'https://api.github.com/users/ladipoore/gists{/gist_id}',
   'starred_url': 'https://api.github.com/users/ladipoore/starred{/owner}{/repo}',
   'subscriptions_url': 'https://api.github.com/users/ladipoore/subscriptions',
   'organizations_url': 'https://api.github.com/users/ladipoore/orgs',
   'repos_url': 'https://api.github.com/users/ladipoore/repos',
   'events_url': 'https://api.github.c

In [4]:
# notice that data is a list filled with strings that contain info about each line
data

['Name,Repo\n',
 'Jonathan Bechtel,https://github.com/JonathanBechtel\n',
 'Aoife Duna,https://github.com/aoifeduna\n',
 'Erik Lindernoren,https://github.com/eriklindernoren']

In [None]:
[[{}, {}], [{}], [{}, {}, {}]]

In [None]:
[{}, {}, {}, {}, {}, {}]

In [20]:
list_of_repos = []

for user in user_repos:
    for repo in user:
        list_of_repos.append(repo)

In [23]:
list_of_repos = [repo for user in user_repos for repo in user]

In [24]:
list_of_repos[1]

{'id': 119903053,
 'node_id': 'MDEwOlJlcG9zaXRvcnkxMTk5MDMwNTM=',
 'name': 'DP2',
 'full_name': 'Gmarin10/DP2',
 'private': False,
 'owner': {'login': 'Gmarin10',
  'id': 23226500,
  'node_id': 'MDQ6VXNlcjIzMjI2NTAw',
  'avatar_url': 'https://avatars1.githubusercontent.com/u/23226500?v=4',
  'gravatar_id': '',
  'url': 'https://api.github.com/users/Gmarin10',
  'html_url': 'https://github.com/Gmarin10',
  'followers_url': 'https://api.github.com/users/Gmarin10/followers',
  'following_url': 'https://api.github.com/users/Gmarin10/following{/other_user}',
  'gists_url': 'https://api.github.com/users/Gmarin10/gists{/gist_id}',
  'starred_url': 'https://api.github.com/users/Gmarin10/starred{/owner}{/repo}',
  'subscriptions_url': 'https://api.github.com/users/Gmarin10/subscriptions',
  'organizations_url': 'https://api.github.com/users/Gmarin10/orgs',
  'repos_url': 'https://api.github.com/users/Gmarin10/repos',
  'events_url': 'https://api.github.com/users/Gmarin10/events{/privacy}',
  'r

In [27]:
pd.DataFrame([{'name': 'john', 'age': 34} for i in range(10)])

Unnamed: 0,name,age
0,john,34
1,john,34
2,john,34
3,john,34
4,john,34
5,john,34
6,john,34
7,john,34
8,john,34
9,john,34


In [42]:
df = pd.DataFrame({
'name': [repo['name'] for repo in list_of_repos],
'html_url': [repo['html_url'] for repo in list_of_repos],
'created_at': [repo['created_at'] for repo in list_of_repos],
'login': [repo['owner']['login'] for repo in list_of_repos]
})

In [38]:
string = 'this is my string'

In [39]:
string['name']

TypeError: string indices must be integers

In [None]:
[{'name': 'jonathan', 'age': 35}]

In [36]:
df.head()

Unnamed: 0,name,html_url,created_at,login
0,DAT-1019-Chloe,https://github.com/chloemd/DAT-1019-Chloe,2020-10-20T00:00:09Z,chloemd
1,DP2,https://github.com/Gmarin10/DP2,2018-02-01T22:47:29Z,Gmarin10
2,GA-DAT-1019,https://github.com/Gmarin10/GA-DAT-1019,2020-10-20T00:00:09Z,Gmarin10
3,Project-Basta,https://github.com/Gmarin10/Project-Basta,2019-06-11T23:08:20Z,Gmarin10
4,Wallbreakers,https://github.com/Gmarin10/Wallbreakers,2019-06-24T22:59:34Z,Gmarin10


In [37]:
pd.DataFrame(list_of_repos)

Unnamed: 0,id,node_id,name,full_name,private,owner,html_url,description,fork,url,...,forks_count,mirror_url,archived,disabled,open_issues_count,license,forks,open_issues,watchers,default_branch
0,305542140,MDEwOlJlcG9zaXRvcnkzMDU1NDIxNDA=,DAT-1019-Chloe,chloemd/DAT-1019-Chloe,False,"{'login': 'chloemd', 'id': 73141231, 'node_id'...",https://github.com/chloemd/DAT-1019-Chloe,My DAT 10/19 repo,False,https://api.github.com/repos/chloemd/DAT-1019-...,...,0,,False,False,0,,0,0,1,main
1,119903053,MDEwOlJlcG9zaXRvcnkxMTk5MDMwNTM=,DP2,Gmarin10/DP2,False,"{'login': 'Gmarin10', 'id': 23226500, 'node_id...",https://github.com/Gmarin10/DP2,Senior design project,False,https://api.github.com/repos/Gmarin10/DP2,...,1,,False,False,0,,1,0,0,master
2,305542139,MDEwOlJlcG9zaXRvcnkzMDU1NDIxMzk=,GA-DAT-1019,Gmarin10/GA-DAT-1019,False,"{'login': 'Gmarin10', 'id': 23226500, 'node_id...",https://github.com/Gmarin10/GA-DAT-1019,,False,https://api.github.com/repos/Gmarin10/GA-DAT-1019,...,0,,False,False,0,,0,0,0,main
3,191462349,MDEwOlJlcG9zaXRvcnkxOTE0NjIzNDk=,Project-Basta,Gmarin10/Project-Basta,False,"{'login': 'Gmarin10', 'id': 23226500, 'node_id...",https://github.com/Gmarin10/Project-Basta,An algorithm that utilizes the Python Airtable...,False,https://api.github.com/repos/Gmarin10/Project-...,...,0,,False,False,0,"{'key': 'gpl-3.0', 'name': 'GNU General Public...",0,0,0,master
4,193592471,MDEwOlJlcG9zaXRvcnkxOTM1OTI0NzE=,Wallbreakers,Gmarin10/Wallbreakers,False,"{'login': 'Gmarin10', 'id': 23226500, 'node_id...",https://github.com/Gmarin10/Wallbreakers,weekly coding exercises for Wallbreakers fello...,False,https://api.github.com/repos/Gmarin10/Wallbrea...,...,0,,False,False,0,"{'key': 'mit', 'name': 'MIT License', 'spdx_id...",0,0,0,master
5,305542132,MDEwOlJlcG9zaXRvcnkzMDU1NDIxMzI=,Test-Repo,clefevre01/Test-Repo,False,"{'login': 'clefevre01', 'id': 73093116, 'node_...",https://github.com/clefevre01/Test-Repo,,False,https://api.github.com/repos/clefevre01/Test-Repo,...,0,,False,False,0,,0,0,0,main
6,305542154,MDEwOlJlcG9zaXRvcnkzMDU1NDIxNTQ=,DAT-1019,ladipoore/DAT-1019,False,"{'login': 'ladipoore', 'id': 22856225, 'node_i...",https://github.com/ladipoore/DAT-1019,ga dat 1019 class,False,https://api.github.com/repos/ladipoore/DAT-1019,...,0,,False,False,0,,0,0,0,main
7,98481060,MDEwOlJlcG9zaXRvcnk5ODQ4MTA2MA==,EdgarSearch,ladipoore/EdgarSearch,False,"{'login': 'ladipoore', 'id': 22856225, 'node_i...",https://github.com/ladipoore/EdgarSearch,A script to download fillings from Edgar more ...,False,https://api.github.com/repos/ladipoore/EdgarSe...,...,0,,False,False,0,,0,0,0,master
8,81528354,MDEwOlJlcG9zaXRvcnk4MTUyODM1NA==,Euler,ladipoore/Euler,False,"{'login': 'ladipoore', 'id': 22856225, 'node_i...",https://github.com/ladipoore/Euler,trying out a few project euler problems in python,False,https://api.github.com/repos/ladipoore/Euler,...,0,,False,False,0,,0,0,0,master
9,81888125,MDEwOlJlcG9zaXRvcnk4MTg4ODEyNQ==,PythonClass,ladipoore/PythonClass,False,"{'login': 'ladipoore', 'id': 22856225, 'node_i...",https://github.com/ladipoore/PythonClass,HW from python class and more,False,https://api.github.com/repos/ladipoore/PythonC...,...,0,,False,False,0,,0,0,0,master


In [25]:
pd.DataFrame(list_of_repos)

Unnamed: 0,id,node_id,name,full_name,private,owner,html_url,description,fork,url,...,forks_count,mirror_url,archived,disabled,open_issues_count,license,forks,open_issues,watchers,default_branch
0,305542140,MDEwOlJlcG9zaXRvcnkzMDU1NDIxNDA=,DAT-1019-Chloe,chloemd/DAT-1019-Chloe,False,"{'login': 'chloemd', 'id': 73141231, 'node_id'...",https://github.com/chloemd/DAT-1019-Chloe,My DAT 10/19 repo,False,https://api.github.com/repos/chloemd/DAT-1019-...,...,0,,False,False,0,,0,0,1,main
1,119903053,MDEwOlJlcG9zaXRvcnkxMTk5MDMwNTM=,DP2,Gmarin10/DP2,False,"{'login': 'Gmarin10', 'id': 23226500, 'node_id...",https://github.com/Gmarin10/DP2,Senior design project,False,https://api.github.com/repos/Gmarin10/DP2,...,1,,False,False,0,,1,0,0,master
2,305542139,MDEwOlJlcG9zaXRvcnkzMDU1NDIxMzk=,GA-DAT-1019,Gmarin10/GA-DAT-1019,False,"{'login': 'Gmarin10', 'id': 23226500, 'node_id...",https://github.com/Gmarin10/GA-DAT-1019,,False,https://api.github.com/repos/Gmarin10/GA-DAT-1019,...,0,,False,False,0,,0,0,0,main
3,191462349,MDEwOlJlcG9zaXRvcnkxOTE0NjIzNDk=,Project-Basta,Gmarin10/Project-Basta,False,"{'login': 'Gmarin10', 'id': 23226500, 'node_id...",https://github.com/Gmarin10/Project-Basta,An algorithm that utilizes the Python Airtable...,False,https://api.github.com/repos/Gmarin10/Project-...,...,0,,False,False,0,"{'key': 'gpl-3.0', 'name': 'GNU General Public...",0,0,0,master
4,193592471,MDEwOlJlcG9zaXRvcnkxOTM1OTI0NzE=,Wallbreakers,Gmarin10/Wallbreakers,False,"{'login': 'Gmarin10', 'id': 23226500, 'node_id...",https://github.com/Gmarin10/Wallbreakers,weekly coding exercises for Wallbreakers fello...,False,https://api.github.com/repos/Gmarin10/Wallbrea...,...,0,,False,False,0,"{'key': 'mit', 'name': 'MIT License', 'spdx_id...",0,0,0,master
5,305542132,MDEwOlJlcG9zaXRvcnkzMDU1NDIxMzI=,Test-Repo,clefevre01/Test-Repo,False,"{'login': 'clefevre01', 'id': 73093116, 'node_...",https://github.com/clefevre01/Test-Repo,,False,https://api.github.com/repos/clefevre01/Test-Repo,...,0,,False,False,0,,0,0,0,main
6,305542154,MDEwOlJlcG9zaXRvcnkzMDU1NDIxNTQ=,DAT-1019,ladipoore/DAT-1019,False,"{'login': 'ladipoore', 'id': 22856225, 'node_i...",https://github.com/ladipoore/DAT-1019,ga dat 1019 class,False,https://api.github.com/repos/ladipoore/DAT-1019,...,0,,False,False,0,,0,0,0,main
7,98481060,MDEwOlJlcG9zaXRvcnk5ODQ4MTA2MA==,EdgarSearch,ladipoore/EdgarSearch,False,"{'login': 'ladipoore', 'id': 22856225, 'node_i...",https://github.com/ladipoore/EdgarSearch,A script to download fillings from Edgar more ...,False,https://api.github.com/repos/ladipoore/EdgarSe...,...,0,,False,False,0,,0,0,0,master
8,81528354,MDEwOlJlcG9zaXRvcnk4MTUyODM1NA==,Euler,ladipoore/Euler,False,"{'login': 'ladipoore', 'id': 22856225, 'node_i...",https://github.com/ladipoore/Euler,trying out a few project euler problems in python,False,https://api.github.com/repos/ladipoore/Euler,...,0,,False,False,0,,0,0,0,master
9,81888125,MDEwOlJlcG9zaXRvcnk4MTg4ODEyNQ==,PythonClass,ladipoore/PythonClass,False,"{'login': 'ladipoore', 'id': 22856225, 'node_i...",https://github.com/ladipoore/PythonClass,HW from python class and more,False,https://api.github.com/repos/ladipoore/PythonC...,...,0,,False,False,0,,0,0,0,master


### Step 2: Remove the `\n` from each item in the list, as well as the item that contains the header info.

In [5]:
# this loops through each item in the list, starting at position 1, and replace the \n character with nothing
cleaned_data = [repo.replace('\n', "") for repo in data[1:]]

In [6]:
# we can confirm now that these marks are no longer there
cleaned_data

['Jonathan Bechtel,https://github.com/JonathanBechtel',
 'Aoife Duna,https://github.com/aoifeduna',
 'Erik Lindernoren,https://github.com/eriklindernoren']

### Step 3:  Separate the username in each string from everything else

In [10]:
# we do the same thing, except we take the LAST item from the list returned by split()
usernames = [url.split('/')[-1] for url in repo_urls]

### Step 4: Obtain the repo data for every single github username

In [11]:
# this part of the url will never change
base_url = 'https://api.github.com'

In [12]:
# this goes through every username, and inserts it into the api url, and then passes that into requests.get().json()
# to obtain a list of repos for every single user
repo_lists = [requests.get(f"{base_url}/users/{username}/repos").json() for username in usernames]

### Step 5: Create a 'flat' list that contains every unique repo for every single user

Answer with list comprehension:

In [13]:
# this is a nested for-loop using a list comprehension that returns each item inside the inner list
repos = [repo for user in repo_lists for repo in user]

Nested loops with comprehensions can be difficult to interpret sometimes, so if a regular for-loop is easier to digest, this is a different way of writing the same thing:

In [14]:
repos = []

for user in repo_lists:
    for repo in user:
        repos.append(repo)

### Step 6:  Get information about the name, owner, url, and date of every single repo.

In [16]:
# this creates a list of all the values for the name key
repo_names = [repo['name'] for repo in repos]
# ditto for the login key -- notice it's accessed inside the owner key
owners     = [repo['owner']['login'] for repo in repos]
# next two work the same way
urls       = [repo['html_url'] for repo in repos]
dates      = [repo['created_at'] for repo in repos]

### Step 7:  Create a dictionary with the data created from step 7

In [17]:
data_dict = {
    'Owner': owners,
    'Name': repo_names,
    'URL': urls,
    'Date': dates
}

### Step 8:  Pass your dictionary into the `pd.dataframe()` method to get your final dataset  

In [18]:
import pandas as pd

# this will take your dictionary and turn it into a dataframe
df = pd.DataFrame(data_dict)

In [19]:
# look how pretty it is :)
df.head()

Unnamed: 0,Owner,Name,URL,Date
0,JonathanBechtel,cdc-dashboard,https://github.com/JonathanBechtel/cdc-dashboard,2016-11-02T14:39:37Z
1,JonathanBechtel,DAT-01-21,https://github.com/JonathanBechtel/DAT-01-21,2020-01-21T12:57:43Z
2,JonathanBechtel,DAT-06-24,https://github.com/JonathanBechtel/DAT-06-24,2019-06-26T15:12:49Z
3,JonathanBechtel,DAT-10-14,https://github.com/JonathanBechtel/DAT-10-14,2019-10-14T16:13:47Z
4,JonathanBechtel,data,https://github.com/JonathanBechtel/data,2019-01-14T22:09:06Z
