## Intermediate Lab:  Creating A Dataset Using the GitHub API

In this lab you'll create a dataset containing all the meta information about your classmates github repos, using only the csv file with everyone's url.  

The process will be done in these 4 general steps:

 - load in the csv file
 - clean the individual lines of each file to get them ready to use
 - connect to the GitHub api to obtain information about everyone's repos
 - re-structure this information to turn it into a dataframe.
 
It'll be a great way to work through the first step of many data science problems: creating a workable dataset out of unorganized, messy data.  Let's get started!

### Step 1:  Load in the csv file with everyone's github repo

**Note:** There are a number of ways to do this, but the easiest way is usually this:

    `with open('file.csv') as f:

        data = f.readlines()`

In [2]:
# your code here
with open('C:\\users\\escag\\Downloads\\DAT-07-28 Github Repos - Sheet1.csv') as f:
    data = f.readlines()


In [17]:
data
type(data)
data1=data[1::]
data1
for i in data1:
    print(i)

Jonathan Bechtel,https://github.com/JonathanBechtel

Luki Elizalde,https://github.com/groovyluki

iuliana trufas,https://github.com/Yuliana-GitHub

Neraj Thangarajah,https://github.com/nthang1

Alina Urs,https://github.com/sprintkayaking

Ashleigh Grant,https://github.com/AshleighGrant

Nick Hudgell ,Https://github.com/nhudgell/GADS

Elisa Scagnetto,https://github.com/lisadt/es_repo280720


In [25]:
data2=[i.split('github.com') for i in data1]

[['Jonathan Bechtel,https://', '/JonathanBechtel\n'],
 ['Luki Elizalde,https://', '/groovyluki\n'],
 ['iuliana trufas,https://', '/Yuliana-GitHub\n'],
 ['Neraj Thangarajah,https://', '/nthang1\n'],
 ['Alina Urs,https://', '/sprintkayaking\n'],
 ['Ashleigh Grant,https://', '/AshleighGrant\n'],
 ['Nick Hudgell ,Https://', '/nhudgell/GADS\n'],
 ['Elisa Scagnetto,https://', '/lisadt/es_repo280720']]

In [72]:
data3=[i[1] for i in data2]
data4=[i.replace('\n',"") for i in data3]
data5=[i.replace('/',"") for i in data4]
data6=[i.replace('es_repo280720',"") for i in data5]
data7=[i.replace('GADS',"") for i in data6]
data7

['JonathanBechtel',
 'groovyluki',
 'Yuliana-GitHub',
 'nthang1',
 'sprintkayaking',
 'AshleighGrant',
 'nhudgell',
 'lisadt']

What you should have now is a list, and each item is a string that contains the comma separated values of each cell in the row of that csv file.  

It should generically look like this:

    `['Name,Repo\n',
      'Person 1,https://github.com/username1\n',
      'Person 2,https://github.com/username2\n',
       ......
       ]`

Double check this is the case.

In [75]:
# your code here


The only thing we need out of each item is the persons username, that part contained in the string at: `https://github.com/username_here`.  Everything else is junk.  

We'll need to go through a few steps to get our info down to a usable format.  

### Step 2: Remove the `\n` from each item in the list, as well as the item that contains the header info.

When you're done you should have a list that looks like this:

      `[
      'Person 1,https://github.com/username1',
      'Person 2,https://github.com/username2',
       ......
       ]`

**hint:** The `replace()` method for strings is probably one of the more useful options that you have.  If you want to replace something with nothing, you can simply specify `""` for that part.

In [None]:
# your code here


### Step 3:  Separate the url in each string from everything else

When you're done you should have a new list that looks like this:

    `[
       'https://github.com/username1',
       'https://github.com/username2',
       ...
     ]`
     
**hint:** The `split()` method will help you out here.

### Step 4:  Separate the username in each string from everything else

When you're done you should have a new list that looks like this:

    `[
       'username1',
       'username2',
       ...
     ]`

In [None]:
# your code here


### Step 5: Obtain the repo data for every single github username

The repository info for every single public account is available via the following url: `https://api.github.com/users/:the_username/repos`

So basically, `requests.get('https://api.github.com/users/:the_username/repos').json()` will return a list filled with every single repo that user has.  

When you're done, you should have a *list of lists*, with each list containing each users individual repos.  It'll look like this:

`[[{user1, repo1}, {user1, repo2}], [{user2, repo1}], [{user3, repo1}, {user3, repo2}, {user3, rep3}], .....]`

**Warning:** We're using the free, unauthenticated version of the API here.  That means we can only make 60 API calls per hour before getting throttled.  If we've used up our bandwidth the response you'll get is a dictionary telling you we've exceeded our rate limit or something similar.

If that's the case, try using your phone (or your neighbors) as a hotspot and connect from there to get a new IP address.

In [76]:
# your code here
import requests
reposlist=[]
for user in data7:
    reposlist.append(requests.get(f'https://api.github.com/users/{user}/repos').json())

In [77]:
reposlist

[[{'id': 260764681,
   'node_id': 'MDEwOlJlcG9zaXRvcnkyNjA3NjQ2ODE=',
   'name': 'bitcoin',
   'full_name': 'JonathanBechtel/bitcoin',
   'private': False,
   'owner': {'login': 'JonathanBechtel',
    'id': 481696,
    'node_id': 'MDQ6VXNlcjQ4MTY5Ng==',
    'avatar_url': 'https://avatars1.githubusercontent.com/u/481696?v=4',
    'gravatar_id': '',
    'url': 'https://api.github.com/users/JonathanBechtel',
    'html_url': 'https://github.com/JonathanBechtel',
    'followers_url': 'https://api.github.com/users/JonathanBechtel/followers',
    'following_url': 'https://api.github.com/users/JonathanBechtel/following{/other_user}',
    'gists_url': 'https://api.github.com/users/JonathanBechtel/gists{/gist_id}',
    'starred_url': 'https://api.github.com/users/JonathanBechtel/starred{/owner}{/repo}',
    'subscriptions_url': 'https://api.github.com/users/JonathanBechtel/subscriptions',
    'organizations_url': 'https://api.github.com/users/JonathanBechtel/orgs',
    'repos_url': 'https://api.

In [80]:
list1=[1,2,3,4,5]
list2=[6,7,8,9,0]
list3=[list1,list2]
list3

[[1, 2, 3, 4, 5], [6, 7, 8, 9, 0]]

In [85]:
list4=[n for i in list3 for n in i]
list4

[1, 2, 3, 4, 5, 6, 7, 8, 9, 0]

In [86]:
reposfin=[n for i in reposlist for n in i]

### Step 6: Create a 'flat' list that contains every unique repo for every single user

When you're done you should have a list that looks like this: `[{user1 repo1}, {user1 repo2}, ....{user n, repo m}]`

Ie, instead of having a list filled with other lists with dictionaries inside of them, make it a list with just dictionaries on the inside.

Ie, no nested levels like you had before.

So, go from this:

`[[{user1, repo1}, {user1, repo2}, {user1, repo3}], [{user2, repo1}, {user2, repo2}]]`
    
To this:

`[{user1, repo1}, {user1, repo2}, {user1, repo3}, {user2, repo1}, {user2, repo2}]`
    
If you have questions about what this entails, then please contact me ASAP.

In [87]:
reposfin

[{'id': 260764681,
  'node_id': 'MDEwOlJlcG9zaXRvcnkyNjA3NjQ2ODE=',
  'name': 'bitcoin',
  'full_name': 'JonathanBechtel/bitcoin',
  'private': False,
  'owner': {'login': 'JonathanBechtel',
   'id': 481696,
   'node_id': 'MDQ6VXNlcjQ4MTY5Ng==',
   'avatar_url': 'https://avatars1.githubusercontent.com/u/481696?v=4',
   'gravatar_id': '',
   'url': 'https://api.github.com/users/JonathanBechtel',
   'html_url': 'https://github.com/JonathanBechtel',
   'followers_url': 'https://api.github.com/users/JonathanBechtel/followers',
   'following_url': 'https://api.github.com/users/JonathanBechtel/following{/other_user}',
   'gists_url': 'https://api.github.com/users/JonathanBechtel/gists{/gist_id}',
   'starred_url': 'https://api.github.com/users/JonathanBechtel/starred{/owner}{/repo}',
   'subscriptions_url': 'https://api.github.com/users/JonathanBechtel/subscriptions',
   'organizations_url': 'https://api.github.com/users/JonathanBechtel/orgs',
   'repos_url': 'https://api.github.com/users/Jo

### Step 7:  Get information about the name, owner, url, and date of every single repo.

In the dictionary for each repo there are keys called `name`, `login`, `html_url`, and `created_at`.  These are going to populate the values for our different columns.

Values for each one of these keys will need to exist inside their own lists.

**hint:** Notice that the `login` key is nested inside a dictionary that's the value to the `owner` key at the outer level.

In [108]:
# your key here
for repo in reposfin:
    print(repo['created_at'])

2020-05-02T19:57:48Z
2016-11-02T14:39:37Z
2020-05-01T14:46:48Z
2020-01-21T12:57:43Z
2019-06-26T15:12:49Z
2020-07-28T03:06:15Z
2019-10-14T16:13:47Z
2019-01-14T22:09:06Z
2016-09-01T16:55:29Z
2019-05-14T11:48:59Z
2016-12-30T00:10:24Z
2015-01-21T04:07:02Z
2016-12-22T23:40:54Z
2019-11-29T20:12:51Z
2019-03-31T02:36:48Z
2017-03-30T20:03:14Z
2019-06-06T06:41:54Z
2019-11-05T23:56:33Z
2019-06-06T06:33:28Z
2019-06-26T23:50:04Z
2019-06-27T00:03:23Z
2020-07-28T19:11:40Z
2020-05-02T19:55:10Z
2019-06-29T23:40:04Z
2019-10-15T00:31:57Z
2020-01-22T00:58:08Z
2019-12-19T14:14:00Z
2015-07-15T12:20:22Z
2017-10-17T04:01:46Z
2016-10-15T22:36:22Z
2020-07-22T12:45:12Z
2020-07-28T19:11:38Z
2020-07-31T11:18:35Z
2020-06-02T06:18:24Z
2020-07-28T19:20:31Z
2020-07-28T19:11:39Z
2020-07-26T20:03:49Z
2020-07-26T20:42:11Z
2020-07-28T19:10:55Z
2020-07-28T19:12:16Z
2020-07-28T19:11:39Z
2020-07-28T19:11:31Z


### Step 8:  Create a dictionary with the data created from step 7

Your final output should look like this:

    `{
       'Owner': [list with the `login` values for each user],
       'Name' : [list with the `name` values for each user],
       'URL'  : [list with the `html_url` values for each user],
       'Date' : [list with the `created_at` values for each user]
     }`

In [112]:
# your answer here
dict_repos={}
for repo in reposfin:
    dict_repos ={
        'Owner': [repo['owner']['login'] for repo in reposfin],
        'Name' : [repo['name'] for repo in reposfin],
        'URL'  : [repo['html_url'] for repo in reposfin],
        'Date' : [repo['created_at'] for repo in reposfin]
    }
    


In [113]:
dict_repos

{'Owner': ['JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'groovyluki',
  'groovyluki',
  'Yuliana-GitHub',
  'Yuliana-GitHub',
  'nthang1',
  'sprintkayaking',
  'sprintkayaking',
  'sprintkayaking',
  'AshleighGrant',
  'AshleighGrant',
  'Nhudgell',
  'lisadt'],
 'Name': ['bitcoin',
  'cdc-dashboard',
  'covid-19',
  'DAT-01-21',
  'DAT-06-24',
  'DAT-07-28',
  'DAT-10-14',
  'data',
  'Data-Analysis',
  'easym

### Step 9:  Pass your dictionary into the `pd.dataframe()` method to get your final dataset  

Use the `df.head()` method to confirm that you have something that's formatted appropriately.

In [114]:
# your answer here
import pandas as pd

reposdf = pd.DataFrame(dict_repos)

In [115]:
reposdf.head()

Unnamed: 0,Owner,Name,URL,Date
0,JonathanBechtel,bitcoin,https://github.com/JonathanBechtel/bitcoin,2020-05-02T19:57:48Z
1,JonathanBechtel,cdc-dashboard,https://github.com/JonathanBechtel/cdc-dashboard,2016-11-02T14:39:37Z
2,JonathanBechtel,covid-19,https://github.com/JonathanBechtel/covid-19,2020-05-01T14:46:48Z
3,JonathanBechtel,DAT-01-21,https://github.com/JonathanBechtel/DAT-01-21,2020-01-21T12:57:43Z
4,JonathanBechtel,DAT-06-24,https://github.com/JonathanBechtel/DAT-06-24,2019-06-26T15:12:49Z


In [123]:
reposdf.tail()

Unnamed: 0,Owner,Name,URL,Date
37,sprintkayaking,GA-test,https://github.com/sprintkayaking/GA-test,2020-07-26T20:42:11Z
38,AshleighGrant,DAT-07-28,https://github.com/AshleighGrant/DAT-07-28,2020-07-28T19:10:55Z
39,AshleighGrant,DAT07-28-AG,https://github.com/AshleighGrant/DAT07-28-AG,2020-07-28T19:12:16Z
40,Nhudgell,GADS,https://github.com/Nhudgell/GADS,2020-07-28T19:11:39Z
41,lisadt,es_repo280720,https://github.com/lisadt/es_repo280720,2020-07-28T19:11:31Z


In [124]:
reposdf

Unnamed: 0,Owner,Name,URL,Date
0,JonathanBechtel,bitcoin,https://github.com/JonathanBechtel/bitcoin,2020-05-02T19:57:48Z
1,JonathanBechtel,cdc-dashboard,https://github.com/JonathanBechtel/cdc-dashboard,2016-11-02T14:39:37Z
2,JonathanBechtel,covid-19,https://github.com/JonathanBechtel/covid-19,2020-05-01T14:46:48Z
3,JonathanBechtel,DAT-01-21,https://github.com/JonathanBechtel/DAT-01-21,2020-01-21T12:57:43Z
4,JonathanBechtel,DAT-06-24,https://github.com/JonathanBechtel/DAT-06-24,2019-06-26T15:12:49Z
5,JonathanBechtel,DAT-07-28,https://github.com/JonathanBechtel/DAT-07-28,2020-07-28T03:06:15Z
6,JonathanBechtel,DAT-10-14,https://github.com/JonathanBechtel/DAT-10-14,2019-10-14T16:13:47Z
7,JonathanBechtel,data,https://github.com/JonathanBechtel/data,2019-01-14T22:09:06Z
8,JonathanBechtel,Data-Analysis,https://github.com/JonathanBechtel/Data-Analysis,2016-09-01T16:55:29Z
9,JonathanBechtel,easyml,https://github.com/JonathanBechtel/easyml,2019-05-14T11:48:59Z
