## Intermediate Lab:  Creating A Dataset Using the GitHub API

In this lab you'll create a dataset containing all the meta information about your classmates github repos, using only the csv file with everyone's url.  

The process will be done in these 4 general steps:

 - load in the csv file
 - clean the individual lines of each file to get them ready to use
 - connect to the GitHub api to obtain information about everyone's repos
 - re-structure this information to turn it into a dataframe.
 
It'll be a great way to work through the first step of many data science problems: creating a workable dataset out of unorganized, messy data.  Let's get started!

### Step 1:  Load in the csv file with everyone's github repo

**Note:** There are a number of ways to do this, but the easiest way is usually this:

    `with open('file.csv') as f:

        data = f.readlines()`

In [2]:
import requests
import pandas as pd


In [3]:
cd 

/Users/lukielizalde


In [4]:
cd Downloads

/Users/lukielizalde/Downloads


In [5]:
with open ('DAT-07-28 Github Repos - Sheet1.csv') as f:
    csv_data = f.readlines()

In [6]:
new_list=[]

for repo in csv_data:
    new_list.append(repo.replace('\n',''))

What you should have now is a list, and each item is a string that contains the comma separated values of each cell in the row of that csv file.  

It should generically look like this:

    `['Name,Repo\n',
      'Person 1,https://github.com/username1\n',
      'Person 2,https://github.com/username2\n',
       ......
       ]`

Double check this is the case.

In [7]:
csv_data

['Name,Github Repo\n',
 'Jonathan Bechtel,https://github.com/JonathanBechtel\n',
 'Luki Elizalde,https://github.com/groovyluki\n',
 'iuliana trufas,https://github.com/Yuliana-GitHub\n',
 'Neraj Thangarajah,https://github.com/nthang1\n',
 'Alina Urs,https://github.com/sprintkayaking\n',
 'Ashleigh Grant,https://github.com/AshleighGrant\n',
 'Nick Hudgell ,Https://github.com/nhudgell/GADS\n',
 'Elisa Scagnetto,https://github.com/lisadt/es_repo280720']

In [8]:
new_list = [repo,replace('\n', '') for repo in csv_data]

SyntaxError: invalid syntax (<ipython-input-8-e4e8c699fcb4>, line 1)

In [9]:
#just crop out the username from the string
new_list[0:5].split('/')[-1]

AttributeError: 'list' object has no attribute 'split'

The only thing we need out of each item is the persons username, that part contained in the string at: `https://github.com/username_here`.  Everything else is junk.  

We'll need to go through a few steps to get our info down to a usable format.  

### Step 2: Remove the `\n` from each item in the list, as well as the item that contains the header info.

When you're done you should have a list that looks like this:

      `[
      'Person 1,https://github.com/username1',
      'Person 2,https://github.com/username2',
       ......
       ]`

**hint:** The `replace()` method for strings is probably one of the more useful options that you have.  If you want to replace something with nothing, you can simply specify `""` for that part.

In [10]:
clean_data = [repo.replace('\n', "") for repo in data[1:]]

NameError: name 'data' is not defined

In [11]:
clean_data = []
for item in csv_data:
    clean_data.append(item.replace('\n', '') for item in csv_data)

In [12]:
clean_data

[<generator object <genexpr> at 0x7f8918458510>,
 <generator object <genexpr> at 0x7f89184585f0>,
 <generator object <genexpr> at 0x7f8918458580>,
 <generator object <genexpr> at 0x7f8918458660>,
 <generator object <genexpr> at 0x7f89184586d0>,
 <generator object <genexpr> at 0x7f8918458740>,
 <generator object <genexpr> at 0x7f89184587b0>,
 <generator object <genexpr> at 0x7f8918458820>,
 <generator object <genexpr> at 0x7f8918458890>]

### Step 3:  Separate the url in each string from everything else

When you're done you should have a new list that looks like this:

    `[
       'https://github.com/username1',
       'https://github.com/username2',
       ...
     ]`
     
**hint:** The `split()` method will help you out here.

In [13]:
repo_url = [line.split(',')[1] for line in clean_data]

AttributeError: 'generator' object has no attribute 'split'

In [14]:
repo_url

NameError: name 'repo_url' is not defined

### Step 4:  Separate the username in each string from everything else

When you're done you should have a new list that looks like this:

    `[
       'username1',
       'username2',
       ...
     ]`

In [15]:
username = [url.split('/')[-1] for url in repo_url]

NameError: name 'repo_url' is not defined

In [16]:
username

NameError: name 'username' is not defined

In [17]:
username2 = [requests.get(f'https://api.github.com/users/{user}/repos').json() for user in username[1:]]

NameError: name 'username' is not defined

In [18]:
# all the stuff from the individual repos
[[{},{}], [[{},{}]], [{}]]

[[{}, {}], [[{}, {}]], [{}]]

In [19]:
username2

NameError: name 'username2' is not defined

In [20]:
user_repos = []

for user in username:
    for repo in user:
        user_repos.append(repo)

NameError: name 'username' is not defined

In [21]:
user_repos[0]

IndexError: list index out of range

### Step 5: Obtain the repo data for every single github username

The repository info for every single public account is available via the following url: `https://api.github.com/users/:the_username/repos`

So basically, `requests.get('https://api.github.com/users/:the_username/repos').json()` will return a list filled with every single repo that user has.  

When you're done, you should have a *list of lists*, with each list containing each users individual repos.  It'll look like this:

`[[{user1, repo1}, {user1, repo2}], [{user2, repo1}], [{user3, repo1}, {user3, repo2}, {user3, rep3}], .....]`

**Warning:** We're using the free, unauthenticated version of the API here.  That means we can only make 60 API calls per hour before getting throttled.  If we've used up our bandwidth the response you'll get is a dictionary telling you we've exceeded our rate limit or something similar.

If that's the case, try using your phone (or your neighbors) as a hotspot and connect from there to get a new IP address.

In [22]:
githubusers_url = 'https://api.github.com'
repo_lists = [requests.get(f"{githubusers_url}/users/{username}/repos").json() for username in usernames]

NameError: name 'usernames' is not defined

In [23]:
usernames

NameError: name 'usernames' is not defined

In [24]:
repo_lists

NameError: name 'repo_lists' is not defined

### Step 6: Create a 'flat' list that contains every unique repo for every single user

When you're done you should have a list that looks like this: `[{user1 repo1}, {user1 repo2}, ....{user n, repo m}]`

Ie, instead of having a list filled with other lists with dictionaries inside of them, make it a list with just dictionaries on the inside.

Ie, no nested levels like you had before.

So, go from this:

`[[{user1, repo1}, {user1, repo2}, {user1, repo3}], [{user2, repo1}, {user2, repo2}]]`
    
To this:

`[{user1, repo1}, {user1, repo2}, {user1, repo3}, {user2, repo1}, {user2, repo2}]`
    
If you have questions about what this entails, then please contact me ASAP.

In [25]:
repos = []

for user in repo_lists:
    for repo in user:
        repos.append(repo)


NameError: name 'repo_lists' is not defined

In [26]:
repos

[]

### Step 7:  Get information about the name, owner, url, and date of every single repo.

In the dictionary for each repo there are keys called `name`, `login`, `html_url`, and `created_at`.  These are going to populate the values for our different columns.

Values for each one of these keys will need to exist inside their own lists.

**hint:** Notice that the `login` key is nested inside a dictionary that's the value to the `owner` key at the outer level.

In [27]:
repo_name = [repo['name'] for repo in repos if type(repo) == dict]
user = [repo['owner']['login'] for repo in repos if type(repo) == dict]
url = [repo['html_url'] for repo in repos if type(repo) == dict]
date = [repo['created_at'] for repo in repos if type(repo) == dict]


### Step 8:  Create a dictionary with the data created from step 7

Your final output should look like this:

    `{
       'Owner': [list with the `login` values for each user],
       'Name' : [list with the `name` values for each user],
       'URL'  : [list with the `html_url` values for each user],
       'Date' : [list with the `created_at` values for each user]
     }`

In [65]:
data_dict = {
    'Owner': user,
    'Name': repo_name,
    'URL': url,
    'Date': date
}

In [66]:
data_dict

{'Owner': ['JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'JonathanBechtel',
  'groovyluki',
  'groovyluki',
  'Yuliana-GitHub',
  'Yuliana-GitHub',
  'nthang1',
  'sprintkayaking',
  'sprintkayaking',
  'sprintkayaking',
  'AshleighGrant',
  'AshleighGrant'],
 'Name': ['bitcoin',
  'cdc-dashboard',
  'covid-19',
  'DAT-01-21',
  'DAT-06-24',
  'DAT-07-28',
  'DAT-10-14',
  'data',
  'Data-Analysis',
  'easyml',
  'formula-app',
  'fo

### Step 9:  Pass your dictionary into the `pd.dataframe()` method to get your final dataset  

Use the `df.head()` method to confirm that you have something that's formatted appropriately.

In [67]:
import pandas as pd

# this will take your dictionary and turn it into a dataframe
df = pd.DataFrame(data_dict)

In [68]:
df.head()

Unnamed: 0,Owner,Name,URL,Date
0,JonathanBechtel,bitcoin,https://github.com/JonathanBechtel/bitcoin,2020-05-02T19:57:48Z
1,JonathanBechtel,cdc-dashboard,https://github.com/JonathanBechtel/cdc-dashboard,2016-11-02T14:39:37Z
2,JonathanBechtel,covid-19,https://github.com/JonathanBechtel/covid-19,2020-05-01T14:46:48Z
3,JonathanBechtel,DAT-01-21,https://github.com/JonathanBechtel/DAT-01-21,2020-01-21T12:57:43Z
4,JonathanBechtel,DAT-06-24,https://github.com/JonathanBechtel/DAT-06-24,2019-06-26T15:12:49Z
