# Data collection

In this notebook, I'll use the **GitHub API** to extract various information from my user profile such as repositories, commits and more. I'll also save this data to **.csv** files so that I can draw insights.

## Import libraries and defining constants

I'll import various libraries needed for fetching the data.

In [10]:
import json
import requests
import numpy as np
import pandas as pd

As `username` is something that I'd like to be variable, I decided to store it in a variable.

In [3]:
USERNAME = 'kb22'

## User information

I'll first extract the user information such as name and related URLs which would be useful ahead.

In [138]:
data = requests.get(url = 'https://api.github.com/users/' + USERNAME)
data = data.json()
data

{u'documentation_url': u'https://developer.github.com/v3/#rate-limiting',
 u'message': u"API rate limit exceeded for 103.21.19.10. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)"}

From the json output above, I'll try to extract basic information such as `bio`, `name`, `public_repos`, `public gists`. I'll also keep some of the urls handy inluding `repos_url`, `gists_url` and `blog`.

In [51]:
print("Information about user {}:\n".format(USERNAME))
print("Name: {}".format(data['name']))
print("About: {}".format(data['bio']))
print("Repos: {}".format(data['public_repos']))
print("Gists: {}".format(data['public_gists']))

Information about user kb22:

Name: Karan Bhanot
About: Working with Python, Machine Learning and Deep Learning to explore the field of Data Science and Machine Learning.
Repos: 36
Gists: 208


## Repositories

Next, I'll fetch repositories for the user.

In [137]:
REPOS_URL = data['repos_url']
repos_data = requests.get(REPOS_URL)
repos_data = repos_data.json()

TypeError: list indices must be integers, not str

I'll first explore only one repository information and take a look at all the information I can keep.

In [52]:
repos_data[0]

{u'archive_url': u'https://api.github.com/repos/kb22/Activity-Recognition-using-Machine-Learning/{archive_format}{/ref}',
 u'archived': False,
 u'assignees_url': u'https://api.github.com/repos/kb22/Activity-Recognition-using-Machine-Learning/assignees{/user}',
 u'blobs_url': u'https://api.github.com/repos/kb22/Activity-Recognition-using-Machine-Learning/git/blobs{/sha}',
 u'branches_url': u'https://api.github.com/repos/kb22/Activity-Recognition-using-Machine-Learning/branches{/branch}',
 u'clone_url': u'https://github.com/kb22/Activity-Recognition-using-Machine-Learning.git',
 u'collaborators_url': u'https://api.github.com/repos/kb22/Activity-Recognition-using-Machine-Learning/collaborators{/collaborator}',
 u'comments_url': u'https://api.github.com/repos/kb22/Activity-Recognition-using-Machine-Learning/comments{/number}',
 u'commits_url': u'https://api.github.com/repos/kb22/Activity-Recognition-using-Machine-Learning/commits{/sha}',
 u'compare_url': u'https://api.github.com/repos/kb22

There are a number of things that we can keep a track of here. I'll select the following:
1. id: Unique id for the repository.
2. forks_count: Total forks of the repository.
3. has_wiki: A boolean that signifies if the repository has a wiki document.
4. license: The license type (if any).
5. name: The name of the repository.
6. open_issues_count: Total issues opened in the repository.
7. watchers_count: Total users watching the repository.
8. created_at: The time and date when the repository was first created.
9. description: The description of the repository.
10. updated_at: The time and date when the repository was last updated.
11. login: Username of the owner of the repository.
12. stargazers_count: The total stars on the reepository.

I'll also keep track of some urls for further analysis including:
1. url: The url of the repository.
2. tags_url: Tags for a given repository.
3. languages_url: All languages in the repository.
4. commits_url: List of all commits in the repository.
5. branches_url: Branches in the repository.

The urls for languages, commits and branches, I'll remove the end value inside the braces.

In [127]:
repos_information = []
for i, repo in enumerate(repos_data):
    data = []
    data.append(repo['id'])
    data.append(repo['name'].encode("utf-8"))
    data.append(repo['description'].encode("utf-8") if repo['description'] != None else "")
    data.append(repo['created_at'].encode("utf-8"))
    data.append(repo['updated_at'].encode("utf-8"))
    data.append(repo['owner']['login'].encode("utf-8"))
    data.append(repo['license']['name'].encode("utf-8") if repo['license'] != None else None)
    data.append(repo['has_wiki'])
    data.append(repo['forks_count'])
    data.append(repo['open_issues_count'])
    data.append(repo['stargazers_count'])
    data.append(repo['watchers_count'])
    data.append(repo['url'].encode("utf-8"))
    data.append(repo['languages_url'].encode("utf-8"))
    data.append(repo['commits_url'].encode("utf-8").split("{")[0])
    data.append(repo['branches_url'].encode("utf-8").split("{")[0])
    data.append('https://api.github.com/repos/' + USERNAME + '/' + repo['name'].encode("utf-8") + '/topics')
    repos_information.append(data)

In [132]:
repos_df = pd.DataFrame(repos_information, columns = ['Id', 'Name', 'Description', 'Created on', 'Updated on', 
                                                      'Owner', 'License', 'Includes wiki', 'Forks count', 
                                                      'Issues count', 'Stars count', 'Watchers count', 'Repo URL', 
                                                      'Languages URL', 'Commits URL', 'Branches URL', 'Topics URL'])
repos_df.head(5)

Unnamed: 0,Id,Name,Description,Created on,Updated on,Owner,License,Includes wiki,Forks count,Issues count,Stars count,Watchers count,Repo URL,Languages URL,Commits URL,Branches URL,Topics URL
0,154344656,Activity-Recognition-using-Machine-Learning,Use machine learning to classify various activ...,2018-10-23T14:40:38Z,2019-04-30T08:00:17Z,kb22,MIT License,True,3,0,1,1,https://api.github.com/repos/kb22/Activity-Rec...,https://api.github.com/repos/kb22/Activity-Rec...,https://api.github.com/repos/kb22/Activity-Rec...,https://api.github.com/repos/kb22/Activity-Rec...,https://api.github.com/repos/kb22/Activity-Rec...
1,98045107,Animation-in-C,A frame by frame Animation designed in C Langu...,2017-07-22T16:52:58Z,2017-07-22T17:06:23Z,kb22,,True,0,0,0,0,https://api.github.com/repos/kb22/Animation-in-C,https://api.github.com/repos/kb22/Animation-in...,https://api.github.com/repos/kb22/Animation-in...,https://api.github.com/repos/kb22/Animation-in...,https://api.github.com/repos/kb22/Animation-in...
2,175083480,Article-Recommender,"Using LDA, the project recommends Wikipedia ar...",2019-03-11T21:04:20Z,2019-06-17T12:54:23Z,kb22,MIT License,True,2,0,3,3,https://api.github.com/repos/kb22/Article-Reco...,https://api.github.com/repos/kb22/Article-Reco...,https://api.github.com/repos/kb22/Article-Reco...,https://api.github.com/repos/kb22/Article-Reco...,https://api.github.com/repos/kb22/Article-Reco...
3,104722536,Attendance-Marker,A QR Code based attendance tracking application.,2017-09-25T08:20:58Z,2017-09-25T08:22:21Z,kb22,,True,0,0,0,0,https://api.github.com/repos/kb22/Attendance-M...,https://api.github.com/repos/kb22/Attendance-M...,https://api.github.com/repos/kb22/Attendance-M...,https://api.github.com/repos/kb22/Attendance-M...,https://api.github.com/repos/kb22/Attendance-M...
4,161978072,Color-Identification-using-Machine-Learning,This project uses Machine Learning to extract ...,2018-12-16T07:27:38Z,2019-06-19T08:26:32Z,kb22,MIT License,True,3,0,11,11,https://api.github.com/repos/kb22/Color-Identi...,https://api.github.com/repos/kb22/Color-Identi...,https://api.github.com/repos/kb22/Color-Identi...,https://api.github.com/repos/kb22/Color-Identi...,https://api.github.com/repos/kb22/Color-Identi...


I'll publish this data into a .csv file called **repos_info.csv**

In [133]:
repos_df.to_csv('repos_info.csv', index = False)

## Topics

For topics of each repository, I'll iterate through all repos' `Topics URL` and get the corresponding data.

In [142]:
for i in range(repos_df.shape[0]):
    response = requests.get(repos_df.loc[i, 'Topics URL'], 
                     headers = {"Accept": "application/vnd.github.mercy-preview+json",
                                "content-type":"application/json"})
    response = response.json()
    if not response['message']:
        topics = reponse['names']
        topics = ', '.join(topics)
        print(topics)
    else:
        print("Could not fetch data due to: {}".format(response['message']))

Could not fetch data due to: API rate limit exceeded for 103.21.19.10. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)
Could not fetch data due to: API rate limit exceeded for 103.21.19.10. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)
Could not fetch data due to: API rate limit exceeded for 103.21.19.10. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)
Could not fetch data due to: API rate limit exceeded for 103.21.19.10. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)
Could not fetch data due to: API rate limit exceeded for 103.21.19.10. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)
Could not fetch data due to: API ra

KeyboardInterrupt: 

'https://api.github.com/repos/kb22/Activity-Recognition-using-Machine-Learning/labels'