# Data collection

In this notebook, I'll use the **GitHub API** to extract various information from my user profile such as repositories, commits and more. I'll also save this data to **.csv** files so that I can draw insights.

## Import libraries and defining constants

I'll import various libraries needed for fetching the data.

In [1]:
import json
import requests
import numpy as np
import pandas as pd

import requests
from requests.auth import HTTPBasicAuth

I'll fetch the credentials from the json file and create an `authentication` variable.

In [2]:
credentials = json.loads(open('credentials.json').read())
authentication = HTTPBasicAuth(credentials['username'], credentials['password'])

## User information

I'll first extract the user information such as name and related URLs which would be useful ahead.

In [3]:
data = requests.get('https://api.github.com/users/' + credentials['username'],
                    auth = authentication)
data = data.json()
data

{'login': 'kb22',
 'id': 14316277,
 'node_id': 'MDQ6VXNlcjE0MzE2Mjc3',
 'avatar_url': 'https://avatars3.githubusercontent.com/u/14316277?v=4',
 'gravatar_id': '',
 'url': 'https://api.github.com/users/kb22',
 'html_url': 'https://github.com/kb22',
 'followers_url': 'https://api.github.com/users/kb22/followers',
 'following_url': 'https://api.github.com/users/kb22/following{/other_user}',
 'gists_url': 'https://api.github.com/users/kb22/gists{/gist_id}',
 'starred_url': 'https://api.github.com/users/kb22/starred{/owner}{/repo}',
 'subscriptions_url': 'https://api.github.com/users/kb22/subscriptions',
 'organizations_url': 'https://api.github.com/users/kb22/orgs',
 'repos_url': 'https://api.github.com/users/kb22/repos',
 'events_url': 'https://api.github.com/users/kb22/events{/privacy}',
 'received_events_url': 'https://api.github.com/users/kb22/received_events',
 'type': 'User',
 'site_admin': False,
 'name': 'Karan Bhanot',
 'company': 'Cvent',
 'blog': 'https://medium.com/@bhanotkaran

From the json output above, I'll try to extract basic information such as `name`, `location`, `email`, `bio`, `public_repos`, and `public gists`. I'll also keep some of the urls handy inluding `repos_url`, `gists_url` and `blog`.

In [4]:
print("Information about user {}:\n".format(credentials['username']))
print("Name: {}".format(data['name']))
print("Email: {}".format(data['email']))
print("Location: {}".format(data['location']))
print("Public repos: {}".format(data['public_repos']))
print("Public gists: {}".format(data['public_gists']))
print("About: {}".format(data['bio']))

Information about user kb22:

Name: Karan Bhanot
Email: bhanotkaran22@gmail.com
Location: India
Public repos: 36
Public gists: 208
About: Working with Python, Machine Learning and Deep Learning to explore the field of Data Science and Machine Learning.


## Repositories

Next, I'll fetch repositories for the user.

In [5]:
REPOS_URL = data['repos_url']
repos_data = requests.get(REPOS_URL, auth = authentication)
repos_data = repos_data.json()

I'll first explore only one repository information and take a look at all the information I can keep.

In [6]:
repos_data[0]

{'id': 154344656,
 'node_id': 'MDEwOlJlcG9zaXRvcnkxNTQzNDQ2NTY=',
 'name': 'Activity-Recognition-using-Machine-Learning',
 'full_name': 'kb22/Activity-Recognition-using-Machine-Learning',
 'private': False,
 'owner': {'login': 'kb22',
  'id': 14316277,
  'node_id': 'MDQ6VXNlcjE0MzE2Mjc3',
  'avatar_url': 'https://avatars3.githubusercontent.com/u/14316277?v=4',
  'gravatar_id': '',
  'url': 'https://api.github.com/users/kb22',
  'html_url': 'https://github.com/kb22',
  'followers_url': 'https://api.github.com/users/kb22/followers',
  'following_url': 'https://api.github.com/users/kb22/following{/other_user}',
  'gists_url': 'https://api.github.com/users/kb22/gists{/gist_id}',
  'starred_url': 'https://api.github.com/users/kb22/starred{/owner}{/repo}',
  'subscriptions_url': 'https://api.github.com/users/kb22/subscriptions',
  'organizations_url': 'https://api.github.com/users/kb22/orgs',
  'repos_url': 'https://api.github.com/users/kb22/repos',
  'events_url': 'https://api.github.com/us

There are a number of things that we can keep a track of here. I'll select the following:
1. id: Unique id for the repository.
2. name: The name of the repository.
3. description: The description of the repository.
4. created_at: The time and date when the repository was first created.
5. updated_at: The time and date when the repository was last updated.
6. login: Username of the owner of the repository.
7. license: The license type (if any).
8. has_wiki: A boolean that signifies if the repository has a wiki document.
9. forks_count: Total forks of the repository.
10. open_issues_count: Total issues opened in the repository.
11. stargazers_count: The total stars on the reepository.
12. watchers_count: Total users watching the repository.

I'll also keep track of some urls for further analysis including:
1. url: The url of the repository.
2. commits_url: The url for all commits in the repository.
3. languages_url: The url for all languages in the repository.

The commit url, I'll remove the end value inside the braces.

In [7]:
repos_information = []
for i, repo in enumerate(repos_data):
    data = []
    data.append(repo['id'])
    data.append(repo['name'])
    data.append(repo['description'])
    data.append(repo['created_at'])
    data.append(repo['updated_at'])
    data.append(repo['owner']['login'])
    data.append(repo['license']['name'] if repo['license'] != None else None)
    data.append(repo['has_wiki'])
    data.append(repo['forks_count'])
    data.append(repo['open_issues_count'])
    data.append(repo['stargazers_count'])
    data.append(repo['watchers_count'])
    data.append(repo['url'])
    data.append(repo['commits_url'].split("{")[0])
    data.append(repo['url'] + '/languages')
    repos_information.append(data)

In [8]:
repos_df = pd.DataFrame(repos_information, columns = ['Id', 'Name', 'Description', 'Created on', 'Updated on', 
                                                      'Owner', 'License', 'Includes wiki', 'Forks count', 
                                                      'Issues count', 'Stars count', 'Watchers count',
                                                      'Repo URL', 'Commits URL', 'Languages URL'])
repos_df.head(10)

Unnamed: 0,Id,Name,Description,Created on,Updated on,Owner,License,Includes wiki,Forks count,Issues count,Stars count,Watchers count,Repo URL,Commits URL,Languages URL
0,154344656,Activity-Recognition-using-Machine-Learning,Use machine learning to classify various activ...,2018-10-23T14:40:38Z,2019-04-30T08:00:17Z,kb22,MIT License,True,3,0,1,1,https://api.github.com/repos/kb22/Activity-Rec...,https://api.github.com/repos/kb22/Activity-Rec...,https://api.github.com/repos/kb22/Activity-Rec...
1,98045107,Animation-in-C,A frame by frame Animation designed in C Langu...,2017-07-22T16:52:58Z,2017-07-22T17:06:23Z,kb22,,True,0,0,0,0,https://api.github.com/repos/kb22/Animation-in-C,https://api.github.com/repos/kb22/Animation-in...,https://api.github.com/repos/kb22/Animation-in...
2,175083480,Article-Recommender,"Using LDA, the project recommends Wikipedia ar...",2019-03-11T21:04:20Z,2019-06-17T12:54:23Z,kb22,MIT License,True,2,0,3,3,https://api.github.com/repos/kb22/Article-Reco...,https://api.github.com/repos/kb22/Article-Reco...,https://api.github.com/repos/kb22/Article-Reco...
3,104722536,Attendance-Marker,A QR Code based attendance tracking application.,2017-09-25T08:20:58Z,2017-09-25T08:22:21Z,kb22,,True,0,0,0,0,https://api.github.com/repos/kb22/Attendance-M...,https://api.github.com/repos/kb22/Attendance-M...,https://api.github.com/repos/kb22/Attendance-M...
4,161978072,Color-Identification-using-Machine-Learning,This project uses Machine Learning to extract ...,2018-12-16T07:27:38Z,2019-06-19T08:26:32Z,kb22,MIT License,True,3,0,11,11,https://api.github.com/repos/kb22/Color-Identi...,https://api.github.com/repos/kb22/Color-Identi...,https://api.github.com/repos/kb22/Color-Identi...
5,98046895,Computer-Simulation-using-Proteus,Simulated the functioning of a basic computer ...,2017-07-22T17:23:04Z,2017-07-22T17:23:04Z,kb22,,True,0,0,0,0,https://api.github.com/repos/kb22/Computer-Sim...,https://api.github.com/repos/kb22/Computer-Sim...,https://api.github.com/repos/kb22/Computer-Sim...
6,186462080,Coursera_Capstone,The repository will include code for the Capst...,2019-05-13T17:00:15Z,2019-06-17T08:10:53Z,kb22,,True,1,0,0,0,https://api.github.com/repos/kb22/Coursera_Cap...,https://api.github.com/repos/kb22/Coursera_Cap...,https://api.github.com/repos/kb22/Coursera_Cap...
7,149577869,Create-dataset-using-API,A Python notebook to create a dataset through ...,2018-09-20T08:36:38Z,2019-06-20T19:07:44Z,kb22,,True,7,0,3,3,https://api.github.com/repos/kb22/Create-datas...,https://api.github.com/repos/kb22/Create-datas...,https://api.github.com/repos/kb22/Create-datas...
8,173539133,Create-Face-Data-from-Images,Using OpenCV Face Detection Neural Network to ...,2019-03-03T06:24:15Z,2019-06-09T12:34:33Z,kb22,,True,4,0,5,5,https://api.github.com/repos/kb22/Create-Face-...,https://api.github.com/repos/kb22/Create-Face-...,https://api.github.com/repos/kb22/Create-Face-...
9,183209160,Deep-Learning-A-Z-Coursework,The project includes all the files for the Dee...,2019-04-24T10:48:09Z,2019-06-05T21:49:42Z,kb22,GNU General Public License v3.0,True,0,0,2,2,https://api.github.com/repos/kb22/Deep-Learnin...,https://api.github.com/repos/kb22/Deep-Learnin...,https://api.github.com/repos/kb22/Deep-Learnin...


## Languages

For topics of each repository, I'll iterate through all repos' `Languagues URL` and get the corresponding data. I'll also store them back to the dataframe.

In [9]:
for i in range(repos_df.shape[0]):
    response = requests.get(repos_df.loc[i, 'Languages URL'], auth = authentication)
    response = response.json()
    print(i, response)
    if response != {}:
        languages = []
        for key, value in response.items():
            languages.append(key)
        languages = ', '.join(languages)
        repos_df.loc[i, 'Languages'] = languages
    else:
        repos_df.loc[i, 'Languages'] = ""

0 {'Jupyter Notebook': 2080768}
1 {'C': 196326}
2 {'Python': 8506}
3 {'Java': 17850}
4 {'Jupyter Notebook': 4152085}
5 {}
6 {'Jupyter Notebook': 1789648, 'HTML': 427681}
7 {'Jupyter Notebook': 20657}
8 {'Python': 2871}
9 {'Jupyter Notebook': 295254}
10 {'Jupyter Notebook': 85404}
11 {'Python': 3034}
12 {'Jupyter Notebook': 47204}
13 {'Jupyter Notebook': 261173}
14 {'Jupyter Notebook': 39932}
15 {'Java': 67597, 'PHP': 6386}
16 {'Jupyter Notebook': 519792}
17 {'HTML': 5632, 'JavaScript': 3664, 'CSS': 2389}
18 {'Jupyter Notebook': 1089017}
19 {'Jupyter Notebook': 119427}
20 {'Jupyter Notebook': 299710}
21 {'JavaScript': 21089, 'Python': 4965, 'HTML': 3208, 'CSS': 2327}
22 {'PHP': 28283, 'CSS': 20740}
23 {'HTML': 72173955, 'Jupyter Notebook': 185823}
24 {'Jupyter Notebook': 147471}
25 {'C++': 3726}
26 {'Java': 24850}
27 {'Java': 17705}
28 {'Jupyter Notebook': 333190, 'Python': 6086}
29 {'Python': 2045}


I'll publish this data into a .csv file called **repos_info.csv**

In [10]:
repos_df.to_csv('repos_info.csv', index = False)

## Commits

I'll now also create a dataset with all the commits done till now.

In [11]:
response = requests.get(repos_df.loc[0, 'Commits URL'], auth = authentication)
response.json()

[{'sha': '62bf829df5045ed10787551b46a1ae1cc103e1d2',
  'node_id': 'MDY6Q29tbWl0MTU0MzQ0NjU2OjYyYmY4MjlkZjUwNDVlZDEwNzg3NTUxYjQ2YTFhZTFjYzEwM2UxZDI=',
  'commit': {'author': {'name': 'Karan Bhanot',
    'email': 'bhanotkaran22@gmail.com',
    'date': '2018-10-29T13:10:26Z'},
   'committer': {'name': 'GitHub',
    'email': 'noreply@github.com',
    'date': '2018-10-29T13:10:26Z'},
   'message': 'Update README.md',
   'tree': {'sha': '24e087e5aab5ccdf19451665672ae2a4930c6ceb',
    'url': 'https://api.github.com/repos/kb22/Activity-Recognition-using-Machine-Learning/git/trees/24e087e5aab5ccdf19451665672ae2a4930c6ceb'},
   'url': 'https://api.github.com/repos/kb22/Activity-Recognition-using-Machine-Learning/git/commits/62bf829df5045ed10787551b46a1ae1cc103e1d2',
   'comment_count': 0,
   'verification': {'verified': True,
    'reason': 'valid',
    'signature': '-----BEGIN PGP SIGNATURE-----\n\nwsBcBAABCAAQBQJb1wbCCRBK7hj4Ov3rIwAAdHIIADt3HJ0NU5/3fg4MSLwuWT3o\nDjR1aUs3xrbHu67xUGLje1etxzJ9+hUa

I'll save the id, date and the message of each commit.

In [12]:
commits_information = []
for i in range(repos_df.shape[0]):
    response = requests.get(repos_df.loc[i, 'Commits URL'], auth = authentication)
    response = response.json()
    for commit in response:
        commit_data = []
        commit_data.append(commit['sha'])
        commit_data.append(commit['commit']['committer']['date'])
        commit_data.append(commit['commit']['message'])
        commits_information.append(commit_data)

In [13]:
commits_df = pd.DataFrame(commits_information, columns = ['Id', 'Date', 'Message'])
commits_df.head(5)

Unnamed: 0,Id,Date,Message
0,62bf829df5045ed10787551b46a1ae1cc103e1d2,2018-10-29T13:10:26Z,Update README.md
1,a49bf9574b3f4170a5baf27ef8664e9a053ccaed,2018-10-27T17:11:37Z,Remove unused imports
2,c13241c5428628f63d44dd1e7a1e568f930dd05b,2018-10-27T15:31:59Z,Use test.csv for testing
3,050ef708c4334382d06fd1dc8061ed14d58da6b4,2018-10-25T10:50:27Z,Use machine learning algorithms
4,48593306b25ad2456fc55131a73e78a1405e62db,2018-10-24T16:06:06Z,Add visualizations of the dataset


I'll publish this data into a .csv file called **commits_info.csv**

In [14]:
commits_df.to_csv('commits_info.csv', index = False)