<a href="https://colab.research.google.com/github/grosa1/hands-on-ml-tutorials/blob/master/tutorial_1/gharchive.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GH Archive examples

GH Archive is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis. It provides the recording of GitHub events API aggregated into hourly archives and also stored in Google BigQuery.

## Import packages

In [0]:
import json
import gzip
import pandas as pd
from datetime import datetime
import os

## Get archives

To download all archives of 2020-04-20, we can simply use the `wget`command:

In [2]:
!wget https://data.gharchive.org/2020-04-20-{0..23}.json.gz

--2020-04-29 15:01:18--  https://data.gharchive.org/2020-04-20-0.json.gz
Resolving data.gharchive.org (data.gharchive.org)... 104.27.165.156, 104.27.164.156, 2606:4700:3031::681b:a59c, ...
Connecting to data.gharchive.org (data.gharchive.org)|104.27.165.156|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31934128 (30M) [application/gzip]
Saving to: ‘2020-04-20-0.json.gz’


2020-04-29 15:01:19 (41.2 MB/s) - ‘2020-04-20-0.json.gz’ saved [31934128/31934128]

--2020-04-29 15:01:19--  https://data.gharchive.org/2020-04-20-1.json.gz
Reusing existing connection to data.gharchive.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 35050565 (33M) [application/gzip]
Saving to: ‘2020-04-20-1.json.gz’


2020-04-29 15:01:20 (37.4 MB/s) - ‘2020-04-20-1.json.gz’ saved [35050565/35050565]

--2020-04-29 15:01:20--  https://data.gharchive.org/2020-04-20-2.json.gz
Reusing existing connection to data.gharchive.org:443.
HTTP request sent, awaiting response... 200 OK
Le

Then we sort archives by date:

In [3]:
def extract_date(filename):
    return datetime.strptime(filename.split('.')[0], '%Y-%m-%d-%H')

data = [f for f in os.listdir() if f.endswith(".json.gz")]
data_sorted = sorted(data, key=lambda f: extract_date(f))

data_sorted

['2020-04-20-0.json.gz',
 '2020-04-20-1.json.gz',
 '2020-04-20-2.json.gz',
 '2020-04-20-3.json.gz',
 '2020-04-20-4.json.gz',
 '2020-04-20-5.json.gz',
 '2020-04-20-6.json.gz',
 '2020-04-20-7.json.gz',
 '2020-04-20-8.json.gz',
 '2020-04-20-9.json.gz',
 '2020-04-20-10.json.gz',
 '2020-04-20-11.json.gz',
 '2020-04-20-12.json.gz',
 '2020-04-20-13.json.gz',
 '2020-04-20-14.json.gz',
 '2020-04-20-15.json.gz',
 '2020-04-20-16.json.gz',
 '2020-04-20-17.json.gz',
 '2020-04-20-18.json.gz',
 '2020-04-20-19.json.gz',
 '2020-04-20-20.json.gz',
 '2020-04-20-21.json.gz',
 '2020-04-20-22.json.gz',
 '2020-04-20-23.json.gz']

## List GitHub events from an archive

In [0]:
data_archive = data_sorted[0]

In [0]:
types = list()
with gzip.open(data_archive) as f:
    for i, line in enumerate(f):
        json_data = json.loads(line)
        types.append(json_data['type'])

In [6]:
pd.DataFrame({"types": types}).groupby("types").size().sort_values(ascending=False)

types
PushEvent                        42278
CreateEvent                       9636
PullRequestEvent                  5696
WatchEvent                        5041
IssueCommentEvent                 3556
DeleteEvent                       2204
IssuesEvent                       1988
ForkEvent                         1667
PullRequestReviewCommentEvent      794
ReleaseEvent                       313
GollumEvent                        261
PublicEvent                        260
MemberEvent                        206
CommitCommentEvent                 187
dtype: int64

## Working with PushEvent

Let's print the structure of a PushEvent JSON

In [0]:
push_events = list()
with gzip.open(data_archive) as archive:
    for line in archive:
        json_data = json.loads(line)
        if (json_data['type'] == 'PushEvent'):
            push_events.append(json_data)   

In [8]:
print(json.dumps(push_events[0], indent=4))

{
    "id": "12094461500",
    "type": "PushEvent",
    "actor": {
        "id": 43556190,
        "login": "paulpatault",
        "display_login": "paulpatault",
        "gravatar_id": "",
        "url": "https://api.github.com/users/paulpatault",
        "avatar_url": "https://avatars.githubusercontent.com/u/43556190?"
    },
    "repo": {
        "id": 251396393,
        "name": "paulpatault/ElectronApp",
        "url": "https://api.github.com/repos/paulpatault/ElectronApp"
    },
    "payload": {
        "push_id": 4942456594,
        "size": 1,
        "distinct_size": 1,
        "ref": "refs/heads/master",
        "head": "e6ed0fc4ca94b2c70967dbc4be04c4b77c465a3e",
        "before": "3582055733a4368c34d3a986936e8d1ad767a9ac",
        "commits": [
            {
                "sha": "e6ed0fc4ca94b2c70967dbc4be04c4b77c465a3e",
                "author": {
                    "name": "paul",
                    "email": "3412ff613491d9ef3b65c4a94ea42fb53750a40c@gmail.com"
          

## Mining of PushEvent commits

Ler's extract a list of rows to build a pandas `DataFrame` later

In [0]:
rows = list()
for push in push_events:
  repo_name = push['repo']['name']
  created_at = push['created_at'] 
  for commit in push['payload']['commits']:
    if commit['distinct']:
      rows.append({
          'id': commit['sha'],
          'repository': repo_name,
          'author': commit['author']['name'],
          'message': commit['message'],
          'url': commit['url'],
          'created_at': created_at
      })

In [10]:
commits_data = pd.DataFrame(rows)

commits_data.head()

Unnamed: 0,id,repository,author,message,url,created_at
0,e6ed0fc4ca94b2c70967dbc4be04c4b77c465a3e,paulpatault/ElectronApp,paul,Update main.js,https://api.github.com/repos/paulpatault/Elect...,2020-04-20T00:00:00Z
1,99b9cfc11ac6dd01327f9ceb9b36d6d8a0b83843,SusumuKanazawa/sample,SusumuKanazawa,Merge pull request #9 from SusumuKanazawa/deve...,https://api.github.com/repos/SusumuKanazawa/sa...,2020-04-20T00:00:00Z
2,60a3dffc4c6e6c485155c83d9e9cae4ccd6512c7,SuperBrainBro/LUDUM-DARE-46,NoohAlavi,a,https://api.github.com/repos/SuperBrainBro/LUD...,2020-04-20T00:00:00Z
3,a17f54b66141b82db68f1283860d5f7493f15af8,hurl2526/what-i-learned-week-13,Patrick,what-i-learned-13,https://api.github.com/repos/hurl2526/what-i-l...,2020-04-20T00:00:00Z
4,ed1db100a939e12f31c8c0710ce0882e9e63cea9,OpenSAGE/OpenSAGE.BlenderPlugin,Michael Schnabel,material import rework wip,https://api.github.com/repos/OpenSAGE/OpenSAGE...,2020-04-20T00:00:00Z


To count the commits for each repository:

In [11]:
commits_data.groupby('repository').size().sort_values(ascending=False)

repository
shangwoa/ab4860                                    398
y232/ntdtv                                         143
ShemaxCodes/oo-tic-tac-toe-onl01-seng-pt-032320    142
NoeCampos22/App_2LeyNewton                         140
akuppan1/rules-for-derivatives-lab-ds-apply-000    124
                                                  ... 
irudym/iloilo-core                                   1
irudym/iloilo-client                                 1
irov/Mengine                                         1
ironfroggy/SeedMagic                                 1
000SergeyMayer000/tracker                            1
Length: 19106, dtype: int64

## Working with duplicates
What happens if we call `groupby()` on commit id?

In [12]:
commits_data.groupby('id').size().sort_values(ascending=False)

id
1f23447aaa7e01dc92d41c76c2db70ca37e767e0    18
26fe874a38e92ea0433e1ba552a92229a483c610    11
15fc953f75766541bf2f1ed95945da2961a47ca6    11
6d7fb9f46ac61761c3d1b8c9a320090980bae2cd    10
0abcf2d1769eaea3af4e8821bd2c865351687715     8
                                            ..
a9a138ac10a55f508bc14fd164fba185fc551f6e     1
a9a131b0bb3b888393e621fb43f73d17f0055988     1
a99fa6f5dd8ae63f89a292bdb0f8c9df253044a8     1
a99f96753e7db59a4f3d6f73a125cb09936699f1     1
00009ce5eec4e505baa18e5bcbfe3add18dd4f4d     1
Length: 49608, dtype: int64

To count duplicates:

In [13]:
commits_data.duplicated(subset='id').sum()

1114

To delete lines only if commit id is duplicated:

In [14]:
commits_data = commits_data.drop_duplicates(subset='id')

commits_data.duplicated(subset='id').sum()

0

## Resources

- GH Archive: [link](https://www.gharchive.org/)
- GitHub event types: [link](https://developer.github.com/v3/activity/events/types/)