# Processing GitHub Events Taken From BigQuery

After making an account with BigQuery the following SQL commands were used to extract the events from 2016 - May 19, 2017 for all 99 repos and their forks taken from SamplingProcess.ipynb. This was ran on May 20, 2017 at 5:02 PM PST.

`SELECT type,
       payload,
       repo.id AS repo_id,
       actor.id AS user_id,
       actor.login AS login,
       created_at,
       id AS archive_id
  FROM [githubarchive]
 WHERE repo.id IN (
           SELECT repo_id
             FROM [repos_table]
       );`
       
The files were exported and are called 2016_events.gz, 2017_01_events.gz, 2017_02_events.gz, 2017_03_events.gz, 2017_04_events.gz, 2017_05_events.gz. Instead of reading all these files in they were combined with the following shell command.

`cat *.gz > events.gz`

These files can be downloaded from the link on the README. 

In [1]:
import pandas as pd
import sqlite3 as sql
import json
import sys
stdout = sys.stdout
reload(sys)
sys.setdefaultencoding('utf8')
sys.stdout = stdout

con = sql.connect('github.db')
c = con.cursor()

## Reading in the combined file

NOTE: Since all files were combined, there are rows that have the same header names that need to be removed

In [2]:
events = pd.read_table('events.gz', sep=',', engine='c')

In [3]:
# find the indices of headers
rm_indices = events['type'][events['type']== 'type'].index

# drop the indicies
events = events.drop(events.index[[rm_indices]])

In [4]:
# fix encoding in payload
events['payload'] = events['payload'].str.encode('ascii', 'ignore')

# change str representation of dict to actual dict
events['payload'] = events['payload'].apply(json.loads)

# change data type to numeric
events[['repo_id', 'user_id', 'archive_id']] = events[['repo_id', 'user_id', 'archive_id']].applymap(pd.to_numeric)

# change data type to date
events['created_at'] = events['created_at'].apply(pd.to_datetime)

events.shape

(478495, 7)

In [5]:
# reselect all the repos (accidentally chose repos with issuses not in english)
repos_ids = pd.read_sql('SELECT DISTINCT(repo_id) AS repos_ids FROM repos;', con)

In [6]:
events = events[events.repo_id.isin(repos_ids.repos_ids)]

events.shape

(475273, 7)

In [7]:
events['type'].unique()

array(['ForkEvent', 'PushEvent', 'WatchEvent', 'CreateEvent',
       'DeleteEvent', 'GollumEvent', 'IssuesEvent', 'PullRequestEvent',
       'IssueCommentEvent', 'CommitCommentEvent',
       'PullRequestReviewCommentEvent', 'ReleaseEvent', 'MemberEvent',
       'PublicEvent'], dtype=object)

## Creating a user table

Run once... Now that all the events are available we can we create a table of unique users and then drop the login and archive_id 

In [36]:
create_users_table = '''
CREATE TABLE users (
    user_id NUMERIC      NOT NULL,
    login   VARCHAR (40) NOT NULL,
    PRIMARY KEY (
        user_id,
        login
    )
)
WITHOUT ROWID;'''

In [39]:
c.execute(create_users_table)
con.commit()

In [40]:
# users can can change their login name so there are duplicate user_id's 
# or login name can be from the user referenced as an organization they're part of
user_table = events[['user_id', 'login']].drop_duplicates()

user_table.to_sql(name='users', con=con, if_exists='append', index=False)

# Event Payloads To Tables In The Database

## Watch Event
The WatchEvent is related to starring a repository, not watching. See this API blog post for an explanation. The event’s actor is the user who starred a repository, and the event’s repository is the repository that was starred.

`
KEY     TYPE    DESCRIPTION
action  string  The action that was performed. Currently, can only be started.`

In [8]:
# check the structure of Watch Event
events[events['type'] == 'WatchEvent'].head()

Unnamed: 0,type,payload,repo_id,user_id,login,created_at,archive_id
21,WatchEvent,{u'action': u'started'},55584626,33275,waigani,2016-05-20 09:37:31,4040270085
22,WatchEvent,{u'action': u'started'},72145556,8845083,v7lin,2016-12-14 01:57:21,5017278521
23,WatchEvent,{u'action': u'started'},64392484,1678118,bbbenji,2016-08-07 19:17:47,4383818313
24,WatchEvent,{u'action': u'started'},69798748,128654,BekoBou,2016-10-19 15:40:14,4735036669
25,WatchEvent,{u'action': u'started'},69798748,1716049,antonvasilenko,2016-10-28 14:37:24,4782874020


In [9]:
watch_table = events[['repo_id', 'user_id', 'login', 'created_at', 'archive_id']][events['type'] == 'WatchEvent']

In [10]:
# rename cols and change data type
watch_table = watch_table.rename(columns={'created_at':'date_starred', 'login':'user_login'})

watch_table['date_starred'] = watch_table['date_starred'].apply(pd.to_datetime)

In [11]:
print watch_table.shape

watch_table.head()

(230336, 5)


Unnamed: 0,repo_id,user_id,user_login,date_starred,archive_id
21,55584626,33275,waigani,2016-05-20 09:37:31,4040270085
22,72145556,8845083,v7lin,2016-12-14 01:57:21,5017278521
23,64392484,1678118,bbbenji,2016-08-07 19:17:47,4383818313
24,69798748,128654,BekoBou,2016-10-19 15:40:14,4735036669
25,69798748,1716049,antonvasilenko,2016-10-28 14:37:24,4782874020


In [12]:
# drop all duplicates across ALL columns
watch_table = watch_table.drop_duplicates()

print watch_table.shape

(230198, 5)


In [13]:
create_watch_table = '''
CREATE TABLE watch_events (
    repo_id      NUMERIC      NOT NULL,
    date_starred TEXT         NOT NULL,
    user_id      NUMERIC      NOT NULL,
    user_login   VARCHAR (40) NOT NULL,
    archive_id   NUMERIC      NOT NULL,
    PRIMARY KEY (
        repo_id,
        user_id,
        date_starred,
        archive_id
    )
)
WITHOUT ROWID;
'''

In [14]:
c.execute(create_watch_table)
con.commit()

In [15]:
watch_table.to_sql(name='watch_events', con=con, if_exists='append', index=False)

## Push Event
Triggered when a repository branch is pushed to. In addition to branch pushes, webhook push events are also triggered when repository tags are pushed.

`
KEY                       TYPE      DESCRIPTION
ref                       string    The full Git ref that was pushed. Example: "refs/heads/master".
head	                  string    The SHA of the most recent commit on ref after the push.
before                    string    The SHA of the most recent commit on ref before the push.
size	                  integer   The number of commits in the push.
distinct_size             integer   The number of distinct commits in the push.
commits                   array     An array of commit objects describing the pushed commits. (The array includes a  
                                    maximum of 20 commits. If necessary, you can use the Commits API to fetch        
                                    additional commits. This limit is applied to timeline events only and isn't     
                                    applied to webhook deliveries.)
commits[][sha]            string	The SHA of the commit.
commits[][message]        string	The commit message.
commits[][author]         object	The git author of the commit.
commits[][author][name]   string	The git author's name.
commits[][author][email]  string	The git author's email address.
commits[][url]            url       Points to the commit API resource.
commits[][distinct]       boolean   Whether this commit is distinct from any that have been pushed before.
`

In [77]:
events[events['type'] == 'PushEvent'].head()

Unnamed: 0,type,payload,repo_id,user_id,login,created_at,archive_id
7,PushEvent,"{u'size': 1, u'head': u'9ace8b54e6b6199109e4f8...",58482213,2467194,yoshuawuyts,2016-08-29 02:28:48 UTC,4481961993
8,PushEvent,"{u'size': 2, u'head': u'1bb34dc6a8c738c65474e9...",67361765,2182307,jpuri,2016-10-24 19:14:12 UTC,4758352997
9,PushEvent,"{u'size': 1, u'head': u'dafa402c9b56f6ca7e388c...",63506379,4643257,AeonLucid,2016-08-07 12:37:43 UTC,4383281798
10,PushEvent,"{u'size': 3, u'head': u'7aba5f0eb216b3127b7522...",65438297,3266682,dgcrouse,2016-08-23 06:02:58 UTC,4455249468
11,PushEvent,"{u'size': 1, u'head': u'5ba0cd4c4706426a6fce75...",75232752,2057020,neskk,2016-12-02 09:08:54 UTC,4958264505


In [39]:
# check the structure of a push event
events['payload'][events['type'] == 'PushEvent'].iloc[0]

{u'before': u'1bba545d19231f5d1030f742673db07db98e354f',
 u'commits': [{u'author': {u'email': u'a1a7f265cb3ad36479782781c7488a7b334751fd@gmail.com',
    u'name': u'Yoshua Wuyts'},
   u'distinct': True,
   u'message': u'tests: fix history',
   u'sha': u'9ace8b54e6b6199109e4f867d5404e94dc6f2efd',
   u'url': u'https://api.github.com/repos/yoshuawuyts/choo/commits/9ace8b54e6b6199109e4f867d5404e94dc6f2efd'}],
 u'distinct_size': 1,
 u'head': u'9ace8b54e6b6199109e4f867d5404e94dc6f2efd',
 u'push_id': 1271365738,
 u'ref': u'refs/heads/update-router',
 u'size': 1}

In [12]:
# checking the different sizes of commits
PushEvent_payload['distinct_size'].unique()

array([   1,    2,    3,    4,    5,   11,    0,   89,   73,    6,   37,
         16,    7,   15,   53,   18,   19,    9,   13,    8,   25,   12,
        183,   10,   14,   24,   22,   27,   43,   75,   76,  141,   30,
         55,   85,   32,   90,   29,   21,   28,   40,   44,   35,   66,
        108,   34,   38,  420,   67,   20,   23,  261,   58,   17,  238,
         98,   99,   31,  342,  833,   26,  178,   63,  135,  179,   68,
         95,   48,   80,   62,   42,  112,   41,   54,   36,  171,   64,
         46,  130,  532,   70,   91,   56,  474,   69,  136,  262,   71,
        169,   97,   84,  114,  321,   33,  175,  426,  268,  218,   94,
        148,  499,   74,   57,   39,  351,  122,  251,  213,   51,   72,
        128,  216,  111,   93,   50,  120,   45,  106,  713,   81,  392,
         49,   88,  147,  314,  187,   86,  115,  126,   79,   60,  233,
        297,  104,  185,  157,   59,  127,   65,  294,   92,  140,  264,
        105,  158,   52,  331,  121,  493,  172,   

## Issue Comment Event
Triggered when an issue comment is created, edited, or deleted.

`
KEY                 TYPE    DESCRIPTION
action              string  The action that was performed on the comment. Can be one of "created", "edited", or
                            "deleted".
changes             object  The changes to the comment if the action was "edited".
changes[body][from] string  The previous version of the body if the action was "edited".
issue               object  The issue the comment belongs to.
comment             object  The comment itself.
`

In [16]:
IssueCommentEvents = events[events['type'] == 'IssueCommentEvent']

In [17]:
# example of a payload
IssueCommentEvents['payload'].iloc[0]

{u'action': u'created',
 u'comment': {u'body': u'Ya once formatted you will have to transfer again',
  u'created_at': u'2016-10-29T14:22:29Z',
  u'html_url': u'https://github.com/Plailect/Guide/issues/665#issuecomment-257094294',
  u'id': 257094294,
  u'issue_url': u'https://api.github.com/repos/Plailect/Guide/issues/665',
  u'updated_at': u'2016-10-29T14:22:29Z',
  u'url': u'https://api.github.com/repos/Plailect/Guide/issues/comments/257094294',
  u'user': {u'avatar_url': u'https://avatars.githubusercontent.com/u/17306233?v=3',
   u'events_url': u'https://api.github.com/users/pbanj/events{/privacy}',
   u'followers_url': u'https://api.github.com/users/pbanj/followers',
   u'following_url': u'https://api.github.com/users/pbanj/following{/other_user}',
   u'gists_url': u'https://api.github.com/users/pbanj/gists{/gist_id}',
   u'gravatar_id': u'',
   u'html_url': u'https://github.com/pbanj',
   u'id': 17306233,
   u'login': u'pbanj',
   u'organizations_url': u'https://api.github.com/user

In [124]:
ICL = []

for index, row in IssueCommentEvents.iterrows():
    
    payload = row['payload']
    comment = payload['comment']
    issue = payload['issue']
    
    ICL.append({'repo_id': row['repo_id'], 'user_id': row['user_id'], 
               'user_login': row['login'], 'date_created': row['created_at'],
               'archive_id': row['archive_id'], 'action': payload['action'], 
               'comment': comment['body'], 'issue_comment_id': pd.to_numeric(comment['id']), 
                'url': comment['html_url'], 'issue_id': pd.to_numeric(issue['id']),
               'issue_state': issue['state']})

In [125]:
# check the structure of issue comment Event
len(ICL)

69821

In [126]:
issue_comments_table = pd.DataFrame(ICL)

In [127]:
issue_comments_table.head()

Unnamed: 0,action,archive_id,comment,date_created,issue_comment_id,issue_id,issue_state,repo_id,url,user_id,user_login
0,created,4786583626,Ya once formatted you will have to transfer again,2016-10-29 14:22:29,257094294,186062739,open,52920387,https://github.com/Plailect/Guide/issues/665#i...,17306233,pbanj
1,created,4404634353,it only seems to happen to be when I am on the...,2016-08-11 12:44:22,239150022,170343029,open,63730796,https://github.com/mchristopher/PokemonGo-Desk...,5121698,justinblayney
2,created,3885837271,BTW - this I ran into this because of a [commo...,2016-04-13 21:48:02,209662472,148192004,open,54994103,https://github.com/FezVrasta/popper.js/issues/...,136564,rosskevin
3,created,4400512395,Looks like this feature isn't avaible from the...,2016-08-10 18:03:45,238951429,170368421,closed,64392484,https://github.com/kvangent/PokeAlarm/issues/7...,10712294,kvangent
4,created,4073978702,+1 I was going to make a similar PR as well fo...,2016-05-28 04:18:24,222289102,156508185,open,53321815,https://github.com/picturepan2/spectre/pull/55...,5353151,DJTB


In [128]:
# only created events
issue_comments_table['action'].unique()

array([u'created'], dtype=object)

In [129]:
# drop any duplicates across ALL columns (not sure why they would be there though???)
issue_comments_table = issue_comments_table.drop_duplicates()

In [134]:
create_issue_comments_table = '''
CREATE TABLE issue_comment_events (
    issue_comment_id NUMERIC      NOT NULL,
    issue_id         NUMERIC      NOT NULL,
    repo_id          NUMERIC      NOT NULL,
    date_created     TEXT         NOT NULL,
    [action]         VARCAHR (15) NOT NULL,
    comment          TEXT,
    issue_state      VARCHAR (15) NOT NULL,
    url              TEXT         NOT NULL,
    user_id          NUMERIC      NOT NULL,
    user_login       VARCHAR (40) NOT NULL,
    archive_id       NUMERIC      NOT NULL,
    PRIMARY KEY (
        issue_comment_id,
        issue_id,
        repo_id,
        date_created
    )
)
WITHOUT ROWID;
'''

In [135]:
c.execute(create_issue_comments_table)
con.commit()

In [136]:
issue_comments_table.to_sql(name='issue_comment_events', con=con, if_exists='append', index=False)

## Fork Event
Triggered when a user forks a repository.

`
KEY     TYPE    DESCRIPTION
forkee  object  The created repository.
`

In [115]:
# check the structure of Watch Event
events['payload'][events['type'] == 'ForkEvent'].iloc[0]

{u'forkee': {u'archive_url': u'https://api.github.com/repos/DavidKindler/js-stack-from-scratch/{archive_format}{/ref}',
  u'assignees_url': u'https://api.github.com/repos/DavidKindler/js-stack-from-scratch/assignees{/user}',
  u'blobs_url': u'https://api.github.com/repos/DavidKindler/js-stack-from-scratch/git/blobs{/sha}',
  u'branches_url': u'https://api.github.com/repos/DavidKindler/js-stack-from-scratch/branches{/branch}',
  u'clone_url': u'https://github.com/DavidKindler/js-stack-from-scratch.git',
  u'collaborators_url': u'https://api.github.com/repos/DavidKindler/js-stack-from-scratch/collaborators{/collaborator}',
  u'comments_url': u'https://api.github.com/repos/DavidKindler/js-stack-from-scratch/comments{/number}',
  u'commits_url': u'https://api.github.com/repos/DavidKindler/js-stack-from-scratch/commits{/sha}',
  u'compare_url': u'https://api.github.com/repos/DavidKindler/js-stack-from-scratch/compare/{base}...{head}',
  u'contents_url': u'https://api.github.com/repos/DavidK

## Issues Event
Triggered when an issue is assigned, unassigned, labeled, unlabeled, opened, edited, milestoned, demilestoned, closed, or reopened.

`
KEY                  TYPE    DESCRIPTION
action               string  The action that was performed. Can be one of "assigned", "unassigned", "labeled", 
                             "unlabeled", "opened", "edited", "milestoned", "demilestoned", "closed", or "reopened".
issue                object  The issue itself.
changes              object  The changes to the issue if the action was "edited".
changes[title][from] string  The previous version of the title if the action was "edited".
changes[body][from]  string  The previous version of the body if the action was "edited".
assignee             object  The optional user who was assigned or unassigned from the issue.
label                object  The optional label that was added or removed from the issue.
`

In [29]:
IssueEvents = events[events['type'] == 'IssuesEvent']

print IssueEvents.shape

(27585, 7)


In [30]:
IL = []

for index, row in IssueEvents.iterrows():
    
    payload = row['payload']
    issue = payload['issue']
    
    try:
        labels = []
        for labs in issue['labels']:            
            labels.append(labs['name'])
        labels = '::'.join(labels)
    except:
        labels = None

    IL.append({'repo_id': row['repo_id'], 'user_id': row['user_id'], 
               'user_login': row['login'], 'date_created': row['created_at'],
               'archive_id': row['archive_id'], 'action': payload['action'],
               'description': issue['body'], 'comments_count': pd.to_numeric(issue['comments']),
               'issue_id': pd.to_numeric(issue['id']), 'issue_number': pd.to_numeric(issue['number']),
               'title': issue['title'], 'labels': labels, 'url':issue['html_url']})

In [31]:
issue_table = pd.DataFrame(IL)

In [32]:
issue_table.head()

Unnamed: 0,action,archive_id,comments_count,date_created,description,issue_id,issue_number,labels,repo_id,title,url,user_id,user_login
0,closed,3900021338,2,2016-04-18 02:06:46,I noticed in the screenshot it looks different...,149019668,4,,53632140,How to modify theme,https://github.com/zyedidia/micro/issues/4,24260,montanaflynn
1,opened,4545891531,0,2016-09-11 14:18:12,I am using Mac OSX and the tracker will work g...,176245961,1209,,63730796,how to avoid getting IP banned,https://github.com/mchristopher/PokemonGo-Desk...,22104808,lochnesskid
2,opened,4676639609,0,2016-10-07 12:40:12,"Hi,\r\n\r\nThank you very much for the plugin....",181658747,133,,61213281,Expand and Contract Messages in the android no...,https://github.com/fechanique/cordova-plugin-f...,19806010,iamntg
3,closed,4287637946,1,2016-07-16 22:07:29,*(6/20: updated the numbers)*\r\n\r\nI don't c...,161111673,5,,57929326,CSrankings is missing half my papers (on DBLP),https://github.com/emeryberger/CSrankings/issu...,1612723,emeryberger
4,closed,4841857831,1,2016-11-09 17:11:30,https://plailect.github.io/Guide/updating-a9lh...,188194735,695,,52920387,Add an optional label to section IV on updatin...,https://github.com/Plailect/Guide/issues/695,16979510,Plailect


In [33]:
# drop any duplicates across ALL columns (not sure why they would be there though???)
issue_table = issue_table.drop_duplicates()

In [34]:
create_issues_table = '''
CREATE TABLE issues_events (
    issue_id       NUMERIC      NOT NULL,
    repo_id        NUMERIC      NOT NULL,
    date_created   TEXT         NOT NULL,
    [action]       VARCAHR (15) NOT NULL,
    title          TEXT,
    description    TEXT,
    labels         TEXT,
    issue_number   NUMERIC,
    comments_count INT,
    url            TEXT         NOT NULL,
    user_id        NUMERIC      NOT NULL,
    user_login     VARCHAR (40) NOT NULL,
    archive_id     NUMERIC      NOT NULL,
    PRIMARY KEY (
        issue_id,
        repo_id,
        date_created,
        [action]
    )
)
WITHOUT ROWID;
'''

In [35]:
c.execute(create_issues_table)
con.commit()

In [36]:
issue_table.to_sql(name='issues_events', con=con, if_exists='append', index=False)

## Pull Request Event
Triggered when a pull request is assigned, unassigned, labeled, unlabeled, opened, edited, closed, reopened, or synchronized. Also triggered when a pull request review is requested, or when a review request is removed.

`
KEY                  TYPE     DESCRIPTION
action               string   The action that was performed. Can be one of "assigned", "unassigned", 
                              "review_requested", "review_request_removed", "labeled", "unlabeled", "opened", 
                              "edited", "closed", or "reopened". If the action is "closed" and the merged key is 
                              false, the pull request was closed with unmerged commits. If the action is "closed" 
                              and the merged key is true, the pull request was merged.
number               integer  The pull request number.
changes              object   The changes to the comment if the action was "edited".
changes[title][from] string   The previous version of the title if the action was "edited".
changes[body][from]  string   The previous version of the body if the action was "edited".
pull_request         object   The pull request itself.
`

In [8]:
events[events['type'] == 'PullRequestEvent'].head()

Unnamed: 0,type,payload,repo_id,user_id,login,created_at,archive_id
127,PullRequestEvent,"{u'action': u'closed', u'number': 196, u'pull_...",58482213,2467194,yoshuawuyts,2016-07-27 10:00:21 UTC,4333251316
128,PullRequestEvent,"{u'action': u'opened', u'number': 158, u'pull_...",50261548,5373549,mehmandarov,2016-07-09 13:10:32 UTC,4255480534
129,PullRequestEvent,"{u'action': u'opened', u'number': 127, u'pull_...",63730796,1134310,rodrigograca31,2016-07-22 17:07:21 UTC,4315231077
268,PullRequestEvent,"{u'action': u'opened', u'number': 96, u'pull_r...",50261548,7223349,tombusby,2016-05-09 14:13:18 UTC,3989494180
269,PullRequestEvent,"{u'action': u'closed', u'number': 66, u'pull_r...",50261548,2743180,sergiokopplin,2016-03-03 20:50:55 UTC,3719640122


In [108]:
pr_payload = pd.DataFrame(events['payload'][events['type'] == 'PullRequestEvent'].tolist())

In [116]:
pr_payload['action'].unique()

array([u'closed', u'opened', u'reopened'], dtype=object)

In [110]:
pr_payload_pr = pd.DataFrame(pr_payload['pull_request'].tolist())

In [139]:
pr_payload_pr.head()

Unnamed: 0,_links,additions,assignee,assignees,base,body,changed_files,closed_at,comments,comments_url,...,requested_reviewers,review_comment_url,review_comments,review_comments_url,state,statuses_url,title,updated_at,url,user
0,{u'review_comment': {u'href': u'https://api.gi...,1,,[],{u'repo': {u'issues_url': u'https://api.github...,The requirebin demo was also in the example ar...,1,2016-07-27T10:00:21Z,1,https://api.github.com/repos/yoshuawuyts/choo/...,...,,https://api.github.com/repos/yoshuawuyts/choo/...,0,https://api.github.com/repos/yoshuawuyts/choo/...,closed,https://api.github.com/repos/yoshuawuyts/choo/...,Update link to example,2016-07-27T10:00:21Z,https://api.github.com/repos/yoshuawuyts/choo/...,{u'following_url': u'https://api.github.com/us...
1,{u'review_comment': {u'href': u'https://api.gi...,3,,[],{u'repo': {u'issues_url': u'https://api.github...,,1,,0,https://api.github.com/repos/sergiokopplin/ind...,...,,https://api.github.com/repos/sergiokopplin/ind...,0,https://api.github.com/repos/sergiokopplin/ind...,open,https://api.github.com/repos/sergiokopplin/ind...,Minor changes. Typo fixes and cleanup.,2016-07-09T13:10:32Z,https://api.github.com/repos/sergiokopplin/ind...,{u'following_url': u'https://api.github.com/us...
2,{u'review_comment': {u'href': u'https://api.gi...,3,,[],{u'repo': {u'issues_url': u'https://api.github...,A simple contribution to make the URL clickabl...,1,,0,https://api.github.com/repos/mchristopher/Poke...,...,,https://api.github.com/repos/mchristopher/Poke...,0,https://api.github.com/repos/mchristopher/Poke...,open,https://api.github.com/repos/mchristopher/Poke...,Make the URL clickable,2016-07-22T17:07:21Z,https://api.github.com/repos/mchristopher/Poke...,{u'following_url': u'https://api.github.com/us...
3,{u'review_comment': {u'href': u'https://api.gi...,1,,,{u'repo': {u'issues_url': u'https://api.github...,I was making a couple of alterations to main.c...,2,,0,https://api.github.com/repos/sergiokopplin/ind...,...,,https://api.github.com/repos/sergiokopplin/ind...,0,https://api.github.com/repos/sergiokopplin/ind...,open,https://api.github.com/repos/sergiokopplin/ind...,Allow css caching and fix orphaned assets/styl...,2016-05-09T14:13:17Z,https://api.github.com/repos/sergiokopplin/ind...,{u'following_url': u'https://api.github.com/us...
4,{u'review_comment': {u'href': u'https://api.gi...,13,,,{u'repo': {u'issues_url': u'https://api.github...,"I needed an icon for my Medium.com account, so...",5,2016-03-03T20:50:54Z,1,https://api.github.com/repos/sergiokopplin/ind...,...,,https://api.github.com/repos/sergiokopplin/ind...,0,https://api.github.com/repos/sergiokopplin/ind...,closed,https://api.github.com/repos/sergiokopplin/ind...,Added a social icon for Medium.com writing pla...,2016-03-03T20:50:54Z,https://api.github.com/repos/sergiokopplin/ind...,{u'following_url': u'https://api.github.com/us...


In [114]:
pr_payload_pr.columns

Index([u'_links', u'additions', u'assignee', u'assignees', u'base', u'body',
       u'changed_files', u'closed_at', u'comments', u'comments_url',
       u'commits', u'commits_url', u'created_at', u'deletions', u'diff_url',
       u'head', u'html_url', u'id', u'issue_url', u'locked',
       u'maintainer_can_modify', u'merge_commit_sha', u'mergeable',
       u'mergeable_state', u'merged', u'merged_at', u'merged_by', u'milestone',
       u'number', u'patch_url', u'rebaseable', u'requested_reviewers',
       u'review_comment_url', u'review_comments', u'review_comments_url',
       u'state', u'statuses_url', u'title', u'updated_at', u'url', u'user'],
      dtype='object')

In [143]:
pr_payload_pr[['commits', 'commits_url', 'created_at', 'deletions', 'diff_url',
       'head', 'html_url', 'id', 'issue_url', 'locked',
       'maintainer_can_modify', 'merge_commit_sha', 'mergeable',
       'mergeable_state', 'merged', 'merged_at', 'merged_by', 'milestone',
       'number', 'patch_url', 'rebaseable']].head()

Unnamed: 0,commits,commits_url,created_at,deletions,diff_url,head,html_url,id,issue_url,locked,...,merge_commit_sha,mergeable,mergeable_state,merged,merged_at,merged_by,milestone,number,patch_url,rebaseable
0,1,https://api.github.com/repos/yoshuawuyts/choo/...,2016-07-25T18:39:12Z,1,https://github.com/yoshuawuyts/choo/pull/196.diff,{u'repo': {u'issues_url': u'https://api.github...,https://github.com/yoshuawuyts/choo/pull/196,78730407,https://api.github.com/repos/yoshuawuyts/choo/...,False,...,d547d5863bf3b50502910110c17d197619842b27,,unknown,True,2016-07-27T10:00:21Z,{u'following_url': u'https://api.github.com/us...,,196,https://github.com/yoshuawuyts/choo/pull/196.p...,
1,1,https://api.github.com/repos/sergiokopplin/ind...,2016-07-09T13:10:32Z,3,https://github.com/sergiokopplin/indigo/pull/1...,{u'repo': {u'issues_url': u'https://api.github...,https://github.com/sergiokopplin/indigo/pull/158,76846611,https://api.github.com/repos/sergiokopplin/ind...,False,...,,,unknown,False,,,,158,https://github.com/sergiokopplin/indigo/pull/1...,
2,1,https://api.github.com/repos/mchristopher/Poke...,2016-07-22T17:07:21Z,1,https://github.com/mchristopher/PokemonGo-Desk...,{u'repo': {u'issues_url': u'https://api.github...,https://github.com/mchristopher/PokemonGo-Desk...,78509356,https://api.github.com/repos/mchristopher/Poke...,False,...,,,unknown,False,,,,127,https://github.com/mchristopher/PokemonGo-Desk...,
3,1,https://api.github.com/repos/sergiokopplin/ind...,2016-05-09T14:13:16Z,1188,https://github.com/sergiokopplin/indigo/pull/9...,{u'repo': {u'issues_url': u'https://api.github...,https://github.com/sergiokopplin/indigo/pull/96,69351580,https://api.github.com/repos/sergiokopplin/ind...,False,...,,,unknown,False,,,,96,https://github.com/sergiokopplin/indigo/pull/9...,
4,3,https://api.github.com/repos/sergiokopplin/ind...,2016-03-03T20:10:00Z,0,https://github.com/sergiokopplin/indigo/pull/6...,{u'repo': {u'issues_url': u'https://api.github...,https://github.com/sergiokopplin/indigo/pull/66,61617351,https://api.github.com/repos/sergiokopplin/ind...,False,...,6a86bde259c541af2e5163e1feb053d246a9744d,,unknown,True,2016-03-03T20:50:54Z,{u'following_url': u'https://api.github.com/us...,,66,https://github.com/sergiokopplin/indigo/pull/6...,


In [145]:
pr_payload_pr['merged_by'].iloc[0]

{u'avatar_url': u'https://avatars.githubusercontent.com/u/2467194?v=3',
 u'events_url': u'https://api.github.com/users/yoshuawuyts/events{/privacy}',
 u'followers_url': u'https://api.github.com/users/yoshuawuyts/followers',
 u'following_url': u'https://api.github.com/users/yoshuawuyts/following{/other_user}',
 u'gists_url': u'https://api.github.com/users/yoshuawuyts/gists{/gist_id}',
 u'gravatar_id': u'',
 u'html_url': u'https://github.com/yoshuawuyts',
 u'id': 2467194,
 u'login': u'yoshuawuyts',
 u'organizations_url': u'https://api.github.com/users/yoshuawuyts/orgs',
 u'received_events_url': u'https://api.github.com/users/yoshuawuyts/received_events',
 u'repos_url': u'https://api.github.com/users/yoshuawuyts/repos',
 u'site_admin': False,
 u'starred_url': u'https://api.github.com/users/yoshuawuyts/starred{/owner}{/repo}',
 u'subscriptions_url': u'https://api.github.com/users/yoshuawuyts/subscriptions',
 u'type': u'User',
 u'url': u'https://api.github.com/users/yoshuawuyts'}

In [144]:
pr_payload_pr['head'].iloc[0]

{u'label': u'MattMcFarland:patch-2',
 u'ref': u'patch-2',
 u'repo': {u'archive_url': u'https://api.github.com/repos/MattMcFarland/choo/{archive_format}{/ref}',
  u'assignees_url': u'https://api.github.com/repos/MattMcFarland/choo/assignees{/user}',
  u'blobs_url': u'https://api.github.com/repos/MattMcFarland/choo/git/blobs{/sha}',
  u'branches_url': u'https://api.github.com/repos/MattMcFarland/choo/branches{/branch}',
  u'clone_url': u'https://github.com/MattMcFarland/choo.git',
  u'collaborators_url': u'https://api.github.com/repos/MattMcFarland/choo/collaborators{/collaborator}',
  u'comments_url': u'https://api.github.com/repos/MattMcFarland/choo/comments{/number}',
  u'commits_url': u'https://api.github.com/repos/MattMcFarland/choo/commits{/sha}',
  u'compare_url': u'https://api.github.com/repos/MattMcFarland/choo/compare/{base}...{head}',
  u'contents_url': u'https://api.github.com/repos/MattMcFarland/choo/contents/{+path}',
  u'contributors_url': u'https://api.github.com/repos/Ma

In [140]:
pr_payload_pr['_links'].iloc[0]

{u'comments': {u'href': u'https://api.github.com/repos/yoshuawuyts/choo/issues/196/comments'},
 u'commits': {u'href': u'https://api.github.com/repos/yoshuawuyts/choo/pulls/196/commits'},
 u'html': {u'href': u'https://github.com/yoshuawuyts/choo/pull/196'},
 u'issue': {u'href': u'https://api.github.com/repos/yoshuawuyts/choo/issues/196'},
 u'review_comment': {u'href': u'https://api.github.com/repos/yoshuawuyts/choo/pulls/comments{/number}'},
 u'review_comments': {u'href': u'https://api.github.com/repos/yoshuawuyts/choo/pulls/196/comments'},
 u'self': {u'href': u'https://api.github.com/repos/yoshuawuyts/choo/pulls/196'},
 u'statuses': {u'href': u'https://api.github.com/repos/yoshuawuyts/choo/statuses/6aa88da861783d11efddae59a1cd3fcf7a5f9de7'}}

In [141]:
pr_payload_pr['base'].iloc[0]

{u'label': u'yoshuawuyts:master',
 u'ref': u'master',
 u'repo': {u'archive_url': u'https://api.github.com/repos/yoshuawuyts/choo/{archive_format}{/ref}',
  u'assignees_url': u'https://api.github.com/repos/yoshuawuyts/choo/assignees{/user}',
  u'blobs_url': u'https://api.github.com/repos/yoshuawuyts/choo/git/blobs{/sha}',
  u'branches_url': u'https://api.github.com/repos/yoshuawuyts/choo/branches{/branch}',
  u'clone_url': u'https://github.com/yoshuawuyts/choo.git',
  u'collaborators_url': u'https://api.github.com/repos/yoshuawuyts/choo/collaborators{/collaborator}',
  u'comments_url': u'https://api.github.com/repos/yoshuawuyts/choo/comments{/number}',
  u'commits_url': u'https://api.github.com/repos/yoshuawuyts/choo/commits{/sha}',
  u'compare_url': u'https://api.github.com/repos/yoshuawuyts/choo/compare/{base}...{head}',
  u'contents_url': u'https://api.github.com/repos/yoshuawuyts/choo/contents/{+path}',
  u'contributors_url': u'https://api.github.com/repos/yoshuawuyts/choo/contribut

In [138]:
pr_payload_pr['html_url'].iloc[0]

u'https://github.com/yoshuawuyts/choo/pull/196'

## Create Event
Represents a created repository, branch, or tag.

`
KEY            TYPE      DESCRIPTION
ref_type       string	The object that was created. Can be one of "repository", "branch", or "tag"
ref            string	The git ref (or null if only a repository was created).
master_branch  string	The name of the repository's default branch (usually master).
description    string	The repository's current description.
`

In [37]:
CreateEvent = events[events['type'] == 'CreateEvent']

print CreateEvent.shape

(7752, 7)


In [38]:
CreateEvent['payload'].iloc[0]

{u'description': u'A statuspage generator that lets you host your statuspage for free on Github.',
 u'master_branch': u'master',
 u'pusher_type': u'user',
 u'ref': u'pyup-update-pygithub-1.27.1-to-1.28',
 u'ref_type': u'branch'}

In [39]:
create_table = pd.concat([CreateEvent.drop(['payload'], axis=1), CreateEvent['payload'].apply(pd.Series)], axis=1)

In [40]:
create_table.head()

Unnamed: 0,type,repo_id,user_id,login,created_at,archive_id,description,master_branch,pusher_type,ref,ref_type
106,CreateEvent,52598117,2930472,jayfk,2016-09-11 20:06:45,4546427203,A statuspage generator that lets you host your...,master,user,pyup-update-pygithub-1.27.1-to-1.28,branch
107,CreateEvent,50261548,2743180,sergiokopplin,2016-05-24 16:18:36,4055726574,:ramen: Minimalist Jekyll Template,gh-pages,user,0.4.0,tag
108,CreateEvent,51905436,4154003,dherault,2016-06-01 08:08:32,4086935723,Emulate AWS and API Gateway locally when deve...,master,user,v2.5.1,tag
252,CreateEvent,59716022,633234,dwlfrth,2016-05-28 16:34:01,4074681227,:ramen: Minimalist Jekyll Template,gh-pages,user,fix/email_link,branch
253,CreateEvent,66615835,12601115,egorbenko,2016-08-26 04:34:46,4473514913,a lightweight artificially intelligent multi-e...,master,user,docker,branch


In [41]:
create_table_cols = ['repo_id', 'user_id', 'login', 'created_at', 'archive_id', 'description',
                     'master_branch', 'ref', 'ref_type']

create_table = create_table[create_table_cols]

In [42]:
create_table = create_table.rename(columns={'login': 'user_login', 'created_at':'date_created'})

In [44]:
# drop all duplicates across ALL columns
create_table = create_table.drop_duplicates()

print create_table.shape

(7748, 9)


In [43]:
create_create_table = '''
CREATE TABLE create_events (
    repo_id       NUMERIC      NOT NULL,
    date_created  TEXT         NOT NULL,
    master_branch VARCHAR (50) NOT NULL,
    ref_type      VARCHAR (15) NOT NULL,
    ref           TEXT,
    description   TEXT,
    user_id       NUMERIC      NOT NULL,
    user_login    VARCHAR (40) NOT NULL,
    archive_id    NUMERIC      NOT NULL,
    PRIMARY KEY (
        repo_id,
        user_id,
        archive_id
    )
)
WITHOUT ROWID;
'''

In [45]:
c.execute(create_create_table)
con.commit()

In [46]:
create_table.to_sql(name='create_events', con=con, if_exists='append', index=False)

## Gollum Event
Triggered when a Wiki page is created or updated.

`
KEY                TYPE    DESCRIPTION
pages              array   The pages that were updated.
pages[][page_name] string  The name of the page.
pages[][title]     string  The current page title.
pages[][action]    string  The action that was performed on the page. Can be "created" or "edited".
pages[][sha]       string  The latest commit SHA of the page.
pages[][html_url]  string  Points to the HTML wiki page.
`

In [23]:
events[events['type'] == 'GollumEvent'].head()

Unnamed: 0,type,payload,repo_id,user_id,login,created_at,archive_id
110,GollumEvent,"{u'pages': [{u'title': u'Mudana de Regio', u'h...",56832437,10860756,HenryLeonheart,2016-04-25 10:09:09 UTC,3930925431
111,GollumEvent,"{u'pages': [{u'title': u'Home', u'html_url': u...",63730796,20782190,MacDogg,2016-08-02 15:57:48 UTC,4360926496
112,GollumEvent,"{u'pages': [{u'title': u'Part 1 (Decrypt9)', u...",52920387,16979510,Plailect,2016-09-14 12:48:42 UTC,4561726421
113,GollumEvent,"{u'pages': [{u'title': u'Part 3 (RedNAND)', u'...",58546876,12843673,KonsoleHL,2016-06-10 17:21:42 UTC,4131124722
114,GollumEvent,"{u'pages': [{u'title': u'Part 1 (Decrypt9)', u...",52920387,16979510,Plailect,2016-08-27 19:33:32 UTC,4479691144


In [24]:
events['payload'][events['type'] == 'GollumEvent'].iloc[0]

{u'pages': [{u'action': u'edited',
   u'html_url': u'/HenryLeonheart/Guide_Portuguese/wiki/Mudan%C3%A7a-de-Regi%C3%A3o',
   u'page_name': u'Mudana-de-Regio',
   u'sha': u'fe9d278b7d921f92b6f4cfb4321717d23cc51acf',
   u'summary': None,
   u'title': u'Mudana de Regio'}]}

## Delete Event
Represents a deleted branch or tag.

`
KEY         TYPE    DESCRIPTION
ref_type    string  The object that was deleted. Can be "branch" or "tag".
ref         string  The full git ref.
`

In [47]:
DeleteEvents = events[events['type'] == 'DeleteEvent']

In [48]:
delete_table = pd.concat([DeleteEvents.drop(['payload'], axis=1), DeleteEvents['payload'].apply(pd.Series)], axis=1)

In [49]:
print delete_table.shape

delete_table.head()

(3745, 9)


Unnamed: 0,type,repo_id,user_id,login,created_at,archive_id,pusher_type,ref,ref_type
109,DeleteEvent,50340470,4172932,KunalKapadia,2016-10-12 11:31:05,4698047202,user,greenkeeper-babel-preset-stage-2-6.17.0,branch
397,DeleteEvent,58482213,2467194,yoshuawuyts,2016-10-28 00:54:28,4779948081,user,rm-hash-match,branch
398,DeleteEvent,67534988,6452882,mcrowson,2016-12-14 14:16:43,5020199929,user,doppins/botocore-equals-1.4.30,branch
542,DeleteEvent,61498284,3019665,jakirkham,2016-11-30 15:11:15,4946666703,user,skip_no_change_upgrade,branch
702,DeleteEvent,70114942,5995907,bebaps,2016-12-09 01:08:08,4993725855,user,gh-pages,branch


In [50]:
delete_table_cols = ['repo_id', 'user_id', 'login', 'created_at', 'archive_id', 'ref', 'ref_type']

delete_table = delete_table[delete_table_cols]

In [51]:
delete_table = delete_table.rename(columns={'created_at': 'date_deleted', 'login':'user_login'})

In [52]:
delete_table = delete_table.drop_duplicates()

print delete_table.shape

(3737, 7)


In [53]:
create_delete_table = '''
CREATE TABLE delete_events (
    repo_id      NUMERIC      NOT NULL,
    date_deleted TEXT         NOT NULL,
    ref_type     VARCHAR (15) NOT NULL,
    ref          TEXT,
    user_id      NUMERIC      NOT NULL,
    user_login   VARCHAR (40) NOT NULL,
    archive_id   NUMERIC      NOT NULL,
    PRIMARY KEY (
        repo_id,
        user_id,
        date_deleted,
        ref
    )
)
WITHOUT ROWID;
'''

In [54]:
c.execute(create_delete_table)
con.commit()

In [55]:
delete_table.to_sql(name='delete_events', con=con, if_exists='append', index=False)

## Pull Request Review Comment Event
Triggered when a comment on a pull request's unified diff is created, edited, or deleted (in the Files Changed tab).

`
KEY                  TYPE    DESCRIPTION
action               string  The action that was performed on the comment. Can be one of "created", "edited", or
                             "deleted".
changes              object  The changes to the comment if the action was "edited".
changes[body][from]  string  The previous version of the body if the action was "edited".
pull_request         object  The pull request the comment belongs to.
comment              object  The comment itself.
`

In [28]:
events[events['type'] == 'PullRequestReviewCommentEvent'].head()

Unnamed: 0,type,payload,repo_id,user_id,login,created_at,archive_id
157,PullRequestReviewCommentEvent,"{u'action': u'created', u'comment': {u'origina...",51769689,170270,sindresorhus,2016-06-17 23:58:49 UTC,4163151005
158,PullRequestReviewCommentEvent,"{u'action': u'created', u'comment': {u'origina...",51769689,737065,paulmolluzzo,2016-04-05 01:03:36 UTC,3845956436
290,PullRequestReviewCommentEvent,"{u'action': u'created', u'comment': {u'origina...",51769689,737065,paulmolluzzo,2016-05-14 12:49:23 UTC,4014735110
441,PullRequestReviewCommentEvent,"{u'action': u'created', u'comment': {u'origina...",51769689,228037,blackjid,2016-03-30 22:50:31 UTC,3827449219
593,PullRequestReviewCommentEvent,"{u'action': u'created', u'comment': {u'origina...",50063252,399441,mathom,2016-07-06 17:02:57 UTC,4241678868


In [29]:
events['payload'][events['type'] == 'PullRequestReviewCommentEvent'].iloc[0]

{u'action': u'created',
 u'comment': {u'_links': {u'html': {u'href': u'https://github.com/sindresorhus/refined-github/pull/251#discussion_r67589649'},
   u'pull_request': {u'href': u'https://api.github.com/repos/sindresorhus/refined-github/pulls/251'},
   u'self': {u'href': u'https://api.github.com/repos/sindresorhus/refined-github/pulls/comments/67589649'}},
  u'body': u'Oh, ok. Just never seen `createContextualFragment` used before.\r\n\r\nI think I would rather do:\r\n\r\n```js\r\n(new DOMParser()).parseFromString(\'<!DOCTYPE html><html lang="en"><body><p></p></body></html>\', \'text/html\');\r\n```',
  u'commit_id': u'a4af7b1c41e6850a1d0ae86e471840a3d28341b4',
  u'created_at': u'2016-06-17T23:58:49Z',
  u'diff_hunk': u'@@ -0,0 +1,56 @@\n+window.showRealNames = () => {\n+\tconst storageKey = \'cachedNames\';\n+\n+\tconst getCachedUsers = cb => {\n+\t\tchrome.storage.local.get(storageKey, data => cb(data[storageKey]));\n+\t};\n+\n+\tconst updateCachedUsers = users => {\n+\t\tchrome.s

## Release Event
Triggered when a release is published.

`
KEY      TYPE    DESCRIPTION
action   string  The action that was performed. Currently, can only be "published".
release  object  The release itself.
`

In [73]:
# check the structure of Watch Event
ReleaseEvent = events[events['type'] == 'ReleaseEvent']

print ReleaseEvent.shape

ReleaseEvent.head()

(1320, 7)


Unnamed: 0,type,payload,repo_id,user_id,login,created_at,archive_id
414,ReleaseEvent,"{u'action': u'published', u'release': {u'body'...",70534107,7341604,rcmalli,2016-10-15 09:41:26,4715111891
415,ReleaseEvent,"{u'action': u'published', u'release': {u'body'...",48933503,1820165,afollestad,2016-02-29 20:12:02,3703380867
711,ReleaseEvent,"{u'action': u'published', u'release': {u'body'...",54994103,5382443,FezVrasta,2016-03-29 21:37:19,3821753509
893,ReleaseEvent,"{u'action': u'published', u'release': {u'body'...",53632140,5513065,zyedidia,2016-12-13 00:02:52,5009951237
1043,ReleaseEvent,"{u'action': u'published', u'release': {u'body'...",53632140,5513065,zyedidia,2016-11-23 00:02:28,4910890860


In [74]:
RE_payload = pd.DataFrame(ReleaseEvent['payload'].tolist())

In [75]:
RE_payload_release = pd.DataFrame(RE_payload['release'].tolist())

In [76]:
pd.DataFrame(RE_payload_release['assets'].tolist())

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,{u'uploader': {u'following_url': u'https://api...,{u'uploader': {u'following_url': u'https://api...,,,,,,,,,,
1,{u'uploader': {u'following_url': u'https://api...,,,,,,,,,,,
2,,,,,,,,,,,,
3,,,,,,,,,,,,
4,,,,,,,,,,,,
5,,,,,,,,,,,,
6,,,,,,,,,,,,
7,,,,,,,,,,,,
8,,,,,,,,,,,,
9,,,,,,,,,,,,


In [77]:
RE_payload_release['draft'].unique()

array([False], dtype=object)

In [78]:
RL = []

for index, row in ReleaseEvent.iterrows():
    
    payload = row['payload']
    release = payload['release']
    
    RL.append({'repo_id': row['repo_id'], 'user_id': row['user_id'], 
               'user_login': row['login'], 'date_published': pd.to_datetime(row['created_at']),
               'archive_id': row['archive_id'], 
               'description': release['body'], 'date_created': pd.to_datetime(release['created_at']),
               'url': release['html_url'], 'release_id': pd.to_numeric(release['id']), 'name': release['name'],
               'prerelease': release['prerelease'], 'tag_name': release['tag_name'],
               'target_commitish': release['target_commitish']})

In [79]:
release_table = pd.DataFrame(RL)

release_table.head()

Unnamed: 0,archive_id,date_created,date_published,description,name,prerelease,release_id,repo_id,tag_name,target_commitish,url,user_id,user_login
0,4715111891,2016-10-15 09:33:56,2016-10-15 09:41:26,,v0.4,False,4397054,70534107,v0.4,squeezenet,https://github.com/rcmalli/deep-learning-model...,7341604,rcmalli
1,3703380867,2016-02-29 20:06:02,2016-02-29 20:12:02,"1. Some internal library updates, including Ap...",1.3.0,False,2713152,48933503,1.3.0,master,https://github.com/afollestad/polar-dashboard/...,1820165,afollestad
2,3821753509,2016-03-29 21:24:40,2016-03-29 21:37:19,This is the first stable release of Near.js in...,Near.js,False,2910797,54994103,v0.1.0,master,https://github.com/FezVrasta/popper.js/release...,5382443,FezVrasta
3,5009951237,2016-12-11 21:43:07,2016-12-13 00:02:52,Autogenerated nightly build of micro,Nightly build,True,4900653,53632140,nightly,master,https://github.com/zyedidia/micro/releases/tag...,5513065,zyedidia
4,4910890860,2016-11-20 16:07:04,2016-11-23 00:02:28,Autogenerated nightly build of micro,Nightly build,True,4736225,53632140,nightly,master,https://github.com/zyedidia/micro/releases/tag...,5513065,zyedidia


In [80]:
# drop duplicates across ALL columns
release_table = release_table.drop_duplicates()

release_table.shape

(1318, 13)

In [83]:
create_release_table = '''
CREATE TABLE release_events (
    release_id       NUMERIC      NOT NULL,
    repo_id          NUMERIC      NOT NULL,
    date_created     TEXT         NOT NULL,
    date_published   TEXT         NOT NULL,
    name             VARCHAR (50),
    description      TEXT,
    prerelease       INT          NOT NULL,
    target_commitish VARCHAR (50) NOT NULL,
    tag_name         VARCHAR (50) NOT NULL,
    url              TEXT         NOT NULL,
    user_id          NUMERIC      NOT NULL,
    user_login       VARCHAR (40) NOT NULL,
    archive_id       NUMERIC      NOT NULL,
    PRIMARY KEY (
        release_id,
        repo_id,
        date_created,
        date_published
    )
)
WITHOUT ROWID;
'''

In [84]:
c.execute(create_release_table)
con.commit()

In [85]:
release_table.to_sql(name='release_events', con=con, if_exists='append', index=False)

## Commit Comment Event
Triggered when a commit comment is created.

`
KEY      TYPE    DESCRIPTION
comment  object  The comment itself.
`

In [32]:
events[events['type'] == 'CommitCommentEvent'].head()

Unnamed: 0,type,payload,repo_id,user_id,login,created_at,archive_id
156,CommitCommentEvent,{u'comment': {u'commit_id': u'6eeebef7153392f7...,51905436,4699,demetriusnunes,2016-10-17 18:50:26 UTC,4722793643
921,CommitCommentEvent,{u'comment': {u'commit_id': u'ca1dfd904d9a7a5c...,61412022,3254314,mathieudutour,2016-07-17 23:19:46 UTC,4289198735
1622,CommitCommentEvent,{u'comment': {u'commit_id': u'7460f8c8019b4721...,54027312,2439146,patchthecode,2016-12-24 20:17:31 UTC,5068635588
2307,CommitCommentEvent,{u'comment': {u'commit_id': u'52201980c02160e1...,63730796,20220872,harshmasters07,2016-08-05 12:25:31 UTC,4377838170
2452,CommitCommentEvent,{u'comment': {u'commit_id': u'00a626ec9a21c0d4...,51905436,4154003,dherault,2016-03-15 22:53:15 UTC,3767542773


In [33]:
events['payload'][events['type'] == 'CommitCommentEvent'].iloc[0]

{u'comment': {u'body': u'Now `this` is undefined, so `this.serverlessLog` throws.',
  u'commit_id': u'6eeebef7153392f714e018052f6a76c8b80bab48',
  u'created_at': u'2016-10-17T18:50:26Z',
  u'html_url': u'https://github.com/dherault/serverless-offline/commit/6eeebef7153392f714e018052f6a76c8b80bab48#commitcomment-19456791',
  u'id': 19456791,
  u'line': 326,
  u'path': u'src/index.js',
  u'position': 67,
  u'updated_at': u'2016-10-17T18:50:26Z',
  u'url': u'https://api.github.com/repos/dherault/serverless-offline/comments/19456791',
  u'user': {u'avatar_url': u'https://avatars.githubusercontent.com/u/4699?v=3',
   u'events_url': u'https://api.github.com/users/demetriusnunes/events{/privacy}',
   u'followers_url': u'https://api.github.com/users/demetriusnunes/followers',
   u'following_url': u'https://api.github.com/users/demetriusnunes/following{/other_user}',
   u'gists_url': u'https://api.github.com/users/demetriusnunes/gists{/gist_id}',
   u'gravatar_id': u'',
   u'html_url': u'https:

## Member Event
Triggered when a user is added or removed as a collaborator to a repository, or has their permissions changed.  

`
member                        object  The user that was added.
action                        string  The action that was performed. Can be one of "added", "deleted", or "edited"
changes                       object  The changes to the collaborator permissions if the action was "edited"
changes[old_permission][from] string  The previous permissions of the collaborator if the action was "edited"
`

In [86]:
MemberEvent = events[events['type'] == 'MemberEvent']

In [87]:
MemberEvent.head(n=1)

Unnamed: 0,type,payload,repo_id,user_id,login,created_at,archive_id
1218,MemberEvent,{u'member': {u'following_url': u'https://api.g...,59815426,11867815,majia67,2016-08-05 14:45:16,4378518804


In [88]:
MemberEvent['payload'].iloc[0]

{u'action': u'added',
 u'member': {u'avatar_url': u'https://avatars.githubusercontent.com/u/8280291?v=3',
  u'events_url': u'https://api.github.com/users/easyworld/events{/privacy}',
  u'followers_url': u'https://api.github.com/users/easyworld/followers',
  u'following_url': u'https://api.github.com/users/easyworld/following{/other_user}',
  u'gists_url': u'https://api.github.com/users/easyworld/gists{/gist_id}',
  u'gravatar_id': u'',
  u'html_url': u'https://github.com/easyworld',
  u'id': 8280291,
  u'login': u'easyworld',
  u'organizations_url': u'https://api.github.com/users/easyworld/orgs',
  u'received_events_url': u'https://api.github.com/users/easyworld/received_events',
  u'repos_url': u'https://api.github.com/users/easyworld/repos',
  u'site_admin': False,
  u'starred_url': u'https://api.github.com/users/easyworld/starred{/owner}{/repo}',
  u'subscriptions_url': u'https://api.github.com/users/easyworld/subscriptions',
  u'type': u'User',
  u'url': u'https://api.github.com/us

In [89]:
# there are only users added
MemberEvent_payload = pd.DataFrame(MemberEvent['payload'].tolist())

MemberEvent_payload['action'].unique()

array([u'added'], dtype=object)

In [90]:
ML = []

for index, row in MemberEvent.iterrows():
    
    payload = row['payload']
    member = payload['member']

    ML.append({'repo_id': row['repo_id'], 'by_user_id': row['user_id'],
               'by_user_login': row['login'], 'action_date': row['created_at'],
               'archive_id': row['archive_id'], 'action':payload['action'], 
               'at_user_id': pd.to_numeric(member['id']), 'at_user_login': member['login']})

In [91]:
member_table = pd.DataFrame(ML)

In [92]:
member_table.head()

Unnamed: 0,action,action_date,archive_id,at_user_id,at_user_login,by_user_id,by_user_login,repo_id
0,added,2016-08-05 14:45:16,4378518804,8280291,easyworld,11867815,majia67,59815426
1,added,2016-08-29 00:27:01,4481744866,1712363,keyphact,4643257,AeonLucid,63506379
2,added,2016-11-27 06:38:03,4928235573,11958359,AkiraLaine,8327811,SimulatedGREG,58616946
3,added,2016-04-05 10:05:28,3847383026,18283204,andynoack,18219846,jopohl,55258005
4,added,2016-09-02 19:23:32,4510275804,20820195,wagamamaz,10713581,zsdonghao,60626727


In [93]:
create_members_table = '''
CREATE TABLE member_events (
    repo_id       NUMERIC      NOT NULL,
    action_date   TEXT         NOT NULL,
    [action]      VARCAHR (15) NOT NULL,
    at_user_id    NUMERIC      NOT NULL,
    at_user_login VARCHAR (40) NOT NULL,
    by_user_id    NUMERIC      NOT NULL,
    by_user_login VARCHAR (40) NOT NULL,
    archive_id    NUMERIC      NOT NULL,
    PRIMARY KEY (
        repo_id,
        action_date,
        at_user_id
    )
)
WITHOUT ROWID;
'''

In [94]:
c.execute(create_members_table)
con.commit()

In [104]:
member_table.to_sql(name='member_events', con=con, if_exists='append', index=False)

## Public Event
Triggered when a private repository is open sourced.

In [96]:
events[events['type'] == 'PublicEvent'].head()

Unnamed: 0,type,payload,repo_id,user_id,login,created_at,archive_id
2915,PublicEvent,{},67749622,2577440,tqchen,2016-09-30 04:17:37,4641374648
14965,PublicEvent,{},61355800,8341653,daijifeng001,2016-06-21 07:33:32,4172657198
24271,PublicEvent,{},57929326,1612723,emeryberger,2016-05-16 00:21:35,4017077526
33191,PublicEvent,{},64519183,6261322,logaretm,2016-08-15 03:35:31,4417258560
34361,PublicEvent,{},54168759,1480321,SergioBenitez,2016-12-23 15:38:42,5066128048


In [97]:
public_table = events[['repo_id', 'user_id', 'login', 'created_at', 'archive_id']][events['type'] == 'PublicEvent']

In [98]:
public_table = public_table.rename(columns={'login':'user_login', 'created_at': 'date_open_sourced'})

In [101]:
create_public_table = '''
CREATE TABLE public_events (
    repo_id           NUMERIC      NOT NULL,
    date_open_sourced TEXT         NOT NULL,
    user_id           NUMERIC      NOT NULL,
    user_login        VARCHAR (40) NOT NULL,
    archive_id        NUMERIC      NOT NULL,
    PRIMARY KEY (
        repo_id,
        user_id,
        date_open_sourced
    )
)
WITHOUT ROWID;
'''

In [102]:
c.execute(create_public_table)
con.commit()

In [103]:
public_table.to_sql(name='public_events', con=con, if_exists='append', index=False)

## Resources
- https://apple.stackexchange.com/questions/80611/merging-multiple-csv-files-without-merging-the-header
- https://stackoverflow.com/questions/16890582/unixmerge-multiple-csv-files-with-same-header-by-keeping-the-header-of-the-firs
- https://stackoverflow.com/questions/21129020/how-to-fix-unicodedecodeerror-ascii-codec-cant-decode-byte
- https://stackoverflow.com/questions/988228/convert-a-string-representation-of-a-dictionary-to-a-dictionary
- https://stackoverflow.com/questions/29712962/how-can-i-convert-string-to-dict-or-list
- https://stackoverflow.com/questions/36606930/delete-an-element-in-a-json-object
- https://stackoverflow.com/questions/21745213/changed-github-password-no-longer-able-to-push-back-to-the-remote
- https://stackoverflow.com/questions/14661701/how-to-drop-a-list-of-rows-from-pandas-dataframe
- https://stackoverflow.com/questions/25494182/print-not-showing-in-ipython-notebook-python