# Processing GitHub Events Taken From BigQuery

After making an account with BigQuery the following SQL commands were used to extract the events from 2016 - May 19, 2017 for all 99 repos and their forks taken from SamplingProcess.ipynb. This was ran on May 20, 2017 at 5:02 PM PST.

`SELECT type,
       payload,
       repo.id AS repo_id,
       actor.id AS user_id,
       actor.login AS login,
       created_at,
       id AS archive_id
  FROM [githubarchive]
 WHERE repo.id IN (
           SELECT repo_id
             FROM [repos_table]
       );`
       
The files were exported and are called 2016_events.gz, 2017_01_events.gz, 2017_02_events.gz, 2017_03_events.gz, 2017_04_events.gz, 2017_05_events.gz. Instead of reading all these files in they were combined with the following shell command.

`cat *.gz > events.gz`

These files can be downloaded from the link on the README. 

In [2]:
import pandas as pd
from github import Github
import sqlite3 as sql
import json
import sys
reload(sys)
sys.setdefaultencoding('utf8')

## Reading in the combined file

NOTE: Since all files were combined, there are rows that have the same header names that need to be removed

In [3]:
events = pd.read_table('events.gz', sep=',', engine='c')

In [4]:
# find the indices of headers
rm_indices = events['type'][events['type']== 'type'].index

# drop the indicies
events = events.drop(events.index[[rm_indices]])

In [5]:
# fix encoding in payload
events['payload'] = events['payload'].str.encode('ascii', 'ignore')

# change str representation of dict to actual dict
events['payload'] = events['payload'].apply(json.loads)

In [9]:
events['type'].value_counts()

WatchEvent                       232086
PushEvent                         81444
IssueCommentEvent                 70140
ForkEvent                         29973
IssuesEvent                       27845
PullRequestEvent                  15298
CreateEvent                        7778
GollumEvent                        4238
DeleteEvent                        3748
PullRequestReviewCommentEvent      3505
ReleaseEvent                       1329
CommitCommentEvent                  840
MemberEvent                         248
PublicEvent                          23
Name: type, dtype: int64

In [19]:
events[events['type'] == 'PublicEvent']

Unnamed: 0,type,payload,repo_id,user_id,login,created_at,archive_id
2915,PublicEvent,{},67749622,2577440,tqchen,2016-09-30 04:17:37 UTC,4641374648
14965,PublicEvent,{},61355800,8341653,daijifeng001,2016-06-21 07:33:32 UTC,4172657198
24271,PublicEvent,{},57929326,1612723,emeryberger,2016-05-16 00:21:35 UTC,4017077526
33191,PublicEvent,{},64519183,6261322,logaretm,2016-08-15 03:35:31 UTC,4417258560
34361,PublicEvent,{},54168759,1480321,SergioBenitez,2016-12-23 15:38:42 UTC,5066128048
65370,PublicEvent,{},63221595,2718714,jcjohnson,2016-07-14 07:16:31 UTC,4276405905
74073,PublicEvent,{},64519183,6261322,logaretm,2016-08-07 18:38:24 UTC,4383764773
102306,PublicEvent,{},56734422,1477672,ro31337,2016-12-23 23:50:41 UTC,5067399987
121268,PublicEvent,{},59029620,450140,ivpusic,2016-05-18 08:04:41 UTC,4028777168
146726,PublicEvent,{},61412022,311752,ianstormtaylor,2016-07-13 20:51:52 UTC,4274547126


## Process the different kinds of events

## Resources
- https://apple.stackexchange.com/questions/80611/merging-multiple-csv-files-without-merging-the-header
- https://stackoverflow.com/questions/16890582/unixmerge-multiple-csv-files-with-same-header-by-keeping-the-header-of-the-firs
- https://stackoverflow.com/questions/21129020/how-to-fix-unicodedecodeerror-ascii-codec-cant-decode-byte
- https://stackoverflow.com/questions/988228/convert-a-string-representation-of-a-dictionary-to-a-dictionary
- https://stackoverflow.com/questions/29712962/how-can-i-convert-string-to-dict-or-list
- https://stackoverflow.com/questions/36606930/delete-an-element-in-a-json-object
- https://stackoverflow.com/questions/21745213/changed-github-password-no-longer-able-to-push-back-to-the-remote
- https://stackoverflow.com/questions/14661701/how-to-drop-a-list-of-rows-from-pandas-dataframe