Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Newer
Older
100644 71 lines (57 sloc) 2.407 kB
c6cee95 @igrigorik sync script + simple readme
authored
1 # Google BigQuery + Github Archive
2
3 [Google BigQuery](https://developers.google.com/bigquery/) is a web service that lets you do interactive analysis of massive datasets—up to billions of rows.
4
5 The Github Activity stream is automatically uploaded to BigQuery sevice to enable interactive analysis.
6
7 ## Sample Queries
8
9 ```sql
cd8003a @igrigorik moar better query examples for BigQuery
authored
10 /* distribution of different events on GitHub */
11 SELECT type, count(type) as cnt
12 FROM [github.events]
13 GROUP BY type
14 ORDER BY cnt DESC
15
16 /* distribution of different events on GitHub for Ruby */
17 SELECT type, count(type) as cnt
18 FROM [github.events]
19 WHERE repository_language="Ruby"
20 GROUP BY type
21 ORDER BY cnt DESC
22
23 /* watches for a specific language + date range */
24 SELECT repository_name, count(repository_name) as watches, repository_description, repository_url
25 FROM github.events
26 WHERE type="WatchEvent"
27 AND repository_language="Ruby"
28 AND PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC('2012-04-01 00:00:00')
29 GROUP BY repository_name, repository_description, repository_url
30 ORDER BY watches DESC
31
32 /* top 100 repos for Ruby by number of pushes */
33 SELECT repository_name, count(repository_name) as pushes, repository_description, repository_url
34 FROM github.events
35 WHERE type="PushEvent"
36 AND repository_language="Ruby"
37 AND PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC('2012-04-01 00:00:00')
38 GROUP BY repository_name, repository_description, repository_url
39 ORDER BY watches DESC
40 LIMIT 100
41
42 /* push events by language */
43 SELECT repository_language, count(repository_language) as pushes
44 FROM github.events
45 WHERE type="PushEvent"
46 AND PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC('2012-04-01 00:00:00')
47 GROUP BY repository_language
48 ORDER BY pushes DESC
49
50 /* show recent push events for Go, sorted by time */
51 SELECT repository_name, repository_watchers, url, PARSE_UTC_USEC(created_at) as date
52 FROM github.events
53 WHERE type="PushEvent"
54 AND repository_language="Go"
55 AND repository_watchers > 1
56 AND PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC('2012-04-01 00:00:00')
57 ORDER BY date DESC
54d5e35 @igrigorik update readme's & web
authored
58 ```
c6cee95 @igrigorik sync script + simple readme
authored
59
54d5e35 @igrigorik update readme's & web
authored
60 For full schema of available fields to select, order, and group by, see schema.js.
61
62 ## Manually loading the data
63
64 If you want to load the archive data into your own BigQuery project:
65
66 ```bash
67 $> wget http://data.githubarchive.org/2012-03-11-15.json.gz
68 $> ruby transform.rb -i 2012-03-11-15.json.gz
69 $> python bq.py --apilog true load github.events 2012-03-11-15.json.gz-out.csv.gz schema.js
70 ```
Something went wrong with that request. Please try again.