Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse files

Merge branch 'bigquery'

  • Loading branch information...
commit 0b591f3270351085cd8ecc82cf875aeef740725d 2 parents 3d86cdb + 37fe903
@igrigorik authored
View
21 README.md
@@ -51,6 +51,27 @@ Yajl::Parser.parse(js) do |event|
print event
end
```
+__Note: [example script to import data into SQLite db](https://gist.github.com/2426614)__
+
+----
+
+GitHub Archive dataset is also available via [Google BigQuery](https://developers.google.com/bigquery/). The JSON data is [normalized](https://github.com/igrigorik/githubarchive.org/blob/master/bigquery/schema.js) and is updated every hour, allowing you to run [arbitrary queries](https://developers.google.com/bigquery/docs/query-reference) and analysis over the entire dataset in seconds. To get started, login into the BigQuery console (bigquery.cloud.google.com), and add the project (name: "*githubarchive*"):
+
+![BigQuery](http://www.githubarchive.org/assets/img/bigquery-directions.png)
+
+An example query, for more check the [repository readme](https://github.com/igrigorik/githubarchive.org/tree/master/bigquery):
+
+```sql
+/* top 100 repos for Ruby by number of pushes */
+SELECT repository_name, count(repository_name) as pushes, repository_description, repository_url
+FROM github.events
+WHERE type="PushEvent"
+ AND repository_language="Ruby"
+ AND PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC('2012-04-01 00:00:00')
+GROUP BY repository_name, repository_description, repository_url
+ORDER BY watches DESC
+LIMIT 100
+```
### License
View
28 bigquery/README.md
@@ -2,27 +2,29 @@
[Google BigQuery](https://developers.google.com/bigquery/) is a web service that lets you do interactive analysis of massive datasets—up to billions of rows.
-The Github Activity stream is automatically uploaded to BigQuery sevice to enable interactive analysis.
+The Github Activity stream is automatically uploaded to BigQuery sevice to enable interactive analysis. Follow the [instructions to access the dataset](http://www.githubarchive.org/).
## Sample Queries
+Have a clever query you would like to share? Fork the project, add it to the project under **queries/name.sql** and send a pull request!
+
```sql
/* distribution of different events on GitHub */
SELECT type, count(type) as cnt
-FROM [github.events]
+FROM [github.timeline]
GROUP BY type
ORDER BY cnt DESC
/* distribution of different events on GitHub for Ruby */
SELECT type, count(type) as cnt
-FROM [github.events]
+FROM [github.timeline]
WHERE repository_language="Ruby"
GROUP BY type
ORDER BY cnt DESC
/* watches for a specific language + date range */
SELECT repository_name, count(repository_name) as watches, repository_description, repository_url
-FROM github.events
+FROM github.timeline
WHERE type="WatchEvent"
AND repository_language="Ruby"
AND PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC('2012-04-01 00:00:00')
@@ -31,17 +33,17 @@ ORDER BY watches DESC
/* top 100 repos for Ruby by number of pushes */
SELECT repository_name, count(repository_name) as pushes, repository_description, repository_url
-FROM github.events
+FROM github.timeline
WHERE type="PushEvent"
AND repository_language="Ruby"
AND PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC('2012-04-01 00:00:00')
GROUP BY repository_name, repository_description, repository_url
-ORDER BY watches DESC
+ORDER BY pushes DESC
LIMIT 100
/* push events by language */
SELECT repository_language, count(repository_language) as pushes
-FROM github.events
+FROM github.timeline
WHERE type="PushEvent"
AND PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC('2012-04-01 00:00:00')
GROUP BY repository_language
@@ -49,7 +51,7 @@ ORDER BY pushes DESC
/* show recent push events for Go, sorted by time */
SELECT repository_name, repository_watchers, url, PARSE_UTC_USEC(created_at) as date
-FROM github.events
+FROM github.timeline
WHERE type="PushEvent"
AND repository_language="Go"
AND repository_watchers > 1
@@ -58,13 +60,3 @@ ORDER BY date DESC
```
For full schema of available fields to select, order, and group by, see schema.js.
-
-## Manually loading the data
-
-If you want to load the archive data into your own BigQuery project:
-
-```bash
-$> wget http://data.githubarchive.org/2012-03-11-15.json.gz
-$> ruby transform.rb -i 2012-03-11-15.json.gz
-$> python bq.py --apilog true load github.events 2012-03-11-15.json.gz-out.csv.gz schema.js
-```
View
8 bigquery/queries/top_watches_by_language.sql
@@ -0,0 +1,8 @@
+/* watches for a specific language + date range */
+SELECT repository_name, count(repository_name) as watches, repository_description, repository_url
+FROM github.timeline
+WHERE type="WatchEvent"
+ AND repository_language="Ruby"
+ AND PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC('2012-04-01 00:00:00')
+GROUP BY repository_name, repository_description, repository_url
+ORDER BY watches DESC
View
BIN  web/assets/img/bigquery-directions.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
View
24 web/index.html
@@ -19,6 +19,10 @@
padding-top: 20px;
padding-bottom: 40px;
}
+ .large {
+ font-size:16px;
+ line-height:25px;
+ }
</style>
<!-- Le HTML5 shim, for IE6-8 support of HTML5 elements -->
@@ -60,7 +64,7 @@ <h1 style="margin-bottom:0.5em">
<div class="row">
<div class="span12">
- <p style="font-size:16px; line-height:25px">GitHub provides <a href="http://developer.github.com/v3/events/types/">18 event types</a>, which range from new commits and fork events, to opening new tickets, commenting, and adding members to a project. The activity is aggregated in hourly archives, which you can access with any HTTP client:</p>
+ <p class="large">GitHub provides <a href="http://developer.github.com/v3/events/types/">18 event types</a>, which range from new commits and fork events, to opening new tickets, commenting, and adding members to a project. The activity is aggregated in hourly archives, which you can access with any HTTP client:</p>
<table class="table table-striped">
<thead>
@@ -88,9 +92,25 @@ <h1 style="margin-bottom:0.5em">
<p><em>Note: timeline data is available starting March 11, 2012.</em></p>
<br />
- <p style="font-size:16px; line-height:25px">Each archive contains a stream of JSON encoded GitHub events (<a href="https://gist.github.com/2017462">sample</a>), which you can process in any language. Ruby example:</p>
+ <p class="large">Each archive contains a stream of JSON encoded GitHub events (<a href="https://gist.github.com/2017462">sample</a>), which you can process in any language. Ruby example:</p>
<script src="https://gist.github.com/2017506.js"></script>
+ <p><em>Note: <a href="https://gist.github.com/2426614">example script to import data into SQLite db</a></em></p>
+
+ </div>
+ </div>
+
+ <hr />
+ <div class="row">
+ <div class="span12">
+ <p class="large">GitHub Archive dataset is also available via <a href="https://developers.google.com/bigquery/">Google BigQuery</a>. The JSON data is <a href="https://github.com/igrigorik/githubarchive.org/blob/master/bigquery/schema.js">normalized</a> and is updated every hour, allowing you to run <a href="https://developers.google.com/bigquery/docs/query-reference">arbitrary queries</a> and analysis over the entire dataset in seconds. To get started, login into the BigQuery console (<a href="https://bigquery.cloud.google.com/">bigquery.cloud.google.com</a>), and add the project (name: "<b>githubarchive</b>"):</p>
+
+ <div class="hero-unit" align="center" style="padding:10px">
+ <img src="assets/img/bigquery-directions.png" />
+ </div>
+
+ <p class="large">An example query, for more check the <a href="https://github.com/igrigorik/githubarchive.org/tree/master/bigquery">repository readme</a>:</p>
+ <script src="https://gist.github.com/2521371.js?file=github-ruby.sql"></script>
</div>
</div>
Please sign in to comment.
Something went wrong with that request. Please try again.