A tool to collect stats about language usage on GitHub
Java Shell
Switch branches/tags
Nothing to show
Latest commit 2e67078 May 20, 2013 @pridkett Cleaned up imports
Signed-off-by: Patrick Wagstrom <patrick@wagstrom.net>
Failed to load latest commit information.
src Cleaned up imports May 20, 2013
.gitignore initial commit Apr 3, 2012
LICENSE.txt initial commit Apr 3, 2012
README.md Uncomment required scrapers Dec 22, 2012
pom.xml Remove jenkins maven repo for args4j Oct 5, 2012



Copyright (c) 2012 IBM Corporation

This is primarily a research tool that can connect to github.com/languages and save a list of the top projects per language. It is developed under the auspices of a joint study agreement between IBM Research and the University of Nebraska-Lincoln.

The primary developer of this software is Patrick Wagstrom. License

This tool is licensed under the terms of the Apache Software License v2.0.


This tool uses maven to build and manage dependencies. Use the following command to automatically download the dependencies and compile the software.

mvn clean compile package


The defaults for the script are pretty sane. You can just type:


If you would rather specify a configuration file use:

./github.sh -c [CONFIGURATION]

Configuration Settings

See src/main/java/net/wagstrom/research/github/language/PropNames.java for the names of properties. The default values for those properties can be found in src/main/java/net/wagstrom/research/github/language/Defaults.java.

Properties that are not set in the configuration file will use default values.

In general you can get by with a simple configuration file like this:


Everything else should magically work.

Simple Queries

By default the program uses the embedded Derby database. The easiest way to perform queries on it is to use the ij tool that ships with the binary distribution of Derby.

This query gets the JavaScript projects marked as 'Most Watched This Week' obtained from the most recent update of the data:

select username, reponame, rank
  from repo, proglang, repoupdate, topcategory
 where repo.id=repoupdate.repo_id
       and repoupdate.proglang_id=proglang.id
       and proglang.name='JavaScript'
       and repoupdate.category_id=topcategory.id
       and topcategory.name='Most Watched This Week'
       and repoupdate.update_id=(select max(id) from githubupdate);

Database Schema

The schema is currently found in the project source code. See the top of src/main/java/net/wagstrom/research/github/language/DatabaseDriver.java for the declarations of all of the tables.

I Don't Want to Run this Scraper

That's great to hear. You probably shouldn't. I've been running this scraper for a much longer time and can just provide you the data I've been collecting as a Derby database. Unfortunately, there are times when github changes how it works and I missed updates for a few months at a time. But hey, it's better than having to run the scraper again. Feel free to email me and I can get you a copy of this data.


If you have found a bug, please file an issue on github. If you then are able to patch the bug yourself, please create a pull request on github. If you have other contributions you think might be helpful feel free to create a pull request on github, although it might be helpful to contact me with your idea first.