ghtorrent: Mirror and index data from the Github API
A library and a collection of scripts used to retrieve data from the Github API
and extract metadata in an SQL database, in a modular and scalable manner. The
scripts are distributed as a Gem (
ghtorrent), but they can also be run by
checking out this repository.
GHTorrent can be used for a variety of purposes, such as:
- Mirror the Github API event stream and follow links from events to actual data to gradually build a Github index
- Create a queriable metadata database for a specific repository
- Construct a data source for extracting process analytics (see for example those) for one or more repositories
GHTorrents components (which can be used individually) are:
- APIClient: Knows how to query the Github API (both single entities and pages) and respect the API request limit. Can be configured to override the default IP address, in case of multihomed hosts.
- Retriever: Knows how to retrieve specific Github entities (users, repositories, watchers) by name. Uses an optional persister to avoid retrieving data that have not changed.
- Persister: A key/value store, which can be backed by a real key/value store, to store Github JSON replies and query them on request. The backing key/value store must support arbitrary queries to the stored JSON objects.
- GHTorrent: Knows how to extract information from the data retrieved by the retriever in order to update an SQL database (see schema) with metadata.
The Persister and GHTorrent components have configurable back ends:
- Persister: Either uses MongoDB > 3.0 (
mongodriver) or no persister (
- GHTorrent: GHTorrent is tested mainly with MySQL and SQLite, but can theoretically be used with any SQL database compatible with Sequel. Your milaege may vary.
For distributed mirroring you also need RabbitMQ >= 3.3
1. Install GHTorrent
GHTorrent is written in Ruby (tested with Ruby > 2.0). To install it as a Gem do:
sudo gem install ghtorrent
2. Install Your Preferred Database
Depending on which SQL database you want to use, install the appropriate dependency gem.
sudo gem install mysql2 # or sqlite3
Copy config.yaml.tmpl to a file in your home directory.
All provided scripts accept the
-c option, which accepts the location of the configuration file as
You can find more information of how you can setup a mirroring cluster of machines to retrieve data in parallel on the Wiki.
To mirror the event stream and capture all data:
ght-mirror-events.rbperiodically polls Github's event queue (
https://api.github.com/events), stores all new events in the configured pestister, and posts them to the
githubexchange in RabbitMQ.
ght-data_retrieval.rbcreates queues that route posted events to processor functions. The functions use the appropriate Github API call to retrieve the linked contents, extract metadata (for database storage), and store the retrieved data in the appropriate collection in the persister, to avoid duplicate API calls. Data in the SQL database contain pointers (the
ext_ref_idfield) to the "raw" data in the persister.
To retrieve data for a repository or user:
ght-retrieve-reporetrieves all data for a specific repository
ght-retrieve-userretrieves all data for a specific user
To perform maintenance:
ght-loadloads selected events from the persister to the queue in order for the
ght-data-retrievalscript to reprocess them
There are two sets of data:
- Raw events: Github's event stream. These
are the roots for mirroring operations. The
ght-data-retrievalcrawler starts from an event and goes deep into the rabbit hole.
- SQL dumps + Linked data: Data dumps from the SQL database and the corresponding MongoDB entities.
Bugs & Feature Requests
Please tell us about features you'd like or bugs you've discovered on our Issue Tracker.
Patches, bug fixes, etc are welcome. Please fork the repository and create a pull request when done fixing/implementing the new feature.
Citing GHTorrent in your Research
If you find GHTorrent and the accompanying datasets useful in your research, please consider citing the following paper:
Georgios Gousios and Diomidis Spinellis, "GHTorrent: GitHub’s data from a firehose," in MSR '12: Proceedings of the 9th Working Conference on Mining Software Repositories, June 2-–3, 2012. Zurich, Switzerland.