Note: we are currently using protobuf 2.4.0a

Setup (on a recentish OSX)

  1. Install the latest XCode with command line tools -- or just the command line tools
  2. Install homebrew
  3. Install protobuf: brew install protobuf
  4. Get parallel, as some scripts depend on it. brew install parallel
  5. Run the DataFetchPipeline.rb in scripts/
  6. Look at build.xml or the pipeline stages available.

Setup (on MS Windows)

  1. Install the Google's protocol buffers binaries:
  2. Unpack the archive and put protoc.exe somewhere in PATH
  3. Run to generate protobuf's classes Steps 4-6 are the same as for OSX

Monitoring the cluster

To check what's inside the generated HAR file:

hadoop fs -ls -R har:///projects/dataset.har

Running hadoop tests with junit outside the unibe network

  1. Make sure that you have a public ssh key. If you don't, follow this guide:
  2. Append your public ssh key to ~/.ssh/authorized_keys
  3. Make sure that you can ssh -l deploy without being asked a password.
  4. Run your test with ./ uploadJar -DmainClass=ch.unibe.scg.cells.hadoop.JUnitRunner -DclassArgument=ch.unibe.scg.cells.hadoop.CellsTestSuite. In case of an unsuccessful run you, you will get the errors in the console.

To kill a hang job

hadoop job -kill job_<your_job_id>

To download the data

Copy the local scripts across the cluster:


Run the DataFetchPipeline:

ssh leela ./scripts/ohloh/DataFetchPipeline.rb

Or locally, for testing.:

./scripts/ohloh/DataFetchPipeline.rb --max_repos 3

HBase shell

Open with:

hbase shell

List tables:


Check MR ouptput

Check HBase table size:

hadoop fs -du -h -s /hbase/

Check size of HAR file:

hadoop fs -du -h /projects/dataset.har