This is a tool for streaming statuses from the Twitter firehose into Hadoop. You should be familiar with Twitter's Streaming API before you think about using Twidoop.
Statuses are stored as
NULL-terminated JSON strings.
Statuses are stored in files keyed by day, e.g.
statuses-2009-12-25.json. The day is based on local time, not the time in the status.
If you stop and restart Twidoop, it will blow away the data collected for that day.
The HDFS replica count and block size are fixed. There's one replica and the block size is 16mb. You might want to change this. A block size of 5mb is better for the samplehose, since otherwise it takes too long to get a new block of data.
Twidoop only works with Hadoop 0.20.x.
- Clojure 1.1.x
Installation (Mac OS X)
Install MacPorts & leiningen:
$ sudo port sync $ sudo port install leiningen
$ mkdir -p ~/.bin $ echo 'export PATH=~/.bin:$PATH' >> ~/.profile $ . ~/.profile $ curl http://github.com/technomancy/leiningen/raw/stable/bin/lein > ~/.bin/lein $ chmod +x ~/.bin/lein $ lein self-install
$ lein jar
--help, which should be self-explanatory:
$ ./twidoop --help twidoop -- Stream the Twitter firehose into Hadoop. Options --output, -o <arg> Output here on HDFS [default /firehose] --hdfs <arg> HDFS to connect to [default hdfs://localhost:9000] --block-size, -b <arg> HDFS block size (in megabytes) [default 16] --replicas, -r <arg> HDFS replica count [default 1] --type, -t <arg> Type of stream to read from: sample or firehose [default sample] --user, -u <arg> Twitter username --pass, -p <arg> Twitter password
Stream the samplehose (available to any registered Twitter user) into HDFS on localhost:
$ ./twidoop -u twitter_user -p twitter_pass --hdfs hdfs://localhost:9000
Stream the firehose into "/user/ieure/firehose" on a HDFS cluster::
$ ./twidoop -u twitter_user -p twitter_pass -t firehose -o /user/ieure/firehose -h hdfs://hadoop.internal:9000
Twidoop is licensed under the three-clause BSD liense. See
LICENSE for the complete licensing terms.