Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse files

Update README

  • Loading branch information...
commit 88608635da20118a27fd41f4972db34f2caa1aab 1 parent 83e7774
@ddaniels888 ddaniels888 authored
Showing with 45 additions and 68 deletions.
  1. +45 −68 README.md
View
113 README.md
@@ -1,7 +1,8 @@
# Twitter Gardenhose
-This simple node.js script saves the 1% of the twitter firehose that is public to S3.
+This node.js script saves the 1% of the twitter firehose to S3. It can be run locally or as a heroku app.
+
+## Setup
-## How to Use
You will need Twitter API Keys, and AWS API Keys. Gather them first.
### Twitter API Keys
@@ -23,21 +24,17 @@ You will need Twitter API Keys, and AWS API Keys. Gather them first.
* Access Key ID
* Secret Access Key
-## How to use locally
-### Set API keys in your environment
-This README assumes Mac OS X.
-
-`cp .env.example .env`
+## Running Locally
-Put the values you just gathered from Twitter and AWS in place of the placeholders denoted with <<>>
+For local usage, this README assumes Mac OS X, though steps should be similar for other operating systems.
-`source .env`
+### Install node and required packages
Make sure node.js is installed.
`brew install node`
-Make sure you have all of the required packages. Simply run:
+Make sure you have all of the required packages:
`npm install`
@@ -45,16 +42,32 @@ inside of the folder. This will automatically detect and read from the
packages.json file. To learn more about the package.json file, see
[http://npmjs.org/doc/json.html](http://npmjs.org/doc/json.html)
+### Set API keys in your environment
+
+Create a copy of the template .env file:
+
+`cp .env.example .env`
+
+Edit .env, putting the values you just gathered from Twitter and AWS in place of the placeholders denoted with <<>>
+
+### Dry run
+
Once all of the packages are installed, try a dry run which won't write to S3:
-`node main.js -dv`
+ source .env
+ node main.js -dv
+
+If your API keys and environment are set up correctly, many tweets will fly past your console. Control-C to kill the process.
+
+### Real run
+
+If you want to actually write to S3, run:
-And if your API keys and environment are set up correctly, many tweets will fly past your console. Control-C to kill the process.
+ source .env
+ node main.js
-If you want to actually write to S3 run
-`node main.js`
+## Running on Heroku
-## How to use from Heroku
Once you are satisfied that everything is set up correctly, you probably want to run the Gardenhose on a PaaS which will have faster upload to S3 and will keep running even when you shut down your local machine. This README assumes you have a [heroku](http://heroku.com) account, and have installed the [heroku toolbelt](https://toolbelt.heroku.com).
### To deploy the app to heroku:
@@ -62,76 +75,40 @@ Once you are satisfied that everything is set up correctly, you probably want to
2. `git push heroku master` - Pushes and deploys the app
3. Set heroku configuration
-`heroku config:add TWITTER_CONSUMER_KEY=<<Your twitter consumer key>>`
-
-`heroku config:add TWITTER_CONSUMER_SECRET=<<Your twitter consumer secret>>`
+ heroku config:add TWITTER_CONSUMER_KEY=<<Your twitter consumer key>>
+ heroku config:add TWITTER_CONSUMER_SECRET=<<Your twitter consumer secret>>
+ heroku config:add TWITTER_ACCESS_TOKEN_KEY=<<Your twitter access token key>>
+ heroku config:add TWITTER_ACCESS_TOKEN_SECRET=<<Your twitter access token secret>>
+ heroku config:add AWS_ACCESS_KEY_ID=<<Your AWS access key ID>>
+ heroku config:add AWS_SECRET_ACCESS_KEY=<<Your AWS secret access key>>
+ heroku config:add AWS_S3_BUCKET_NAME=<<Your AWS S3 bucket name to store tweets>>
+
+### Start a worker on heroku
-`heroku config:add TWITTER_ACCESS_TOKEN_KEY=<<Your twitter access token key>>`
-
-`heroku config:add TWITTER_ACCESS_TOKEN_SECRET=<<Your twitter access token secret>>`
-
-`heroku config:add AWS_ACCESS_KEY_ID=<<Your AWS access key ID>>`
-
-`heroku config:add AWS_SECRET_ACCESS_KEY=<<Your AWS secret access key>>`
-
-`heroku config:add AWS_S3_BUCKET_NAME=<<Your AWS S3 bucket name to store tweets>>`
-
-### Start on Heroku
`heroku ps:scale worker=1` - Starts the worker and begins storing tweets.
-### Stop on Heroku
+### Stop the worker on heroku
+
`heroku ps:scale worker=0` - Shuts down the worker.
Note, you should not have more than one worker going. Doing so will cause
duplication in the tweets that are stored on S3
-### To monitor the status
-Assuming you have the [heroku toolbelt](https://toolbelt.heroku.com) installed
-simply run the command:
+### Monitoring status
+To monitor the worker, run:
`heroku logs --tail`
-This will show you the current status of the twitter streaming.
+This will show you the current status of the gardenhose.
## How it Works
-The remarkable power of Node.js makes this app somewhat trivial.
-
-The Twitter library for node sets up an object with the authentication
-credentials of any Twitter app. You then connect it to a twitter stream. To
-see the types of stream available and the parameters those streams take,
-go to https://dev.twitter.com/docs/streaming-apis
-Node.js is event-based. That means that a function will get called every
-time an 'event' happens. We have a function that is bound to the 'data'
-event of the twitter stream that will write that tweet to a file.
+The twitter_gardenhose app sucks down tweets, and writes a new file every time it's buffer reaches 20MB. Each individual file is named by unix timestamp it was created.
-Since tweets come in so rapidly, we use a filestream object instead of a
-plain file object. The difference between the two is documented here:
-http://nodejs.org/api/fs.html
-
-When a file reaches a certain size (currently arbitrarily 20MB), the file
-will be closed, a new one created, and the full one uploaded to amazon's
-S3. We use a library called [Knox](https://github.com/LearnBoost/knox/) to
-super easily upload stuff to S3 buckets.
-
-The uploading and closing of a file happens in the closeCurrentStream()
-method.
-
-We didn't want a file that was too small, otherwise we'd get hundreds of
-thousands of small files on S3. That's inefficent. We didn't want a single
-file that was too big because we don't want to lose all of our tweets
-in case something bad happens to that file.
-
-## Data output
-Each individual file is named by unix timestamp it was created.
-
-We dump raw tweets into JSON compatible text files.
-
-The JSON for a tweet is way, way more than 140 characters. It contains
-information for the tweet, the media, the user, retweet info, geolocation,
-and a bunch of other stuff.
+Raw tweets are stored as JSON, one object per tweet. In addition to the text of the tweet, the JSON object contains lots of metadata about the tweet (the media, the user, retweet info, geolocation, etc).
## Filtering Tweets
+
What if you wanted to limit the gardenhose to tweets with location tags in the continental US?
Change the call to twit.stream to read:
Please sign in to comment.
Something went wrong with that request. Please try again.