A simple continous harvester for twitter with node
JavaScript
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
cfg
lib
schema
test
.babelrc
.editorconfig
.eslintrc
.gitattributes
.gitignore
.travis.yml
LICENSE
README.md
gulpfile.js
package.json
secrets.tar.enc
twitter-harvest.js

README.md

twitter-harvest NPM version Build Status Dependency Status Coverage percentage

A simple continuous harvester for twitter

This application is able to capture tweets which happen around the world. Currently it works only with the Twitter stream API 1.1.

  • You have to define or modify the cfg/cfg.json and create at least one capture agent in cfg/agents/ directory (enable to true).
  • You can activate mail alert from a SMTP account like gmail (see Private configuration and the mail_alert flag in main configuration)
  • If fs_out is true (default), the captured tweets are written to the file system with the following convention:
  • If todo_out is true (should be false by default), a kind of queue is created (directory 'data/TODO') where filenames to consume by an external process. This allow to write the tweets to any db
    • Note, that the number of files by directory is limited (depend of the OS), the filenames need to be consumed by the external process regularly to avoid issues

data_dir/year/month/day/hour-min-sec_tweet-id

e.g.

data/2015/9/24/16-30-44_647055571951190000

Install

$ npm install --save twitter-harvest

Usage

node twitter-harvest.js

Usage with forever

$ npm install -g forever
$ forever start twitter-harvest.js

With forever it is possible to run the task 'forever'. And leave your session.

Main configuration

{
  "agents_dir"    : "cfg/agents/",
  "data_dir"      : "./data/",
  "private_cfg"   : "./cfg/cfg-private.json",

  "mail_alert"    : false,

  "fs_out"        : true,
  "std_out"       : true,
  "todo_out"      : true  
}
  • agents_dir: path where to put the agent file
  • data_dir: path where to write the tweets on the file system
  • private_cfg: file where private data is stored (such as mail credential)
  • mail_alert: if true enable mail alerting in case of failure
  • fs_out: if true write the twitter data on the file system
  • std_out: if true write the twitter data on the console
  • todo_out: if true write the json filename in the 'data/TODO' dir (to be consumed by an other process to BD (mysql, ...)

Agents configuration

put all the agent definition files to the agent directory (one file per agent).

$ cat cfg/agents/*.json
{
  "type_doc"            : "twitter",
  "enable"              : true,
  "type_filter"         : "track",
  "type_api"            : "stream",
  "name"                : "keywords-geneva",
  "filter"              : {
    "track"             : "genève,geneva,genebra,genevra,genf"
  },
  "stream"              : "filter",
  "consumer_key"        : "...",
  "consumer_secret"     : "...",
  "access_token_key"    : "...",
  "access_token_secret" : "..."  
}

to capture all the tweets where there is a mention of geneva word for several languages.

{
  "type_doc"            : "twitter",
  "enable"              : true,
  "type_filter"         : "locations",
  "type_api"            : "stream",
  "name"                : "location-geneva",
  "filter"              : {
    "locations"  : "5.77,45.85,7.15,46.80"
  },
  "stream"              : "filter",
  "consumer_key"        : "...",
  "consumer_secret"     : "...",
  "access_token_key"    : "...",
  "access_token_secret" : "..."
}

to capture all the tweets which are posted around Geneva area (Switzerland).

  • type_doc : 'twitter'
  • enable : if true this agent is launched
  • type_filter : locations | filter | follow
  • stream : filter | firehose (if you have the chance)
  • consumer_key, consumer_secret, access_token_key, access_token_secret : personal keys given by twitter for using their APIs

more API twitter doc https://dev.twitter.com/streaming/overview/request-parameters

Private configuration

{
  "mail_service"    : "gmail",
  "mail_auth_user"  : "username",
  "mail_auth_path"  : "password",
  "mail_from"       : "alert_twitter_harvest",
  "mail_to"         : "name@gmail.com"
}
  • mail_service : name of the mail service
  • mail_auth_user : username credential of the mail service
  • mail_auth_path : password credential of the mail service
  • mail_from : who will send the mail
  • mail_to : who want to be alerted

One mail is also sent when the system is started, you should received this mail on your mail box if all well configured.

note : supported mail system is given by nodemailer node module (here is the supported service https://github.com/andris9/nodemailer-wellknown#supported-services), but only gmail was tested for gmail, it is possible you have to decrease the security level of your mail account (so don't use a personal account) and to authorize specifically the application by using this url: https://g.co/allowaccess

Test

$ gulp

Notes

Note that currently, we have 3 errors messages when twitter-harvest is launched. This is not important. Here are theses Error messages

{ [Error: Cannot find module './build/Release/DTraceProviderBindings'] code: 'MODULE_NOT_FOUND' }
{ [Error: Cannot find module './build/default/DTraceProviderBindings'] code: 'MODULE_NOT_FOUND' }
{ [Error: Cannot find module './build/Debug/DTraceProviderBindings'] code: 'MODULE_NOT_FOUND' }

To do

  • add more tests
  • add extra option to add extra info in the output(from agents)
  • add other api interface (not only the streaming API)

License

MIT © Arnaud Gaudinat

Change log

  • 0.3.4:
    • chat the node twitter lib with Twit (for better handling of error)
  • 0.3.3:
    • add the TODO option and directory to allow writing in DB
    • add 2 digits on filenames and JSON extension
  • 0.3.2:
    • add JSONschema validation