Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


Code used in production of Stowe et al 2016. There've been significant bug fixes and processing improvements since the paper, but the dataset, methods, and features are the same.


python (-p or --param FLOAT) (-t STRING) (datafile)

The -p argument can be used to specify the main parameter the ML algorithm uses - it will change depending on the algorithm.

The -t argument can be used to specify which tag to classify - this can be any of the high level tags

  • Sentiment
  • Information
  • Reporting
  • Movement
  • Preparation
  • Actions

Finally, a dataset file may be passed - if not, it will use the default file set in

Code Overview

Takes a .json object (defaulting to the provided 'data/part1/cleaned.json') of tweets. The object contains is keyed by tweet_id. tweet_id:{'text':''*, 'geo_coords':'[lat, long]' or '[]', 'user':'user name', 'date':'MM-DD-YYYY HH:MM:SS', 'annotations':[list of possibles anns, or one element "None"], 'previous':'tweet_id of previous tweet in user stream', 'next':'tweet_id of next tweet in user stream'}

The json is loaded, featurized according the parameters of the script using the module, and then run through 5-fold CV using the module. The particular ML algorithm can be specified as ALGORITHM parameter (either 'SVM','NB', or 'LR'). It returns a dictionary of tweet_id:binary prediction, as well as F1, precision, and recall.



Python, tested on 3.4
GenSim, for the Word2Vec model
NLTK, for text normalization
SciKit-Learn, for machine learning algorithms (SVM/Naive Bayes/LogReg)
Numpy, for support. SciKit-Learn or NLTK installations should include numpy/scipy.


*Tweet texts

We are not able to directly provide Tweet texts as users may make tweets private or delete them. Instead, we provide all of our metadata, along with tweet ids. This allows collection of available tweets via Twitter without unnecessarily exposing user data.
Because of this, the data provided (data/part1/cleaned.json) contains an empty 'text' field. This field should be filled with Tweet texts collected from Twitter and tagged with the Twitter-NLP tagger, with both --pos and --chunk flags.

The 'text' field for each tweet should look like this:

'text':'Just/O/RB/B-ADVP posted/O/VBD/B-VP a/O/DT/B-NP photo/O/NN/I-NP @/O/IN/B-PP Eight/O/NNP/B-NP Mile/B-geo-loc/NNP/I-NP River/I-geo-loc/NNP/I-NP'

This field is then parsed into POS and NE tags. This is done by splitting on "/", with the [-2] element being the POS tag and the [-3] element being the NE tag. [0:-3] are joined into the lexical item, and normalized.

As an alternative, one could format their tweets with dummy values for POS and NER
'text':'Just/0/0/0 posted/0/0/0 a/0/0/0 photo/0/0/0 @/0/0/0 Eight/0/0/0 Mile/0/0/0 River/0/0/0'

This allows the featurizer to parse the next, without POS or NER features.

Word Embeddings

The word embedding model is provided by Git's Large File Storage
Please send any and all questions to:


Code for Stowe et al 2016






No releases published


No packages published