Skip to content

kevincstowe/chime-ml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

chime-ml

Code used in production of Stowe et al 2016. There've been significant bug fixes and processing improvements since the paper, but the dataset, methods, and features are the same.

USAGE

python CHIME-ML.py (-p or --param FLOAT) (-t STRING) (datafile)

The -p argument can be used to specify the main parameter the ML algorithm uses - it will change depending on the algorithm.

The -t argument can be used to specify which tag to classify - this can be any of the high level tags

  • Sentiment
  • Information
  • Reporting
  • Movement
  • Preparation
  • Actions

Finally, a dataset file may be passed - if not, it will use the default file set in CHIME-ML.py.

Code Overview

Takes a .json object (defaulting to the provided 'data/part1/cleaned.json') of tweets. The object contains is keyed by tweet_id. tweet_id:{'text':''*, 'geo_coords':'[lat, long]' or '[]', 'user':'user name', 'date':'MM-DD-YYYY HH:MM:SS', 'annotations':[list of possibles anns, or one element "None"], 'previous':'tweet_id of previous tweet in user stream', 'next':'tweet_id of next tweet in user stream'}

The json is loaded, featurized according the parameters of the CHIME-ML.py script using the Features.py module, and then run through 5-fold CV using the Learn.py module. The particular ML algorithm can be specified as ALGORITHM parameter (either 'SVM','NB', or 'LR'). It returns a dictionary of tweet_id:binary prediction, as well as F1, precision, and recall.

CURRENTLY REQUIRES

Packages

Python, tested on 3.4
GenSim, for the Word2Vec model
NLTK, for text normalization
SciKit-Learn, for machine learning algorithms (SVM/Naive Bayes/LogReg)
Numpy, for support. SciKit-Learn or NLTK installations should include numpy/scipy.

Extras

*Tweet texts

We are not able to directly provide Tweet texts as users may make tweets private or delete them. Instead, we provide all of our metadata, along with tweet ids. This allows collection of available tweets via Twitter without unnecessarily exposing user data.
Because of this, the data provided (data/part1/cleaned.json) contains an empty 'text' field. This field should be filled with Tweet texts collected from Twitter and tagged with the Twitter-NLP tagger, with both --pos and --chunk flags.

The 'text' field for each tweet should look like this:

'text':'Just/O/RB/B-ADVP posted/O/VBD/B-VP a/O/DT/B-NP photo/O/NN/I-NP @/O/IN/B-PP Eight/O/NNP/B-NP Mile/B-geo-loc/NNP/I-NP River/I-geo-loc/NNP/I-NP http://t.co/1nkkwsIZ/O/URL/I-NP'

This field is then parsed into POS and NE tags. This is done by splitting on "/", with the [-2] element being the POS tag and the [-3] element being the NE tag. [0:-3] are joined into the lexical item, and normalized.

As an alternative, one could format their tweets with dummy values for POS and NER
'text':'Just/0/0/0 posted/0/0/0 a/0/0/0 photo/0/0/0 @/0/0/0 Eight/0/0/0 Mile/0/0/0 River/0/0/0 http://t.co/1nkkwsIZ/0/0/0'

This allows the featurizer to parse the next, without POS or NER features.

Word Embeddings

The word embedding model is provided by Git's Large File Storage
Please send any and all questions to:
kevin.stowe@colorado.edu

About

Code for Stowe et al 2016

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages