Skip to content
Jason Liu edited this page Aug 4, 2015 · 1 revision

Table of Contents:

Twitter Data

All of the API data is available in the MongoDB server. Ontop of that there is a predict column that measures the text-only alcohol relevant of a tweet, along with labels found from Amazon Mechanical Turk and lastly random_number generated on insert.

NOTE, random_number is very correlated with the presence of labels so don't use random_number for test/train split.

If you just want access to the collection of interest in the mongo instance

from config import db
>>>db.count()
#int(80324)

Data Access

from dao import DataAccess
df = DataAccess.as_dataframe()
>>>df.columns
#Index(['created_at', 'labels', 'predict', 'text', 'user'], dtype='object')
>>>df["user"][1].keys()
#dict_keys(['created_at', 'statuses_count', 'favourites_count', 'followers_count', 'friends_count'])

Feature Exploration

To investigate how time and user information can improve the quality of our classifier. to do this I have created a notebook of visualizations of our features with with respect to the predict index constructed prior.

#TODO

Gensim Phrases and tfidf

With respect to efficiency, I will also investigate using Gensim's Phrases and TfidfModel on the entire dataset along with the control to build test data appropriate weightings. Our training data has a lot more references to words like: drinking, drunk, etc. So it will skew a lot of our word weights unless .

  • TfidfModel: Training data distribution does not represent real world data due to various filters we've already included. This is why we should consider training tfidf and phrases on the entire dataset.
  • Phrases: improves the width of our matrices since we know the appropriate n-gram run lengths.

Experiments should be done

  • File size
  • Run time (would help a lot with GridSearch)
  • Performance
#TODO

Pipelines

Since I am using primarily sklearn we should make use of pipelines. I'd like to build a few classes that wrap Pipelines and give us more functionality such as:

  • Cross Validation of multiple classifiers
  • Diagnostics, ROC, Confusion Matrices, Learning Rate, Learning Curve
  • Persistence into MongoDB and S3 (clf.pickle and diagnostics with a fixed schema

Custom Transformers and Pickles

  • Custom Gensim Phrase/tfidf Transformers
  • Loading Pretrained TfidfVectorizors
  • Custom Tokenizers
  • Custom DateTime Encoders
  • Custom Reputation Models

Learning Rate and AMT

The data I decide to pay to classify via AMT should be determined iteratively by looking at Learning Rate. Perhaps at a resolution of 500.

learning_curve

I should keep submitting jobs until that curve flattens out.

Multiclass Classification

Once the algorithm classifies something as Alcohol related we need to then build subclasses around that. Perhaps use different tfidf weights? Reconsider Features? Not certain at this point in time.

  • Should also probably find a way to break up the pipeline at this point so we don't waste time recomputing features if it passes the first classifier.
  • Diagnostics for Multi-class are a bit different than Single-class.

Processing

Last step is to write the processing code that can take raw twitter json data and produce the relevant statistics over a collection of tweets.

The interface goes as follows:

>>>script.py --input s3.address:twitter.json --output output.json

Once the file is generated, a notebook can then take that output to make plots.