Home

With respect to efficiency, I will also investigate using Gensim's Phrases and TfidfModel on the entire dataset along with the control to build test data appropriate weightings. Our training data has a lot more references to words like: drinking, drunk, etc. So it will skew a lot of our word weights unless .

TfidfModel: Training data distribution does not represent real world data due to various filters we've already included. This is why we should consider training tfidf and phrases on the entire dataset.
Phrases: improves the width of our matrices since we know the appropriate n-gram run lengths.

Experiments should be done

File size
Run time (would help a lot with GridSearch)
Performance

#TODO

Pipelines

Since I am using primarily sklearn we should make use of pipelines. I'd like to build a few classes that wrap Pipelines and give us more functionality such as:

Cross Validation of multiple classifiers
Diagnostics, ROC, Confusion Matrices, Learning Rate, Learning Curve
Persistence into MongoDB and S3 (clf.pickle and diagnostics with a fixed schema

Custom Transformers and Pickles

Custom Gensim Phrase/tfidf Transformers
Loading Pretrained TfidfVectorizors
Custom Tokenizers
Custom DateTime Encoders
Custom Reputation Models

Learning Rate and AMT

The data I decide to pay to classify via AMT should be determined iteratively by looking at Learning Rate. Perhaps at a resolution of 500.

learning_curve

I should keep submitting jobs until that curve flattens out.

Multiclass Classification

Once the algorithm classifies something as Alcohol related we need to then build subclasses around that. Perhaps use different tfidf weights? Reconsider Features? Not certain at this point in time.

Should also probably find a way to break up the pipeline at this point so we don't waste time recomputing features if it passes the first classifier.
Diagnostics for Multi-class are a bit different than Single-class.

Processing

Last step is to write the processing code that can take raw twitter json data and produce the relevant statistics over a collection of tweets.

The interface goes as follows:

>>>script.py --input s3.address:twitter.json --output output.json

Once the file is generated, a notebook can then take that output to make plots.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Table of Contents:

Twitter Data

Data Access

Feature Exploration

Gensim Phrases and tfidf

Pipelines

Learning Rate and AMT

Multiclass Classification

Processing

Clone this wiki locally