-
Notifications
You must be signed in to change notification settings - Fork 2
Home
All of the API data is available in the MongoDB server. Ontop of that there is a predict
column
that measures the text-only alcohol relevant of a tweet, along with labels
found from Amazon Mechanical Turk and lastly random_number
generated on insert.
NOTE, random_number
is very correlated with the presence of labels
so don't use random_number
for test/train split.
If you just want access to the collection of interest in the mongo instance
from config import db
>>>db.count()
#int(80324)
from dao import DataAccess
df = DataAccess.as_dataframe()
>>>df.columns
#Index(['created_at', 'labels', 'predict', 'text', 'user'], dtype='object')
>>>df["user"][1].keys()
#dict_keys(['created_at', 'statuses_count', 'favourites_count', 'followers_count', 'friends_count'])
To investigate how time and user information can improve the quality of our classifier.
to do this I have created a notebook of visualizations of our features with with respect to the predict
index constructed prior.
#TODO
With respect to efficiency, I will also investigate using Gensim's Phrases
and TfidfModel
on the entire dataset along with the control to build test data appropriate weightings. Our training data has a lot more references to words like: drinking
, drunk
, etc. So it will skew a lot of our word weights unless .
- TfidfModel: Training data distribution does not represent real world data due to various filters we've already included. This is why we should consider training tfidf and phrases on the entire dataset.
- Phrases: improves the width of our matrices since we know the appropriate n-gram run lengths.
Experiments should be done
- File size
- Run time (would help a lot with GridSearch)
- Performance
#TODO
Since I am using primarily sklearn
we should make use of pipelines. I'd like to build a few classes that wrap Pipelines and give us more functionality such as:
- Cross Validation of multiple classifiers
- Diagnostics, ROC, Confusion Matrices, Learning Rate, Learning Curve
- Persistence into MongoDB and S3 (clf.pickle and diagnostics with a fixed schema
Custom Transformers and Pickles
- Custom Gensim Phrase/tfidf Transformers
- Loading Pretrained TfidfVectorizors
- Custom Tokenizers
- Custom DateTime Encoders
- Custom Reputation Models
The data I decide to pay to classify via AMT should be determined iteratively by looking at Learning Rate. Perhaps at a resolution of 500.
I should keep submitting jobs until that curve flattens out.
Once the algorithm classifies something as Alcohol related we need to then build subclasses around that. Perhaps use different tfidf weights? Reconsider Features? Not certain at this point in time.
- Should also probably find a way to break up the pipeline at this point so we don't waste time recomputing features if it passes the first classifier.
- Diagnostics for Multi-class are a bit different than Single-class.
Last step is to write the processing code that can take raw twitter json data and produce the relevant statistics over a collection of tweets.
The interface goes as follows:
>>>script.py --input s3.address:twitter.json --output output.json
Once the file is generated, a notebook can then take that output to make plots.