We learn to identify political comments versus science comments.



## Load some text data



One source for text data is reddit.  Download some comments from
[http://files.pushshift.io/reddit/comments/>](http://files.pushshift.io/reddit/comments/>)and use `bz2` to decompress
them.  Here I load a file called `RC_2010-10.bz2`.  Each line of the
decompressed file is a JSON object, which can be decoded using
`json.loads` in Python after `import json`.

I load over 250k comments from the &ldquo;politics&rdquo; and &ldquo;science&rdquo;
subreddits.



In [1]:
import json
import bz2
comments = []
with bz2.open('/home/jim/downloads/RC_2010-10.bz2', 'r') as f:
    for line in f:
        comment = json.loads(line.strip())
        if comment['subreddit'] in ['politics', 'science']:
            comments.append( comment )

JSON is a popular format, so it behooves us to be comfortable with it.

But `comment['body']` is a string of text, so we convert it to a
vector.



In [1]:
from sklearn.feature_extraction.text import HashingVectorizer
vectorizer = HashingVectorizer(n_features=2**18)
corpus = [comment['body'] for comment in comments]
X = vectorizer.fit_transform(corpus)

We want to learn `y`, the subreddit.



In [1]:
import numpy as np
y = np.array([comment['subreddit'] == 'politics' for comment in comments])

## Evaluating the model



Let&rsquo;s see how well this works.



In [1]:
from sklearn.model_selection import train_test_split 
from sklearn import linear_model
model = linear_model.SGDClassifier(alpha=1e-05)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, stratify=y)
model.fit( X_train, y_train )
y_pred = model.predict(X_test)

from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

print("Accuracy: {:.2f}%".format(accuracy_score(y_test, y_pred) * 100))
print("F1 Score: {:.2f}".format(f1_score(y_test, y_pred) * 100))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

## Fitting the final model



When we&rsquo;re happy with our model, we fit it on **all** the data.



In [1]:
from sklearn import linear_model
model = linear_model.SGDClassifier(alpha=1e-05)
model.fit(X,y)

You may want to save the model for later.  Let&rsquo;s dump it to disk using
`joblib`, part of scikit-learn.



In [1]:
from sklearn.externals import joblib
_ = joblib.dump(model, "science-versus-politics.model", compress=9)

We can reload it.



In [1]:
model = joblib.load("science-versus-politics.model")

And we can use it to classify text.



In [1]:
def classify(text):
    if model.predict(vectorizer.fit_transform([text]))[0]:
        return 'politics'
    else:
        return 'science'

This is probably `politics`.



In [1]:
classify('Who will win the election?')

I hope this is `science`.



In [1]:
classify('Is there any relationship between matter and energy?')