# NLP Classification

Workflow for supervised learning on text data

In [252]:
import pandas as pd

from sklearn.pipeline import Pipeline

# ignore warning messages (sklearn has a ton)
import warnings
warnings.filterwarnings('ignore')

In [258]:
# combine all daasets
office_df = pd.read_csv('../data/dundermifflin.csv')
print('Office df shape:', office_df.shape)

overwatch_df = pd.read_csv('../data/overwatch.csv')
print('Overwatch df shape:', overwatch_df.shape)

Office df shape: (41467, 6)
Overwatch df shape: (47774, 6)


In [259]:
# combine data into single DataFrame
all_df = pd.concat([office_df, overwatch_df])
print(all_df.shape)
all_df.head()

(89241, 6)


Unnamed: 0.1,Unnamed: 0,title,id,subreddit,body,comment
0,0,Should I call you Jimothy?,ay2o5j,DunderMifflin,,I read somewhere that most people who think th...
1,1,Should I call you Jimothy?,ay2o5j,DunderMifflin,,I got Oscar Martinez... Michael am I gay?
2,2,Should I call you Jimothy?,ay2o5j,DunderMifflin,,That is correct.
3,3,Should I call you Jimothy?,ay2o5j,DunderMifflin,,Am I the only one who took slight pride in get...
4,4,Should I call you Jimothy?,ay2o5j,DunderMifflin,,You got: Creed Bratton\nYou're very mysterious...


In [260]:
# data looks fairly balanced
all_df.subreddit.value_counts()

Overwatch        47774
DunderMifflin    41467
Name: subreddit, dtype: int64

# Preprocess Text

As a first step, we'll tokenize the documents, remove stopwords, etc.

Gensim requires that we create a `gensim.corpora.Dictionary` object for our text.
This object expects a list of lists, (each nested list itself a list of tokens), so we'll ensure our preprocessing functions output our data as such:

In [285]:
# Windows
nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser']) 

# Unix System (mac/ubuntu/etc.)
# nlp = spacy.load('en', disable=['ner', 'parser']) 

stop_words = nlp.Defaults.stop_words

In [308]:
import re

# you can also just use your own stopwords
stop_words.update(['los', 'angeles'])

def clean_token(token):
    c_token = re.sub("[^A-Za-z']+", ' ', str(token))
    # lower-case and strip whitespace
    c_token = c_token.lower().strip()
    # remove stopwords
    if c_token in stop_words:
        return ''
    return c_token

def clean_comment(comment):
    if not isinstance(comment, str):
        return ['']
    clean_tokens = [
        clean_token(token) 
        for token in comment.split()
    ]
    # remove empty strings
    cleaned_tokens = [com for com in clean_tokens 
                      if com != '']
    cleaned_comment = ' '.join(cleaned_tokens)
    return cleaned_comment


In [309]:
example_text = "USC is cool but Los Angeles is whatever."
print(clean_comment(example_text))

usc cool


In [310]:
# apply text-cleaning to all comment columns
all_df['clean_comment'] = all_df['comment'].apply(lambda x: clean_comment(x))

In [311]:
all_df['clean_comment'][1:10]

1                       got oscar martinez michael gay
2                                              correct
3                    took slight pride getting michael
4    got creed bratton you're mysterious people har...
5    damn took test i'm phyllis guess lot learn tow...
6    god took quiz time im toby camera document rea...
7                                 identity theft crime
8                                          daryll love
9    quiz dwight retaking order michael you're defi...
Name: clean_comment, dtype: object

## Split Data into Training/Testing Sets

Notes, it's something of a convention to use `X` for the dataset, and `y` for the label. 

In [290]:
from sklearn.model_selection import train_test_split

X = all_df['clean_comment']
y = all_df['subreddit']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [291]:
print('Training Data Shape:', X_train.shape)
print('Testing Data Shape:', X_test.shape)

Training Data Shape: (62468,)
Testing Data Shape: (26773,)


## Create embeddings

Before throwing our into ml models, we need to create embeddings.
This can be done in several ways, with complexity differening depending on the desired method.
For this notebook, we'll keep it simple and just use scikit-learn

In [292]:
# Create tf-idf embeddings
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# init model w/ default params
tfidf  = TfidfVectorizer()

# fit transformer to our data
tfidf.fit(X_train)

# transform our data
corpus_train = tfidf.transform(X_train)

# corpus_train = tfidf.fit_transform(X_train)
corpus_test = tfidf.transform(X_test)

In [293]:
print('Training Shape:', corpus_train.shape)
print('Testing Shape:', corpus_test.shape)

Training Shape: (62468, 33761)
Testing Shape: (26773, 33761)


## sklearn Estimators

Using an `sklearn` ml algorithm is super simple:

In [294]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# init classifier with default params
rf = RandomForestClassifier()

# train classifier on our data
rf.fit(corpus_train, y_train)

# use classifier to make predictions
y_preds = rf.predict(corpus_test)

# get accuracy
accuracy_score(y_preds, y_test)

0.77223322003511

## sklearn pipelines

Idea: throw everything together into one object. Treat this object itself as a single classifier.

In [295]:
from sklearn.pipeline import Pipeline

# Simple Pipeline
pipeline_model = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', LogisticRegression())
])
pipeline_model.fit(X_train, y_train)
y_preds = pipeline_model.predict(X_test)
accuracy_score(y_preds, y_test)

0.8111903783662645

## Cycling through multiple embeddings + pipelines

Defining a pipeline allows us to iterate through many different model + vector combinations.
Also, the code becomes more concise, and more expressive

In [296]:
from sklearn.decomposition import TruncatedSVD

def create_pipeline(vectorizer, estimator, reducer=False):
    """
    Create pipeline with optional dimensionality-reduction.
    """
    steps = [
        ('vectorizer', vectorizer)
    ]
    if reducer:
        steps.append(('reducer', TruncatedSVD()))
    steps.append(('classifier', estimator))
    return Pipeline(steps)


In [297]:
# Single Classifier with pipeline

pipe = create_pipeline(CountVectorizer(), SGDClassifier(), reducer=False)

pipe.fit(X_train, y_train)
y_preds = pipe.predict(X_test)
print(accuracy_score(y_preds, y_test))

0.8014417510178165


In [298]:
# Create combinations

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

models = []
for vectorizer in (CountVectorizer(), TfidfVectorizer()):
    for estimator in (LogisticRegression, SGDClassifier, RandomForestClassifier):
        models.append(create_pipeline(vectorizer, estimator(), reducer=False))
        models.append(create_pipeline(vectorizer, estimator(), reducer=True))

print(models[1:4])

[Pipeline(memory=None,
     steps=[('vectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
       ...penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False))]), Pipeline(memory=None,
     steps=[('vectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
       ...m_state=None, shuffle=True, tol=None,
       validation_fraction=0.1, verbose=0, warm_start=False))]), Pipeline(memory=None,
     steps=[('vectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=

In [299]:
# Assess Models
scores = []
for model in models:
    model_name = str(type(model.named_steps['classifier'])).split('.')[-1]
    
    if 'reducer' in model.named_steps:
        acc_print = 'Accuracy of {} with dimensionality reduction: {}'
    else:
        acc_print = 'Accuracy of {} without dimensionality reduction: {}'
        
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    scores.append(accuracy)
    print(acc_print.format(model_name, accuracy))

Accuracy of LogisticRegression'> without dimensionality reduction: 0.8111903783662645
Accuracy of LogisticRegression'> with dimensionality reduction: 0.5318791319613043
Accuracy of SGDClassifier'> without dimensionality reduction: 0.7990512830090016
Accuracy of SGDClassifier'> with dimensionality reduction: 0.5318791319613043
Accuracy of RandomForestClassifier'> without dimensionality reduction: 0.7696559967131065
Accuracy of RandomForestClassifier'> with dimensionality reduction: 0.5620587905725918
Accuracy of LogisticRegression'> without dimensionality reduction: 0.809584282672842
Accuracy of LogisticRegression'> with dimensionality reduction: 0.5336346319052777
Accuracy of SGDClassifier'> without dimensionality reduction: 0.793560676801255
Accuracy of SGDClassifier'> with dimensionality reduction: 0.5318791319613043
Accuracy of RandomForestClassifier'> without dimensionality reduction: 0.7727187838494005
Accuracy of RandomForestClassifier'> with dimensionality reduction: 0.526836738