# Building Text Classifiers

Frank Neugebauer
March 24, 2019

The objective of this project is to demonstrate the accuracy of different text classifiers in Python. To get that output, corpora from Reddit that show categorized and controversial entries is used.

Some of what's demonstrated:
* Reading JSON files
* Sampling to increase performance
* Tokenization
* Creating vectors as features
* Logistic regression with different penalities
* Multinomrial Naive Bayes

First, import everything that's needed.

In [74]:
import pandas as pd
import numpy as np
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

## DO NOT RUN THIS BLOCK
The next step is to load the data, but it's massive and in order to avoid processing problems, I load the data and take a sample of 1000 from each. Then, that sample data is saved as separate CSV files.

Only the small samples have been uploaded, so do not run this (unless, of course, you have the full JSON files noted in the code.

In [42]:
cat_comments = pd.read_json('data/categorized-comments.jsonl', lines=True)
cont_comments = pd.read_json('data/controversial-comments.jsonl', lines=True)

small_cat_comments = cat_comments.sample(n=1000)
small_cont_comments = cont_comments.sample(n=1000)

small_cat_comments.to_csv(r'data/small_cat_comments.csv')
small_cont_comments.to_csv(r'data/small_cont_comments.csv')

In order to avoid loading the entire data set each time, this code block independently loads the sample CSV files. This means that the previous step can be skipped every time except the first time (or any time you change the sample size).

This code should always work because at a minimum, the initial 1,000 observation files should be there (e.g., `small_cat_comments.csv`).

In [44]:
cat_comments = pd.read_csv(r'data/small_cat_comments.csv')
cont_comments = pd.read_csv(r'data/small_cont_comments.csv')

## Create Corpora - one for the categorized text, the other for controversial

In [45]:
cat_comments_only = cat_comments['txt']
cont_comments_only = cont_comments['txt']
cat_corpus = cat_comments_only.tolist()
cont_corpus = cont_comments_only.tolist()

# Show a little of one of the corpora unfiltered
print(cat_corpus[1:5])

["What precious thing did you want to post that you can't?\n\n", 'Freeney sack@!\n\nGood job old timer!!', "Don't blow it....keep it simple.... count your money ", "Which platform? When? Can't play till Friday probably. Even if I have the freaking season pass :("]


Notice that the corpus is still intact; it includes stop words and punctuation - and even newline sequences. This can be wasteful depending on your objectives. In this case, stop words (e.g., 'the') can be removed since they don't indicate caegories or controversy (intuitively).

In [46]:
def clean_corpus(corpus):
    word_tokens = []
    for sentence in corpus:
        word_tokens.append(word_tokenize(sentence))

    filtered_sentences = []
    for tokenized_sentence in word_tokens:
        filtered_sentence = []
        for word in tokenized_sentence:
            if word not in stop_words:
                filtered_sentence.append(word)
        filtered_sentences.append(filtered_sentence)

    concat_sent = ''
    final_corpus = []
    for filtered_word in filtered_sentences:
        for element in filtered_word:
            concat_sent += str(element)
            concat_sent += ' '
        final_corpus.append(concat_sent)
    
    return final_corpus
    
stop_words = set(stopwords.words('english')) 

cat_corpus_clean = clean_corpus(cat_corpus)
cont_corpus_clean = clean_corpus(cont_corpus)

# Show a little of the filtered corpus
print(cat_corpus_clean[1:5])

["I got ta say , Nintendo knocked park getting games like lined fill gap Zelda Mario What precious thing want post ca n't ? ", "I got ta say , Nintendo knocked park getting games like lined fill gap Zelda Mario What precious thing want post ca n't ? Freeney sack @ ! Good job old timer ! ! ", "I got ta say , Nintendo knocked park getting games like lined fill gap Zelda Mario What precious thing want post ca n't ? Freeney sack @ ! Good job old timer ! ! Do n't blow ... .keep simple ... . count money ", "I got ta say , Nintendo knocked park getting games like lined fill gap Zelda Mario What precious thing want post ca n't ? Freeney sack @ ! Good job old timer ! ! Do n't blow ... .keep simple ... . count money Which platform ? When ? Ca n't play till Friday probably . Even I freaking season pass : ( "]


In [81]:
cat_vectorizer = TfidfVectorizer()
cont_vectorizer = TfidfVectorizer()
cat_vector = cat_vectorizer.fit_transform(cat_corpus_clean)
cont_vector = cont_vectorizer.fit_transform(cont_corpus_clean)

cat_features = cat_vector.toarray()
cont_features = cont_vector.toarray()
cat_target = cat_comments['cat']
cont_target = cont_comments['con']

accuracy_df = pd.DataFrame(columns=['Model', 'Data Set', 'Accuracy(Train)', 'Accuracy(Test)'])

def createClassifier(model_txt, data_txt, target, features, test_size, classifier):
    features_train, features_test, target_train, target_test = \
        train_test_split(features, target, test_size=test_size)

    model = classifier.fit(features_train, target_train)
    test_predictions = model.predict(features_test)
    train_predictions = model.predict(features_train)

    accuracy_test = accuracy_score(target_test, test_predictions)
    accuracy_train = accuracy_score(target_train, train_predictions)

    new_df = accuracy_df.append({'Model': model_txt, 'Data Set':data_txt, 'Accuracy(Train)':accuracy_train,
                        'Accuracy(Test)':accuracy_test}, ignore_index=True)
    print(new_df)
    return model

With the reusable function created, call it for each variation, for each data set.

In [82]:
classifier = LogisticRegression(random_state=0, penalty='l1', solver='liblinear')
model_cont_lr_l1 = createClassifier('LR (L1)', 'Controversy', cont_target, cont_features, 0.25, classifier)

classifier = LogisticRegression(random_state=0, penalty='l2', solver='liblinear')
model_cont_lr_l2 = createClassifier('LR (L2)', 'Controversy', cont_target, cont_features, 0.25, classifier)

classifier = MultinomialNB(class_prior=[0.25, 0.5])
model_cont_nb = createClassifier('NB     ', 'Controversy', cont_target, cont_features, 0.25, classifier)

classifier = LogisticRegression(random_state=0, penalty='l1', solver='liblinear', multi_class='ovr')
model_cat_lr_l1 = createClassifier('LR (L1)', 'Categories', cat_target, cat_features, 0.25, classifier)

classifier = LogisticRegression(random_state=0, penalty='l2', solver='liblinear', multi_class='ovr')
model_cat_lr_l2 = createClassifier('LRn (L2)', 'Categories', cat_target, cat_features, 0.25, classifier)

classifier = MultinomialNB(class_prior=[0.25, 0.25, .25, .25])
model_cat_nb = createClassifier('NB      ', 'Categories', cat_target, cat_features, 0.25, classifier)

     Model     Data Set  Accuracy(Train)  Accuracy(Test)
0  LR (L1)  Controversy         0.958667           0.956
     Model     Data Set  Accuracy(Train)  Accuracy(Test)
0  LR (L2)  Controversy         0.957333            0.96
     Model     Data Set  Accuracy(Train)  Accuracy(Test)
0  NB       Controversy            0.956           0.964
     Model    Data Set  Accuracy(Train)  Accuracy(Test)
0  LR (L1)  Categories         0.426667           0.468
      Model    Data Set  Accuracy(Train)  Accuracy(Test)
0  LRn (L2)  Categories             0.46           0.392
      Model    Data Set  Accuracy(Train)  Accuracy(Test)
0  NB        Categories            0.408            0.48


## Out of Sample Predictions

Putting this in context, this model can be used as the 'engine' to make predictions based on new data. Taking a step back, in theory, the corpus and prediction can be anything you have the right data for - in this case, the data was great because every category comment had a category and every controversial commenet was noted as such. Without that level of detail, this engine would not be possible because you could not train a model as shown.

Here I'll take the same comment and run it through both the category and controvery models to see if works.

In [98]:
#the_new_comment = np.array(word_tokenize("I think the Wonkaland football team really sucks."))
the_new_comment = "I think the Wondaland football team really sucks"

print(the_new_comment)
foo = np.array(the_new_comment)

model_cont_lr_l1.predict(foo.reshape(1,-1))

I think the Wondaland football team really sucks


ValueError: X has 1 features per sample; expecting 5327