# Building Text Classifiers

Frank Neugebauer
October 1, 2019

The objective of this project is to demonstrate the accuracy of different text classifiers in Python. To get that output, corpora from Reddit that show categorized and controversial entries is used.

Some of what's demonstrated:
* Reading JSON files
* Sampling to increase performance
* Tokenization
* Creating vectors as features
* Logistic regression with different penalities
* Multinomrial Naive Bayes

First, import everything that's needed.

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Loading and Sampling Data

The next step is to load the data, but it's massive and in order to avoid processing problems, I load the data and take a sample of 1000 from each. Then, that sample data is saved as separate CSV files.

Only the small samples have been uploaded, so do not run this (unless, of course, you have the full JSON files noted in the code.

In [3]:
#cat_comments = pd.read_json('data/categorized-comments.jsonl', lines=True)
#cont_comments = pd.read_json('data/controversial-comments.jsonl', lines=True)

#small_cat_comments = cat_comments.sample(n=1000)
#small_cont_comments = cont_comments.sample(n=1000)

#small_cat_comments.to_csv(r'data/small_cat_comments.csv')
#small_cont_comments.to_csv(r'data/small_cont_comments.csv')
print("NO NEED TO RUN THIS, JUST LOAD THE CSV(s) IN THE NEXT BLOCK!")

NO NEED TO RUN THIS, JUST LOAD THE CSV(s) IN THE NEXT BLOCK!


In order to avoid loading the entire data set each time, this code block independently loads the sample CSV files. This means that the previous step can be skipped every time except the first time (or any time you change the sample size).

This code should always work because at a minimum, the initial 1,000 observation files should be there (e.g., `small_cat_comments.csv`).

In [4]:
cat_comments = pd.read_csv(r'data/small_cat_comments.csv')
#cont_comments = pd.read_csv(r'data/small_cont_comments.csv')

## Set the Target and Features

Within the categorical data set, the target is the `cat` (category) column and the feature (there's only one) is the `txt` column.

In [8]:
cat_target = cat_comments['cat']
cat_features = cat_comments['txt']

# Show a little of the features and target
print("Target sample:\n", cat_target[0:5], "\n")
print("Feature sample:\n", cat_features[0:5])

Target sample:
 0               video_games
1    science_and_technology
2                    sports
3                    sports
4               video_games
Name: cat, dtype: object 

Feature sample:
 0    I gotta say, Nintendo knocked it out of the pa...
1    What precious thing did you want to post that ...
2               Freeney sack@!\n\nGood job old timer!!
3    Don't blow it....keep it simple.... count your...
4    Which platform? When? Can't play till Friday p...
Name: txt, dtype: object


## Split to Train and Test Sets

The train set should be bigger since it's used to build the model, whereas the test set is used to validate the model with new data. For that reason, the test data is only 25% of the full set.

In [10]:
features_train, features_test, target_train, target_test = \
        train_test_split(cat_features, cat_target, test_size=0.25)

## Create the Model and Predictions

A pipeline is used and passed a TfidfVectorizer and MultinomailNB (Naive Bayes) objects. The pipeline, in turn, makes the words a vector of numbers so Naive Bayes can be used to predict the category.

In English, the sentences are broken down into words, which are the converted to numeric equivalents. Then, new sentences are broken down to numbers and compared to the numbers already created. When the algorithm finds the closest match, it chooses the category from that match as the prediction.

In [23]:
nb_model = make_pipeline(TfidfVectorizer(), MultinomialNB())
nb_model.fit(features_train, target_train)
nb_test_predictions = nb_model.predict(features_test)

print("Small sample of the categorical predictions:", nb_test_predictions[0:5])

Small sample of the categorical predictions: ['video_games' 'video_games' 'video_games' 'video_games' 'sports']


## Accuracy

Understanding how accurate the test is matters greatly and indicates how well a model will perform in the "real world." The lower the accuracy, the less likely the model will accurately predict a new sample. Unfortunately, the Naive Bayes model did not predict well, although that's likely for several reasons, including:

* The sample size is very small
* The text may require more tweaking (e.g., removing stop words and punctuation)
* Yeah, I think that's all...

In [15]:
cat_accuracy_test = accuracy_score(target_test, nb_test_predictions)
print("The accuracy for the categorical test is:", cat_accuracy_test)

The accuracy for the categorical test is: 0.496


## Out of Sample Predictions

Putting this in context, this model can be used as the 'engine' to make predictions based on new data. Taking a step back, in theory, the corpus and prediction can be anything you have the right data for - in this case, the data was great because every category comment had a category and every controversial commenet was noted as such. Without that level of detail, this engine would not be possible because you could not train a model as shown.

Here I'll take the same comment and run it through both the category and controvery models to see if works.

In [20]:
new_data_easy = "I love football!"
new_data_hard = "I really love this game. Football is amazing!"

pred = nb_model.predict([new_data_easy])
print("The easy predicted category is:", pred)

pred = nb_model.predict([new_data_hard])
print("The hard predicted category is:", pred)

The easy predicted category is: ['sports']
The hard predicted category is: ['video_games']


## Observations

The challenge (in particular) with the "hard" new data is that both `game` and `football` were within the text. This problem can be solved with more sample, although this illustrates the fundamental nature of these kinds of models; they require data. 

If this were a production application, the new sample (i.e., txt="I really love this game. Football is amazing!" cat="sports") would be added to the corpus so next time the model was generated, it would work properly.