# Text Classification

*Based on content by [Vilja Hulden](https://programminghistorian.org/en/lessons/naive-bayesian), [Jake VanderPlas](https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.05-Naive-Bayes.ipynb), and [Jacob Eisenstein](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/slides/ch02-linear-classification.pdf).*

We begin with some standard imports:

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set() # fancier viz

## Wrapping Our Heads Around Generative Models

### Gaussian Naive Bayes

Perhaps the easiest naive Bayes classifier to understand is Gaussian naive Bayes.
In this classifier, the assumption is that *data from each label is drawn from a simple Gaussian distribution*.
Imagine that you have the following data:

In [None]:
from sklearn.datasets import make_blobs
X, y = make_blobs(100, 2, centers=2, random_state=2, cluster_std=1.5)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='RdBu');

One extremely fast way to create a simple model is to assume that the data is described by a Gaussian distribution with no covariance between dimensions.
This model can be fit by simply finding the mean and standard deviation of the points within each label, which is all you need to define such a distribution.
The result of this naive Gaussian assumption is shown in the following figure:

![(run code in Appendix to generate image)](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/figures/05.05-gaussian-NB.png?raw=1)

The ellipses here represent the Gaussian generative model for each label, with larger probability toward the center of the ellipses.
With this generative model in place for each class, we have a simple recipe to compute the likelihood $P({\rm features}~|~L_1)$ for any data point, and thus we can quickly compute the posterior ratio and determine which label is the most probable for a given point.

This procedure is implemented in Scikit-Learn's ``sklearn.naive_bayes.GaussianNB`` estimator:

In [None]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X, y);

Now let's generate some new data and predict the label:

In [None]:
rng = np.random.RandomState(0)
Xnew = [-6, -14] + [14, 18] * rng.rand(2000, 2)
ynew = model.predict(Xnew)

Now we can plot this new data to get an idea of where the decision boundary is:

In [None]:
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='RdBu')
lim = plt.axis()
plt.scatter(Xnew[:, 0], Xnew[:, 1], c=ynew, s=20, cmap='RdBu', alpha=0.1)
plt.axis(lim);

We see a slightly curved boundary in the classifications—in general, the boundary in Gaussian naive Bayes is quadratic.

Of course, the final classification will only be as good as the model assumptions that lead to it, which is why Gaussian naive Bayes often does not produce very good results.

Still, in many cases—especially as the number of features becomes large—this assumption is not detrimental enough to prevent Gaussian naive Bayes from being a useful method.

## Multinomial Naive Bayes

The Gaussian assumption just described is by no means the only simple assumption that could be used to specify the generative distribution for each label. Another useful example is multinomial naive Bayes, where the features are assumed to be generated from a simple multinomial distribution. The multinomial distribution describes the probability of observing counts among a number of categories, and thus multinomial naive Bayes is most appropriate for features that represent word counts (or other frequencies within the documents to be classified).

The idea is precisely the same as before, except that instead of modeling the data distribution with the best-fit Gaussian, we model the data distribuiton with a best-fit multinomial distribution.

### Downloading the data ###
Here we will use the sparse word count features from the 20 Newsgroups corpus that is part of scikit-learn to show how we might classify some documents into categories.

Let's download the data and take a look at the possible categories:

In [None]:
from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups()
data.target_names # gets us the names of the newsgroups

For simplicity here, we will select just a few of these categories, and download the training and testing set built into this data package:

In [None]:
categories = ['rec.autos', 'rec.motorcycles',
              'sci.space', 'comp.graphics']
train = fetch_20newsgroups(subset='train', categories=categories) # quickly make a train dataset
test = fetch_20newsgroups(subset='test', categories=categories) # quickly make a test dataset

Here is a representative entry from the data:

In [None]:
print(train.data[5])

In order to use these data in our classifier, we need to be able to convert the content of each string into a vector of numbers. We can either use a simple BoW model, as described above, or we can use something else like TF-IDF. 

For this classifier we will use the TF-IDF vectorizer and create a pipeline -- a scikit-learn feature that does what it sounds like -- that attaches it to a multinomial naive Bayes classifier:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

model = make_pipeline(TfidfVectorizer(), MultinomialNB())

With this pipeline, we can apply the model to the training data, and predict labels for the test data:

In [None]:
model.fit(train.data, train.target)
labels = model.predict(test.data)

Now that we have predicted the labels for the test data, we can evaluate them to learn about the performance of the estimator.
For example, here is the confusion matrix between the true and predicted labels for the test data:

In [None]:
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(test.target, labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=train.target_names, yticklabels=train.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label');

Evidently, even this very simple classifier can separate space talk from computer talk (not surprising), as well as car talk from motorcycle talk (which I'd guess would be harder). 

The very cool thing here is that we now have the tools to determine the category for *any* string, using the ``predict()`` method of this pipeline.

Here's a quick utility function that will return the prediction for a single string:

In [None]:
def predict_category(s, train=train, model=model):
    pred = model.predict([s])
    return train.target_names[pred[0]]

Let's try it out:

In [None]:
predict_category('sending a payload to the ISS')

In [None]:
predict_category('discussing whether or not to wear a helmet')

In [None]:
predict_category('determining the screen resolution')

predict_category('YOU TYPE SOMETHING IN!')

Remember that this is nothing more sophisticated than a simple probability model for the (weighted) frequency of each word in the string; nevertheless, the result is striking.
Even a very naive algorithm, when used carefully and trained on a large set of high-dimensional data, can be surprisingly effective.

## Measures of classification ##

The basic measure of classification prowess is accuracy: how often did classifier guess the class of a document correctly? This is calculated by simply dividing the number of correct guesses by the total number of documents considered.

But oftentimes, we’re interested in a specific category-- for example, how well the classifier did with respect to the “comp.graphics” category, above, in particular.

So if we’re considering a single category, we need a few more numbers: first, how many trials belonging to the category -- say “comp.graphics” -- there are in our test sample; second, how many times we’ve guessed that a trial belongs to the “comp.graphics” category; and third, how many times we have guessed correctly that a trial belongs to “comp.graphics."

Now that we have this information, we can use it to calculate a couple of standard measures of classification efficiency: precision and recall. Precision tells us how often we correctly guessed that a trial was in the “comp.graphics” category. Recall lets us know what proportion of the “comp.graphics” trials we caught, as so:

In [None]:
from sklearn.metrics import classification_report
report = classification_report(test.target, labels, target_names=train.target_names)

print(report) # f1-score is harmonic mean of precision and recall


## The uses of outliers ##

TK