## Naive Bayes

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

What we'd like to do is learn how to classify each document into the correct newsgroup based only on the text in the document. To keep things simple, we just use 4 newsgroups: `'alt.atheism', 'soc.religion.christian', 'comp.graphics' and 'sci.med'`.

In [None]:
import numpy as np

from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report, confusion_matrix

## Getting the data
The following code will download training and test sets containing the documents. It might take a little bit of time to fetch the data!

In [None]:
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train',
                                  categories=categories,
                                  shuffle=True,
                                  random_state=0)

twenty_test = fetch_20newsgroups(subset='test',
                                 categories=categories,
                                 shuffle=True,
                                 random_state=0)

twenty_train.target_names

In [None]:
print(len(twenty_train.target))
print("First target is", twenty_train.target_names[twenty_train.target[0]])
print("First document is", twenty_train.data[0])

## Feature extraction
For each training document, we count the number of occurrences of each word (*term*) and use this to build a **term-document** matrix. This matrix contains the frequency of terms that occur in the set of training documents.

In [None]:
count_vectorizer = CountVectorizer()
X_train_counts = count_vectorizer.fit_transform(twenty_train.data)
print(X_train_counts.shape)
print(type(X_train_counts))

In [None]:
print(X_train_counts[0, 26895])
print(X_train_counts)

## Fitting the model
Let's now fit a naive Bayes model to the training data. This model will learn the mapping between the document term frequencies and the newsgroup the document was posted in.

In [None]:
model = MultinomialNB().fit(X_train_counts, twenty_train.target)

## Model evaluation

In [None]:
docs_new = ['Reading the Bible',
            'OpenGL vs DirectX',
            'Diabetes and glucose',
            'Humanists in the Republican Party']
X_new_counts = count_vectorizer.transform(docs_new)

predicted = model.predict(X_new_counts)

for doc, category in zip(docs_new, predicted):
    print('{0} => {1}'.format(doc, twenty_train.target_names[category]))

In [None]:
docs_test = twenty_test.data
X_new_counts = count_vectorizer.transform(docs_test)
predicted = model.predict(X_new_counts)
print("Test set accuracy is", np.mean(predicted == twenty_test.target))

In [None]:
print(classification_report(twenty_test.target,
                            predicted,
                            target_names=twenty_test.target_names))

In [None]:
print(confusion_matrix(twenty_test.target, predicted))