# Naive-Bayes Classification

In this lab, we consider the 20 newsgroups text dataset from [scikit-learn](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html).

In [None]:
from sklearn.datasets import fetch_20newsgroups
from pprint import pprint

import numpy as np
import matplotlib.pyplot as plt
%matplotlib

# Training set
cats = ['rec.sport.baseball', 'sci.electronics', 'misc.forsale']
train_data = fetch_20newsgroups(subset='train', categories=cats)

## Getting to know your data

In [None]:
print(train_data.target, len(train_data.target))
print(train_data.target[0:10])
print(train_data.target_names)

In [None]:
print(train_data.target_names[train_data.target[0]])
print()
print(train_data.data[0])

### Data Preprocessing

With the help of feature extraction modules from Scikit-Learn, *the machine learning library for Python*, (see, [here](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)) build a **document-term** matrix for the training data you just retrieved (see the *Data-Types* lecture).

In [None]:
# Write your code here

Modify it so that only words that appear at least in 10% of the coprus appear.

In [None]:
# Write your code here

Use this matrix to compute the relative frequency of each term. The sort the resulting array from the most frequent term to the less frequent one. Pay attention that we use in this case `numpy.matrix` and `scipy.sparse` matrices instead of numpy arrays.

In [None]:
# Write your code here

Use `matplotlib` to plot the distributions of the 10 first terms in the corpus, where the terms are ranked from the most frequent one to the least frequent one.

In [None]:
# Write your code here

Verify graphically that this distribution follow a **power law** (see, [scipy.stats.powerlaw](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.powerlaw.html)).

In [None]:
# Write your code here

From [http://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting](http://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting) :

In a large text corpus, some words will be very present (e.g. “the”, “a”, “is” in English) hence carrying very little meaningful information about the actual contents of the document. If we were to feed the direct count data directly to a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms.

In order to re-weight the count features into floating point values suitable for usage by a classifier it is very common to use the tf–idf transform.

Tf means term-frequency while tf–idf means term-frequency times inverse document-frequency. This is a originally a term weighting scheme developed for information retrieval (as a ranking function for search engines results), that has also found good use in document classification and clustering.

$$\text{tf}(t,d) = \text{Number of times term }t \text{ occurs in document } d$$

If $N$ is the total number of documents in the corpus $D$ then

$$\text{idf}(t,D)=\log \frac{N}{\big|\{d\in D\mid t\in d \}\big|}$$

$$\text{tf-idf}(t,d)=\text{tf}(t,d)\times \text{idf}(t,D)$$

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english', min_df=0.01, max_df=0.8)
text_train_data = vectorizer.fit_transform(train_data.data)

Explain each parameter from [`TfidfVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

In [None]:
print(len(train_data.data))
print(text_train_data.shape)

## Classification

Build a *Naive Bayes* classifier that is able to predict the category of a given message. 

In [None]:
# Write your code here

Assess the performance of this classifier with respect to the test set. For this purpose, you have to use [TfidfVectorizer.transform](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.transform) instead of `TfidfVectorizer.fit_transform`.

In [None]:
# Write your code here

Perform a **k-fold cross validation** (with k = 10) on the complete data set. For this purpose, use the function [cross_val_score](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score) from scikit-learn. Compute the mean, standard deviation, min and max values from these scores.

In [None]:
# Write your code here