## Multinomial Naive Bayes on 20newsgroups Dataset

In the question, you are asked to write code for classifying text data using Mulitnomial Naive Bayes.
[The 20 Newsgroups data set](http://qwone.com/~jason/20Newsgroups/) is a benchmark dataset used for text classification and clustering tasks.

#### A Few Task Reminders:
* Import required libraries.
* Load the data and split into train, validation, and test sets. You can also perform k-fold cross-validation on the train set for better performance estimates.
* Report the size of each class and perform any required data pre-processing.
* In this case, the input data is text data. So, a goal of this assignment/question is to learn a few basics for working with text data. In particular, for the purpose of the assignment, I am only looking for learning what is Term Frequency–Inverse Document Frequency (TF-IDF) is and how to use the [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) function. 
* Train the Multinomial Naive Bayes Classifier on the data.
* Make predictions and evaluate the model. Generate a confusion matrix and a classification report (e.g. using 'classification_report').
* Summarize your findings and make sure you sufficiently document your code.

**Note.** It is intentional that this problem assignment extends outside of what we have covered in class (i.e. text data pre-processing) and encourages more independent learning and exploration with ML hands-on experience and applications. I hope you would have fun with these type of questions and that they are not very stressful. Also, feedback is welcomed and encouraged!

In [None]:
import numpy as np
import matplotlib.pyplot as plt

Let's load the data and take a look at the target names:

In [None]:
from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups()
data.target_names

For this problem, only consider the four categories below as our classes:

In [None]:
classes = ['comp.windows.x', 'rec.autos', 'sci.space', 'talk.religion.misc']
train = fetch_20newsgroups(subset='train', categories=classes)
test = fetch_20newsgroups(subset='test', categories=classes)

In [None]:
# A sample in the train data
print(train.data[5])

#### Term Frequency–Inverse Document Frequency (TF-IDF):

In the vector space model of text data, each document can represented as $n_1$-dimensional vector, where $n_1$ is the size of the dictionary (number of terms) used to represent the documents in the dataset.
Given $n_2$ documents, we construct a term-document matrix $X \in \mathbb{R}^{n_{1 \times} n_2}$ where $X_{i j}$ corresponds to the significance of term $i$ in document $j$ using a Term frequency - Inverse Document Frequency (TF-IDF) representation:

$$
X_{i j}=T F_{i j} \log \left(\frac{n_2}{D F_i}\right)
$$
where,
\begin{align*}
& \text{Term Frequency: } \quad  TF_{i j}=\frac{f_{i, j}}{\sum \limits _{k \in j} f_{k, j}} \quad  \text { where } f_{i j} \text { is the number of times term $i$ appears in document $j$} \\
& \text{Document Frequency: } \quad   D F_i= |\{j \in \text { corpus } \mid t_i \in j\}|: \text{ number of documents containing term $i$}
\end{align*}


In [None]:
# For the purpose of the assignment, you can use the following:
from sklearn.feature_extraction.text import TfidfVectorizer

n_features = 5000
stop_words_list = nltk.corpus.stopwords.words('english') # optional for assignment
vectorizer = TfidfVectorizer(max_features=n_features, stop_words=stop_words_list)
vectors = vectorizer.fit_transform( ) #insert appropriate training set
feature_names = vectorizer.get_feature_names()
dense = vectors.todense()
denselist = np.array(dense).transpose()  

In [None]:
# For the purpose of the assignment, use default parameters and no need for tuning any hyperparameters or options.
# For example, 

from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()