<table align="center">
   <td align="center"><a target="_blank" href="https://colab.research.google.com/github/ds5110/summer-2021/blob/master/11b-naive-bayes-20newsgroups.ipynb">
<img src="https://github.com/ds5110/summer-2021/raw/master/colab.png"  style="padding-bottom:5px;" />Run in Google Colab</a></td>
</table>


# 11b -- Naive Bayes 20 newsgroups classification

* The [multinomial distribution](https://en.wikipedia.org/wiki/Multinomial_distribution) describes the probability of observing counts among a number of categories.
* Multinomial naive Bayes is appropriate for features that represent counts or count rates.
* Similar to Gaussian naive Bayes except fitting multinomial instead of Gaussian

### Reading

* [05.04-Feature-Engineering.ipynb](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.04-Feature-Engineering.ipynb)
  * simple example of [sklearn.feature_extraction.CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
  * simple example of [sklearn.feature_extraction.TfidfdVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
* [05.05-Naive-Bayes.ipynb](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.05-Naive-Bayes.ipynb) VanderPlas -- github


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups()

print(type(data))
print(dir(data))

data.target_names

In [None]:
# Extract newsgroups for 4 categories
categories = ['talk.religion.misc', 'soc.religion.christian',
              'sci.space', 'comp.graphics']
train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)

In [None]:
print('train.data:', type(train.data),'has length:', len(train.data))
print('train.target:', type(train.target),'has length:', len(train.target))
print('train.target_names:', train.target_names)
print('train.target[5]', train.target[5])
print('train.data[5]', train.data[5])

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

model = make_pipeline(TfidfVectorizer(), MultinomialNB())

In [None]:
model.fit(train.data, train.target)
labels = model.predict(test.data)

In [None]:
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(test.target, labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=train.target_names, yticklabels=train.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label');

### Once you've trained the model, you can give it any string.

In [None]:
def predict_category(s, train=train, model=model):
    pred = model.predict([s])
    return train.target_names[pred[0]]

In [None]:
predict_category('sending a payload to the ISS')

In [None]:
predict_category('discussing islam vs atheism')

In [None]:
predict_category('determining the screen resolution')

# When to use naive Bayes (from VanderPlas)...

* naive Bayes classifiers make such stringent assumptions about data
  * they will generally not perform as well as a more complicated model. 
  * but they have several advantages:
    * extremely fast for both training and prediction
    * they provide straightforward probabilistic prediction
    * They are often very easily interpretable
    * They have very few (if any) tunable parameters
* These advantages make naive Bayesian classifier good as an initial baseline classification
  * If it performs well, then great.
  * If it does not perform well, then you can begin exploring more sophisticated models, with this as a baseline
* Naive Bayes classifiers tend to perform especially well in one of the following situations:
  * When the naive assumptions actually match the data (very rare in practice)
  * For very well-separated categories, when model complexity is less important
  * For very high-dimensional data, when model complexity is less important
The last two points seem distinct, but they actually are related: as the dimension of a dataset grows, it is much less likely for any two points to be found close together (after all, they must be close in every single dimension to be close overall). This means that clusters in high dimensions tend to be more separated, on average, than clusters in low dimensions, assuming the new dimensions actually add information. For this reason, simplistic classifiers like naive Bayes tend to work as well or better than more complicated classifiers as the dimensionality grows: once you have enough data, even a simple model can be very powerful.