# Naive Bayes using sklearn

### Load the 20 newsgroups dataset

In [1]:
from sklearn.datasets import fetch_20newsgroups
cats = ['sci.crypt', 'sci.electronics', 'sci.med', 'sci.space']
newsgroups = fetch_20newsgroups(subset='all', categories = cats)
data = newsgroups.data
classes = newsgroups.target

In [2]:
print(data[1])
print('target::',classes[1])

From: geb@cs.pitt.edu (Gordon Banks)
Subject: Re: vangus nerve (vagus nerve)
Article-I.D.: pitt.19397
Reply-To: geb@cs.pitt.edu (Gordon Banks)
Organization: Univ. of Pittsburgh Computer Science
Lines: 16

In article <52223@seismo.CSS.GOV> bwb@seismo.CSS.GOV (Brian W. Barker) writes:

>mostly right. Is there a connection between vomiting
>and fainting that has something to do with the vagus nerve?
>
Stimulation of the vagus nerve slows the heart and drops the blood
pressure.




-- 
----------------------------------------------------------------------------
Gordon Banks  N3JXP      | "Skepticism is the chastity of the intellect, and
geb@cadre.dsl.pitt.edu   |  it is shameful to surrender it too soon." 
----------------------------------------------------------------------------

target:: 2


### Fit and transform the training data

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
newsgroups_train = fetch_20newsgroups(subset='train', categories = cats)
cv = CountVectorizer()
cv.fit(newsgroups_train.data)
transformed = cv.transform(newsgroups_train.data)
print(transformed.shape)

(2373, 38683)


### Feed the training documents and their labels to a NB classifier

In [4]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB(alpha=1)
clf.fit(transformed, newsgroups_train.target)

MultinomialNB(alpha=1, class_prior=None, fit_prior=True)

### Predict the labels of the testing data

In [6]:
newsgroups_test = fetch_20newsgroups(subset='test', categories=cats)
transformed_test = cv.transform(newsgroups_test.data)
predictions = clf.predict(transformed_test)
print(predictions)

[3 3 2 ..., 1 3 2]


### Compute the F-score

In [7]:
from sklearn import metrics
print(metrics.f1_score(newsgroups_test.target, predictions, average='micro'))

0.935402153262


&nbsp;

#### Links/References

https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html

https://datascience.stackexchange.com/questions/15989/micro-average-vs-macro-average-performance-in-a-multiclass-classification-settin/16001