# ANLY-501 mod 6 discussion: Naive Bayes and linear SVM with Python

Rui Qiu (rq47)

2021-11-08

The source of text data is the infamous [20 Newsgroups](http://qwone.com/~jason/20Newsgroups/).

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from pprint import pprint

news_train = fetch_20newsgroups(subset='train', shuffle=True)

In [2]:
pprint(list(news_train.target_names))

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']


## Naive Bayes Classifer

In [3]:
nb_clf = Pipeline([('vect', CountVectorizer()),
                    ('tfidf', TfidfTransformer()),
                    ('clf', MultinomialNB())])

nb_clf = nb_clf.fit(news_train.data, news_train.target)

In [4]:
nb_test = [
    "People ‘unvaccinated by choice’ in Singapore no longer can receive free covid-19 treatment",
    "Bradley Beal is setting the tone for the thriving Wizards: ‘It’s taking us to another level’",
    "Lewis Hamilton and Mercedes not giving up F1 title fight, insists Toto Wolff",
    "Hubble telescope team gets one science instrument running again, continues troubleshooting glitch"
]

nb_pred = nb_clf.predict(nb_test)

nb_pred

for doc, category in zip(nb_test, nb_pred):
    print("{0} => {1}".format(doc, news_train.target_names[category]))

People ‘unvaccinated by choice’ in Singapore no longer can receive free covid-19 treatment => soc.religion.christian
Bradley Beal is setting the tone for the thriving Wizards: ‘It’s taking us to another level’ => soc.religion.christian
Lewis Hamilton and Mercedes not giving up F1 title fight, insists Toto Wolff => soc.religion.christian
Hubble telescope team gets one science instrument running again, continues troubleshooting glitch => sci.space


It seems that our naive Bayes classifier only got one out of four test cases correct.

## Linear SVM

In [5]:
svm_clf = Pipeline([('vect', CountVectorizer()),
                    ('tfidf', TfidfTransformer()),
                    ('clf', LinearSVC(random_state=42))])

svm_clf = svm_clf.fit(news_train.data, news_train.target)

In [6]:
svm_pred = svm_clf.predict(nb_test)

svm_pred

for doc, category in zip(nb_test, svm_pred):
    print("{0} => {1}".format(doc, news_train.target_names[category]))

People ‘unvaccinated by choice’ in Singapore no longer can receive free covid-19 treatment => sci.med
Bradley Beal is setting the tone for the thriving Wizards: ‘It’s taking us to another level’ => comp.graphics
Lewis Hamilton and Mercedes not giving up F1 title fight, insists Toto Wolff => comp.windows.x
Hubble telescope team gets one science instrument running again, continues troubleshooting glitch => sci.space


Well, two correct. Improvments I see!