# Naive Bayes Example and SVM Homework

Guest Proff: Edward Raff
edraff1@umbc.edu
raff.edward@umcb.edu


In [0]:
import pandas as pd
from sklearn import datasets
import tensorflow as tf
from matplotlib import pyplot
from matplotlib import pyplot
from matplotlib import pylab
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.tree import export_graphviz
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
#some new tricks
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

Say we have our features $x_1 \ldots x_d$, and a set of classes $c_1, \ldots, c_k$. We want to build a classifier based on probability and statistics. Describing a classifier in probablistic terms, we want to know the probability of class $c_k$ given our features $x$, we would write that as:

$$P(c_k | x_1, \ldots, x_d)$$

If we knew what the joint probability $P(c, x_1, x_2 ,\ldots, x_n)$ was, we could apply bayes rule to obtain this answer. 

$$P(c_k | x_1, \ldots, x_d)  \propto P(c_k) \cdot P(x_1, \ldots, x_d | c_k) $$


How do we evaluate this? One option is to try and factor the joint probability $P(c, x_1, x_2 ,\ldots, x_n)$. If we factor this using the [chain rule](https://en.wikipedia.org/wiki/Chain_rule_(probability)), we get

$$P( x_1, x_2 ,\ldots, x_d, c) = $$
$$P(x_1 | x_2, \ldots, x_d, c) \cdot  P(x_2 | x_3, \ldots, x_d, c) \cdot  P(x_3 | x_4, \ldots, x_d, c) \ldots  P(x_{d-1} | x_d, c) \cdot  P(x_d | c) \cdot P(c)$$

But this dosn't help us all that much. One naive assumption we could make to simplify things is to assume that knowing the class $c$ explains away all other features. This is the [conditional indepdendence](https://en.wikipedia.org/wiki/Independence_(probability_theory)#Conditional_independence) assumption 

iff $A\perp B | C$, then $$P(A , B | C) = P(A | C) \cdot P(B | C)$$

or equivalently, 

iff $A\perp B | C$, then $$P(A | B,  C) = P(A | C)$$

If we assume $x_i \perp x_j | C, \forall i \neq j$, then we can simplify our application of the chain rule to get

$$P(c_k | x_1, \ldots, x_d) \propto \prod_{i=1}^d P(x_i | c_k)$$

$$P(c_k | x_1, \ldots, x_d) = \frac{\prod_{i=1}^d P(x_i | c_k)}{\sum_{z=1}^k \prod_{i=1}^d P(x_i | c_z)}$$

In [3]:
# p(pass | eat_pie, study) 
p_pass = 90.0/100.0
#p(pass | eat_pie, study) ∝ p(eat_pie | pass) * p(study | pass) * p(pass)

p_study_g_pass = 85.0/90.0
p_pie_g_pass = 75/90.0

p_study_g_no_pass = 2.0/10.0
p_pie_g_no_pass = 5.0/10.0

#p_pass_g_pie_study = p_pie_g_pass * p_study_g_pass * p_pass
#p_no_pass_g_pie_study = p_pie_g_no_pass * p_study_g_no_pass * (1-p_pass)

#What if we remove pie?
p_pass_g_pie_study = p_study_g_pass * p_pass
p_no_pass_g_pie_study = p_study_g_no_pass * (1-p_pass)

print("Probability of Passing given that I study and eat pie = ", p_pass_g_pie_study/(p_pass_g_pie_study+p_no_pass_g_pie_study))



Probability of Passing given that I study and eat pie =  0.9770114942528736


In [4]:
#Load the data
data = datasets.fetch_20newsgroups(subset='all')

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [5]:
print(data.DESCR)

.. _20newsgroups_dataset:

The 20 newsgroups text dataset
------------------------------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on
20 topics split in two subsets: one for training (or development)
and the other one for testing (or for performance evaluation). The split
between the train and test set is based upon a messages posted before
and after a specific date.

This module contains two loaders. The first one,
:func:`sklearn.datasets.fetch_20newsgroups`,
returns a list of the raw texts that can be fed to text feature
extractors such as :class:`sklearn.feature_extraction.text.CountVectorizer`
with custom parameters so as to extract feature vectors.
The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`,
returns ready-to-use features, i.e., it is not necessary to use a feature
extractor.

**Data Set Characteristics:**

    Classes                     20
    Samples total            18846
    Dimensionality               1
    Features       

In [0]:
# For a number of reasons Pandas isn't very useful when processing text data. We'll
# stick with lists or numpy arrays.
data_train_X, data_test_X = train_test_split(data.data, random_state=1 )
data_train_y, data_test_y = train_test_split(data.target, random_state=1 )

In [7]:
data_train_X[0]

"From: feustel@netcom.com (David Feustel)\nSubject: Re: BATF/FBI Murders Almost Everyone in Waco Today! 4/19\nOrganization: DAFCO: OS/2 Software Support & Consulting\nLines: 10\n\nIt's truly unfortunate that we don't have the Japanese tradition of\nHari-Kari for public officials to salvage some tatters of honor after\nthey commit offenses against humanity like were perpetrated in Waco,\nTexas today.\n-- \nDave Feustel N9MYI <feustel@netcom.com>\n\nI'm beginning to look forward to reaching the %100 allocation of taxes\nto pay for the interest on the national debt. At that point the\nfederal government will be will go out of business for lack of funds.\n"

In [0]:
# Here we use SKLearn's CountVectorizer
# note we use fit transform on the training data and transform on the test data
# if a word only appears in the test data it is ignored.
vectorizer = CountVectorizer(binary=False)
vectors_train=vectorizer.fit_transform(data_train_X)
vectors_test=vectorizer.transform(data_test_X)

In [9]:
vectors_train[0]

<1x153196 sparse matrix of type '<class 'numpy.int64'>'
	with 80 stored elements in Compressed Sparse Row format>

In [10]:
word_of_interest = "re"
#Did email # 0 have the word "from"
print("Did email 0 have the word '", word_of_interest, "': ", vectors_train[0][0,vectorizer.vocabulary_[word_of_interest]] >= 1)

Did email 0 have the word ' re ':  True


In [11]:
# Let's look at the shape of the data
# The first number is the number of examples
# The second is the number of words in the corpus
print(vectors_train.shape)
print(vectors_test.shape)


(14134, 153196)
(4712, 153196)


In [12]:
# Let's look at the shape of the data
# (document, token)     1 if present (Sparse representation so if it's not there 0 implied)
print(vectors_train)

  (0, 67564)	1
  (0, 87663)	1
  (0, 42700)	1
  (0, 106908)	1
  (0, 70259)	1
  (0, 38642)	1
  (0, 145948)	2
  (0, 70625)	1
  (0, 64650)	1
  (0, 111510)	1
  (0, 36019)	1
  (0, 54023)	1
  (0, 101517)	1
  (0, 105838)	1
  (0, 79938)	1
  (0, 108926)	1
  (0, 133280)	1
  (0, 33106)	1
  (0, 2647)	1
  (0, 117412)	1
  (0, 66645)	1
  (0, 90355)	1
  (0, 38905)	1
  (0, 101164)	1
  (0, 53599)	1
  :	:
  (14133, 69349)	2
  (14133, 42728)	1
  (14133, 83751)	1
  (14133, 127516)	1
  (14133, 144750)	1
  (14133, 70572)	2
  (14133, 134323)	1
  (14133, 33821)	1
  (14133, 80748)	3
  (14133, 146276)	1
  (14133, 33677)	1
  (14133, 36019)	2
  (14133, 105838)	1
  (14133, 127776)	1
  (14133, 135553)	8
  (14133, 105264)	2
  (14133, 134355)	8
  (14133, 145072)	1
  (14133, 81002)	3
  (14133, 89567)	1
  (14133, 131382)	1
  (14133, 106436)	1
  (14133, 130634)	2
  (14133, 48888)	3
  (14133, 67189)	3


The above equation is much easier to deal with. We can now deal with each feature independently, and simply multiply them together! The only qestion is, how do we pick $P$? We need to define a probability distribution that we belive each $x_i$ comes from. 

For text classification problems, the multinomial distribution is the most popular choice to use with naive bayes. 

In [13]:
nb = MultinomialNB(alpha=.1)
nb.fit(vectors_train, data_train_y)

MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True)

In [14]:
#print a confusion matrix and accuracy
print(confusion_matrix(nb.predict(vectors_test), data_test_y))
print(accuracy_score(nb.predict(vectors_test), data_test_y))

[[183   0   1   0   0   0   0   0   0   0   0   0   0   0   0   2   0   0
    0  16]
 [  0 220  43   5   7  29   1   2   0   0   0   3   7   1   2   1   0   1
    0   0]
 [  0   0  39   0   2   0   1   0   0   0   0   0   0   0   0   0   1   0
    0   0]
 [  0   7 101 232   9   5   5   0   0   0   0   0  13   2   0   1   1   0
    0   0]
 [  0   3  25  11 209   1   4   0   1   0   0   0   5   0   1   1   0   1
    0   0]
 [  0   5  30   1   0 189   0   0   0   0   0   2   0   0   0   0   0   0
    0   0]
 [  1   2   5   5   2   1 213   3   4   1   0   1   0   2   1   0   0   0
    0   0]
 [  1   0   3   1   0   0  12 241   1   0   0   0   2   0   0   1   0   0
    0   1]
 [  0   0   2   0   0   1   2  14 244   0   1   0   1   2   0   1   0   0
    0   0]
 [  0   0   0   0   0   0   2   0   0 221   4   0   0   0   2   0   0   0
    0   1]
 [  0   0   0   1   0   0   1   0   0   1 255   0   0   0   0   1   0   0
    0   0]
 [  0   2   3   0   0   1   2   0   0   1   0 248   0   1   0   0

# What about continuous data?

You have to use the "right" (read, good enough) distribution for the type of data you are looking at. When we are working with non-text data, the Multinomial is not a popular choice, and people tend to use what is called the Gaussian Naive Bayes variant. 

In this case, we assume $P(x_i | c_j)$  follows a guassian distribution. Since this one is easier, we will work through the math. We essenitaly just plug in the guassian distribution to get the following likelihood. 

$$P \left( x _ { i } | c_j \right) = \frac { 1 } { \sqrt { 2 \pi \sigma _ { c_j } ^ { 2 } } } \exp \left( - \frac { \left( x _ { i } - \mu _ { c_j } \right) ^ { 2 } } { 2 \sigma _ { c_j } ^ { 2 } } \right)$$


Now we have two parameters of the guassian distribution $\mu_{c_j}$ and $\sigma{c_j}$ that we need to set. If we wanted to take an optimization approach and define an objective function, we would take the negative log likelihood

$$-\log\left(P(x_i | c_j)\right) = \frac{1}{2} \left(\frac{(\mu_{c_j}-x_i)^2}{{\sigma _ { c_j }}^2}+2 \log (\sigma _ { c_j })+\log (2)+\log (\pi )\right)$$


Lucky for us, guassians are convient, and the minimizer is in the notation $\mu_{ c_j }$ is the mean of $x_i$ in class $c_j$ and $\sigma _ { c_j }$ is the standard deviation of $x_i$ for class $c_j$. If $n_{c_j}$ is the number of data points that have the label $c_j$, we get

$$\mu_{c_j} = \frac{1}{n_{c_j}} \sum_{z=1}^{n_{c_j}} x^{(z)}_i$$

$$\sigma_{c_j} = \sqrt{\frac{1}{n_{c_j}}  \sum_{z=1}^{n_{c_j}} \left(\mu_{c_j}-x^{(z)}_i\right)^2}$$

Now lets try it out on some data!

In [15]:
data_iris = datasets.load_iris()
data_iris_train_X, data_test_X = train_test_split(data_iris.data, random_state=1 )
data_iris_train_y, data_test_y = train_test_split(data_iris.target, random_state=1 )
#gnb= GaussianNB()
gnb= MultinomialNB()
gnb.fit(data_iris_train_X, data_iris_train_y)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [16]:
pred_iris=gnb.predict(data_test_X)
print( confusion_matrix( pred_iris, data_test_y ) )
print( accuracy_score( pred_iris, data_test_y ) )

[[13  0  0]
 [ 0  0  0]
 [ 0 16  9]]
0.5789473684210527


# Homework Problem 1
- Note the data contains many headers, quotes and headers.  These create things that are easy to idenitify. You can remove this from the documents using  data_train = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes')
- What happens to the performance of the model?  Is this to be expected?  Was it a good idea?
- I used a CountVectorizer(binary=True)
- What happens to the vectors if you set binary=False?  What happens to the performance?
- Try training an svm with a linear kernel using binary vectors, count vectors, also apply a tfidf transform to the data.  Compare performance with Naive Bayes.




In [17]:
import pandas as pd
from sklearn import datasets
import tensorflow as tf
from matplotlib import pyplot
from matplotlib import pyplot
from matplotlib import pylab
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.tree import export_graphviz
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
#some new tricks
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

# getting 20newsgroups dataset
data = datasets.fetch_20newsgroups(subset='all')

data_train_X, data_test_X = train_test_split(data.data, random_state=1 )
data_train_y, data_test_y = train_test_split(data.target, random_state=1 )

# vectorizer = CountVectorizer(binary=False)
vectorizer = CountVectorizer(binary=True)
vectors_train=vectorizer.fit_transform(data_train_X)
vectors_test=vectorizer.transform(data_test_X)

nb = MultinomialNB(alpha=.1)
nb.fit(vectors_train, data_train_y)

#print a confusion matrix and accuracy
print(confusion_matrix(nb.predict(vectors_test), data_test_y))
print(accuracy_score(nb.predict(vectors_test), data_test_y))

[[184   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0
    0  17]
 [  0 220  36   5   3  22   0   3   0   0   0   7   7   2   1   2   0   0
    0   0]
 [  0   0  81   0   2   0   1   0   0   0   0   0   0   0   0   0   1   0
    0   0]
 [  0   7  97 231  10   6   8   0   0   2   0   1   9   3   1   1   0   0
    0   0]
 [  0   3   6  10 211   0   4   0   1   0   0   0   6   0   1   1   0   0
    0   0]
 [  0   4  23   1   0 194   0   0   0   0   1   2   0   0   1   0   0   0
    0   0]
 [  1   1   2   6   3   1 214   4   4   2   0   0   1   1   1   0   0   0
    0   0]
 [  1   0   3   0   0   0  10 240   1   0   0   0   2   1   0   0   0   0
    0   1]
 [  0   0   0   0   0   1   3  10 244   0   1   0   0   0   0   0   0   0
    0   0]
 [  0   0   0   0   0   0   2   0   0 221   2   0   0   0   1   0   0   0
    0   0]
 [  0   0   0   1   0   0   1   0   0   1 255   0   0   0   0   1   0   0
    0   0]
 [  0   2   5   0   1   1   2   1   0   0   0 244   1   0   0   0

In [18]:
# getting cleaned data for training
data_clean = datasets.fetch_20newsgroups(subset='all',remove=('headers','footers','quotes'))

data_clean_train_X, data_test_X = train_test_split(data_clean.data, random_state=1 )
data_clean_train_y, data_test_y = train_test_split(data_clean.target, random_state=1 )

# vect_clean = CountVectorizer(binary=False)
vect_clean = CountVectorizer(binary=True)
vect_clean_train=vect_clean.fit_transform(data_train_X)
vect_clean_test=vect_clean.transform(data_test_X)

nb = MultinomialNB(alpha=.1)
nb.fit(vect_clean_train, data_train_y)

#print a confusion matrix and accuracy
print(confusion_matrix(nb.predict(vect_clean_test), data_test_y))
print(accuracy_score(nb.predict(vect_clean_test), data_test_y))

[[138   1   2   1   1   1   0   2   3   0   3   3   0   2   2   7   4   5
    9  21]
 [  1 181  38   4   2  27   3   2   0   0   0   6   8   3   7   2   0   2
    0   0]
 [  0   0  34   0   3   0   0   0   0   0   0   0   0   0   0   0   1   0
    0   0]
 [  0  11 104 219  17   8  14   0   2   2   0   0  14   2   2   0   0   0
    0   0]
 [  0  12  14   9 180   4  10   0   2   0   1   1   8   1   1   1   0   1
    1   0]
 [  0  10  37   2   2 175   0   1   0   0   1   2   0   0   2   0   0   1
    0   0]
 [  0   1   2   4   4   2 179   6   3   4   0   0   3   2   0   0   0   0
    0   1]
 [  2   0   0   2   4   1  15 214  16   0   1   0   4   2   3   0   1   0
    1   2]
 [  3   1   4   0   0   1   1   6 191   1   3   1   2   2   0   1   2   1
    2   0]
 [  5   7   9   7   8   2   4  12   9 209  13  11   7   7  10   5   9   6
    5   5]
 [  0   1   1   0   0   0   3   0   1   1 226   0   0   0   1   0   0   0
    1   0]
 [  4   5   4   2   2   5   5   1   1   1   1 203   4   0   3   0

# RESULT
... it looks like it got worse?

* I used a CountVectorizer(binary=True)
* What happens to the vectors if you set binary=False? What happens to the performance?
* Try training an svm with a linear kernel using binary vectors, count vectors, also apply a tfidf transform to the data. Compare performance with Naive Bayes.




In [19]:
# getting cleaned data for training
data_clean = datasets.fetch_20newsgroups(subset='all',remove=('headers','footers','quotes'))

data_clean_train_X, data_test_X = train_test_split(data_clean.data, random_state=1 )
data_clean_train_y, data_test_y = train_test_split(data_clean.target, random_state=1 )

vect_false = CountVectorizer(binary=False)
vect_false_train=vect_false.fit_transform(data_train_X)
vect_false_test=vect_false.transform(data_test_X)

nb = MultinomialNB(alpha=.1)
nb.fit(vect_false_train, data_train_y)

#print a confusion matrix and accuracy
print(confusion_matrix(nb.predict(vect_false_test), data_test_y))
print(accuracy_score(nb.predict(vect_false_test), data_test_y))

[[135   0   2   1   1   0   0   3   2   1   6   2   1   1   6  10   5   7
    9  27]
 [  1 186  45   4   4  31   4   2   1   2   0   3   9   1   6   3   0   2
    0   0]
 [  0   0   8   0   2   0   0   0   0   0   0   0   0   0   0   0   1   0
    0   0]
 [  0   8 104 210  16   5  11   0   2   0   0   0  12   2   0   0   0   0
    0   0]
 [  0  10  17  12 174   4  11   0   2   0   1   1   8   1   1   1   0   0
    0   0]
 [  0  10  44   2   3 174   0   1   0   0   0   3   0   0   1   0   0   1
    0   0]
 [  0   0   2   5   4   1 174   3   2   3   0   0   1   1   0   0   0   0
    0   0]
 [  1   0   0   1   4   2  15 217  16   0   1   1   5   2   3   0   1   0
    1   2]
 [  3   1   1   0   0   1   3   9 190   1   1   1   1   2   2   1   0   0
    0   0]
 [  5   7  10   7   7   1   4  12   9 206  17  11   7   6   9   5   7   5
    5   5]
 [  0   0   1   0   0   0   1   0   1   1 222   1   0   0   0   0   0   0
    1   0]
 [  2   7  10   2   2   3   6   0   1   0   0 200   7   0   2   0

# RESULT
... it looks like it got worse again????

Let's try it on the original...

In [20]:
# getting 20newsgroups dataset
data = datasets.fetch_20newsgroups(subset='all')

data_train_X, data_test_X = train_test_split(data.data, random_state=1 )
data_train_y, data_test_y = train_test_split(data.target, random_state=1 )

vectorizer = CountVectorizer(binary=False)
vectors_train=vectorizer.fit_transform(data_train_X)
vectors_test=vectorizer.transform(data_test_X)

nb = MultinomialNB(alpha=.1)
nb.fit(vectors_train, data_train_y)

#print a confusion matrix and accuracy
print(confusion_matrix(nb.predict(vectors_test), data_test_y))
print(accuracy_score(nb.predict(vectors_test), data_test_y))

[[183   0   1   0   0   0   0   0   0   0   0   0   0   0   0   2   0   0
    0  16]
 [  0 220  43   5   7  29   1   2   0   0   0   3   7   1   2   1   0   1
    0   0]
 [  0   0  39   0   2   0   1   0   0   0   0   0   0   0   0   0   1   0
    0   0]
 [  0   7 101 232   9   5   5   0   0   0   0   0  13   2   0   1   1   0
    0   0]
 [  0   3  25  11 209   1   4   0   1   0   0   0   5   0   1   1   0   1
    0   0]
 [  0   5  30   1   0 189   0   0   0   0   0   2   0   0   0   0   0   0
    0   0]
 [  1   2   5   5   2   1 213   3   4   1   0   1   0   2   1   0   0   0
    0   0]
 [  1   0   3   1   0   0  12 241   1   0   0   0   2   0   0   1   0   0
    0   1]
 [  0   0   2   0   0   1   2  14 244   0   1   0   1   2   0   1   0   0
    0   0]
 [  0   0   0   0   0   0   2   0   0 221   4   0   0   0   2   0   0   0
    0   1]
 [  0   0   0   1   0   0   1   0   0   1 255   0   0   0   0   1   0   0
    0   0]
 [  0   2   3   0   0   1   2   0   0   1   0 248   0   1   0   0

# RESULT
Also worse, but not by much (0.8843 down to 0.8727)

* Try training an svm with a linear kernel using binary vectors, count vectors, also apply a tfidf transform to the data. Compare performance with Naive Bayes.

In [21]:
from sklearn.svm import SVC
from sklearn.feature_extraction.text import TfidfVectorizer

# getting 20newsgroups dataset
data = datasets.fetch_20newsgroups(subset='all',remove=('headers','footers','quotes'))

data_train_X, data_test_X = train_test_split(data.data, random_state=1 )
data_train_y, data_test_y = train_test_split(data.target, random_state=1 )

vectorizer = TfidfVectorizer(binary=True)
vectors_train=vectorizer.fit_transform(data_train_X)
vectors_test=vectorizer.transform(data_test_X)

svm = SVC(kernel='linear')
svm.fit(vectors_train, data_train_y)

#print a confusion matrix and accuracy
print(confusion_matrix(svm.predict(vectors_test), data_test_y))
print(accuracy_score(svm.predict(vectors_test), data_test_y))

[[129   1   0   2   0   0   0   3   2   3   2   2   1   2   4  12   3   5
    8  23]
 [  1 167  11   7   5  21   2   1   4   3   1   8   5   2   5   1   1   0
    1   0]
 [  1  11 181  17  10  18   4   0   0   0   0   2   3   0   2   1   1   0
    1   0]
 [  0   9  20 196  18   3   6   0   0   0   0   2  10   1   0   0   0   1
    0   1]
 [  1   7   8  10 166   1   4   1   1   0   1   1   6   2   0   1   1   0
    0   0]
 [  0  10  11   1   1 170   2   1   0   0   1   2   0   0   1   1   1   0
    0   0]
 [  1   2   0   5   6   0 204   6   4   2   0   1   4   1   1   0   0   0
    0   1]
 [  6  10  14  11   9   6  12 208  17  19  11  16  13   8  12   8   9   5
    9   9]
 [  6   3   3   0   4   2   4  14 195   7   4   1   6   4   3   0   6   4
    4   3]
 [  5   1   0   0   1   2   1   0   4 176  13   2   1   4   2   2   1   6
    5   2]
 [  0   0   0   0   0   0   0   0   2   9 215   0   0   0   0   0   1   0
    1   0]
 [  1   3   2   0   1   0   2   0   1   0   1 185   3   1   2   0

# RESULT
* This actually took really long, and after researching a bit on the web it seems this is to be expected...
* It looks like I did something wrong here, though

# Homework Problem 2
- Use the fetch_sms_spam() function below to get the sms spam data set.
- More info: [Here](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection)
- Apply the vectorization and classification tasks above to  the data.
- Which gave the best performance (use a confusionmatrix and accuraccy score to help decide this).

In [0]:
# It's worth while reading through this function there's useful things here, that
# I'm not explicitly covering in class.
def fetch_sms_spam():
  import requests # requests is a handy http library
  import zipfile # a zip library
  r=requests.get('http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/smsspamcollection.zip')
  # I know of no way to get data out of a zipfile without saving to disk extracting and reading in, sigh
  zf=open('smsspam.zip','wb')
  zf.write(r.content)
  zip_ref = zipfile.ZipFile('smsspam.zip', 'r')
  zip_ref.extractall('smsspam')
  zip_ref.close()
  zf.close
  sms_file=open('smsspam/SMSSpamCollection.txt','r')
  #object to return
  data = {'data':[], 'target':[], 'target_classes': ['ham', 'spam']}
  # First tab splits the class and the SMS message
  # There's an argument to be made I should use the csv library and use
  # delimiter = '\t', I can't argue with that - it's a good idea. I did it this
  # way for pedogogy.
  for line in sms_file:
    idx = line.find('\t')
    target = line[:idx]
    doc = line[idx+1:]
    data['data'] += [doc]
    if target == 'ham': data['target'] += [0]
    else: data['target'] += [1]
  sms_file.close()
  return data

In [0]:
data1 = fetch_sms_spam()

In [24]:
from sklearn.svm import SVC

data_train_X, data_test_X = train_test_split(data1['data'], random_state=1 )
data_train_y, data_test_y = train_test_split(data1['target'], random_state=1 )

vectorizer = CountVectorizer(binary=True)
vectors_train=vectorizer.fit_transform(data_train_X)
vectors_test=vectorizer.transform(data_test_X)

svm = SVC()
svm.fit(vectors_train, data_train_y)

#print a confusion matrix and accuracy
print(confusion_matrix(svm.predict(vectors_test), data_test_y))
print(accuracy_score(svm.predict(vectors_test), data_test_y))



[[1212  182]
 [   0    0]]
0.8694404591104734


# RESULT

* Not entirely sure what to make of that...

* Accuracy score looks decent, at least.

In [25]:
from sklearn.svm import SVC

data_train_X, data_test_X = train_test_split(data1['data'], random_state=1 )
data_train_y, data_test_y = train_test_split(data1['target'], random_state=1 )

vectorizer = CountVectorizer(binary=False)
vectors_train=vectorizer.fit_transform(data_train_X)
vectors_test=vectorizer.transform(data_test_X)

svm = SVC()
svm.fit(vectors_train, data_train_y)

#print a confusion matrix and accuracy
print(confusion_matrix(svm.predict(vectors_test), data_test_y))
print(accuracy_score(svm.predict(vectors_test), data_test_y))



[[1212  182]
 [   0    0]]
0.8694404591104734


# RESULT

* vectors being binary or not doesn't matter for SVC

In [26]:
data_train_X, data_test_X = train_test_split(data1['data'], random_state=1 )
data_train_y, data_test_y = train_test_split(data1['target'], random_state=1 )

vectorizer = CountVectorizer(binary=False)
vectors_train=vectorizer.fit_transform(data_train_X)
vectors_test=vectorizer.transform(data_test_X)

nb = MultinomialNB(alpha=.1)
nb.fit(vectors_train, data_train_y)

#print a confusion matrix and accuracy
print(confusion_matrix(nb.predict(vectors_test), data_test_y))
print(accuracy_score(nb.predict(vectors_test), data_test_y))

[[1204    7]
 [   8  175]]
0.9892395982783357


# RESULT
* Again, confusion matrix is a bit odd...
* But accuracy score is way better with Naive Bayes!!

In [27]:
vectorizer = CountVectorizer(binary=True)
vectors_train=vectorizer.fit_transform(data_train_X)
vectors_test=vectorizer.transform(data_test_X)

nb = MultinomialNB(alpha=.1)
nb.fit(vectors_train, data_train_y)

#print a confusion matrix and accuracy
print(confusion_matrix(nb.predict(vectors_test), data_test_y))
print(accuracy_score(nb.predict(vectors_test), data_test_y))

[[1203    7]
 [   9  175]]
0.9885222381635581


# RESULT
* nearly identical accuracy score no matter if binary or not

# CONCLUSION

* out of previous tests of various prediction models, it was found that the accuracy scores are as follow:
  * SVC = 0.8694
  * NB w/ Binary Vectors = 0.9885
  * NB w/o Binary Vectors = 0.9892
* It looks like the Naive Bayes without Binary Vectors performed the best for the spam detection data!
* The following code produces only the accuracy score for all the models created in this assignment

In [28]:
# -*- coding: utf-8 -*-
"""
Created on Sun Mar  3 14:43:55 2019

@author: matti
"""

# =============================================================================
# Naive Bayes with binary vectors as applied to the unclean 20newsgroups data
# =============================================================================
# getting 20newsgroups dataset
data = datasets.fetch_20newsgroups(subset='all')

data_train_X, data_test_X = train_test_split(data.data, random_state=1 )
data_train_y, data_test_y = train_test_split(data.target, random_state=1 )

# vectorizer = CountVectorizer(binary=False)
vectorizer = CountVectorizer(binary=True)
vectors_train=vectorizer.fit_transform(data_train_X)
vectors_test=vectorizer.transform(data_test_X)

nb = MultinomialNB(alpha=.1)
nb.fit(vectors_train, data_train_y)

#print the accuracy score
print('The accuracy score for the Naive Bayes with binary vectors as applied to the unclean 20newsgroups data is: ',accuracy_score(nb.predict(vectors_test), data_test_y))


# =============================================================================
# Naive Bayes with binary vectors as applied to the cleaned 20newsgroups data without headers, footers, and quotes
# =============================================================================
# getting cleaned data for training
data_clean = datasets.fetch_20newsgroups(subset='all',remove=('headers','footers','quotes'))

data_clean_train_X, data_test_X = train_test_split(data_clean.data, random_state=1 )
data_clean_train_y, data_test_y = train_test_split(data_clean.target, random_state=1 )

# vect_clean = CountVectorizer(binary=False)
vect_clean = CountVectorizer(binary=True)
vect_clean_train=vect_clean.fit_transform(data_train_X)
vect_clean_test=vect_clean.transform(data_test_X)

nb = MultinomialNB(alpha=.1)
nb.fit(vect_clean_train, data_train_y)

#print the accuracy score
print('The accuracy score for the Naive Bayes with binary vectors as applied to the cleaned 20newsgroups data is: ',accuracy_score(nb.predict(vect_clean_test), data_test_y))


# =============================================================================
# Naive Bayes without binary vectors as applied to the unclean 20newsgroups data
# =============================================================================
# getting 20newsgroups dataset
data = datasets.fetch_20newsgroups(subset='all')

data_train_X, data_test_X = train_test_split(data.data, random_state=1 )
data_train_y, data_test_y = train_test_split(data.target, random_state=1 )

vectorizer = CountVectorizer(binary=False)
vectors_train=vectorizer.fit_transform(data_train_X)
vectors_test=vectorizer.transform(data_test_X)

nb = MultinomialNB(alpha=.1)
nb.fit(vectors_train, data_train_y)

#print the accuracy score
print('The accuracy score for the Naive Bayes without binary vectors as applied to the unclean 20newsgroups data: ',accuracy_score(nb.predict(vectors_test), data_test_y))


# =============================================================================
# Naive Bayes without binary vectors as applied to the cleaned 20newsgroups data
# =============================================================================
data_clean = datasets.fetch_20newsgroups(subset='all',remove=('headers','footers','quotes'))

data_clean_train_X, data_test_X = train_test_split(data_clean.data, random_state=1 )
data_clean_train_y, data_test_y = train_test_split(data_clean.target, random_state=1 )

vect_false = CountVectorizer(binary=False)
vect_false_train=vect_false.fit_transform(data_train_X)
vect_false_test=vect_false.transform(data_test_X)

nb = MultinomialNB(alpha=.1)
nb.fit(vect_false_train, data_train_y)

#print the accuracy score
print('The accuracy score for the Naive Bayes without binary vectors as applied to the cleaned 20newsgroups data: ',accuracy_score(nb.predict(vect_false_test), data_test_y))





# =============================================================================
# The script is getting hung up on the SVC fitting...
# =============================================================================
#from sklearn.svm import SVC
#from sklearn.feature_extraction.text import TfidfVectorizer
#
## getting 20newsgroups dataset
#data = datasets.fetch_20newsgroups(subset='all',remove=('headers','footers','quotes'))
#
#data_train_X, data_test_X = train_test_split(data.data, random_state=1 )
#data_train_y, data_test_y = train_test_split(data.target, random_state=1 )
#
#vectorizer = TfidfVectorizer(binary=True)
#vectors_train=vectorizer.fit_transform(data_train_X)
#vectors_test=vectorizer.transform(data_test_X)
#
#svm = SVC(gamma='auto')
#svm.fit(vectors_train, data_train_y)
#
#
#print(confusion_matrix(svm.predict(vectors_test), data_test_y))
#print(accuracy_score(svm.predict(vectors_test), data_test_y))
# =============================================================================
# I will debug this later.
# =============================================================================


# =============================================================================
# Making the function for getting the sms spam dataset
# =============================================================================
# It's worth while reading through this function there's useful things here, that
# I'm not explicitly covering in class.
def fetch_sms_spam():
  import requests # requests is a handy http library
  import zipfile # a zip library
  r=requests.get('http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/smsspamcollection.zip')
  # I know of no way to get data out of a zipfile without saving to disk extracting and reading in, sigh
  zf=open('smsspam.zip','wb')
  zf.write(r.content)
  zip_ref = zipfile.ZipFile('smsspam.zip', 'r')
  zip_ref.extractall('smsspam')
  zip_ref.close()
  zf.close
  sms_file=open('smsspam/SMSSpamCollection.txt','r')
  #object to return
  data = {'data':[], 'target':[], 'target_classes': ['ham', 'spam']}
  # First tab splits the class and the SMS message
  # There's an argument to be made I should use the csv library and use
  # delimiter = '\t', I can't argue with that - it's a good idea. I did it this
  # way for pedogogy.
  for line in sms_file:
    idx = line.find('\t')
    target = line[:idx]
    doc = line[idx+1:]
    data['data'] += [doc]
    if target == 'ham': data['target'] += [0]
    else: data['target'] += [1]
  sms_file.close()
  return data

data1 = fetch_sms_spam()


# =============================================================================
# SVC model with binary vectors as applied to the sms spam data
# =============================================================================
data_train_X, data_test_X = train_test_split(data1['data'], random_state=1 )
data_train_y, data_test_y = train_test_split(data1['target'], random_state=1 )

vectorizer = CountVectorizer(binary=True)
vectors_train=vectorizer.fit_transform(data_train_X)
vectors_test=vectorizer.transform(data_test_X)

svm = SVC(gamma='auto')
svm.fit(vectors_train, data_train_y)

#print the accuracy score
print('\nThe accuracy score for the SVC model with binary vectors as applied to the sms spam data: ',accuracy_score(svm.predict(vectors_test), data_test_y))


# =============================================================================
# SVC model without binary vectors as applied to the sms spam data
# =============================================================================
from sklearn.svm import SVC

data_train_X, data_test_X = train_test_split(data1['data'], random_state=1 )
data_train_y, data_test_y = train_test_split(data1['target'], random_state=1 )

vectorizer = CountVectorizer(binary=False)
vectors_train=vectorizer.fit_transform(data_train_X)
vectors_test=vectorizer.transform(data_test_X)

svm = SVC(gamma='auto')
svm.fit(vectors_train, data_train_y)

#print the accuracy score
print('The accuracy score for the SVC model without binary vectors as applied to the sms spam data: ',accuracy_score(svm.predict(vectors_test), data_test_y))


# =============================================================================
# Naive Bayes model with binary vectors as applied to the sms spam data
# =============================================================================
data_train_X, data_test_X = train_test_split(data1['data'], random_state=1 )
data_train_y, data_test_y = train_test_split(data1['target'], random_state=1 )

vectorizer = CountVectorizer(binary=True)
vectors_train=vectorizer.fit_transform(data_train_X)
vectors_test=vectorizer.transform(data_test_X)

nb = MultinomialNB(alpha=.1)
nb.fit(vectors_train, data_train_y)

#print the accuracy score
print('The accuracy score for the NB model with binary vectors as applied to the sms spam data: ',accuracy_score(nb.predict(vectors_test), data_test_y))


# =============================================================================
# Naive Bayes model without binary vectors as applied to the sms spam data
# =============================================================================
vectorizer = CountVectorizer(binary=False)
vectors_train=vectorizer.fit_transform(data_train_X)
vectors_test=vectorizer.transform(data_test_X)

nb = MultinomialNB(alpha=.1)
nb.fit(vectors_train, data_train_y)

#print the accuracy score
print('The accuracy score for the NB model without binary vectors as applied to the sms spam data: ',accuracy_score(nb.predict(vectors_test), data_test_y))

The accuracy score for the Naive Bayes with binary vectors as applied to the unclean 20newsgroups data is:  0.8843378607809848
The accuracy score for the Naive Bayes with binary vectors as applied to the cleaned 20newsgroups data is:  0.7425721561969439
The accuracy score for the Naive Bayes without binary vectors as applied to the unclean 20newsgroups data:  0.8726655348047538
The accuracy score for the Naive Bayes without binary vectors as applied to the cleaned 20newsgroups data:  0.7292020373514432

The accuracy score for the SVC model with binary vectors as applied to the sms spam data:  0.8694404591104734
The accuracy score for the SVC model without binary vectors as applied to the sms spam data:  0.8694404591104734
The accuracy score for the NB model with binary vectors as applied to the sms spam data:  0.9885222381635581
The accuracy score for the NB model without binary vectors as applied to the sms spam data:  0.9892395982783357
