## Read data using pandas

In [11]:
import pandas as pd
df = pd.read_csv('data/SMSSpamCollection', delimiter='\t', header=None, names=['y', 'X'])

# Show the first 5 rows
df.head()

Unnamed: 0,y,X
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [6]:
# Count how many of each category
df.y.value_counts()

ham     4825
spam     747
Name: y, dtype: int64

## Split the data into training and test sets

Test set is set aside for independent testing

In [25]:
from sklearn.model_selection import train_test_split

# shuffle and split the corpus into train and test sets
X_train, X_test, y_train, y_test = train_test_split(df.X, df.y, random_state=42)

# Count how many of each category for training and test sets
print(pd.Series(y_train).value_counts())
print(pd.Series(y_test).value_counts())

ham     3618
spam     561
Name: y, dtype: int64
ham     1207
spam     186
Name: y, dtype: int64


## Process text data to extract features

- Tokenize by spaces
- Remove stop words
- Convert each word into its count (how many times the word appears in the corpus)
- Usage of uppercase is common in spam messages, we will preserve the case by setting lowercase=False.

In [77]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(lowercase=False)
vectorizer.fit(X_train)

Z_train = vectorizer.transform(X_train)
Z_test = vectorizer.transform(X_test)

Z_train

<4179x9329 sparse matrix of type '<class 'numpy.int64'>'
	with 56614 stored elements in Compressed Sparse Row format>

In [78]:
Z_test

<1393x9329 sparse matrix of type '<class 'numpy.int64'>'
	with 16839 stored elements in Compressed Sparse Row format>

## Train a Naive Bayes classifier

Naive Bayes classfiers are generally a good first start for a simple classification model. 

This computes probabilities of spam vs. ham using a Gaussian distribution.

$$
P(x_i \mid y) = \frac{1}{\sqrt{2\pi\sigma^2_y}} \exp\left(-\frac{(x_i - \mu_y)^2}{2\sigma^2_y}\right)
$$

http://scikit-learn.org/stable/modules/naive_bayes.html#gaussian-naive-bayes

In [79]:
from sklearn.naive_bayes import GaussianNB

clf = GaussianNB()
clf.fit(Z_train.toarray(), y_train)

print('Accuracy', clf.score(Z_test.toarray(), y_test))

Accuracy 0.9267767408470926


In [80]:
from sklearn.metrics import confusion_matrix

y_pred = clf.predict(Z_test.toarray())

confusion_matrix(y_test, y_pred)

array([[1121,   86],
       [  16,  170]], dtype=int64)

Accuracy looks good (92.7%) for the test set, but in reality the number of misclassified "spam" is relatively high.

```
array([[1121,  86],
       [  16,  170]], dtype=int64)
```

- 1121 = number of correctly classified 'ham' SMS
- 170 = number of correctly classified 'spam' SMS
- 16 = number of 'spam' wrongly classified as 'ham'
- 86 = number of 'ham' wrongly classified as 'spam'

## Inspecting error cases

Let's take a look at some of the incorrectly labelled cases

In [83]:
spam_classified_as_ham = X_test[(y_test == 'spam') & (y_test != y_pred)]

spam_classified_as_ham

881     Reminder: You have not downloaded the content ...
3864    Oh my god! I've found your number again! I'm s...
2402    Babe: U want me dont u baby! Im nasty and have...
4527    I want some cock! My hubby's away, I need a re...
2663    Hello darling how are you today? I would love ...
751     Do you realize that in about 40 years, we'll h...
3463    Bloomberg -Message center +447797706009 Why wa...
1126    For taking part in our mobile survey yesterday...
227     Will u meet ur dream partner soon? Is ur caree...
3755    Bloomberg -Message center +447797706009 Why wa...
856     Talk sexy!! Make new friends or fall in love i...
3360    Sorry I missed your call let's talk when you h...
2770    Burger King - Wanna play footy at a top stadiu...
731     Email AlertFrom: Jeri StewartSize: 2KBSubject:...
1507    Thanks for the Vote. Now sing along with the s...
68      Did you hear about the new "Divorce Barbie"? I...
Name: X, dtype: object

In [84]:
ham_classified_as_spam = X_test[(y_test == 'ham') & (y_test != y_pred)]

ham_classified_as_spam

1859                     Sir, i am waiting for your call.
2952                     Hey now am free you can call me.
3142                       Customer place i will call you
2422    Err... Cud do. I'm going to  at 8pm. I haven't...
4937                             K..k.:)congratulation ..
5519    Can you pls send me that company name. In saib...
5339                  You'd like that wouldn't you? Jerk!
315             You made my day. Do have a great day too.
5349                                          I'm home...
33      For fear of fainting with the of all that hous...
1870                       Mom wants to know where you at
4050                     Yeah that's the impression I got
2067                           Then. You are eldest know.
4677                              Ü ready then call me...
3535                           Good evening! How are you?
3689                           I'll meet you in the lobby
4999                               Can you talk with me..
132           

## Save model for deployment

We can optimize the model further (for example by using word vectors instead of counts), but for now let's save the model so that we can use it from our web server.

Scikit-learn models can be saved using pickle, which is Python's general purpose serialisation library.

http://scikit-learn.org/stable/modules/model_persistence.html

In [85]:
import pickle

# Serialise the CountVectorizer
pickle.dump(vectorizer, open('model/vectorizer.pickle', 'wb'))

# Serialize the GaussianNB classifier
pickle.dump(clf, open('model/model.pickle', 'wb'))

In [1]:
!dir model

 Volume in drive D is DATA
 Volume Serial Number is B200-6E0E

 Directory of D:\stackup-workshops\simple-ml\model

04/10/2018  03:25 PM    <DIR>          .
04/10/2018  03:25 PM    <DIR>          ..
04/10/2018  03:25 PM           299,209 model.pickle
04/10/2018  03:25 PM           399,600 vectorizer.pickle
               2 File(s)        698,809 bytes
               2 Dir(s)  782,077,120,512 bytes free
