# Unigrams, Bigrams, and Trigrams in Naive Bayes Classifiers 
# Using data from Table 13.1


In [10]:
import pandas as pd
import nltk
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
import numpy as np

df = pd.read_csv('./iaml-edimburg-spam-data.csv', usecols=[0,1], encoding='latin-1')
df.columns = ['label','message']
df

Unnamed: 0,label,message
0,spam,send us your password
1,ham,send us your review
2,ham,review y our password
3,spam,review us
4,spam,send your password
5,spam,send us your account


# Pre-processing

Once we have our data ready, it is time to do some preprocessing. 
We will focus on removing useless variance for our task at hand. 
First, we have to convert the labels from strings to binary values for our classifier:

In [11]:
df['label'] = df.label.map({'ham': 0, 'spam': 1})  

#Second, convert all characters in the message to lower case:

df['message'] = df.message.map(lambda x: x.lower())  

#Third, remove any punctuation:
df['message'] = df.message.str.replace('[^\w\s]', '')  

df

Unnamed: 0,label,message
0,1,send us your password
1,0,send us your review
2,0,review y our password
3,1,review us
4,1,send your password
5,1,send us your account


In [12]:
#Fourth, tokenize the messages into into single words using nltk. 
nltk.download('punkt')

#Now we can apply the tokenization:
df['message'] = df['message'].apply(nltk.word_tokenize)  

df

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\oscar.cala\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,label,message
0,1,"[send, us, your, password]"
1,0,"[send, us, your, review]"
2,0,"[review, y, our, password]"
3,1,"[review, us]"
4,1,"[send, your, password]"
5,1,"[send, us, your, account]"


Fifth, we will perform some word stemming. The idea of stemming is to normalize our text for all variations of words carry the same meaning, regardless of the tense. One of the most popular stemming algorithms is the Porter Stemmer:

In [13]:
stemmer = PorterStemmer()

df['message'] = df['message'].apply(lambda x: [stemmer.stem(y) for y in x])  

df

Unnamed: 0,label,message
0,1,"[send, us, your, password]"
1,0,"[send, us, your, review]"
2,0,"[review, y, our, password]"
3,1,"[review, us]"
4,1,"[send, your, password]"
5,1,"[send, us, your, account]"


Finally, we will transform the data into occurrences, which will be the features that we will feed into our model:

In [14]:
# This converts the list of words into space-separated strings
df['message'] = df['message'].apply(lambda x: ' '.join(x))

count_vect = CountVectorizer()  
counts = count_vect.fit_transform(df['message'])  

df

Unnamed: 0,label,message
0,1,send us your password
1,0,send us your review
2,0,review y our password
3,1,review us
4,1,send your password
5,1,send us your account


In [15]:
counts

<6x7 sparse matrix of type '<type 'numpy.int64'>'
	with 20 stored elements in Compressed Sparse Row format>

We could leave it as the simple word-count per message, but it is better to use Term Frequency Inverse Document Frequency, more known as tf-idf:

In [16]:
transformer = TfidfTransformer().fit(counts)

counts = transformer.transform(counts)  

# Training the Model

Now that we have performed feature extraction from our data, it is time to build our model. We will start by splitting our data into training and test sets:

In [17]:
X_train, X_test, y_train, y_test = train_test_split(counts, df['label'], test_size=0.1, random_state=69)  
model = BernoulliNB().fit(X_train, y_train)  


# Evaluating the Model
Once we have put together our classifier, we can evaluate its performance in the testing set:

In [18]:
predicted = model.predict(X_test)

print(np.mean(predicted == y_test))  

1.0


 It could happen that our classifier is over-fitting the legitimate class while ignoring the spam class. To solve this uncertainty, let's have a look at the confusion matrix:

In [19]:
from sklearn.metrics import confusion_matrix

print(confusion_matrix(y_test, predicted))

[[1]]
