# Introduction
This notebook contains the code for the **Spam Detection** mini project in the Udacity's Machine Learning Engineer Nanodegree Programme.

### Step 1.1: Understanding our dataset

In [11]:
import pandas as pd

In [12]:
df = pd.read_table('SMSSpamCollection', names=['label', 'sms_message'])
df.head()

Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


### Step 1.2: Data Preprocessing

In [13]:
# Convert ham and spam to label
df.loc[:,'label'] = df.label.map({'ham':0, 'spam':1})
print(df.shape)
df.head()

(5572, 2)


Unnamed: 0,label,sms_message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


This dataset consits of 5,572 rows, i.e. 5,572 sms messages.

### Step 2.1: Bag of Words

In [14]:
# Implement BoW on my own
documents = ['Hello, how are you!',
             'Win money, win from home.',
             'Call me now.',
             'Hello, Call hello you tomorrow?']

In [17]:
# Conver to lower case
lower_case_documents = [d.lower() for d in documents]
print(lower_case_documents)

['hello, how are you!', 'win money, win from home.', 'call me now.', 'hello, call hello you tomorrow?']


In [23]:
# Remove punctutation
import string

sans_punctuation_documents = []

for i in lower_case_documents:
    sans_punctuation_documents.append(i.translate(str.maketrans("","", string.punctuation)))
    
sans_punctuation_documents

['hello how are you',
 'win money win from home',
 'call me now',
 'hello call hello you tomorrow']

###### Important
The trick here is related to the `str.maketrans()` function. By reading its documentation, it says that if there is a third argument (like in my code), all the characters specified by the third argument will be replaced with `None`, i.e. they will be deleted. In other words, what I am doing is removing all punctuations from my documents.

In [25]:
# Tokenisation
preprocessed_documents = [[w for w in d.split()] for d in sans_punctuation_documents]
preprocessed_documents

[['hello', 'how', 'are', 'you'],
 ['win', 'money', 'win', 'from', 'home'],
 ['call', 'me', 'now'],
 ['hello', 'call', 'hello', 'you', 'tomorrow']]

In [26]:
# Count frequencies
from collections import Counter
import pprint

In [28]:
frequency_list = [Counter(d) for d in preprocessed_documents]
pprint.pprint(frequency_list)

[Counter({'hello': 1, 'how': 1, 'are': 1, 'you': 1}),
 Counter({'win': 2, 'money': 1, 'from': 1, 'home': 1}),
 Counter({'call': 1, 'me': 1, 'now': 1}),
 Counter({'hello': 2, 'call': 1, 'you': 1, 'tomorrow': 1})]


### Step 2.2: Implementing BoW in scikit-learn

In [29]:
'''
Here we will look to create a frequency matrix on a smaller document set to make sure we understand how the 
document-term matrix generation happens. We have created a sample document set 'documents'.
'''
documents = ['Hello, how are you!',
                'Win money, win from home.',
                'Call me now.',
                'Hello, Call hello you tomorrow?']

In [30]:
from sklearn.feature_extraction.text import CountVectorizer

In [61]:
count_vector = CountVectorizer()

##### Parameter of CountVectorizer

In [62]:
print(count_vector)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)


In [64]:
# Fit document 
count_vector.fit(documents)
count_vector.get_feature_names()

['are',
 'call',
 'from',
 'hello',
 'home',
 'how',
 'me',
 'money',
 'now',
 'tomorrow',
 'win',
 'you']

In [65]:
# Create document tokenised array
doc_array = count_vector.transform(documents).toarray()
doc_array

array([[1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1],
       [0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 2, 0],
       [0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0],
       [0, 1, 0, 2, 0, 0, 0, 0, 0, 1, 0, 1]])

In [70]:
# Create a panda dataframe
frequency_matrix = pd.DataFrame(doc_array, columns = count_vector.get_feature_names())
frequency_matrix

Unnamed: 0,are,call,from,hello,home,how,me,money,now,tomorrow,win,you
0,1,0,0,1,0,1,0,0,0,0,0,1
1,0,0,1,0,1,0,0,1,0,0,2,0
2,0,1,0,0,0,0,1,0,1,0,0,0
3,0,1,0,2,0,0,0,0,0,1,0,1


**YEAH!** I have implemented a BoW representation! Yuppy!

### Step 3.1.: Training and Testing sets

In [72]:
from sklearn.model_selection import train_test_split

In [79]:
X_train, X_test, Y_train, Y_test = train_test_split(df['sms_message'], df['label'], random_state=1)

In [80]:
print('Number of rows in the total set: {}'.format(df.shape[0]))
print('Number of rows in the training set: {}'.format(X_train.shape[0]))
print('Number of rows in the test set: {}'.format(X_test.shape[0]))

Number of rows in the total set: 5572
Number of rows in the training set: 4179
Number of rows in the test set: 1393


### Step 3.2.: Apply BoW processing to our dataset

In [83]:
# Define a new count vectorizer
count_vector = CountVectorizer()

# Fit the training data and return the matrix
training_data = count_vector.fit_transform(X_train)

# Transform testing data
testing_data = count_vector.transform(X_test)

### Step 4.1.: Bayes Thorem Implementation from scratch

In [89]:
# Example diabetes
# P(D)
p_diabetes = 0.01

# P(~D)
p_no_diabetes = 0.99

# Sensitivity or P(Pos|D)
p_pos_diabetes = 0.9

# Specificity or P(Neg/~D)
p_neg_no_diabetes = 0.9

In [90]:
p_pos = p_diabetes * p_pos_diabetes + p_no_diabetes * (1 - p_neg_no_diabetes)

In [93]:
p_diabetes_pos = p_diabetes * p_pos_diabetes / p_pos
print('Probability of an individual having diabetes, given that that individual got\
a positive test result is:\n{}'.format(p_diabetes_pos))

Probability of an individual having diabetes, given that that individual gota positive test result is:
0.08333333333333336


In [96]:
# Instructions:
# Compute the probability of an individual not having diabetes, given that, that individual got a positive test result.
# In other words, compute P(~D|Pos).

# The formula is: P(~D|Pos) = (P(~D) * P(Pos|~D) / P(Pos)

# Note that P(Pos/~D) can be computed as 1 - P(Neg/~D). 

# Therefore:
# P(Pos/~D) = p_pos_no_diabetes = 1 - 0.9 = 0.1
print('Probability of an individual not having diabetes, given that that individual got\
a positive test result is:\n{}'.format(1 - p_diabetes_pos))

Probability of an individual not having diabetes, given that that individual gota positive test result is:
0.9166666666666666


In [97]:
### Example with other features
# Instructions: Compute the probability of the words 'freedom' and 'immigration' being said in a speech, or
# P(F,I).

# The first step is multiplying the probabilities of Jill Stein giving a speech with her individual 
# probabilities of saying the words 'freedom' and 'immigration'. Store this in a variable called p_j_text

# The second step is multiplying the probabilities of Gary Johnson giving a speech with his individual 
# probabilities of saying the words 'freedom' and 'immigration'. Store this in a variable called p_g_text

# The third step is to add both of these probabilities and you will get P(F,I).

In [98]:
# P(J)
p_j = 0.5

# P(F/J)
p_j_f = 0.1

# P(I/J)
p_j_i = 0.1

p_j_text = p_j * p_j_f * p_j_i
print(p_j_text)

0.005000000000000001


In [99]:
# P(G)
p_g = 0.5

# P(F/G)
p_g_f = 0.7

# P(I/G)
p_g_i = 0.2

p_g_text = p_g * p_g_f * p_g_i
print(p_g_text)

0.06999999999999999


In [100]:
p_f_i = p_j_text + p_g_text

print('Probability of words freedom and immigration being said are: ', format(p_f_i))

Probability of words freedom and immigration being said are:  0.075


In [102]:
p_j_fi = p_j_text / p_f_i
print('The probability of Jill Stein saying the words Freedom and Immigration: ', format(p_j_fi))

The probability of Jill Stein saying the words Freedom and Immigration:  0.06666666666666668


In [105]:
p_g_fi = p_g_text / p_f_i
print('The probability of Gary Johnson saying the words Freedom and Immigration: ', format(p_g_fi))

The probability of Gary Johnson saying the words Freedom and Immigration:  0.9333333333333332


### Step 5: Implementing Naive Bayes using scikit-learn

In [106]:
from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [107]:
predictions = naive_bayes.predict(testing_data)

### Step 6: Evaluate the model

In [108]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [109]:
print('Accuracy score: {}'.format(accuracy_score(Y_test, predictions)))
print('Precision score: {}'.format(precision_score(Y_test, predictions)))
print('Recall score: {}'.format(recall_score(Y_test, predictions)))
print('F1 score: {}'.format(f1_score(Y_test, predictions)))

Accuracy score: 0.9885139985642498
Precision score: 0.9720670391061452
Recall score: 0.9405405405405406
F1 score: 0.9560439560439562
