# Building a Spam Filter with Naive Bayes

**Project Goal**: Design a filter to detect spam SMS messages.

**Dataset**: The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). The data collection process is described in more details on [this page](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/#composition), where you can also find some of the authors' papers.

# Part 1: Exploratory Data Analysis

In [33]:
import pandas as pd

In [34]:
df = pd.read_csv('SMSSpamCollection',sep='\t',header=None,names=['Label', 'SMS'])

In [35]:
print(df.columns)

Index(['Label', 'SMS'], dtype='object')


In [36]:
print(str(len(df)))

5572


In [37]:
pd.set_option('display.max_colwidth', 50)

## Part 2: Dividing test and training data

* Reserving 20% of the data for testing, 80% for training the model
* We'll randomize the dataset before splitting

In [38]:
def print_label_totals(df,column,labels):
    for label in labels:
        total_rows = len(df)
        num_label = len(df[df[column] == label])
        print(str(num_label)+" out of "+str(total_rows)+" are label: "+label+" - "+str((num_label/total_rows)*100)+"%")

In [39]:
randomized_df = df.sample(frac=1,random_state=1)
num_training_rows = int(len(df)*.8)
training_set = df.iloc[:num_training_rows]
test_set = df.iloc[num_training_rows:]
print_label_totals(training_set,'Label',['ham','spam'])
print_label_totals(test_set,'Label',['ham','spam'])

3855 out of 4457 are label: ham - 86.49315683194975%
602 out of 4457 are label: spam - 13.506843168050258%
970 out of 1115 are label: ham - 86.99551569506725%
145 out of 1115 are label: spam - 13.004484304932735%


## Part 3: Data Preprocessing and Cleaning for the Training data

Well need to do some encoding to map the dataset from using strings as messages, to using quantitative frequency measurements for how many times a given word was present in the message.

This will enable us to calculate occurrence frequencies for individual words across the data set and then eventually probabilities.

We'll also need to remove punctuation and standardize capitalization across all of the messages to simplify string matching for words later.

In [40]:
training_set.iloc[2,1]

"Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"

In [41]:
training_set['SMS'] = training_set['SMS'].str.replace('\W',' ')
training_set['SMS'] = training_set['SMS'].str.lower()
## Alternative method
## df.SMS.str.replace('[^a-zA-Z]',' ')
print(training_set.iloc[2,1])

free entry in 2 a wkly comp to win fa cup final tkts 21st may 2005  text fa to 87121 to receive entry question std txt rate t c s apply 08452810075over18 s


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [42]:
print(training_set.head(3))

  Label                                                SMS
0   ham  go until jurong point  crazy   available only ...
1   ham                      ok lar    joking wif u oni   
2  spam  free entry in 2 a wkly comp to win fa cup fina...


In [43]:
training_set['SMS'] = training_set['SMS'].str.split()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [44]:
print(training_set.head(3))

  Label                                                SMS
0   ham  [go, until, jurong, point, crazy, available, o...
1   ham                     [ok, lar, joking, wif, u, oni]
2  spam  [free, entry, in, 2, a, wkly, comp, to, win, f...


In [45]:
vocab_list = []
for message in training_set['SMS']:
    for word in message:
        vocab_list.append(word)
        
vocabulary = list(set(vocab_list))
print(vocabulary[:20])

['haunt', 'colleagues', 'exam', 'burial', 'hp', '09066364311', '09701213186', 'fast', 'mouth', 'oath', 'bcums', 'later', 'babes', 'motive', '2004', 'cmon', 'photo', 'medicine', 'official', 'file']


In [48]:
word_counts_per_sms = {}
for vocab_word in vocabulary:
    word_counts_per_sms[vocab_word] = [0]*len(training_set)
    
for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1
        
word_count_df = pd.DataFrame(word_counts_per_sms)

In [None]:
print(df_new_training_set.head(3))

In [52]:
df_new_training_set = pd.concat([training_set,word_count_df],axis=1,sort=False)
print(df_new_training_set.head(3))

  Label                                                SMS  haunt  colleagues  \
0   ham  [go, until, jurong, point, crazy, available, o...      0           0   
1   ham                     [ok, lar, joking, wif, u, oni]      0           0   
2  spam  [free, entry, in, 2, a, wkly, comp, to, win, f...      0           0   

   exam  burial  hp  09066364311  09701213186  fast  ...  headstart  hardcore  \
0     0       0   0            0            0     0  ...          0         0   
1     0       0   0            0            0     0  ...          0         0   
2     0       0   0            0            0     0  ...          0         0   

   james  iraq  tsandcs  result  2px  suffers  less  irritating  
0      0     0        0       0    0        0     0           0  
1      0     0        0       0    0        0     0           0  
2      0     0        0       0    0        0     0           0  

[3 rows x 7811 columns]


## Part 3: Training the Naive Bayes Algorithm

### Theoretical basis

We will use the following equations:
\begin{equation}
P(Spam | w_1,w_2, ..., w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam)
\end{equation}

\begin{equation}
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}}
\end{equation}

\begin{equation}
P(Ham | w_1,w_2, ..., w_n) \propto P(Ham) \cdot \prod_{i=1}^{n}P(w_i|Ham)
\end{equation}

\begin{equation}
P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}
\end{equation}

\begin{aligned}
&N_{w_i|Spam} = \text{the number of times the word } w_i \text{ occurs in spam messages} \\
&N_{w_i|Spam^C} = \text{the number of times the word } w_i \text{ occurs in non-spam messages} \\
\\
&N_{Spam} = \text{total number of words in spam messages} \\
&N_{Spam^C} = \text{total number of words in non-spam messages} \\
\\
&N_{Vocabulary} = \text{total number of words in the vocabulary} \\
&\alpha = 1 \ \ \ \ (\alpha \text{ is a smoothing parameter})
\end{aligned}

In [53]:
p_spam = len(df_new_training_set[df_new_training_set['Label'] == 'spam'])/len(df_new_training_set)
print(p_spam)

In [54]:
p_ham = len(df_new_training_set[df_new_training_set['Label'] == 'ham'])/len(df_new_training_set)
print(p_ham)

In [57]:
n_spam = 0
all_spam_df = df_new_training_set[df_new_training_set['Label'] == 'spam']
for word_list in all_spam_df['SMS']:
    n_spam += len(word_list)
print(n_spam)

15341


In [58]:
n_ham = 0
all_ham_df = df_new_training_set[df_new_training_set['Label'] == 'ham']
for word_list in all_ham_df['SMS']:
    n_ham += len(word_list)
print(n_ham)

57202


In [59]:
n_vocabulary = len(vocabulary)
print(n_vocabulary)

7809


In [None]:
alpha = 1

In [None]:
# Initialize two dictionaries, where each key-value pair is a unique word (from our vocabulary) represented as a string, and the value is 0. We'll need one dictionary to store the parameters for P(wi|Spam), and the other for P(wi|Ham).
spam_word_probabilities = {}
ham_word_probabilities = {}

for vocab_word in vocabulary:
    spam_word_probabilities[vocab_word] = 0
    ham_word_probabilities[vocab_word] = 0

In [None]:
# Isolate the spam and the ham messages in the training set into two different DataFrames.
spam_prob = df_new_training_set[df_new_training_set['Label'] == 'spam']
ham_prob = df_new_training_set[df_new_training_set['Label'] == 'ham']

# Iterate over the vocabulary, and, for each word, calculate P(wi|Spam) and P(wi|Ham)
for vocab_word in vocabulary:
    num_word_occurrences_spam = spam_prob[vocab_word].sum()
    num_word_occurrences_ham = ham_prob[vocab_word].sum()
    p_word_given_spam =  (num_word_occurrences_spam+alpha)/(n_spam+alpha*n_vocabulary)
    p_word_given_ham =  (num_word_occurrences_ham+alpha)/(n_ham+alpha*n_vocabulary)
    # update the probability value in the two dictionaries
    spam_word_probabilities[vocab_word] = p_word_given_spam
    ham_word_probabilities[vocab_word] = p_word_given_ham

In [None]:
def spam_classify_message(complete_message):
    all_words = complete_message.str.split()