# SMS Spam Filter Project
## Creating a SMS spam filter using the multinomial Naive Bayes algorithm

Text (SMS) message spam is a serious issue for millions of consumers around the world. While most messages are simply annoying, many spam messages are targeted campaings to steal consumers information. Filtering messages is not a simple endeavour. How does a computer decide whether an imcoming message is spam or legitimate? One method, the choice for this project, is to use the multinomial Naive Bayes algorithm. In general, the algorithm works as such:
1. Learns how humans classify messages.
2. Uses that human knowledge to estimate probabilities for new messages (using Bayes Theorem) as spam or not spam.
3. Classifies a new message based on these probabilities. If the probability a message is spam is higher than non-spam, it will classify the message as spam, and vice-versa. If the probabilities are equal, then we require a human to decide the message classification.

Our goal for this project is:
* To build a SMS spam filter with at least an 80% accuracy using the multinomial Naive Bayes algorithm. 

We will use a dataset of 5,572 already classified SMS messages from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection) to "teach" the computer how to classify messages. 

To begin our project, we will import the dataset and explore to familiarize ourselves with these data. 

In [136]:
# Import libraries and dataset
import pandas as pd

sms = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])

In [137]:
sms.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Label   5572 non-null   object
 1   SMS     5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [138]:
# Find percentage of spam vs ham (non-spam) messages
sms['Label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

### Establishing a training and testing set
On a simple exploration of our dataset, we see that about 87% of these messages are ham (non-spam) and 13% are spam. Next, we want to begin building our spam filter. However, before we start creating the software, we want to establish a method to test and verify the software works correctly. If we wait until the end of the project, we might be tempted to create a biased test so the software passes. 

When we complete the spam filter, we need to test how well it classifies new messages. In order to do this, we will first split our data into two sets:
* A training set: we will use this to "train" the algorithm how to classify messages
* A test set: we will use this to test the accuracy of the filter. 

In general, we want to use as much data as possible to train the algorithm while having enough data to test the algorithm. We will keep 80% of the dataset for training and 20% for testing. 
* The training set will have 4,458 messages
* The test set will have 1,114 messages

The test is simple: as the messages are already classified by a human, we only need to compare the classifications the algorithm makes to the human classifications. We will use this test once we actually build the software, but first we will split the dataset and begin creating the algorithm.

In [139]:
# First we will randomize the dataset
random_sms = sms.sample(frac=1, random_state=1) # random_stat=1 for consistency

# Split dataset into train and test
train = random_sms.iloc[:4458, :].reset_index()
test = random_sms.iloc[4459:, :].reset_index()

In [140]:
# Verify sample is consisten with entire dataset
train['Label'].value_counts(normalize=True)

ham     0.86541
spam    0.13459
Name: Label, dtype: float64

In [141]:
# Verify sample is consisten with entire dataset
test['Label'].value_counts(normalize=True)

ham     0.867925
spam    0.132075
Name: Label, dtype: float64

### Data Cleaning
Before we can move on to creating the algorithm, we need to clean these datasets. The goal for this cleaning process is to transform the train and test dataframes into dataframes with columns for each word in the entire vocabulary of words in these messages. Each row will still represent a message, but instead of an `SMS` column containing a string with the message, it will contain the number of times that word occurs in a given message. In other words, each column will contan the frequency for that word in the message. 

The SMS messages contain capitalizations and punctuation marks that we do not want. In order to transform the dataframes, we will first remove punctuation and make all the messages lowercase. 

In [142]:
# Remove punctuations from train
train['SMS'] = train['SMS'].str.replace('\W', ' ', regex=True)

In [143]:
# Make lowercase train
train['SMS'] = train['SMS'].str.lower()

Next, we will create a list called `vocabulary` that contains all the unique words that occur in these messages. We will first split the strings in `SMS` into lists. Then, we will add all the individual words to the list `vocabulary`. Finally, we will conver the list to a set and back to remove duplicate words. 

In [144]:
# Transform each row in `SMS` into a list
train['SMS'] = train["SMS"].str.split()

# Initialize list for vocabulary
vocabulary = []

# Iterate over `SMS` and add each word to `vocabulary`
for msg in train['SMS']:
    for word in msg:
        vocabulary.append(word)

# Keep only unique words
vocabulary = set(vocabulary)
vocabulary = list(vocabulary)


Next, we want to create a dictionary we can use to create a new dataframe with the columns representing all words in `vocabulary` and the rows indicating the number of times each word occurs in an individual message. 

* First, we will start by creating a dictionary `word_counts_per_sms` where each key is a unique word in `vocabulary` and each index is a list of zeros equal in length to the length of the training set, `train['SMS']`. 
* Next, we will loop over `train['SMS']` using the `enumerate()` function to get both the index and the SMS message. 
    * Using a nested loop, we loop over `sms` (where `sms` is a list of strings, where each sstring represents a word in a message).
        * Inside that loop, we increment `word_counts_per_sms[word][index]` by 1.

In [145]:
# Define dictionary word_counts_per_sms with each key representing a word from vocabulary and each index a list of 0s of len(train['SMS'])
word_counts_per_sms = {
    unique_word: [0] * len(train["SMS"]) for unique_word in vocabulary
}

# Iterate over train["SMS"] using enumerate() to add counts to each key in word_counts_per_sms
for index, sms in enumerate(train['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1



In [None]:
# Transform word_counts_per_sms to dataframe
word_counts = pd.DataFrame(word_counts_per_sms)

# Concatenate word_counts with train
train_updated = pd.concat([train, word_counts], axis=1)