# Spam or Not Spam

SMS and Email Spam filtering has become quite ubiquitous in every email or SMS client currently available. The UCI Machine Learning Repository has put together a dataset of SMS messages of which they have already classified as spam or not. 

The goal of this project will be to apply Multinomial Naive Bayes Classification to create a  filter for this dataset of SMS messages. The dataset can be found here: https://archive.ics.uci.edu/ml/datasets/sms+spam+collection

In [1]:
import pandas as pd

#read in file
sms = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label','SMS'])

print(sms.shape)
sms.head()

(5572, 2)


Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [2]:
#Count spam vs ham
sms['Label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

In this dataset, 87% of the messages are spam, while 13% are not spam.

## Create a Training Set and a Test Set

In [3]:
#create a randomized dataset
sms_randomized = sms.sample(frac = 1, random_state = 1)

#index our dataset so we can split it into training and testing
training_index = round(len(sms_randomized)*0.8)

#create the sets
train_set = sms_randomized[:training_index].reset_index(drop=True)
test_set = sms_randomized[training_index:].reset_index(drop=True)

In [4]:
#check if we randomized and split into 2 sets corectly
print(train_set['Label'].value_counts(normalize=True))
print(test_set['Label'].value_counts(normalize=True))

ham     0.86541
spam    0.13459
Name: Label, dtype: float64
ham     0.868043
spam    0.131957
Name: Label, dtype: float64


We randomized the sms dataset, and successfully placed 80% of the messages into a training set and 20% into a testing set. From the print-out, we can see that we successfully randomized our sms dataset, since the distribution of spam and not-spam messages remain the same in our training and testing sets.

## Clean Data for Naive Bayes Classification

Remove all punctuation from messages and convert to lower case

In [5]:
#training set before cleaning
train_set.head()

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [6]:
#remove all punctuation from SMS column
train_set['SMS'] = train_set['SMS'].str.replace('\W', ' ')

#transform messages to lower case
train_set['SMS'] = train_set['SMS'].str.lower()
train_set.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


### Making a Vocabulary

In order to classify the SMS messages, we will need to analyze the body(every word) of each message. To do that we'll create a unique list of words that appear in the SMS dataset - a vocabulary, so to speak.

In [7]:
#turn each sms message into a list of individual words
train_set['SMS'] = train_set['SMS'].str.split()
train_set.head()

Unnamed: 0,Label,SMS
0,ham,"[yep, by, the, pretty, sculpture]"
1,ham,"[yes, princess, are, you, going, to, make, me,..."
2,ham,"[welp, apparently, he, retired]"
3,ham,[havent]
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,..."


In [8]:
#loop over SMS messages to find each unique word
vocabulary = []

for row in train_set['SMS']:
    for word in row:
        vocabulary.append(word)
        
vocabulary = list(set(vocabulary))
print('There are {} unique words in our training set of SMS messages.'.format(len(vocabulary)))

There are 7783 unique words in our training set of SMS messages.


### Make Final Transformation of Training Set

To perform the Naive Bayes algorithm, which we will show later on, we will need a count of the words that show up in every SMS message.

In [9]:
#create a dictionary where key is each unique word, value is length of training set(how many sms we have)
word_counts_per_sms = {unique_word: [0] * len(train_set['SMS']) 
                       for unique_word in vocabulary}

#loop through each SMS message, noting the message number and message body
for index, sms in enumerate(train_set['SMS']):

    #loop through each word in each SMS message
    for word in sms:
        #if the word is in the SMS, add one to the index
        word_counts_per_sms[word][index] += 1

In [10]:
#convert dictionary to DataFrame
word_counts = pd.DataFrame(word_counts_per_sms)

#display first few rows
word_counts.head()

Unnamed: 0,story,murder,lodging,2nights,said,ago,lionp,desert,fyi,quizzes,...,caught,ga,mathews,trebles,9t,loose,rcb,tsunamis,radio,40gb
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [11]:
#Combine our training set with word counts
train_set_clean = pd.concat([train_set, word_counts], axis=1)
train_set_clean.head()

Unnamed: 0,Label,SMS,story,murder,lodging,2nights,said,ago,lionp,desert,...,caught,ga,mathews,trebles,9t,loose,rcb,tsunamis,radio,40gb
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


After cleaning, we now have a training dataset where each row includes these three things:

1) SMS message - as a list of lowercase words and no punctuation<br>
2) Labeled as spam or ham(not spam)<br>
3) A count of the words inside the SMS message

## Preparing for Naive Bayes Algorithm

With a cleaned testing set, we can now prepare several variables we need for the Naive Bayes algorithm. The variables we need will help solve these two equations needed for Naive Bayes algorithm:

\begin{equation}
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}} \\\
P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}
\end{equation}

First, we will use our cleaned training set to find these values. As these values remain constant, it will save compute time moving forward.

- P(Spam) and P(Ham)
- N<sub>Spam</sub>, N<sub>Ham</sub>, N<sub>Vocabulary</sub>
- Set Laplace Smoothing: $\alpha = 1$

### Calculating Constants

In [None]:
#Calculate number of words in all spam messages
n_spam = train_set_clean[train_set_clean['Label'] == 'spam'].sum(axis=1).sum()

#Calculate number of words in all spam messages
n_ham = train_set_clean[train_set_clean['Label'] == 'ham'].sum(axis=1).sum()

#Calculate number of unique words in vocabulary
n_vocab = len(vocabulary)

#Laplace smoothing
alpha = 1

### Calculating Parameters

Now that we have the constants, we can go on and solve our two equations for parameters $P(w_i|Spam)$ and $P(w_i|Ham)$.

In [None]:
#create 2 dictionaries to store probability of a word given spam or ham sms
parameters_spam = {word: 0 for word in vocabulary} 
parameters_ham = {word: 0 for word in vocabulary} 

#isolate spam and ham messages as 2 separate dataframes
spam_sms = train_set_clean[train_set_clean['Label'] == 'spam']
ham_sms = train_set_clean[train_set_clean['Label'] == 'ham']

#loop through vocabulary and calculate probabilities of P(word|spam) and P(word|not spam) 
for word in vocabulary:
    
    #calculate total appearances of a word for all spam messages
    n_word_given_spam = spam_sms[word].sum()
    
    #calculate probabilities via naive bayes equations
    spam_numerator = n_word_given_spam + alpha
    spam_denominator = n_spam + alpha * n_vocab
    
    #update dictionary
    parameters_spam[word] = spam_numerator / spam_denominator

    #repeat process for ham messages
    n_word_given_ham = ham_sms[word].sum()
    ham_numerator = n_word_given_ham + alpha
    ham_denominator = n_ham + alpha * n_vocab
    parameters_ham[word] = ham_numerator / ham_denominator

### Calculating $P(Spam)$ and $P(Ham)$

We will need these two constants later on

In [None]:
#calculate probabilities of sms being spam or not spam
p_spam = train_set_clean['Label'].value_counts(normalize=True)['spam']
p_ham = train_set_clean['Label'].value_counts(normalize=True)['ham']

## Building the Spam Filter

To build the spam filter with the Naive Bayes algorithm, we will need to solve 2 probability equations. One for spam and another for ham(not spam).

\begin{equation}
P(Spam | w_1,w_2, ..., w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam) \\\
P(Ham | w_1,w_2, ..., w_n) \propto P(Ham) \cdot \prod_{i=1}^{n}P(w_i|Ham)
\end{equation}

In [None]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in parameters_spam:
            #use parameters_spam dictionary
            p_spam_given_message *= parameters_spam[word]
        if word in parameters_ham:
            #use parameters_ham dictionary
            p_ham_given_message *= parameters_ham[word]
    
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [None]:
#test a few sample messages

message_1 = 'WINNER!! This is the secret code to unlock the money: C3421.'
message_2 = 'Hey, what time is dinner tonight?'

classify(message_1)

In [None]:
classify(message_2)

## Spam Filter Accuracy

Based on the two test messages above, it seems like we built a good spam filter with the Naive Bayes algorithm. Below, lets apply the our spam filter to the test set we set aside earlier on.

In [None]:
#we will now return our outputs
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]

        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [None]:
#apply our function to the test set's SMS messages
test_set['predicted'] = test_set['SMS'].apply(classify_test_set)
test_set.head()

In [None]:
#compare our results
correct = 0
total = len(test_set)

for row in test_set.iterrows():
    #return the row, and not index
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1

        #calculate accuracy
accuracy = (correct/total)*100

print('Correct:', correct)
print('Incorrect:', total - correct)
print('Our spam filter makes predictions at {:5.3f}% accuracy.'.format(accuracy))

In [None]:
#Incorrectly classified SMS messages
test_set[test_set['predicted'] != test_set['Label']]

## Conclusion

We were able to build a SMS message filter that identifies spam or not spam at 98.743% accuracy. Out of 1143 messages in our test set, only 14 messages were incorrectly identified. 

## Next Steps

For next steps it will be beneficial to explore why these SMS messages were incorrectly classified by our spam filter. Since our spam filter works by classifying messages in lower case, there is an opportunity to make our filter more robust and accurate by being able to involve letter case(upper and lower).