# Classifying texts as spam or not spam

5572 text messages were previously classified as spam or not spam. The dataset was downloaded from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection) and contains SMS from three sources:

* Grumbletext: 425 spam messages
* Random ham messages from NUS SMS Corpus: 3375 non-spam messages
* Caroline Tag's PhD Thesis: 450 non-spam messages
* SMS Spam Corpus v.0.1 Big: 1002 spam, 322 non-spam messages

The purpose of this notebook is to build a spam filter using the naive bayes algorithm. 

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('SMSSpamCollection', 
            sep='\t', header=None, names=['Label','SMS'])

In [3]:
df.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Label   5572 non-null   object
 1   SMS     5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [5]:
counted = df['Label'].value_counts()
perc_nspam = round(counted[0]/(df['Label'].count()),2)

print('Percent of non-spam messages: {0}'.format(perc_nspam*100))
print('Percent of spam messages: {0}'.format((1-perc_nspam)*100))

Percent of non-spam messages: 87.0
Percent of spam messages: 13.0


Approximately 87% of the dataset are non-spam and 13% spam messages.

The dataset is divided into training (80%) and test(20%) sets. Our goal is to correctly classify as spam or non-spam at least 80% of the test set.

In [6]:
df_s = df.sample(frac=1, # randomize entire dataset
          random_state=1, # seed
         )

In [7]:
# calculate what 80% of the number of rows is
0.8*df_s.shape[0]

4457.6

In [8]:
# the first 80% of the randomized dataset is for training
train_df = df_s.iloc[:4459,:].reset_index(drop = True)
# the rest is for testing
test_df = df_s.iloc[4459:,:].reset_index(drop=True)

In [9]:
counted = train_df['Label'].value_counts()
perc_nspam = round(counted[0]/(train_df['Label'].count()),2)

print('Percent of non-spam messages: {0}'.format(perc_nspam*100))
print('Percent of spam messages: {0}'.format((1-perc_nspam)*100))

Percent of non-spam messages: 87.0
Percent of spam messages: 13.0


In [10]:
counted = test_df['Label'].value_counts()
perc_nspam = round(counted[0]/(test_df['Label'].count()),2)

print('Percent of non-spam messages: {0}'.format(perc_nspam*100))
print('Percent of spam messages: {0}'.format((1-perc_nspam)*100))

Percent of non-spam messages: 87.0
Percent of spam messages: 13.0


The same percentage of spam and non-spam messages were retained when splitting the dataset.

# Data Cleaning

In [11]:
train_df.head()

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [12]:
# Replace everything that is not a letter or number with a space
train_df['SMS'] = train_df['SMS'].str.replace('\W', ' ') 
# Lower case each letter
train_df['SMS'] = train_df['SMS'].str.lower()

In [13]:
train_df.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


In [14]:
# Turn each message into a list of words
train_df['SMS'] = train_df['SMS'].str.split()

In [15]:
vocabulary = []

# extract each word into a new list
for cell in train_df['SMS']:
    for i in range(len(cell)):
        vocabulary.append(cell[i])

In [16]:
# turn the list into a set, to get rid of duplicates, 
# and then back into a list

print(len(vocabulary))
vocabulary = list(set(vocabulary))
print(len(vocabulary))

72436
7785


In [17]:
word_counts_per_sms = {unique_word: [0] * len(train_df['SMS']) for unique_word in vocabulary}

Create a dictionary where the keys are the unique words from the dataset, and the values are five-thousand 0's

In [18]:
for index, cell in enumerate(train_df['SMS']):
    for word in cell:
        # word represents the unique word
        # index represents the location of the word in the original dataset
        word_counts_per_sms[word][index] += 1

`enumerate(sequence,start=0)` returns an enumerate object in the form of `[0,word],[1,new_word],[2,newest_word]`.

For each word, find the appropriate key then add 1.

In [19]:
words_df  = pd.DataFrame(word_counts_per_sms) # turn dictionary into a df.

The new `words_df` is concatenated with the training set, this way the `Label` and `SMS` columns are included.

In [20]:
new_train_df = pd.concat([train_df, words_df],axis=1)

# Spam Filter

## Defining Constants

In order to set up the spam filter, key parameters for Naive Bayes needs to be defined.

__Naive Bayes__:

We need to calculate the probabilities for both spam and non-spam. The following equations are used:

\begin{equation}
P(Spam | w_1,w_2, ..., w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam) \\
P(Ham | w_1,w_2, ..., w_n) \propto P(Non-spam) \cdot \prod_{i=1}^{n}P(w_i|Non-spam)
\end{equation}

1. The probability that a message is `spam` given the message contains `w_1 ... w_n` (the unique words in our `vocabulary` is proportional to the probability that any message is spam times the probability that `w_i` occurs in the message given the message is spam.
2. The same is true of `non-spam` calculations

In order to calculate the probability that `word_i` occurs in some message given the message is spam we can use the following equations:

\begin{equation}
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}} \\
P(w_i|Ham) = \frac{N_{w_i|Non-spam} + \alpha}{N_{Non-spam} + \alpha \cdot N_{Vocabulary}}
\end{equation}

1. The probability that `w_i` occurs in some message given the message is `spam` is equal to the total number of times `w_i` occurrs in all spam messages + alpha. This is then divided by the total number of words in non-spam messages + alpha times the total number of words in our vocabulary.
2. The same is true of `non-spam` calculations.

__Variable key:__

* `spam_df`: a filtered version of `train_df` containing only spam messages
* `non_spam_df`: a filtered version of `train_df` containing only non-spam messages
* `p_spam`: probability that any message is labeled spam
* `p_non_spam`: probability that any message is labeled non_spam
* `alpha`: Laplace smoothing parameter
* `n_vocab`: number of words in our `vocabulary` list
* `n_spam`: total (not unique) number of words in all spam messages
* `n_non_spam`: total (not unique) number of words in all non_spam messages
* `n_w_spam_dic`: number of times each word occurs in all spam messages
* `n_w_non_spam_dic`: number of times each word occurs in all non_spam messages
* `p_w_given_spam`: the probability that a word is in a message, given the message is labeled spam
* `p_w_given_non_spam`: the probability that a word is in a message, given the message is labeled non-spam

In [21]:
new_train_df.head()

Unnamed: 0,Label,SMS,dogging,professional,witout,the,trainners,mrt,end,read,...,bein,09061790121,near,waste,zouk,warranty,father,maximize,worrying,goigng
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
spam_df = new_train_df[new_train_df['Label'] == 'spam']
non_spam_df = new_train_df[new_train_df['Label'] == 'ham']

In [23]:
# The probability that any message in our train_df is spam
p_spam = (len(spam_df))/len(new_train_df)
# The probability that any message in our train_df is non-spam
p_non_spam = (len(non_spam_df))/len(new_train_df)

In [24]:
# the total (not unique) number of words in non_spam messages
n_spam = 0
# the total (not unique) number of words in spam messages
n_non_spam = 0

for row in range(len(train_df)):
    if train_df.loc[row,'Label'] == 'ham':
        n_non_spam += len(train_df.loc[row,'SMS'])
    if train_df.loc[row,'Label'] == 'spam':
        n_spam += len(train_df.loc[row,'SMS'])

# the total number of unique words in all messages
n_vocab = len(vocabulary)

# alpha for Laplace smoothing
alpha = 1

## Creation of Algorithm

### Probability of words given label

In [25]:
# a dictionary of the number of times each word
# occurs in spam and non-spam messages
n_w_spam_dic = {}
n_w_non_spam_dic = {}

for word in vocabulary:
    # initialize for number of occurrences list
    n_w_spam_dic[word] = 0
    n_w_non_spam_dic[word] = 0

In [26]:
def n_word_given_key(word, key='spam'):
    '''
    Finds the number of times the word occurs in the appropriate df,
    then appends it to the appropriate dictionary.
    '''
    if key == 'spam':
        series = spam_df['SMS']
    else:
        series = non_spam_df['SMS']
        
    n_word = 0
    for cell in series:
        for msg_word in cell:
            if msg_word == word:
                n_word += 1
    return n_word

In [27]:
# For each word in the vocabulary
for word in vocabulary:
    # Find the number of times it occurs in all messages
    n_w_spam_dic[word] = n_word_given_key(word,'spam')
    n_w_non_spam_dic[word] = n_word_given_key(word,'non-spam')

Now that we have the number of times each word occurs in spam or non-spam messages, we can calculate the probability that each word is in a message given the message is either spam or non-spam"

\begin{equation}
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}} \\
P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}
\end{equation}



In [28]:
p_w_given_spam = {}
p_w_given_non_spam = {}

# The probability that the word occurs in a message given the message is spam
for word in vocabulary:
    p_w_given_spam[word] = (n_w_spam_dic[word] + alpha) / (n_spam + (alpha * n_vocab))
    p_w_given_non_spam[word] = (n_w_non_spam_dic[word] + alpha) / (n_non_spam + (alpha * n_vocab))
    
    

### Probability of label given message

We have all the parameters to calculate the probability of some message being spam (or non-spam) given the message's contents.

\begin{equation}
P(Spam | w_1,w_2, ..., w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam)
\end{equation}



In [29]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_non_spam_given_message = p_non_spam
    
    for word in message:
        if word in p_w_given_spam:
            p_spam_given_message *= p_w_given_spam[word]
        if word in p_w_given_non_spam:
            p_non_spam_given_message *= p_w_given_non_spam[word]

    print('P(Spam|message):', p_spam_given_message)
    print('P(Non-spam|message):', p_non_spam_given_message)

    if p_non_spam_given_message > p_spam_given_message:
        print('Label: Not spam')
    elif p_non_spam_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [30]:
my_test_spam = 'WINNER!! This is the secret code to unlock the money: C3421.'
my_test_non_spam = "Sounds good, Tom, then see u there"

In [31]:
classify(my_test_spam)

P(Spam|message): 1.3467710812047665e-25
P(Non-spam|message): 1.9339258494902383e-27
Label: Spam


In [32]:
classify(my_test_non_spam)

P(Spam|message): 2.43520654878629e-25
P(Non-spam|message): 3.6832948885668876e-21
Label: Not spam


# Run Algorithm on test_df

In [33]:
import re

def classify_test(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_non_spam_given_message = p_non_spam
    
    for word in message:
        if word in p_w_given_spam:
            p_spam_given_message *= p_w_given_spam[word]
        if word in p_w_given_non_spam:
            p_non_spam_given_message *= p_w_given_non_spam[word]

    if p_non_spam_given_message > p_spam_given_message:
        return 'ham'
    elif p_non_spam_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [34]:
test_df['predicted'] = test_df['SMS'].apply(classify_test)

In [35]:
test_df.head()

Unnamed: 0,Label,SMS,predicted
0,ham,But i haf enuff space got like 4 mb...,ham
1,spam,Had your mobile 10 mths? Update to latest Oran...,spam
2,ham,All sounds good. Fingers . Makes it difficult ...,ham
3,ham,"All done, all handed in. Don't know if mega sh...",ham
4,ham,But my family not responding for anything. Now...,ham


In [37]:
test_df['eval'] = ''

In [39]:
correct = 0
total = len(test_df)

for row in test_df.iterrows():
    if row[1]['Label'] == row[1]['predicted']:
        row[1]['eval'] = 'correct'
        correct += 1
    if row[1]['Label'] != row[1]['predicted']:
        row[1]['eval'] = 'incorrect'
accuracy = round((correct/total)*100, 2)
print('Our algorithm correctly classified our messages {0}% of the time'.format(accuracy))

Our algorithm correctly classified our messages 98.74% of the time


# Disclaimer

Our first try at the algorithm had an accuracy rate of 12.49%. This was because `classify_test` returned `"Not Spam"` instead of `"ham"`. Because of this error, the `Label` column never equaled the `predicted` column.

# Message from Dataquest

In this project, we managed to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. The filter had an accuracy of 98.74% on the test set, which is an excellent result. We initially aimed for an accuracy of over 80%, but we managed to do way better than that.

If you want to keep working on this project, here's a few next steps you can take:

* Isolate the 14 messages that were classified incorrectly and try to figure out why the algorithm reached the wrong conclusions.
* Make the filtering process more complex by making the algorithm sensitive to letter case.
* Get the project portfolio-ready by using a few tips from our style guide for data science projects.

Congratulations, this is the end of the Conditional Probability course! We've come a long way and learned how to:

* Assign probabilities to events based on certain conditions by using conditional probability rules.
* Assign probabilities to events based on whether they are in relationship of statistical independence or not with other events.
* Assign probabilities to events based on prior knowledge by using Bayes' theorem.
* Create a spam filter for SMS messages using the multinomial Naive Bayes algorithm.


In [41]:
test_df[test_df['eval'] == 'incorrect']

Unnamed: 0,Label,SMS,predicted,eval
113,spam,Not heard from U4 a while. Call me now am here...,ham,incorrect
134,spam,More people are dogging in your area now. Call...,ham,incorrect
151,ham,Unlimited texts. Limited minutes.,spam,incorrect
158,ham,26th OF JULY,spam,incorrect
283,ham,Nokia phone is lovly..,spam,incorrect
292,ham,A Boy loved a gal. He propsd bt she didnt mind...,needs human classification,incorrect
301,ham,No calls..messages..missed calls,spam,incorrect
318,ham,We have sent JD for Customer Service cum Accou...,spam,incorrect
503,spam,Oh my god! I've found your number again! I'm s...,ham,incorrect
545,spam,"Hi babe its Chloe, how r u? I was smashed on s...",ham,incorrect
