# The War on Spam Text Messages
We all ___hate___ spam texts. It seems around election time the amount of spam texts skyrockets. Although this filter might not perform perfectly against the political campaigning texts, a few tweaks and additional data could potentially help filter those out too.  
Using a Naive Bayes algorithm we are going to build a spam filter for SMS text messages.  
The human classifications are in the data set for the SMS messages. These will be used for training, and checking the accuracy of the filter. (The classification of 'ham' indicates a non-spam message).

In [1]:
import pandas as pd
import re

msgs = pd.read_csv('SMSSpamCollection.txt', sep='\t', header=None, names=['label', 'SMS'])

In [2]:
msgs.describe()

Unnamed: 0,label,SMS
count,5572,5572
unique,2,5169
top,ham,"Sorry, I'll call later"
freq,4825,30


In [3]:
# Splitting the dataframe into 2 dataframes a training set and a testing set.
train = msgs.sample(frac=.8, random_state=1)
test = msgs.drop(train.index)

In [4]:
# Resetting both indexes.
train = train.reset_index(drop=True)
test = test.reset_index(drop=True)

In [5]:
print('Ham vs Spam percentages of each dataframe')
print('Training:')
print(train['label'].value_counts(normalize=True)*100)
print('Testing:')
print(test['label'].value_counts(normalize=True)*100)

Ham vs Spam percentages of each dataframe
Training:
ham     86.54105
spam    13.45895
Name: label, dtype: float64
Testing:
ham     86.804309
spam    13.195691
Name: label, dtype: float64


In [6]:
# cleaning up the data
train['SMS'] = train['SMS'].str.replace(r'\W', ' ').str.lower()

### Building the Vocabulary for the Naive Bayes Algorithm
Here we extract every word out of the text messages in the training data set. The `vocab` is then turned into a `set()` to remove any duplicate words.

In [7]:
train['SMS'] = train['SMS'].str.split()
vocab = []
for row in train['SMS']:
    for word in row:
        vocab.append(word)

vocab = set(vocab)
vocab = list(vocab)

In [8]:
word_cnt_per_sms = {word: [0] * len(train['SMS']) for word in vocab}
for i, sms in enumerate(train['SMS']):
    for word in sms:
        word_cnt_per_sms[word][i] += 1

In [9]:
word_cnts_df = pd.DataFrame(data=word_cnt_per_sms)
train_df = pd.concat([train, word_cnts_df], axis=1)

### Calculating the probabilities for each word in both the spam and ham classed messages

In [10]:
p_spam = train_df['label'].value_counts(normalize=True)[1]
p_ham = train_df['label'].value_counts(normalize=True)[0]

n_spam = train_df[train_df['label'] == 'spam'].iloc[:,2:].sum(axis=1)
n_ham = train_df[train_df['label'] ==  'ham'].iloc[:,2:].sum(axis=1)

n_spam = n_spam.sum()
n_ham = n_ham.sum()
n_vocab = train_df.shape[1] - 2
# Changing alpha to 0.6 actually increased the accuracy a bit (0.1%)
# alpha removes the 0 probability for words not in the vocabulary
alpha = .6

In [11]:
spam_params = {w: [0] for w in vocab}
ham_params = {w: [0] for w in vocab}
spam_msgs =  train_df[train_df['label']=='spam']
ham_msgs = train_df[train_df['label']=='ham']

In [12]:
for k in spam_params:
    spam_params[k] = (spam_msgs[k].sum() + alpha)/(n_spam + alpha*n_vocab)
for k in ham_params:
    ham_params[k] = (ham_msgs[k].sum() + alpha)/(n_ham + alpha*n_vocab)

### The classify function which checks a message against the training data set.
By iterating through each word in the message it checks the probability of a word belonging to the spam or ham separated sets of words.

In [13]:
# classification function
def classify(message):
    message = re.sub(r'\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_msg = p_spam
    p_ham_given_msg = p_ham
    
    label = 'ham'
    
    for word in message:
        if word in spam_params:
            p_spam_given_msg *= spam_params[word]
        if word in ham_params:
            p_ham_given_msg *= ham_params[word]
    if p_spam_given_msg > p_ham_given_msg:
        label = 'spam'
    if p_spam_given_msg == p_ham_given_msg:
        print('Probabilities are equal, assigning to ham')
        
    # if the probabilities are equal err on the side of ham
    #if p_spam_given_msg == p_ham_given_msg:
        #label = 'ham'
    return label

classify('winner free money')

'spam'

In [14]:
test['prediction'] = test['SMS'].apply(classify)

Probabilities are equal, assigning to ham


### Now we will calculate the accuracy of our filter against the human classifications

In [15]:
correct, fail = 0, 0
total = test.shape[0]
index_failed = []
for row in test.iterrows():
    row  = row[1]
    if row['label'] == row['prediction']:
        correct += 1
    else:
        fail += 1
        index_failed.append(row)
        
print(f'Correct: {correct}')
print(f'Failed: {fail}')
print(f'Accuracy: {correct/total*100:.2f}%')

Correct: 1102
Failed: 12
Accuracy: 98.92%


# 98.92% Accuracy
The spam filter correctly classified 98.92% of the test data set. Using the assumption that if a message's probability of spam is equal to it's probability of ham, we classified it as ham. It correctly eliminated 1 failed occurrence in our test messages, there were 6 occurrences when the algorithm is ran on the entire data set (see last cell). All of the messages classified as ham due to the equal probabilities were predicted correctly. Another slight tweak was adjusting the alpha value. Changing it to 0.6 instead of 1 improved our accuracy some as well.  
  
Out of the 12 failed classifications 7 spam messages would have made it through our filter, and 5 ham messages would have been filtered (or classified as spam) erroneously.  
  
When running the filter on the entire data set we were 99.30% accurate. The Naive Bayes algorithm was quite effective at calculating the probability of spam messages. 

In [16]:
for i in index_failed:
    print('--------------')
    print(i['SMS'])
    print('prediction:', i['prediction'])
    print('label:', i['label']) 

--------------
No calls..messages..missed calls
prediction: spam
label: ham
--------------
Hello. We need some posh birds and chaps to user trial prods for champneys. Can i put you down? I need your address and dob asap. Ta r
prediction: ham
label: spam
--------------
26th OF JULY
prediction: spam
label: ham
--------------
0A$NETWORKS allow companies to bill for SMS, so they are responsible for their "suppliers", just as a shop has to give a guarantee on what they sell. B. G.
prediction: ham
label: spam
--------------
RCT' THNQ Adrian for U text. Rgds Vatian
prediction: ham
label: spam
--------------
Not heard from U4 a while. Call me now am here all night with just my knickers on. Make me beg for it like U did last time 01223585236 XX Luv Nikiyu4.net
prediction: ham
label: spam
--------------
2/2 146tf150p
prediction: ham
label: spam
--------------
Oh my god! I've found your number again! I'm so glad, text me back xafter this msgs cst std ntwk chg £1.50
prediction: ham
label: spam
---

In [17]:
# this function returns more details pertaining to a particular message
# It returns the relative probabilities and the difference between the two.
def reclass_check(message, tie_ham=True):
    message = re.sub(r'\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_msg = p_spam
    p_ham_given_msg = p_ham
    
    label = 'ham'
    
    for word in message:
        if word in spam_params:
            p_spam_given_msg *= spam_params[word]
        if word in ham_params:
            p_ham_given_msg *= ham_params[word]
    if p_spam_given_msg > p_ham_given_msg:
        label = 'spam'
    if not tie_ham:
        if p_spam_given_msg == p_ham_given_msg:
            label = 'not sure'
    return label, p_ham_given_msg, p_spam_given_msg, (p_ham_given_msg - p_spam_given_msg)

In [18]:
# to check if defaulting to ham if tied we can run the reclass_check to see if any of the values are 0
# which they are not for both the testing set and the entire data set (not displayed),
# but you can try it out if you put this code in a cell below the last cell.
for i in index_failed:
    print(reclass_check(i['SMS']))

('spam', 3.3337319614499942e-18, 1.546878343065051e-17, -1.2135051469200517e-17)
('ham', 3.8459433558699016e-61, 1.053234146087431e-70, 3.8459433548166676e-61)
('spam', 5.496785842777343e-13, 5.6897472274736865e-12, -5.140068643195952e-12)
('ham', 1.4401616339935071e-68, 1.9018184774337135e-71, 1.4382598155160735e-68)
('ham', 6.759560064531696e-08, 3.201537848985072e-08, 3.5580222155466237e-08)
('ham', 1.9674023341509878e-84, 2.285903402623665e-93, 1.9674023318650843e-84)
('ham', 1.5571659241445183e-05, 9.357979941924251e-06, 6.213679299520932e-06)
('ham', 1.2991774226844665e-66, 4.1413736950670494e-73, 1.299177008547097e-66)
('spam', 1.3047315846592374e-12, 1.4113300331928352e-11, -1.2808568747269115e-11)
('ham', 3.980373812097504e-95, 4.591967501563511e-99, 3.979914615347347e-95)
('spam', 3.3310522207230706e-10, 4.8964901798375424e-09, -4.563384957765235e-09)
('spam', 1.1565730001971696e-60, 3.593766741311431e-58, -3.58220101130946e-58)


### And this is how it performs against the entire data set (both the testing and training sets combined)

In [19]:
msgs['prediction'] = msgs['SMS'].apply(classify)

Probabilities are equal, assigning to ham
Probabilities are equal, assigning to ham
Probabilities are equal, assigning to ham
Probabilities are equal, assigning to ham
Probabilities are equal, assigning to ham
Probabilities are equal, assigning to ham
Probabilities are equal, assigning to ham


In [20]:
correct, fail = 0, 0
total = msgs.shape[0]
index_failed = []
for row in msgs.iterrows():
    row  = row[1]
    if row['label'] == row['prediction']:
        correct += 1
    else:
        fail += 1
        index_failed.append(row)
print('For the entire data set (test and train sets):')       
print(f'correct: {correct}')
print(f'failed: {fail}')
print(f'Accuracy: {correct/total*100:.2f}%')

For the entire data set (test and train sets):
correct: 5533
failed: 39
Accuracy: 99.30%


#### Not too shabby!