Lets get naive with Bayes! 

In this project, we are going to build a spamfilter to decide wheter a sms is spam or not.

Lets get to it! 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
data = pd.read_csv('SMSSpamCollection',sep = '\t', header=None, names = ['Label','SMS'])

In [3]:
data

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Label   5572 non-null   object
 1   SMS     5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


5572 entries, and no null values (yay) 

In [5]:
data[data['Label'] == 'ham'].count()

Label    4825
SMS      4825
dtype: int64

In [6]:
data[data['Label'] == 'spam'].count()

Label    747
SMS      747
dtype: int64

So the majority of the smses are not spam (aka ham) 

In order for this to work properly (or more so lets see if it actually can detect spam) lets create a test and training set by first randomize the whole dataset

In [7]:
random_data = data.sample(frac=1,random_state=1)

In [8]:
cut_off = int(len(random_data)*0.8)

In [9]:
training_set = random_data.iloc[:cut_off,:].reset_index()

In [10]:
test_set = random_data.iloc[cut_off:,:].reset_index()

In [11]:
training_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4457 entries, 0 to 4456
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   index   4457 non-null   int64 
 1   Label   4457 non-null   object
 2   SMS     4457 non-null   object
dtypes: int64(1), object(2)
memory usage: 104.6+ KB


In [12]:
test_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1115 entries, 0 to 1114
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   index   1115 non-null   int64 
 1   Label   1115 non-null   object
 2   SMS     1115 non-null   object
dtypes: int64(1), object(2)
memory usage: 26.3+ KB


In [13]:
training_set['Label'].value_counts(normalize = True)

ham     0.86538
spam    0.13462
Name: Label, dtype: float64

In [14]:
test_set['Label'].value_counts(normalize=True)

ham     0.868161
spam    0.131839
Name: Label, dtype: float64

Good distrivution of spam sms in both of the sets, lets gooooooo! 

In [15]:
import re
def data_cleaning(str):
    return re.sub('\W',' ',str).lower()

In [16]:
training_set['SMS'] = training_set['SMS'].apply(data_cleaning).str.lower()

In [17]:
training_set['SMS']

0                            yep  by the pretty sculpture
1           yes  princess  are you going to make me moan 
2                              welp apparently he retired
3                                                 havent 
4       i forgot 2 ask ü all smth   there s a card on ...
                              ...                        
4452               how about clothes  jewelry  and trips 
4453    sorry  i ll call later in meeting any thing re...
4454    babe  i fucking love you too    you know  fuck...
4455    u ve been selected to stay in 1 of 250 top bri...
4456    hello my boytoy     geeee i miss you already a...
Name: SMS, Length: 4457, dtype: object

In [18]:
training_set['SMS'] = training_set['SMS'].str.split()

In [19]:
vocabulary = []

for lst in training_set['SMS']:
    for string in lst:
        vocabulary.append(string)

In [20]:
vocabulary = set(vocabulary)

In [21]:
vocabulary = list(vocabulary)

In [22]:
len(vocabulary)

7782

Now when we have a list of all the possible words in the SMS col, lets create a dataframe that says how many of the words are in each SMS

In [23]:
word_counts_per_sms = {unique_word: [0]*len(training_set['SMS']) for unique_word in vocabulary}

In [24]:
for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        if word != '':
            word_counts_per_sms[word][index] +=1

In [25]:
training_word_count = pd.DataFrame(word_counts_per_sms)

In [26]:
training_set = pd.concat([training_set,training_word_count],axis=1)

Lets start to do some classification! First, lets define p_spam, Nspam etc...

In [27]:
p_non_spam, p_spam =training_set['Label'].value_counts(normalize = True)

In [30]:
# Isolating spam and ham messages first
spam_messages = training_set[training_set['Label'] == 'spam']
ham_messages = training_set[training_set['Label'] == 'ham']

# P(Spam) and P(Ham)
p_spam = len(spam_messages) / len(training_set)
p_ham = len(ham_messages) / len(training_set)

# N_Spam
n_words_per_spam_message = spam_messages['SMS'].apply(len)
n_spam = n_words_per_spam_message.sum()

# N_Ham
n_words_per_ham_message = ham_messages['SMS'].apply(len)
n_ham = n_words_per_ham_message.sum()

# N_Vocabulary
n_vocabulary = len(vocabulary)

# Laplace smoothing
alpha = 1

Time to calculate the frequecy of words in that are spam and ham, for each individual word

In [31]:
spam_dict ={unique_word: 0 for unique_word in vocabulary}
ham_dict = {unique_word: 0 for unique_word in vocabulary}

In [32]:
def count_words_for_words(row):
    if row['Label'] == 'ham':
        for word in row['SMS']:
                ham_dict[word] +=1
    else:
        for word in row['SMS']:
                spam_dict[word] +=1
training_set.apply(count_words_for_words,axis=1)

0       None
1       None
2       None
3       None
4       None
        ... 
4452    None
4453    None
4454    None
4455    None
4456    None
Length: 4457, dtype: object

In [33]:
probability_spam_dict ={unique_word: 0 for unique_word in vocabulary}
probability_ham_dict ={unique_word: 0 for unique_word in vocabulary}


In [34]:
for key,value in probability_spam_dict.items():
    probability_spam_dict[key] = (spam_dict[key]+alpha)/(n_spam+alpha*n_vocabulary)

for key,value in probability_ham_dict.items():
    probability_ham_dict[key] = (ham_dict[key]+alpha)/(n_spam+alpha*n_vocabulary)


In [38]:
def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    
    p_spam_given_message = p_spam 
    for word in message:
        if word not in probability_spam_dict:
            p_spam_given_message *= alpha/(n_vocabulary*alpha+n_spam)
        else: 
            p_spam_given_message *= probability_spam_dict[word]
    
    p_ham_given_message = p_non_spam
    for word in message:
        if word not in probability_ham_dict:
            p_ham_given_message *= alpha/(n_vocabulary*alpha+n_ham)
        else:
            p_ham_given_message *=probability_ham_dict[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return None

In [39]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

'ham'

In [40]:
test_set['predicted'] = test_set['SMS'].apply(classify)

In [41]:
test_set['predicted'].value_counts(dropna = False,normalize=True)

ham     0.909417
spam    0.090583
Name: predicted, dtype: float64

Lets calculate the accuracy of the spam filter! 

In [42]:
correct = 0
total = len(test_set)

for index,row in test_set.iterrows():
    if row['Label'] == row['predicted']:
        correct +=1
        
correct/total

0.9533632286995516

In [43]:
test_set['Label'].value_counts(normalize=True)

ham     0.868161
spam    0.131839
Name: Label, dtype: float64