# Building a Spam Filter with Naive Bayes

we will be working in this project SMS Spam massages the Dataset was collected from [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection).<br>
The data collection process is described in more details on [this page](https://www.dt.fee.unicamp.br/~tiago/smsspamcollection/#composition).

**Our Main Objective is to Classify Messages whether it is Spam or NoN Spam Using the Naive Bayes Algorithm.**<br>
so our first step is to teach the Comuputer how to classify messages.<br>
the data set 5,572 SMS massages that already Classified by humans.

### Data Set Description from thire websites.
- A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site.
- A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages.
- A list of 450 SMS ham messages collected from Caroline Tag's PhD Thesis.
- Finally, we have incorporated the SMS Spam Corpus v.0.1 Big. It has 1,002 SMS ham messages and 322 spam messages and it is public.

In [8]:
import pandas as pd
import numpy as np
import yaml

In [9]:
#reading string constants from config.yml
with open('config.yml') as config:
    data = yaml.load(config, Loader=yaml.loader.SafeLoader)
sms = data['sms_string']
label = data['label_string']

In [2]:
sms_massages = pd.read_csv("SMSSpamCollection",sep="\t",header=None,names=[label, sms])
sms_massages

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


- ham means non-spam.
- 5572 rows × 2 columns

In [3]:
sms_massages["Label"].value_counts(normalize=True)*100

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

86.60% ham massages & 13.40% spam massages

------------------------------------
Once our spam filter is done, we'll need to test how good it is with classifying new messages. To test the spam filter, we're first going to split our dataset into two categories:

A training set, which we'll use to "train" the computer how to classify messages.
A test set, which we'll use to test how good the spam filter is with classifying new messages.

## Train & Test Data.

**The training set will have 4,458 messages (about 80% of the dataset).**<br>
**The test set will have 1,114 messages (about 20% of the dataset).**

In [4]:
sample_massages =  sms_massages.sample(frac=1,random_state=1)
sample_massages.shape

(5572, 2)

In [5]:
train_test_data = round(len(sample_massages)* 0.8)
train_test_data

4458

In [6]:
# boolean_mask = np.random.rand(len(sample_massages)) < 0.8
train = sample_massages[:train_test_data].reset_index(drop=True)
test = sample_massages[train_test_data:].reset_index(drop=True)
train.size
print(train.shape)
print(test.shape)

(4458, 2)
(1114, 2)


In [7]:
train.head()

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


--------------------------------------------
## Letter Case and Punctuation

Remove all the punctuation from the SMS column

In [8]:
import re

In [9]:
pattern = re.sub('\W',' ', 'Secret!! Money, goods.')

In [10]:
train =  train.copy()
test =  test.copy()
train[sms] = train[sms].str.replace('\W', pattern)
test[sms] = test[sms].str.replace('\W', pattern)
train

  train["SMS"] = train["SMS"].str.replace('\W', pattern)
  test["SMS"] = test["SMS"].str.replace('\W', pattern)


Unnamed: 0,Label,SMS
0,ham,YepSecret Money goods Secret Money goods...
1,ham,YesSecret Money goods Secret Money goods...
2,ham,WelpSecret Money goods apparentlySecret M...
3,ham,HaventSecret Money goods
4,ham,ISecret Money goods forgotSecret Money g...
...,...,...
4453,ham,SorrySecret Money goods Secret Money goo...
4454,ham,BabeSecret Money goods Secret Money good...
4455,spam,USecret Money goods veSecret Money goods...
4456,ham,HelloSecret Money goods mySecret Money g...


transform every letter in every word to lower case

In [11]:
train[sms] = train[sms].str.lower()
test[sms] = test[sms].str.lower()
train.head(10)

Unnamed: 0,Label,SMS
0,ham,yepsecret money goods secret money goods...
1,ham,yessecret money goods secret money goods...
2,ham,welpsecret money goods apparentlysecret m...
3,ham,haventsecret money goods
4,ham,isecret money goods forgotsecret money g...
5,ham,oksecret money goods isecret money goods...
6,ham,isecret money goods wantsecret money goo...
7,ham,nosecret money goods dearsecret money go...
8,ham,oksecret money goods pasecret money good...
9,ham,illsecret money goods besecret money goo...


In [12]:
train[sms] = train[sms].str.split()
# test[sms] = test[sms].str.split()
train.head()

Unnamed: 0,Label,SMS
0,ham,"[yepsecret, money, goods, secret, money, goods..."
1,ham,"[yessecret, money, goods, secret, money, goods..."
2,ham,"[welpsecret, money, goods, apparentlysecret, m..."
3,ham,"[haventsecret, money, goods]"
4,ham,"[isecret, money, goods, forgotsecret, money, g..."


## Creating the Vocabulary
vocabulary is the unique word in Traning set

In [13]:
vocabulary = []
for value in train[sms]:
# print out the list
    for i in value:
# print out the string 
        if i not in vocabulary:
            vocabulary.append(i)
        else:
            vocabulary.remove

In [14]:
print("we have {} unique word in the vocabulary".format(len(vocabulary)))

we have 8400 unique word in the vocabulary


just to be sure we dont have duplicates we will use set.

In [15]:
vocabulary = list(set(vocabulary))
len(vocabulary)

8400

## The Final Traning set.

In [16]:
word_counts_per_sms = {unique_word: [0] * len(train[sms]) for unique_word in vocabulary}

for index, sms in enumerate(train[sms]):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [17]:
words_counts = pd.DataFrame(word_counts_per_sms)
words_counts.iloc[:5,:5]

Unnamed: 0,sporadicallysecret,gloucesterroadsecret,overheatingsecret,creditsecret,w1jsecret
0,0,0,0,0,0
1,0,0,0,0,0
2,0,0,0,0,0
3,0,0,0,0,0
4,0,0,0,0,0


In [18]:
words_concat = pd.concat([train,words_counts],axis=1)
print(words_concat.head(2))
print(words_concat.shape)

  Label                                                SMS  \
0   ham  [yepsecret, money, goods, secret, money, goods...   
1   ham  [yessecret, money, goods, secret, money, goods...   

   sporadicallysecret  gloucesterroadsecret  overheatingsecret  creditsecret  \
0                   0                     0                  0             0   
1                   0                     0                  0             0   

   w1jsecret  theater  telephonesecret  questionsecret  ...  nttsecret  \
0          0        0                0               0  ...          0   
1          0        0                0               0  ...          0   

   longer  billedsecret  txtauctionsecret  smidginsecret  websecret  \
0       0             0                 0              0          0   
1       0             0                 0              0          0   

   amusedsecret  3gbpsecret  txtxsecret  thasasecret  
0             0           0           0            0  
1             0          

##  Calcualting the P(spam) & P(ham) & N_Spam & N_Ham.

In [32]:
spam_massages =  words_concat[words_concat[label] == "spam"]
ham_massages  = words_concat[words_concat[label] == "ham"]
print(len(ham_massages))
print(len(spam_massages))
print(len(words_concat))


3858
600
4458


**P(spam) = Total number of Spam messages / total messages.**<br>
**P(ham) = Total number of ham messags / total messages**

In [20]:
p_spam = len(spam_massages) / len(words_concat)
p_ham = len(ham_massages) / len(words_concat)
print(p_spam)
print(p_ham)

0.13458950201884254
0.8654104979811574


**N_spam  = is equal to the number of words in all the spam messages.**<br>
**N_ham = is equal to the number of words in all the ham messages.**

In [21]:
#N_spam
n_words_per_spam_message = spam_massages[sms].apply(len)
n_spam = n_words_per_spam_message.sum()

# N_Ham
n_words_per_ham_message = ham_massages[sms].apply(len)
n_ham = n_words_per_ham_message.sum()

n_vocabulary = len(vocabulary)
alpha = 1


print(n_ham)
print(n_spam)
print(n_vocabulary)

203458
52470
8400


------------------------------

## Calculating Parameters

In [22]:
# Initiate parameters
parameters_spam = {unique_word:0 for unique_word in vocabulary}
parameters_ham = {unique_word:0 for unique_word in vocabulary}

In [37]:
# Calculate parameters
for word in vocabulary:
    n_word_given_spam = spam_massages[word].sum()   # spam_messages already defined in a cell above
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha*n_vocabulary)
    parameters_spam[word] = p_word_given_spam
    
    n_word_given_ham = ham_massages[word].sum()   # ham_messages already defined in a cell above
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha*n_vocabulary)
    parameters_ham[word] = p_word_given_ham

In [45]:
# Dictionary 
print(type(parameters_spam))

<class 'dict'>


In [25]:
from functions import classify

In [26]:
classify("WINNER!! This is the secret code to unlock the money: C3421.", p_spam, p_ham)

P(Spam|message): 7.753304364778861e-27
P(Ham|message): 1.39986880035683e-27
Label: Spam


In [27]:
classify("Sounds good, Tom, then see u there", , p_spam, p_ham)

P(Spam|message): 1.6106258959218911e-25
P(Ham|message): 7.664602182405598e-24
Label: Ham


Now that we've calculated all the constants and parameters we need, we can start creating the spam filter. The spam filter can be understood as a function that:

- Takes in as input a new message (w1, w2, ..., wn)
- Calculates P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn)
- Compares the values of P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn), and:
- If P(Ham|w1, w2, ..., wn) > P(Spam|w1, w2, ..., wn), then the message is classified as ham.
- If P(Ham|w1, w2, ..., wn) < P(Spam|w1, w2, ..., wn), then the message is classified as spam.
- If P(Ham|w1, w2, ..., wn) = P(Spam|w1, w2, ..., wn), then the algorithm may request human help.

## Measuring the spam filters Accuracy

In [39]:
from functions import classify_test_set

In [41]:
test['predicted'] = test[sms].apply(classify_test_set, args=(p_spam, p_ham))
test.head()

Unnamed: 0,Label,SMS,predicted
0,ham,latersecret money goods isecret money go...,ham
1,ham,butsecret money goods isecret money good...,ham
2,spam,hadsecret money goods yoursecret money g...,spam
3,ham,allsecret money goods soundssecret money ...,ham
4,ham,allsecret money goods donesecret money g...,ham


In [80]:
correct = 0
total = test.shape[0]

for row in test.iterrows():
    row = row[1]
    if row[label] == row["predicted"]:
        correct += 1
        
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)

Correct: 1089
Incorrect: 25
Accuracy: 0.9775583482944344


**The accuracy is close to 97.75%, which is really good. Our spam filter looked at 1,114 messages that it hasn't seen in training, and classified 1089 correctly.**

## Summary:
 we build a spam filter that 97.75% Accurate we can see that 25 incorrectly classified.