<a href="https://colab.research.google.com/github/hussam95/Portfolio/blob/main/Spam_Filter_Hussam.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building a Spam Filter With Nominal Naive Bayes Algorithm 

In this project, we will use Naive Bayes Theorem to build a spam filter for unwanted text messages (SMS). Naive Bayes theorem is a way to test the validity of a hypothesis against some evidence based on prior knowledge. 

In our case, we will label new SMS either a spam or not spam as a hypothesis. We will then test these two hypotheses using the evidence or prior knowledge availbe in this [dataset](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). 

Naive Bayes algorithim utilzies the power of probabilities to assign different values to our two hypotheses. The one with greater probability value (spam hypothesis or not spam hypothesis) will decide whether the new message is a spam or not. In other words, we will update our beliefs (spam or not spam) using new evidence. 

## Reading the Dataset

In [None]:
# Reading data using tab separator, setting headers to none to avoid
# turning first row into header, and naming columns
import pandas as pd
data= pd.read_csv('SMSSpamCollection',
                  sep='\t', header=None,
                 names=['Label','SMS'])
data


Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [None]:
# Calculating percentages of spam and non spam messages in the original dataset
a=data['Label'].value_counts(normalize=True)*100
print('Spam%: {}\nNon-spam%: {}'.format(a['spam'],
                                                            a['ham']))

Spam%: 13.406317300789663
Non-spam%: 86.59368269921033


## Splitting Dataset: Creating Training and Test Set

Once our spam filter is done, we'll need to test how good it is with classifying new messages. To test the spam filter, we're first going to split our dataset into two categories:

- A training set, which we'll use to "train" the computer how to classify messages.
- A test set, which we'll use to test how good the spam filter is with classifying new messages.

We're going to keep 80% of our dataset for training, and 20% for testing (we want to train the algorithm on as much data as possible, but we also want to have enough test data). The dataset has 5,572 messages, which means that:

- The **training set** will have **4,458 messages** (about 80% of the dataset).
- The **test set** will have **1,114 messages** (about 20% of the dataset).

To better understand the purpose of putting a test set aside, let's begin by observing that all 1,114 messages in our test set are already classified by a human. When the spam filter is ready, we're going to treat these messages as new and have the filter classify them. Once we have the results, we'll be able to compare the algorithm classification with that done by a human, and this way we'll see how good the spam filter really is.

**For this project, our goal is to create a spam filter that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam).**



In [None]:
# Randomizing the entire dataset
random = data.sample(frac=1, random_state=1)

In [None]:
# Selecting first 4,458 values of dataset for training purposes
training = random.iloc[0:4458]
training= training.reset_index(drop=True)

In [None]:
# Selecting last 1,114 values of dataset for testing purposes
testing= random.iloc[4458:]
testing= testing.reset_index(drop=True)

In [None]:
## Calculating spam non-spam percentage in training set
a=training['Label'].value_counts(normalize=True)*100
print('Training-Spam%: {}\nTraining-Non-spam%: {}'.format(a['spam'],
                                                            a['ham']))

Training-Spam%: 13.458950201884253
Training-Non-spam%: 86.54104979811575


In [None]:
## Calculating spam non-spam percentage in testing set
a=testing['Label'].value_counts(normalize=True)*100
print('Testing-Spam%: {}\nTesting-Non-spam%: {}'.format(a['spam'],
                                                            a['ham']))

Testing-Spam%: 13.195691202872531
Testing-Non-spam%: 86.80430879712748


Remember, Spam and non-Spam percentage from orginal dataset containing 5,572 messages was:
- **Spam%: 13.406317300789663**1,
- **Non-spam%: 86.59368269921033**

See how after randomly selecting our training (4,458 items) and testing (1,114 items) sets, the spam non-spam percentage is very similar to original percentage.

## Data Cleaning

All unqiue items (words) in SMS column constitue the vocabulary we need to calculate different probabilities. The goal is to transform the training dataset such that its number of columns equal length of vocabulary. 

Moreover, instead of placing the words as values for each column against every label, we are going to put the count of that word.
  
  ![picture](https://drive.google.com/uc?export=view&id=1d9w3YBvMQ_KX0x7MMAV7saakAZBnComH)




However, to acheive this transformation, we need to bring our data into some shape. So, let's do some cleaning


In [None]:
# Cleaing SMS column

## Removing Punctuation marks
training['SMS']=training['SMS'].str.replace("\W",' ')

## Changing case to lower in SMS column
training['SMS']=training['SMS'].str.lower()
training

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...
...,...,...
4453,ham,sorry i ll call later in meeting any thing re...
4454,ham,babe i fucking love you too you know fuck...
4455,spam,u ve been selected to stay in 1 of 250 top bri...
4456,ham,hello my boytoy geeee i miss you already a...


## Finding Unique Words/Vocabulary in All SMS 

Let's now calculate the number of unqiue words in SMS column of training dataset.

In [None]:
vocabulary = []

for sms in training['SMS']:
    words= sms.split()
    
    for word in words:
        if word in vocabulary:
            continue
        else:
            vocabulary.append(word)


In [None]:
print('There are {} unqiue words in training datasets\'\n SMS column i.e. in SMS column\'s vocabulary.'.format(len(vocabulary)))

There are 7783 unqiue words in training datasets'
 SMS column i.e. in SMS column's vocabulary.


## Data Transformation

In [None]:
# The code [0] * 5 outputs [0, 0, 0, 0, 0].
#So the code [0] * len(training_set['SMS']) outputs
#a list of the length of training_set['SMS'],
#where each element in the list will be a 0.

word_counts_per_sms = {unique_word: [0] * len(training['SMS']) for unique_word in vocabulary}

# We loop over training_set['SMS'] using at the same time 
#the enumerate() function to get both the index and the SMS message (index and sms)

for index, sms in enumerate(training['SMS']):
    for word in sms.split():
        word_counts_per_sms[word][index] += 1

word_counts=pd.DataFrame(word_counts_per_sms)
word_counts.head()

Unnamed: 0,yep,by,the,pretty,sculpture,yes,princess,are,you,going,to,make,me,moan,welp,apparently,he,retired,havent,i,forgot,2,ask,ü,all,smth,there,s,a,card,on,da,present,lei,how,want,write,or,sign,it,...,cutting,ooh,4got,moseley,weds,bruv,09053750005,310303,08718725756,140ppm,dan,reminded,cme,hos,occupied,armenia,swann,abbey,09066660100,2309,harry,potter,phoenix,readers,inconsiderate,nag,recession,hence,genes,prakesh,beauty,hides,secrets,n8,jewelry,related,trade,arul,bx526,wherre
0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,2,1,2,2,2,1,1,1,1,2,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:
training_clean=pd.concat([training,word_counts],axis=1)
training_clean.head()

Unnamed: 0,Label,SMS,yep,by,the,pretty,sculpture,yes,princess,are,you,going,to,make,me,moan,welp,apparently,he,retired,havent,i,forgot,2,ask,ü,all,smth,there,s,a,card,on,da,present,lei,how,want,write,or,...,cutting,ooh,4got,moseley,weds,bruv,09053750005,310303,08718725756,140ppm,dan,reminded,cme,hos,occupied,armenia,swann,abbey,09066660100,2309,harry,potter,phoenix,readers,inconsiderate,nag,recession,hence,genes,prakesh,beauty,hides,secrets,n8,jewelry,related,trade,arul,bx526,wherre
0,ham,yep by the pretty sculpture,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,ham,yes princess are you going to make me moan,0,0,0,0,0,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,ham,welp apparently he retired,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,ham,havent,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,ham,i forgot 2 ask ü all smth there s a card on ...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,2,1,2,2,2,1,1,1,1,2,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## Calculations


### Calculating Constants
Now that we're done with data cleaning and have a training set to work with, we can begin creating the spam filter. Naive Bayes algorithm will need to know the probability values of the two equations below to be able to classify new messages:

  ![picture](https://drive.google.com/uc?export=view&id=1wQCDOI0Zf_1VpR_9fo5OVW7J4mdb5PpV) 

Also, to calculate P(wi|Spam) and P(wi|Ham) inside the formulas above, we need to use these equations:

  ![picture](https://drive.google.com/uc?export=view&id=1xluKdDobth8U0cxbljSqvs4J8908GCLN)

Let's begin by calculating constants from all the four equations i.e. P(spam), P(ham), Nspam, Nham, alpha, and Nvocabulary.

In [None]:
spam_messages= training_clean[training_clean['Label']=='spam']
ham_messages = training_clean[training_clean['Label']=='ham']

p_spam = len(spam_messages)/len(training_clean)
p_ham = len(ham_messages)/len(training_clean)

n_spam = spam_messages.iloc[0:, 2:].sum().sum()
n_ham= ham_messages.iloc[0:,2:].sum().sum()

alpha = 1
n_vocabulary = len(vocabulary)

print('p_spam:{}\np_ham:{}\nn_spam:{}\nn_ham:{}\nn_vocabulary={}\nalpha:{}'.format(p_spam, p_ham, n_spam,n_ham,n_vocabulary,alpha))


p_spam:0.13458950201884254
p_ham:0.8654104979811574
n_spam:15190
n_ham:57237
n_vocabulary=7783
alpha:1


### Calculating Parameters

We can use our training set to calculate the probability for each word in our vocabulary. If our vocabulary contained only the words "lost", "navigate", and "sea", then we'd need to calculate six probabilities:

- P("lost"|Spam) and P("lost"|Ham)
- P("navigate"|Spam) and P("navigate"|Ham)
- P("sea"|Spam) and P("sea"|Ham)

We have 7,783 words in our vocabulary, which means we'll need to calculate a total of 15,566 probabilities. For each word, we need to calculate both P(wi|Spam) and P(wi|Ham).

In [None]:
# p_w_given_spam = ?
# p_w_given_ham = ?
p_word_given_spam = {}
for col in (spam_messages.columns[2:]):
  n_w_i = spam_messages[col].sum()
  p_word_given_spam[col] = (n_w_i+alpha)/(n_spam+(alpha*n_vocabulary))


p_w_given_spam=pd.DataFrame(p_word_given_spam.items(), columns= ['word','p_w_given_spam'])
p_w_given_spam.head()


Unnamed: 0,word,p_w_given_spam
0,yep,4.4e-05
1,by,0.001524
2,the,0.006878
3,pretty,4.4e-05
4,sculpture,4.4e-05


In [None]:
p_word_given_ham = {}
for col in (ham_messages.columns[2:]):
  n_w_i = ham_messages[col].sum()
  p_word_given_ham[col] = (n_w_i+alpha)/(n_ham+(alpha*n_vocabulary))

p_w_given_ham=pd.DataFrame(p_word_given_ham.items(), columns= ['word','p_w_given_ham'])
p_w_given_ham.head()

Unnamed: 0,word,p_w_given_ham
0,yep,0.000154
1,by,0.001707
2,the,0.014165
3,pretty,0.0002
4,sculpture,3.1e-05


## Classifying a New Message

Now that we've calculated all the constants and parameters we need, we can start creating the spam filter. The spam filter can be understood as a function that:

- Takes in as input a new message (w1, w2, ..., wn)
- Calculates P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn)
- Compares the values of P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn), and:
1. If P(Ham|w1, w2, ..., wn) > P(Spam|w1, w2, ..., wn), then the message is classified as ham.
2. If P(Ham|w1, w2, ..., wn) < P(Spam|w1, w2, ..., wn), then the message is classified as spam.
3. P(Ham|w1, w2, ..., wn) = P(Spam|w1, w2, ..., wn), then the algorithm may request human help.


For the classify() function above, note that:

1. The input variable message is assumed to be a string. We perform a bit of data cleaning on the string message

2. P(Spam) and P(Ham) get mulpiplied with probabilities of different words given different conditions (see equations below). We, therefore, initiliaze them with p_spam and p_ham values we calculated above to factor them in

  ![picture](https://drive.google.com/uc?export=view&id=1WylUcWb0xDjrGxIYmg6yxlWQNlY2xFvw)






In [None]:
sms1 = 'WINNER!! This is the secret code to unlock the money: C3421.'
sms2= "Sounds good, Tom, then see u there"
import re

def classify(message):
  message = re.sub('\W', ' ', message).lower().split()
  
  p_spam_given_message= p_spam #see eq. above to understand initialization
  p_ham_given_message= p_ham #see eq. above to understand initialization

  for w in message:
    if w in p_word_given_spam:      
      p_spam_given_message *= p_word_given_spam[w]
    
    if w in p_word_given_ham:      
      p_ham_given_message *= p_word_given_ham[w]
    else:
      continue
  
  if p_spam_given_message > p_ham_given_message:
    print('The message is spam!')
  elif p_ham_given_message > p_spam_given_message:
    print('The message is not a spam!')
  else:
    print('Sorry. I cannot tell if the message is spam or not. Consult some human!')
  return (p_spam_given_message, p_ham_given_message)  

  

print(classify(sms1))
print(classify(sms2))

The message is spam!
(1.3481290211300841e-25, 1.9368049028589875e-27)
The message is not a spam!
(2.4372375665888117e-25, 3.687530435009238e-21)


## Measuring the Spam Filter's Accuracy

In [None]:
import re

def classify(message):
  message = re.sub('\W', ' ', message).lower().split()
  
  p_spam_given_message= p_spam
  p_ham_given_message= p_ham

  for w in message:
    if w in p_word_given_spam:      
      p_spam_given_message *= p_word_given_spam[w]
    
    if w in p_word_given_ham:      
      p_ham_given_message *= p_word_given_ham[w]
    else:
      continue
  
  if p_spam_given_message > p_ham_given_message:
    return 'spam'
  elif p_ham_given_message > p_spam_given_message:
    return 'ham'
    

correct_predictions=0
total_predictions=0

for i in range(len(testing)):
  label= testing.iloc[i]['Label']
  message = testing.iloc[i]['SMS']
  
  if classify(message) == str(label):
    correct_predictions+=1
    total_predictions+=1
  else:
    total_predictions+=1

print('The accuracy of spam filter is: {}%'.format(round((correct_predictions/total_predictions)*100,3)))


The accuracy of spam filter is: 98.743%



**In this project, we managed to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. The filter had an accuracy of 98.74% on the test set, which is an excellent result. We initially aimed for an accuracy of over 80%, but we managed to do way better than that.**


