<a href="https://colab.research.google.com/github/pulkitmehtawork/Portfolio/blob/master/Spam%20Filter/Spam_Filter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Spam Filter Using Naive Byes

In this project , we will be building a spam filter for SMS messages.

To do that, we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified .

In [0]:
# import libraries
import pandas as pd
import numpy as np
import re



In [0]:
df = pd.read_csv("/content/SMSSpamCollection.csv",header=None,sep='\t',names=['Label', 'SMS'])

In [71]:
df.shape

(5572, 2)

In [72]:
df.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [73]:
df['Label'].value_counts()

ham     4825
spam     747
Name: Label, dtype: int64

In [74]:
df['Label'].value_counts(normalize = True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

Here , we observed that approx 86.6% of messages are non-spam & only 13.4 are spam.

In [0]:
df_shuffled = df.sample(frac=1,random_state=1)

In [76]:
df_shuffled[:5]

Unnamed: 0,Label,SMS
1078,ham,"Yep, by the pretty sculpture"
4028,ham,"Yes, princess. Are you going to make me moan?"
958,ham,Welp apparently he retired
4642,ham,Havent.
4674,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [0]:
len_shuffled = len(df_shuffled)

In [0]:
train = df_shuffled[:round(0.8*len_shuffled)]
test = df_shuffled[round(0.8*len_shuffled):]

In [79]:
train.shape,test.shape

((4458, 2), (1114, 2))

In [80]:
train['Label'].value_counts(normalize = True)

ham     0.86541
spam    0.13459
Name: Label, dtype: float64

In [81]:
test['Label'].value_counts(normalize = True)

ham     0.868043
spam    0.131957
Name: Label, dtype: float64

We can see that distribution of spam , non spam message is approx same as that of original data set . 

## Data Cleaning

- Lower Case all letters
- Remove all punctuations

In [82]:
train['SMS'] = train['SMS'].str.replace('\W',' ')
test['SMS'] = test['SMS'].str.replace('\W',' ')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [83]:
train['SMS'] = train['SMS'].str.lower()
test['SMS'] = test['SMS'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [84]:
train.head()


Unnamed: 0,Label,SMS
1078,ham,yep by the pretty sculpture
4028,ham,yes princess are you going to make me moan
958,ham,welp apparently he retired
4642,ham,havent
4674,ham,i forgot 2 ask ü all smth there s a card on ...


## Creating Vocabulary

In [85]:
train['SMS'] = train['SMS'].str.split(' ')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [86]:
train.head()

Unnamed: 0,Label,SMS
1078,ham,"[yep, , by, the, pretty, sculpture]"
4028,ham,"[yes, , princess, , are, you, going, to, make,..."
958,ham,"[welp, apparently, he, retired]"
4642,ham,"[havent, ]"
4674,ham,"[i, forgot, 2, ask, ü, all, smth, , , there, s..."


In [0]:
vocabulary = []
for i in train['SMS']:
  for j in i:
    vocabulary.append(j)


In [0]:
vocabulary =set(vocabulary)

In [89]:
len(vocabulary)

7784

In [90]:
type(vocabulary)

set

In [0]:
vocabulary =list(vocabulary)

In [92]:
type(vocabulary)

list

In [0]:
train = train.reset_index().drop('index',axis=1)

In [94]:
train.head()


Unnamed: 0,Label,SMS
0,ham,"[yep, , by, the, pretty, sculpture]"
1,ham,"[yes, , princess, , are, you, going, to, make,..."
2,ham,"[welp, apparently, he, retired]"
3,ham,"[havent, ]"
4,ham,"[i, forgot, 2, ask, ü, all, smth, , , there, s..."


In [0]:
word_counts_per_sms = {unique_word: [0] * len(train['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(train['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [0]:
word_counts_per_sms  = pd.DataFrame(word_counts_per_sms)

In [0]:
word_counts_per_sms =pd.concat([word_counts_per_sms,train[['Label','SMS']]],axis = 1)

In [98]:
word_counts_per_sms.head()


Unnamed: 0,Unnamed: 1,happening,fumbling,life,eveb,loving,algorithms,suzy,220,shldxxxx,flirtparty,proper,system,dropped,den,1,cl,lead,swalpa,wahala,cool,havebeen,taxless,elections,alwys,loooooool,specs,fffff,kaitlyn,tahan,themes,pest,15pm,soooo,listed,chapter,sorting,wildlife,der,showr,...,indian,09061744553,get,restock,push,2004,worms,splash,spiritual,wahleykkum,second,cttergg,bergkamp,apples,2mro,44,correction,toothpaste,cramps,accessible,ithink,bout,call2optout,tomorro,toking,alerts,09096102316,lf56,success,printer,loveme,replys150,london,trade,tear,px3748,role,na,Label,SMS
0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,ham,"[yep, , by, the, pretty, sculpture]"
1,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,ham,"[yes, , princess, , are, you, going, to, make,..."
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,ham,"[welp, apparently, he, retired]"
3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,ham,"[havent, ]"
4,7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,ham,"[i, forgot, 2, ask, ü, all, smth, , , there, s..."


## Calculating constants for Multinomial Naive Byes

In [0]:
# probability of spam
p_spam = len(train[train['Label'] == 'spam'])/len(train)

In [100]:
p_spam

0.13458950201884254

In [0]:
p_ham = len(train[train['Label'] == 'ham'])/len(train)

In [102]:
p_ham

0.8654104979811574

In [103]:
train[train['Label'] == 'spam']['SMS']

16      [freemsg, why, haven, t, you, replied, to, my,...
18      [congrats, , 2, mobile, 3g, videophones, r, yo...
56      [free, message, activate, your, 500, free, tex...
60      [call, from, 08702490080, , , tells, u, 2, cal...
61      [someone, has, conacted, our, dating, service,...
                              ...                        
4437    [congratulations, you, ve, won, , you, re, a, ...
4439    [win, the, newest, , harry, potter, and, the, ...
4443    [someone, u, know, has, asked, our, dating, se...
4449    [your, chance, to, be, on, a, reality, fantasy...
4455    [u, ve, been, selected, to, stay, in, 1, of, 2...
Name: SMS, Length: 600, dtype: object

In [0]:
alpha = 1

In [0]:
train['word_length'] = train['SMS'].apply(lambda x:len(x))

In [106]:
train.head()

Unnamed: 0,Label,SMS,word_length
0,ham,"[yep, , by, the, pretty, sculpture]",6
1,ham,"[yes, , princess, , are, you, going, to, make,...",12
2,ham,"[welp, apparently, he, retired]",4
3,ham,"[havent, ]",2
4,ham,"[i, forgot, 2, ask, ü, all, smth, , , there, s...",33


In [0]:
nspam = train[train['Label'] == 'spam']['word_length'].sum()

In [0]:
nham = train[train['Label'] == 'ham']['word_length'].sum()

In [109]:
nham

71219

In [0]:
nvocab = len(vocabulary)

In [121]:
nvocab

7784

## Calculating Parameters

In [0]:
dict_spam_param = {}
dict_ham_param = {}

In [0]:
for word in vocabulary:
  dict_spam_param[word] = 0
  dict_ham_param[word] = 0


In [123]:
len(dict_ham_param)

7784

In [113]:
train.head(3)

Unnamed: 0,Label,SMS,word_length
0,ham,"[yep, , by, the, pretty, sculpture]",6
1,ham,"[yes, , princess, , are, you, going, to, make,...",12
2,ham,"[welp, apparently, he, retired]",4


In [0]:
word_counts_per_sms_spam = word_counts_per_sms[word_counts_per_sms['Label'] == 'spam']
word_counts_per_sms_ham = word_counts_per_sms[word_counts_per_sms['Label'] == 'ham']

In [115]:
word_counts_per_sms.head(3)

Unnamed: 0,Unnamed: 1,happening,fumbling,life,eveb,loving,algorithms,suzy,220,shldxxxx,flirtparty,proper,system,dropped,den,1,cl,lead,swalpa,wahala,cool,havebeen,taxless,elections,alwys,loooooool,specs,fffff,kaitlyn,tahan,themes,pest,15pm,soooo,listed,chapter,sorting,wildlife,der,showr,...,indian,09061744553,get,restock,push,2004,worms,splash,spiritual,wahleykkum,second,cttergg,bergkamp,apples,2mro,44,correction,toothpaste,cramps,accessible,ithink,bout,call2optout,tomorro,toking,alerts,09096102316,lf56,success,printer,loveme,replys150,london,trade,tear,px3748,role,na,Label,SMS
0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,ham,"[yep, , by, the, pretty, sculpture]"
1,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,ham,"[yes, , princess, , are, you, going, to, make,..."
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,ham,"[welp, apparently, he, retired]"


In [116]:
word_counts_per_sms_ham.loc[:,'happening'].sum()

4

In [0]:
for word in vocabulary:
  nwspam = word_counts_per_sms_spam.loc[:,word].sum()
  nwham = word_counts_per_sms_ham.loc[:,word].sum()
  pwspam = (nwspam + alpha)/(nspam+ (alpha *nvocab))
  pwham = (nwham + alpha)/(nham+ (alpha *nvocab))
  dict_spam_param[word] = pwspam 
  dict_ham_param[word] = pwham 



## Classifying a new message

In [0]:
def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()


    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
      if word in dict_spam_param:
        p_spam_given_message *= dict_spam_param[word]
      if word in dict_ham_param:
        p_ham_given_message *= dict_ham_param[word]

    

      

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [125]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 4.8441099043219405e-26
P(Ham|message): 3.3551837806850006e-28
Label: Spam


In [126]:
classify('Sounds good, Tom, then see u there')

P(Spam|message): 1.099416001503406e-25
P(Ham|message): 9.431033496608998e-22
Label: Ham


## Measuring Spam Filter's Accuracy

In [0]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in dict_spam_param:
            p_spam_given_message *= dict_spam_param[word]

        if word in dict_ham_param:
            p_ham_given_message *= dict_ham_param[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [130]:
test['predicted'] = test['SMS'].apply(classify_test_set)
test.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Label,SMS,predicted
2131,ham,later i guess i needa do mcat study too,ham
3418,ham,but i haf enuff space got like 4 mb,ham
3424,spam,had your mobile 10 mths update to latest oran...,spam
1538,ham,all sounds good fingers makes it difficult ...,ham
5393,ham,all done all handed in don t know if mega sh...,ham


In [0]:
correct = 0
total = len(test)

In [132]:
total

1114

In [0]:
for idx,row in test.iterrows():
  if row['Label'] == row['predicted']:
    correct +=1
accuracy = correct/total



In [134]:
accuracy

0.9865350089766607

## Our model performed decently with accuracy of 98.65%