<a href="https://colab.research.google.com/github/prabhudc/DQ/blob/master/DQ_Spam_Filter_with_Naive_Bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Naive Bayes Spam Filter on SMS messages

This project is to implement a multinomian Naive Bayes classifier to differentiate between spam and non-spam. The classifier will be run on an SMS dataset containing 5572 messages. The data was source from the [UCI Machine Learning Library](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection).

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

  import pandas.util.testing as tm


### Preperation

In [2]:
df_data = pd.read_csv("/content/drive/My Drive/DataScience/files/SMSSpamCollection",sep="\t",header=None,names=['Label', 'SMS'])
df_data.sample(n=5)

Unnamed: 0,Label,SMS
2488,ham,K ill drink.pa then what doing. I need srs mod...
1159,ham,Hey! There's veggie pizza... :/
2539,ham,The monthly amount is not that terrible and yo...
980,ham,Another month. I need chocolate weed and alcohol.
1572,ham,Near kalainar tv office.thenampet


In [3]:
round((df_data['Label'].value_counts()/df_data.shape[0])*100,2)

ham     86.59
spam    13.41
Name: Label, dtype: float64

We see that over 86 % of the messages are good messages and the remaining little over 13 percent are spam messages

Now, we randomize the dataset and split it into a train and test set. 80% of records are set as train and the remaining as test.

In [4]:
df_data = df_data.sample(frac=1,random_state=1)
df_train =  df_data.iloc[:int(df_data.shape[0]*0.8),:]
df_test =  df_data.iloc[int(df_data.shape[0]*0.8):,:]

Let's check if the distribution of labels is the same for the newly created train and test set too.

In [5]:
print(((df_train['Label'].value_counts()/df_train.shape[0]))*100)
print(((df_test['Label'].value_counts()/df_test.shape[0]))*100)

ham     86.53803
spam    13.46197
Name: Label, dtype: float64
ham     86.816143
spam    13.183857
Name: Label, dtype: float64


Indeed the distributions are similar to the original dataset

### Data pre-processing

1. Remove punctuations
2. Convert all words to lower-case

In [6]:
df_train.iloc[:,1] = df_train['SMS'].apply(lambda x : re.sub('\W',' ',x).lower()).copy()
df_test.iloc[:,1] = df_test['SMS'].apply(lambda x : re.sub('\W',' ',x).lower()).copy()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item_labels[indexer[info_axis]]] = value


Next step is to create a vocabulary by collecting unique words across all SMS messages.

In [7]:
vocabulary_set = set()
dummy = df_train['SMS'].apply(lambda x :  [vocabulary_set.add(elem) for elem in x.split()] )

In [8]:
print(len(vocabulary_set))

7782


There are totally 7782 words in the vocabulary

Now we collect the counts of occurrences of each word from the vocabulary in each of the messages

In [9]:
list_of_counts = []

def get_word_count(sms_text):
  # Get the word-counts in each text and save it in a dictionary
  word_count_dict = dict(pd.Series(sms_text.split()).value_counts())
  # Go oevr each word in the vocabular and collect the counts
  vocab_dict = dict()
  for elem in vocabulary_set:
    if word_count_dict.get(elem) != None:
      vocab_dict[elem] = word_count_dict.get(elem)
    else:
      vocab_dict[elem] = 0  
  list_of_counts.append(vocab_dict)

dummy = df_train['SMS'].apply(lambda x  : get_word_count(x))
df_word_counts = pd.DataFrame(list_of_counts)




  """


In [10]:
df_word_counts.head()

Unnamed: 0,constantly,film,sis,ploughing,wan2,password,planned,enter,aren,hopeso,yan,children,times,opportunity,responsibility,ros,created,sake,kissing,hogli,goals,acc,bras,dudette,northampton,flowers,wee,desires,miss,vegas,understood,cafe,tightly,petey,drivby,biz,4t,yeovil,any,5th,...,2u,ou,appointments,arestaurant,alfie,jacket,affections,everyone,recently,sambar,retired,dent,ee,minmobsmorelkpobox177hp51fl,reg,shipped,cosign,82050,slave,benefits,boys,changing,w1,emotion,ic,govt,necessary,watevr,patty,billed,secret,80062,eire,slovely,refreshed,put,hop,forced,avenue,thank
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [11]:
df_word_counts.shape


(4457, 7782)

As seen above, 4457 is the number of records in the training set and 7782 is the length of the vocabulary.

We combine the word counts with the human-assigned labels

In [27]:
df_train_counts = pd.concat( [df_train[['Label']].reset_index(drop=True), df_word_counts],axis=1)
df_train_counts.head()

Unnamed: 0,Label,constantly,film,sis,ploughing,wan2,password,planned,enter,aren,hopeso,yan,children,times,opportunity,responsibility,ros,created,sake,kissing,hogli,goals,acc,bras,dudette,northampton,flowers,wee,desires,miss,vegas,understood,cafe,tightly,petey,drivby,biz,4t,yeovil,any,...,2u,ou,appointments,arestaurant,alfie,jacket,affections,everyone,recently,sambar,retired,dent,ee,minmobsmorelkpobox177hp51fl,reg,shipped,cosign,82050,slave,benefits,boys,changing,w1,emotion,ic,govt,necessary,watevr,patty,billed,secret,80062,eire,slovely,refreshed,put,hop,forced,avenue,thank
0,ham,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,ham,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,ham,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,ham,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,ham,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


We now create a function that could delivery the probabilities given a dataset of word-counts per message

In [183]:
def get_overall_probabilities(df_counts):
  total_word_count = df_counts.iloc[:,1:].sum().sum()
  spam_word_count = df_counts[df_counts['Label'] == 'spam'].iloc[:,1:].sum().sum()
  ham_word_count = df_counts[df_counts['Label'] == 'ham'].iloc[:,1:].sum().sum()
  p_spam = spam_word_count/len(vocabulary_set)
  p_ham = ham_word_count/len(vocabulary_set)
  return (p_spam,p_ham)

Define a function to get the probabilities for each word being spam or ham

In [179]:
def get_word_probabilities(df_counts,vocabulary_set,alpha=1):
  total_word_count = df_counts.iloc[:,1:].sum().sum()
  spam_word_count = df_counts[df_counts['Label'] == 'spam'].iloc[:,1:].sum().sum()
  ham_word_count = total_word_count - spam_word_count
  # No. of time each word occurs in a spam/ham message
  ham_w_count = df_train_counts[df_train_counts['Label'] == 'ham'].iloc[:,1:].apply(func=sum,  axis=0)
  spam_w_count = df_train_counts[df_train_counts['Label'] == 'spam'].iloc[:,1:].apply(func=sum,  axis=0)
  # Probabilities of each word being a spam and as a ham
  p_w_spam = (spam_w_count + alpha)/(spam_word_count + alpha * len(vocabulary_set))
  p_w_ham = (ham_w_count + alpha)/(ham_word_count + alpha * len(vocabulary_set))
  df_word_probabilities = pd.DataFrame([p_w_spam,p_w_ham],index=['spam','ham'])
  return  df_word_probabilities   

In [184]:
# Get overall probabilities
p_spam,p_ham = get_overall_probabilities(df_train_counts)
# Get word-level probabilities of being spam or ham
df_word_probabilities = get_word_probabilities(df_train_counts,vocabulary_set,alpha=1)

Define a text classifier that relies on the parameters determined above

In [157]:
def classify(raw_text):
  cleansed_text = re.sub('\W',' ',raw_text).lower()
  word_list = cleansed_text.split()
  p_spam_given_message = p_spam
  p_ham_given_message = p_ham
  
  # For each word in the text, calculate spam/ham probabilities
  for word in word_list:  
    if vocabulary_set.issuperset([word]) == True:
      p_spam_given_message *= df_word_probabilities[word]['spam']
      p_ham_given_message *= df_word_probabilities[word]['ham']    
  # else do nothing
    
  if p_ham_given_message > p_spam_given_message:
    return 'ham'
  elif p_spam_given_message > p_ham_given_message:
    return 'spam'
  else:
    return 'needs human classification'

Now classify and determine the model-accuracy

In [185]:
train_classified = df_train['SMS'].apply(classify)
train_accuracy = np.sum(df_train['Label'] == train_classified)*100/df_train.shape[0]

test_classified = df_test['SMS'].apply(classify)
test_accuracy = np.sum(df_test['Label'] == test_classified)*100/df_test.shape[0]

### Conclustion

In [186]:
print('Training accuracy : {:.2f}'.format(train_accuracy))
print('Test accuracy : {:.2f}'.format(test_accuracy))

Training accuracy : 99.10
Test accuracy : 98.65


The Naive classifier was able to predict at an accuracy of 98.65 on the test dataset