# SPAM DETECTOR
(using Bayes Theorem)

#### What all will be done in this notebook : 
    * Understadning the dataset
    * Getting the Bag of Words from the dataset
    * Implementing Bag of Words from scratch
    * Implementing Bag of Words using scikit-learn
    * Training and testing the sets
    * Applying the Bag of Words processing to our dataset
    * Bayes Theorem implementation from scratch
    * Naive Bayes Theorem implementation from scratch
    * Evaluting the model
    * Conclusion
    ---
For more NLP learining : https://mhardik003.notion.site/NLP-751ad844946e499c9c64445a1254f648

#### Understanding the Dataset
    We will be using the dataset from the UCI Machine Learning Repository which has a very good collection of datasets for experimental research purposes.
    The direct data can be downloaded from : https://www.dt.fee.unicamp.br/~tiago/smsspamcollection/

    The SMS Spam Collection v.1 (hereafter the corpus) is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam. 

    The SMS Spam Collection v.1 (text file: smsspamcollection) has a total of 4,827 SMS legitimate messages (86.6%) and a total of 747 (13.4%) spam messages.

    The files contain one message per line. Each line is composed by two columns: one with label (ham or spam) and other with the raw text. Here are some examples:

        * ham   What you doing?how are you?
        * ham   Ok lar... Joking wif u oni...
        * ham   dun say so early hor... U c already then say...
        * ham   MY NO. IN LUTON 0125698789 RING ME IF UR AROUND! H*
        * ham   Siva is in hostel aha:-.
        * ham   Cos i was out shopping wif darren jus now n i called him 2 ask wat present he wan lor. Then he started guessing who i was wif n he finally guessed darren lor.
        * spam   FreeMsg: Txt: CALL to No: 86888 & claim your reward of 3 hours talk time to use from your phone now! ubscribe6GBP/ mnth inc 3hrs 16 stop?txtStop
        * spam   Sunshine Quiz! Win a super Sony DVD recorder if you canname the capital of Australia? Text MQUIZ to 82277. B
        * spam   URGENT! Your Mobile No 07808726822 was awarded a L2,000 Bonus Caller Prize on 02/09/03! This is our 2nd attempt to contact YOU! Call 0871-872-9758 BOX95QU

    Note: messages are not chronologically sorted.



#### Downloading the libraries

In [18]:
! pip3 install -U scikit-learn


Defaulting to user installation because normal site-packages is not writeable


### Importing the Libraries 

In [3]:
import pandas as pd
import re  # regex
import pprint  # pretty print
from nltk.corpus import stopwords
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB  # navive bayes
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


In [4]:
# dataset downloaded from the link above

smsdataset = pd.read_table('./Spam Detector Resources/SMSSpamCollection.txt',
                           sep='\t', header=None, names=['label', 'sms_message'])
smsdataset.head()  # gives the top 5 entries of the smsdataset


Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


### Data Preprocessing
    We need to convert our label 'ham' and 'spam' to '0' and '1' for ease of computation in scikit-learn libraries since it only deals with numerical values and hence if we were to leave our label values as strings, scikit-learn would do the conversion internally (more specifically, the string labels will be cast to unknown float values)

    Our model would still be able to make predictions if we left our labels as strings but we could have issues later when calculating performance metrics, for example when calculating our precision and recall scores. Therefore to avoid these unexprected errors its better to have our categorial values to be fed as integers in the model

In [5]:
smsdataset['label'] = smsdataset.label.map({'ham': 0, 'spam': 1})
print(smsdataset.shape)  # tells the rows and columns of the dataset
smsdataset.head()


(5572, 2)


Unnamed: 0,label,sms_message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


### Bag of Words (BoW)
    Bag of Words(BoW) concept which is a term used to specify the problems that have a 'bag of words' or a collection of text data that needs to be worked with. The basic idea of BoW is to take a piece of text and count the frequency of the words in that text. It is important to note that the BoW concept treats each word individually and the order in which the words occur does not matter.


    Using this, we can covert a collection of documents to a matrix, with each document being a row and each word(token) being the column, and the corresponding (row,column) values being the frequency of occurrance of each word or token in that document, which is beneficial since most ML algorithms rely on numerical data to be fed into them as input, and email/sms messages are usually text heavy.

##### Bag of Words from scratch

In [6]:
# text preprocessing
stopwords_temp = stopwords.words('english')

data_temp = {'label': [], 'sms_message': []}
bagofwords_scratch = pd.DataFrame(data_temp)


bagofwords_scratch["label"] = bagofwords_scratch["label"].astype("int8")
for i in smsdataset.iterrows():
    # print(i[1]['sms_message'])
    lower_case = i[1]['sms_message'].lower()  # converting to lower case
    removed_punctuation = re.sub(
        r"[^a-zA-Z0-9]", " ", lower_case)  # removing punctuations
    removed_punctuation_list = removed_punctuation.split()
    removed_stopwords = [
        word for word in removed_punctuation_list if word not in stopwords_temp]
    
    
    final_sms = " ".join(removed_stopwords)

    bagofwords_scratch = bagofwords_scratch.append(
        {'label': i[1]['label'], 'sms_message': final_sms}, ignore_index=True)  # adding the new message to the new dataset


# pprint.pprint(bagofwords_scratch)


bagofwords_scratch_list = []
for i in bagofwords_scratch.iterrows():
    message_wordlist = i[1]['sms_message'].split()
    message_wordset=set(message_wordlist)
    message_wordlist=list(message_wordset)
    bagofwords_scratch_list.append(message_wordlist)

# pprint.pprint(bagofwords_scratch_list)

frequency_list_scratch = []

for i in bagofwords_scratch_list:
    frequency_counts = Counter(i)
    frequency_list_scratch.append(frequency_counts)

# print(frequency_list_scratch)
bagofwords_scratch.head()


Unnamed: 0,label,sms_message
0,0,go jurong point crazy available bugis n great ...
1,0,ok lar joking wif u oni
2,1,free entry 2 wkly comp win fa cup final tkts 2...
3,0,u dun say early hor u c already say
4,0,nah think goes usf lives around though


#### Bag of Words using Scikit-learn

    It tokenizes the string(separates the string into individual words) and gives an integer ID to each token.
    It counts the occurrance of each of those tokens.

Please Note:

* The CountVectorizer method automatically converts all tokenized words to their lower case form so that it does not treat words like 'He' and 'he' differently. It does this using the lowercase parameter which is by default set to True.

* It also ignores all punctuation so that words followed by a punctuation mark (for example: 'hello!') are not treated differently than the same words not prefixed or suffixed by a punctuation mark (for example: 'hello'). It does this using the token_pattern parameter which has a default regular expression which selects tokens of 2 or more alphanumeric characters.

* The third parameter to take note of is the stop_words parameter. Stop words refer to the most commonly used words in a language. They include words like 'am', 'an', 'and', 'the' etc. By setting this parameter value to english, CountVectorizer will automatically ignore all words(from our input text) that are found in the built in list of english stop words in scikit-learn. This is extremely helpful as stop words can skew our calculations when we are trying to find certain key words that are indicative of spam.


In [20]:
sms_sentences = []


for i in smsdataset.iterrows():
    sms_sentences.append(i[1]["sms_message"])
# print(sms_sentences)

# reating the countervectorizer instance telling it to remove the english stopwords
stopwords1 = stopwords.words("english")
count_vector_sciki = CountVectorizer(stop_words=stopwords1)
print(count_vector_sciki)

# fitting the sms_sentences dataset to the CountVectorizer object using fit()
count_vector_sciki.fit(sms_sentences)

# getting the list of words which have been categorized as featured using the get_features_names() method
count_vector_sciki.get_feature_names_out()


CountVectorizer(stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours',
                            'ourselves', 'you', "you're", "you've", "you'll",
                            "you'd", 'your', 'yours', 'yourself', 'yourselves',
                            'he', 'him', 'his', 'himself', 'she', "she's",
                            'her', 'hers', 'herself', 'it', "it's", 'its',
                            'itself', ...])


array(['00', '000', '000pes', ..., 'zyada', 'èn', 'ú1'], dtype=object)

In [8]:
# creating a matrix with rows being each message and the columns being each word in the whole corpus
# the corresponding value is the freq of occurance of that word (in the column) in a particular document(in the row)

sms_array = count_vector_sciki.transform(sms_sentences).toarray()
sms_array


# converting the array obtained (sms_array) into a dataframe and set the column names to the word names
frequency_matrix = pd.DataFrame(
    sms_array, columns=count_vector_sciki.get_feature_names_out())
frequency_matrix
# basically this is the bag of words for the sms_messages dataset


Unnamed: 0,00,000,000pes,008704050406,0089,0121,01223585236,01223585334,0125698789,02,...,zeros,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada,èn,ú1
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5567,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5568,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5569,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5570,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Splitting the `smsdataset` into training and testing sets

We will be splitting our datset into a training and a testing set so we can test our model later


*    `X_train` is our training data for the 'sms_message' column.
*    `y_train` is our training data for the 'label' column
*    `X_test` is our testing data for the 'sms_message' column.
*    `y_test` is our testing data for the 'label' column Print out the number of rows we have in each our training and testing data.


In [9]:
X_train, X_test, y_train, y_test = train_test_split(smsdataset['sms_message'],
                                                    smsdataset['label'],
                                                    random_state=1)

X_train_scratch, X_test_scratch, y_train_scratch, y_test_scratch = train_test_split(bagofwords_scratch['sms_message'],
                                                                                    bagofwords_scratch['label'],
                                                                                    random_state=1)


print('Number of rows in the total set: {}'.format(smsdataset.shape[0]))
print('Number of rows in the training set: {}'.format(X_train.shape[0]))
print('Number of rows in the test set: {}'.format(X_test.shape[0]))

# print("--------\nX_train\n---------\n",X_train)
print("-------\nX_train\n-------\n", X_train_scratch)
print("-------\nY_train\n-------\n", y_train_scratch)
print("tkrhbfe : ", y_train[710])


Number of rows in the total set: 5572
Number of rows in the training set: 4179
Number of rows in the test set: 1393
-------
X_train
-------
 710     4mths half price orange line rental latest cam...
3740                                       stitch trouser
2711    hope enjoyed new content text stop 61610 unsub...
3155    heard u4 call 4 rude chat private line 0122358...
3748                        neva tell noe home da aft wat
                              ...                        
905     getting worried derek taylor already assumed w...
5192       oh oh den muz change plan liao go back yan jiu
3980    ceri u rebel sweet dreamz little buddy c ya 2m...
235     text meet someone sexy today u find date even ...
5157                                         k k sms chat
Name: sms_message, Length: 4179, dtype: object
-------
Y_train
-------
 710     1
3740    0
2711    1
3155    1
3748    0
       ..
905     0
5192    0
3980    0
235     1
5157    0
Name: label, Length: 4179, dtype: int64

### Applying Bag of Words to the dataset

Using CountVectorizer()

*    Firstly, we have to fit our training data (X_train) into CountVectorizer() and return the matrix.
*    Secondly, we have to transform our testing data (X_test) to return the matrix.

Note that `X_train` is our training data for the 'sms_message' column in our dataset and we will be using this to train our model.

`X_test` is our testing data for the 'sms_message' column and this is the data we will be using(after transformation to a matrix) to make predictions on. We will then compare those predictions with `y_test` in a later step.

In [10]:
# just as done above to the whole dataset, we will be using CountVectorizer on the training and the testing sets seperately

# Instantiate the CountVectorizer method
count_vector = CountVectorizer()

# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(X_train)
# print(training_data)


# Transform testing data and return the matrix. Note we are not fitting the testing data into the CountVectorizer()
testing_data = count_vector.fit_transform(X_test)


### Applying the Naive Bayes Theorem from scratch

In layman's terms, the Bayes theorem calculates the probability of an event occurring, based on certain other probabilities that are related to the event in question. It is composed of a prior(the probabilities that we are aware of or that is given to us) and the posterior(the probabilities we are looking to compute using the priors).

To learn more go to the notion page : https://mhardik003.notion.site/NLP-751ad844946e499c9c64445a1254f648

![Probability of spam](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/ebfa3e5d-8341-4283-8176-ec6e757f3bea/Untitled.png)

In [11]:
# using the training data

# number of spam messages
number_spam = Counter(y_train)[1]
print("Number of spam messages in training data : ", number_spam)

# number of ham messages
number_ham = Counter(y_train)[0]
print("Number of ham messages in training data : ", number_ham)

# probability of spam messages
p_spam = float(number_spam)/(number_spam+number_ham)
print("Probability of spam messages in training data : ", p_spam)

# probability of ham messages
p_ham = float(number_ham)/(number_spam+number_ham)
print("Probability of ham messages in training data : ", p_ham)


training_bagofwords_spam = []  # bag of words containing words from spam messages
training_bagofwords_ham = []  # bag of words containing words from ham messages

for i, j in X_train_scratch.iteritems():
    if(y_train_scratch[i] == 0):
        training_bagofwords_ham.extend(j.split())
    elif(y_train_scratch[i] == 1):
        training_bagofwords_spam.extend(j.split())

# print(training_bagofwords)
training_bagofwords_spam_counter = Counter(training_bagofwords_spam)
training_bagofwords_ham_counter = Counter(training_bagofwords_ham)
pprint.pprint(training_bagofwords_spam_counter)
pprint.pprint(training_bagofwords_ham_counter)

false_spam = 0
true_spam = 0
false_ham = 0
true_ham = 0


for i, j in X_test_scratch.iteritems():
    testing_list_temp = j.split()
    p_msg_spam = p_spam
    p_msg_ham = p_ham
    for word_message in testing_list_temp:
        if(training_bagofwords_spam_counter[word_message] != 0):
            p_msg_spam = p_msg_spam * \
                training_bagofwords_spam_counter[word_message]/number_spam
        else:
            p_msg_spam = p_msg_spam * 10**-6
        if(training_bagofwords_ham_counter[word_message] != 0):
            p_msg_ham = p_msg_ham * \
                training_bagofwords_ham_counter[word_message]/number_ham
        else:
            p_msg_ham = p_msg_ham*10**-6
    if(p_msg_spam < p_msg_ham):
        if(y_test_scratch[i] == 0):
            true_ham += 1
        else:
            false_ham += 1
    else:
        if(y_test_scratch[i] == 1):
            true_spam += 1
        else:
            false_spam += 1

testing_counter = Counter(y_test)

number_actually_spam = testing_counter[1]
number_actually_ham = testing_counter[0]

# p_msg_spam=p_msg_spam
# print("probability that the message is spam : ", p_msg_spam)
# print("probability that the message is ham : ", p_msg_ham)
# print(p_msg_ham>p_msg_spam)

print("True spam : ", true_spam)
print("False spam : ", false_spam)
print("True ham : ", true_ham)
print("False ham : ", false_ham)


# ratio of the number of correct predictions to the total number of predictions
accuracy_scratch = true_spam/(len(y_test)) + true_ham/(len(y_test))
print("accuracy from scratch : ", accuracy_scratch)
# [True Positives/(True Positives + False Positives)]
precision_scratch = true_spam/(true_spam+false_spam)
print("precision from scratch : ", precision_scratch)
# [True Positives/(True Positives + False Negatives)]
sensitivity_scratch = true_spam/(number_actually_spam)
print("Recall (sensitivity) from scratch : ", sensitivity_scratch)
f1=1/(1/precision_scratch+1/sensitivity_scratch)*2
print("f1 : ", f1)


Number of spam messages in training data :  562
Number of ham messages in training data :  3617
Probability of spam messages in training data :  0.13448193347690834
Probability of ham messages in training data :  0.8655180665230916
Counter({'call': 271,
         'free': 158,
         '2': 156,
         'u': 139,
         'txt': 114,
         '4': 107,
         'ur': 101,
         'text': 99,
         'stop': 97,
         'mobile': 90,
         'claim': 88,
         'reply': 81,
         '1': 79,
         'prize': 75,
         'www': 72,
         'get': 66,
         'cash': 56,
         'uk': 55,
         'new': 48,
         '150p': 48,
         'tone': 47,
         'urgent': 47,
         'send': 46,
         'c': 46,
         'nokia': 46,
         'win': 45,
         'week': 44,
         'co': 43,
         'contact': 42,
         'guaranteed': 42,
         'please': 41,
         'service': 41,
         'com': 40,
         'customer': 40,
         'msg': 38,
         '50': 38,
         

In [17]:
print(training_data)
print(y_train)

  (0, 509)	1
  (0, 3181)	1
  (0, 5193)	1
  (0, 4781)	1
  (0, 3971)	1
  (0, 5479)	1
  (0, 3880)	1
  (0, 1572)	1
  (0, 4987)	1
  (0, 2864)	2
  (0, 3170)	1
  (0, 7424)	1
  (0, 4983)	1
  (0, 264)	1
  (0, 1552)	1
  (0, 4375)	1
  (0, 4743)	1
  (0, 50)	1
  (0, 6656)	1
  (0, 6892)	1
  (0, 4662)	1
  (0, 4779)	1
  (0, 2022)	1
  (1, 2222)	1
  (1, 7420)	1
  :	:
  (4177, 4255)	1
  (4177, 4446)	1
  (4177, 4778)	1
  (4177, 2744)	1
  (4177, 254)	1
  (4177, 5490)	1
  (4177, 2556)	1
  (4177, 4508)	1
  (4177, 6034)	1
  (4177, 6662)	1
  (4177, 307)	1
  (4177, 837)	1
  (4177, 3700)	1
  (4177, 5796)	1
  (4177, 358)	1
  (4177, 4934)	1
  (4177, 2453)	1
  (4177, 2097)	1
  (4177, 5403)	1
  (4177, 2786)	1
  (4177, 6577)	1
  (4178, 1691)	1
  (4178, 4238)	1
  (4178, 7257)	1
  (4178, 5999)	1
710     1
3740    0
2711    1
3155    1
3748    0
       ..
905     0
5192    0
3980    0
235     1
5157    0
Name: label, Length: 4179, dtype: int64


### Applying the Naive Bayes Theorem using Scikit-Learn

In [13]:
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data, y_train)


### Evaluating the Model

**`Accuracy`** measures how often the classifier makes the correct prediction. It’s the ratio of the number of correct predictions to the total number of predictions (the number of test data points).

**`Precision`** tells us what proportion of messages we classified as spam, actually were spam. It is a ratio of true positives(words classified as spam, and which are actually spam) to all positives(all words classified as spam, irrespective of whether that was the correct classification), in other words it is the ratio of

*[True Positives/(True Positives + False Positives)]*

**`Recall(sensitivity)`** tells us what proportion of messages that actually were spam were classified by us as spam. It is a ratio of true positives(words classified as spam, and which are actually spam) to all the words that were actually spam, in other words it is the ratio of

*[True Positives/(True Positives + False Negatives)]*

---

For classification problems that are skewed in their classification distributions like in our case, for example if we had a 100 text messages and only 2 were spam and the rest 98 weren't, accuracy by itself is not a very good metric. We could classify 90 messages as not spam(including the 2 that were spam but we classify them as not spam, hence they would be false negatives) and 10 as spam(all 10 false positives) and still get a reasonably good accuracy score. For such cases, precision and recall come in very handy. These two metrics can be combined to get the F1 score, which is weighted average of the precision and recall scores. This score can range from 0 to 1, with 1 being the best possible F1 score.

In [14]:
predictions = naive_bayes.predict(testing_data)
print('Accuracy score: ', format(accuracy_score(y_test, predictions)))
print('Precision score: ', format(precision_score(y_test, predictions)))
print('Recall score: ', format(recall_score(y_test, predictions)))
print('F1 score: ', format(f1_score(y_test, predictions)))


ValueError: X has 4056 features, but MultinomialNB is expecting 7455 features as input.