<a href="https://colab.research.google.com/github/khojwar/Machine-Learning-and-Deep-Learning/blob/main/NLP/Spam_Filtering_using_Bag_of_Word_(BoW).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Dataset source: https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import string

### Read Data

`string.punctuation` is a string containing all ASCII punctuation characters, such as !"#$%&'()*+,-./:;<=>?@[\]^_{|}~`

`str.maketrans('', '', string.punctuation)` creates a translation table that maps each character in `string.punctuation` to `None`. This effectively removes all punctuation characters from the string.

In [4]:
# read dataset
spam_df = pd.read_csv('/content/spam.csv', encoding="ISO-8859-1")

# subset and rename columns
spam_df = spam_df[["v1", "v2"]]
spam_df.rename(columns={'v1':"spam", 'v2':"text"}, inplace=True)

#convert spam column to binary
spam_df["spam"] = spam_df["spam"].apply(lambda s: True if s == 'spam' else False)   # OR  spam_df.spam.apply(lambda s: True if s == 'spam' else False)

#lowercase everything and remove punctuation
spam_df["text"] = spam_df["text"].apply(lambda  t: t.lower().translate(str.maketrans("","", string.punctuation)))

# shuffle
spam_df = spam_df.sample(frac=1)

In [5]:
spam_df

Unnamed: 0,spam,text
4333,False,boo what time u get out u were supposed to tak...
4506,False,he neva grumble but i sad lor hee buy tmr lor ...
1902,False,my sister got placed in birla soft da
2442,False,i donno if they are scorable
4061,False,hi dear we saw dear we both are happy where yo...
...,...,...
4200,False,wylie update my weed dealer carlos went to fre...
772,False,idc get over here you are not weaseling your w...
5400,False,hard but true how much you show amp express y...
170,False,sir i need axis bank account no and bank address


`.iloc` *selects rows and columns from a DataFrame* or Series based on their integer position (index)
> `.iloc`: Integer-based indexing.

*   eg. `df.iloc[0, 1]`  *(Select the first row and the second column)*
*   eg. `df.iloc[0:2, 1:3] ` *(Select a range of rows and columns)*

> `.loc`: Label-based indexing. eg. `df.loc[0, 'B']`

In [6]:
# select 5 rows from text column where spam column should be spam == True
for t in spam_df[spam_df["spam"] == True].iloc[:5].text:
    print(t)
    print("------------")

what do u want for xmas how about 100 free text messages  a new video phone with half price line rental call free now on 0800 0721072 to find out more
------------
rct thnq adrian for u text rgds vatian
------------
important information 4 orange user 0789xxxxxxx today is your lucky day2find out why log onto httpwwwurawinnercom theres a fantastic surprise awaiting you
------------
4mths half price orange line rental  latest camera phones 4 free had your phone 11mths  call mobilesdirect free on 08000938767 to update now or2stoptxt
------------
bored housewives chat n date now 08717507711 btnational rate 10pmin only from landlines
------------


In [7]:
# select 5 rows from text column where spam column should be spam == False
for t in spam_df[spam_df["spam"] == False].iloc[:5].text:
    print(t)
    print("------------")

boo what time u get out u were supposed to take me shopping today 
------------
he neva grumble but i sad lor hee buy tmr lor aft lunch but we still meetin 4 lunch tmr a not neva hear fr them lei ìï got a lot of work ar
------------
my sister got placed in birla soft da
------------
i donno if they are scorable
------------
hi dear we saw dear we both are happy where you my battery is low
------------


### Split the dataset into train and test set

In [8]:
# get the training set
train_spam_df = spam_df.iloc[:int(len(spam_df)*0.7)]

# get the testing set
test_spam_df = spam_df.iloc[int(len(spam_df)*0.7):]

In [9]:
FRAC_SPAM_TEXTS = spam_df["spam"].mean()
print(FRAC_SPAM_TEXTS)

0.13406317300789664


### Create Spam Bag of Words and Non-Spam Bag of Words

the purpose of creating "Spam Bag of Words" and "Non-Spam Bag of Words" is to *convert text messages into numerical features* that machine learning algorithms can understand.

In [10]:
#get all words from spam and non-spam datasets
train_spam_words = ' '.join(train_spam_df[train_spam_df['spam'] == True].text).split(' ')
train_non_spam_words = ' '.join(train_spam_df[train_spam_df['spam'] == False].text).split(' ')

# Find the common words between spam and non-spam datasets
common_words = set(train_spam_words).intersection(set(train_non_spam_words))

calculating the relative frequency of each word in the "common_words" set within the spam messages to ***create a Bag of Words (BoW) representation*** with normalized term frequencies for text classification or analysis.

TF-IDF (Term Frequency-Inverse Document Frequency), which take into account the ***importance of words within the entire corpus*** and are often more effective for text classification tasks.

In [11]:
train_spam_bow = dict()

for w in common_words:
    train_spam_bow[w] = train_spam_words.count(w) / len(train_spam_words)   # Calculate the relative frequency of the word in spam messages by counting the occurrences of the word and dividing by the total number of words in spam messages

In [12]:
train_non_spam_bow = dict()

for w in common_words:
    train_non_spam_bow[w] = train_non_spam_words.count(w) / len(train_non_spam_words)

### Predict on Test Set

# $ P(\text{SPAM} | \text{"urgent please call this number"}) $
# $\propto P(\text{"urgent please call this number"} | \text{SPAM}) \times P(\text{SPAM}) $
# $= P(\text{"urgent"} | \text{SPAM}) \times P(\text{"please"} | \text{SPAM}) \times \dots \times P(\text{SPAM})$

### Due to numerical issues, equivalently  compute:

# $log(P(\text{"urgent"} | \text{SPAM}) \times P(\text{"please"} | \text{SPAM}) \times \dots \times P(\text{SPAM}))$
# $ = log(P(\text{"urgent"} | \text{SPAM})) + log(P(\text{"please"} | \text{SPAM})) + \dots + log(P(\text{SPAM}))$

`predict_text` function ***calculates and predicts whether a given text is spam or non-spam*** based on the probabilities of words appearing in spam and non-spam Bag of Words (BoW) representations

In [16]:
def predict_text(t, verbose=False):
    #if some word doesnt appear in either spam or non-spam BOW, disregard it
    valid_words = [w for w in t if w in train_spam_bow]

    #get the probabilities of each valid word showing up in spam and non-spam BOW
    spam_probs = [train_spam_bow[w] for w in valid_words]
    non_spam_probs = [train_non_spam_bow[w] for w in valid_words]

    #print probs if requested
    if verbose:
        data_df = pd.DataFrame()
        data_df["word"] = valid_words
        data_df["spam_probs"] = spam_probs
        data_df["non_spam_probs"] = non_spam_probs
        data_df["ratio"] = [s/n if n>0 else np.inf for s, n in zip(spam_probs, non_spam_probs)]
        print(data_df)

    #calculate spam score as sum of logs for all probabilities
    spam_score = sum([np.log(p) for p in spam_probs]) + np.log(FRAC_SPAM_TEXTS)

    #calculate non-spam score as sum of logs for all probabilities
    non_spam_score = sum([np.log(p) for p in non_spam_probs]) + np.log(1 - FRAC_SPAM_TEXTS)

    #if verbose, report the two scores
    if verbose:
        print("Spam score: ", spam_score)
        print("Non_spam_score: ", non_spam_score)

    #if spam score is higher, mark this as spam
    return (spam_score >= non_spam_score)



In [17]:
predict_text('urgent call this number'.split(), verbose=True)

     word  spam_probs  non_spam_probs      ratio
0  urgent    0.003961        0.000041  95.696007
1    call    0.019886        0.003436   5.787879
2    this    0.004991        0.003394   1.470451
3  number    0.001901        0.000931   2.041515
Spam score:  -23.02356605704421
Non_spam_score:  -28.57426883634924


True

In [18]:
predict_text('hey do you want to go a movie tonight'.split(), verbose=True)

      word  spam_probs  non_spam_probs     ratio
0      hey    0.000317        0.001470  0.215653
1       do    0.001347        0.004967  0.271139
2      you    0.016321        0.025272  0.645811
3     want    0.001505        0.002256  0.667238
4       to    0.036524        0.022271  1.639995
5       go    0.001743        0.003374  0.516641
6        a    0.021154        0.016103  1.313668
7    movie    0.000158        0.000207  0.765568
8  tonight    0.000079        0.000973  0.081443
Spam score:  -59.00155438061725
Non_spam_score:  -50.78711215789316


False

In [19]:
predict_text('offer for unlimited money call now'.split(), verbose=True)

        word  spam_probs  non_spam_probs      ratio
0      offer    0.001347        0.000062  21.691095
1        for    0.011171        0.006954   1.606326
2  unlimited    0.000634        0.000083   7.655681
3      money    0.000238        0.000704   0.337751
4       call    0.019886        0.003436   5.787879
5        now    0.010458        0.004264   2.452791
Spam score:  -37.30034183613151
Non_spam_score:  -42.588685132213584


True

In [20]:
predict_text('are you at class yet'.split(), verbose=True)

  word  spam_probs  non_spam_probs     ratio
0  are    0.004833        0.005423  0.891215
1  you    0.016321        0.025272  0.645811
2   at    0.001505        0.005381  0.279727
3  yet    0.000158        0.000745  0.212658
Spam score:  -26.70589436565434
Non_spam_score:  -21.465962175516808


False

In [22]:
predictions = test_spam_df["text"].apply(lambda t: predict_text(t.split()))

Below code calculates the False Positive Rate (FPR) of a binary classification model's predictions. The FPR *measures the proportion of actual negative (non-spam) instances that are incorrectly classified as positive (spam) by the model. *

In [27]:
frac_spam_messages_correctly_detected = np.sum((predictions == True) & (test_spam_df.spam == True)) / np.sum(test_spam_df.spam == True)
print(f"Fraction Spam Correctly Detected : {frac_spam_messages_correctly_detected}")

Fraction Spam Correctly Detected : 0.918918918918919


Below code calculates the False Positive Rate (FPR) of a binary classification model's predictions correctly. The FPR *measures the proportion of actual negative (non-spam) instances that are incorrectly classified as positive (spam) by the model.*

* `(predictions == True) & (test_spam_df.spam == False)` creates a Boolean mask that *checks the predictions are True (predicted as spam) and the actual labels in the test_spam_df DataFrame are False (indicating non-spam)*.

* `np.sum((predictions == True) & (test_spam_df.spam == False))` *calculates the sum of True values in the Boolean mask* created in the previous step. In other words, it counts the number of valid messages that were predicted as spam.

* `np.sum(test_spam_df.spam == False)` calculates the total number of valid (non-spam) messages in the test_spam_df DataFrame.

In [26]:
frac_valid_sent_to_spam = np.sum((predictions == True) & (test_spam_df.spam == False)) / np.sum(test_spam_df.spam == False)
print('Fraction Valid Messages Sent to Spam: %s'%frac_valid_sent_to_spam)

Fraction Valid Messages Sent to Spam: 0.01793103448275862
