<a href="https://colab.research.google.com/github/makhijakabir/assignments-ml/blob/main/Assignment_No_03_Spam_Filtering_using_Naive_Bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PROBLEM STATEMENT

- The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.

- The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.


# STEP #0: LIBRARIES IMPORT


In [96]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [97]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.metrics import recall_score, precision_score, f1_score

from sklearn.feature_extraction.text import TfidfVectorizer

#nltk.download('all')
%matplotlib inline

In [98]:
dataset = '/content/drive/MyDrive/Colab Notebooks/Sem 5 data/emails.csv'

spam = 1
notSpam = 0

spamIndex = 1368

# STEP #1: IMPORT DATASET

In [99]:
data = pd.read_csv(dataset)

In [100]:
data

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1
...,...,...
5723,Subject: re : research and development charges...,0
5724,"Subject: re : receipts from visit jim , than...",0
5725,Subject: re : enron case study update wow ! a...,0
5726,"Subject: re : interest david , please , call...",0


In [101]:
strLens = data.text.str.len()

data['len'] = strLens

In [102]:
data

Unnamed: 0,text,spam,len
0,Subject: naturally irresistible your corporate...,1,1484
1,Subject: the stock trading gunslinger fanny i...,1,598
2,Subject: unbelievable new homes made easy im ...,1,448
3,Subject: 4 color printing special request add...,1,500
4,"Subject: do not have money , get software cds ...",1,235
...,...,...,...
5723,Subject: re : research and development charges...,0,1189
5724,"Subject: re : receipts from visit jim , than...",0,1167
5725,Subject: re : enron case study update wow ! a...,0,2131
5726,"Subject: re : interest david , please , call...",0,1060


# STEP #2: VISUALIZE DATASET

In [103]:
# Let's see the longest message 43952

indexLongest = data[data['len']==43952].index.values
email = data.loc[indexLongest[0], 'text']
email



In [104]:
# Let's divide the messages into spam and ham

spamEmails = data.iloc[:spamIndex, :]
hamEmails = data.iloc[spamIndex:, :]

In [105]:
spamEmails

Unnamed: 0,text,spam,len
0,Subject: naturally irresistible your corporate...,1,1484
1,Subject: the stock trading gunslinger fanny i...,1,598
2,Subject: unbelievable new homes made easy im ...,1,448
3,Subject: 4 color printing special request add...,1,500
4,"Subject: do not have money , get software cds ...",1,235
...,...,...,...
1363,Subject: are you ready to get it ? hello ! v...,1,347
1364,Subject: would you like a $ 250 gas card ? do...,1,188
1365,"Subject: immediate reply needed dear sir , i...",1,3164
1366,Subject: wanna see me get fisted ? fist bang...,1,734


In [106]:
hamEmails

Unnamed: 0,text,spam,len
1368,"Subject: hello guys , i ' m "" bugging you "" f...",0,1188
1369,Subject: sacramento weather station fyi - - ...,0,1997
1370,Subject: from the enron india newsdesk - jan 1...,0,7902
1371,Subject: re : powerisk 2001 - your invitation ...,0,3644
1372,Subject: re : resco database and customer capt...,0,5535
...,...,...,...
5723,Subject: re : research and development charges...,0,1189
5724,"Subject: re : receipts from visit jim , than...",0,1167
5725,Subject: re : enron case study update wow ! a...,0,2131
5726,"Subject: re : interest david , please , call...",0,1060


# STEP #3: CREATE TESTING AND TRAINING DATASET/DATA CLEANING

# STEP 3.3 COUNT VECTORIZER EXAMPLE 

In [107]:
vectorizer = CountVectorizer(stop_words='english')
all_features = vectorizer.fit_transform(data.text)
all_features.shape

(5728, 36996)

In [108]:
vectorizer.vocabulary_

{'subject': 32145,
 'naturally': 23219,
 'irresistible': 18705,
 'corporate': 9986,
 'identity': 17562,
 'lt': 21006,
 'really': 27817,
 'hard': 16546,
 'recollect': 27941,
 'company': 9223,
 'market': 21520,
 'suqgestions': 32408,
 'information': 18103,
 'isoverwhelminq': 18751,
 'good': 15964,
 'catchy': 7986,
 'logo': 20818,
 'stylish': 32126,
 'statlonery': 31776,
 'outstanding': 24679,
 'website': 35805,
 'make': 21296,
 'task': 32839,
 'easier': 12539,
 'promise': 26937,
 'havinq': 16654,
 'ordered': 24447,
 'iogo': 18626,
 'automaticaily': 5740,
 'world': 36333,
 'ieader': 17579,
 'isguite': 18729,
 'ciear': 8594,
 'products': 26835,
 'effective': 12742,
 'business': 7477,
 'organization': 24485,
 'practicable': 26421,
 'aim': 4338,
 'hotat': 17240,
 'nowadays': 23820,
 'marketing': 21528,
 'efforts': 12759,
 'list': 20662,
 'clear': 8769,
 'benefits': 6385,
 'creativeness': 10221,
 'hand': 16470,
 'original': 24510,
 'logos': 20821,
 'specially': 31356,
 'reflect': 28097,
 'dis

In [109]:
xTrain, xTest, yTrain, yTest = train_test_split(all_features, data.spam, test_size=0.3, random_state=88)

# STEP#4: TRAINING THE MODEL WITH ALL DATASET

In [110]:
classifier = MultinomialNB()
classifier.fit(xTrain, yTrain)

MultinomialNB()

In [111]:
nrCorrect = (yTest == classifier.predict(xTest)).sum()
nrInCorrect = yTest.size - nrCorrect

print("The number of correctly classified emails is:", nrCorrect)
print("The number of in-correctly classified emails is:", nrInCorrect)

The number of correctly classified emails is: 1697
The number of in-correctly classified emails is: 22


In [112]:
fracWrong = nrInCorrect / (nrCorrect + nrInCorrect)
print('The testing accuracy of the model is:', 1-fracWrong)

The testing accuracy of the model is: 0.9872018615474113


# STEP#5: EVALUATING THE MODEL 

In [113]:
classifier.score(xTest, yTest)

0.9872018615474113

In [114]:
recall_score(yTest, classifier.predict(xTest))

0.9876543209876543

In [115]:
precision_score(yTest, classifier.predict(xTest))

0.9592326139088729

In [116]:
f1_score(yTest, classifier.predict(xTest))

0.9732360097323601

# STEP #6: LET'S ADD ADDITIONAL FEATURE TF-IDF

In [118]:
doc=data['text'].values

tf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tf.fit_transform(doc)