## Dependencies

## Introduction

<p><b>Algorithms we're considering</b></p>
<p>Naive Bayes</p>
<p>Support Vector Machines</p>
<p>K Nearest Neighbors</p>

In [25]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [26]:
path = "spam_ham_dataset.csv"
data = pd.read_csv(path)
data

Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,605,ham,Subject: enron methanol ; meter # : 988291\r\n...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1
4,2030,ham,Subject: re : indian springs\r\nthis deal is t...,0
...,...,...,...,...
5166,1518,ham,Subject: put the 10 on the ft\r\nthe transport...,0
5167,404,ham,Subject: 3 / 4 / 2000 and following noms\r\nhp...,0
5168,2933,ham,Subject: calpine daily gas nomination\r\n>\r\n...,0
5169,1409,ham,Subject: industrial worksheets for august 2000...,0


In [27]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5171 entries, 0 to 5170
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  5171 non-null   int64 
 1   label       5171 non-null   object
 2   text        5171 non-null   object
 3   label_num   5171 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 161.7+ KB


In [28]:
# There are no null values

## Data Cleaning

In [29]:
import nltk # natural language toolkit
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer # split sentences to their root words
from sklearn.feature_extraction.text import TfidfVectorizer  # convert words to vectors
import string



nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [30]:
lemmatizer = WordNetLemmatizer()

words = []

stopwords_set  = set(stopwords.words('english'))

for i in range(len(data)):
    text = data['text'].iloc[i].casefold()
    text = text.translate(str.maketrans('','',string.punctuation)).split()
    text = [lemmatizer.lemmatize(word) for word in text if word not in stopwords_set]
    text = ' '.join(text)
    words.append(text)

## Vectorize Words

In [31]:
vectorizer = TfidfVectorizer()

X = vectorizer.fit_transform(words).toarray()
y = data['label_num']

### Split Data into Training and Test Data

In [32]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [33]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(4136, 47906)
(4136,)
(1035, 47906)
(1035,)


# Naive Bayes

#### Gussian Naive Bayes

In [40]:
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
GNB_model = GaussianNB()
# Model training
GNB_model.fit(X_train, y_train)
GNB_model.score(X_train, y_train)

0.988394584139265

In [41]:
# on test data
GNB_model.score(X_test, y_test)

0.9468599033816425

#### Multinomial Naive Bayes

In [42]:
MNB_model = MultinomialNB()
# Model training
MNB_model.fit(X_train, y_train)
MNB_model.score(X_train, y_train)

0.9608317214700194

In [43]:
# on test data
MNB_model.score(X_test, y_test)

0.9217391304347826

#### Bernoulli Naive Bayes

In [44]:
BNB_model = BernoulliNB()
# Model training
BNB_model.fit(X_train, y_train)
BNB_model.score(X_train, y_train)

0.8740328820116054

In [45]:
# on test data
BNB_model.score(X_test, y_test)

0.8473429951690822

## Support Vector Machines

In [38]:
#Import svm model
from sklearn import svm

#Create a svm Classifier
clf = svm.SVC()

#Train the model using the training sets
clf.fit(X_train, y_train)
clf.score(X_train, y_train)


1.0

In [39]:
#performance on test data
clf.score(X_test, y_test)

0.9855072463768116

## K- Nearest Neighbor

In [36]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
knn.score(X_train, y_train)

0.9765473887814313

In [37]:
#per4formance on test data
knn.score(X_test, y_test)

0.9603864734299516

In [46]:
# Results

# Results

| Model          |  | Accuracy on Test |
| :---------------- | :------: | ----: |
| Naive Bayes       |   Gussian Naive Bayes   | 0.9469 |
|       | Multinomial Naive Bayes | 0.9217|
|        |   Bernoulli Naive Bayes   | 0.8473|
| SVM          |    | 0.9855|
| K Nearest Neighbors   |   | 0.9604|
