###### Classify the email using the binary classification method.
###### Email Spam detection has two states: 
###### a) Normal State – Not Spam, 
###### b) Abnormal State – Spam. 
###### Use K-Nearest Neighbors and Support Vector Machine for classification. 
###### Analyze their performance.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [2]:
df = pd.read_csv('emails.csv')

In [3]:
df.head()

Unnamed: 0,Email No.,the,to,ect,and,for,of,a,you,hou,...,connevey,jay,valued,lay,infrastructure,military,allowing,ff,dry,Prediction
0,Email 1,0,0,1,0,0,0,2,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Email 2,8,13,24,6,6,2,102,1,27,...,0,0,0,0,0,0,0,1,0,0
2,Email 3,0,0,1,0,0,0,8,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Email 4,0,5,22,0,5,1,51,2,10,...,0,0,0,0,0,0,0,0,0,0
4,Email 5,7,6,17,1,5,2,57,0,9,...,0,0,0,0,0,0,0,1,0,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5172 entries, 0 to 5171
Columns: 3002 entries, Email No. to Prediction
dtypes: int64(3001), object(1)
memory usage: 118.5+ MB


In [5]:
df.describe()

Unnamed: 0,the,to,ect,and,for,of,a,you,hou,in,...,connevey,jay,valued,lay,infrastructure,military,allowing,ff,dry,Prediction
count,5172.0,5172.0,5172.0,5172.0,5172.0,5172.0,5172.0,5172.0,5172.0,5172.0,...,5172.0,5172.0,5172.0,5172.0,5172.0,5172.0,5172.0,5172.0,5172.0,5172.0
mean,6.640565,6.188128,5.143852,3.075599,3.12471,2.62703,55.517401,2.466551,2.024362,10.600155,...,0.005027,0.012568,0.010634,0.098028,0.004254,0.006574,0.00406,0.914733,0.006961,0.290023
std,11.745009,9.534576,14.101142,6.04597,4.680522,6.229845,87.574172,4.314444,6.967878,19.281892,...,0.105788,0.199682,0.116693,0.569532,0.096252,0.138908,0.072145,2.780203,0.098086,0.453817
min,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,1.0,1.0,0.0,1.0,0.0,12.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,3.0,3.0,1.0,1.0,2.0,1.0,28.0,1.0,0.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,8.0,7.0,4.0,3.0,4.0,2.0,62.25,3.0,1.0,12.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
max,210.0,132.0,344.0,89.0,47.0,77.0,1898.0,70.0,167.0,223.0,...,4.0,7.0,2.0,12.0,3.0,4.0,3.0,114.0,4.0,1.0


In [6]:
X = df.drop('Email No.', axis=1)
y = df['Prediction']

In [7]:
# Scaling the data
scalar = StandardScaler()
X_scalaed = scalar.fit_transform(X)


In [8]:
X_train, X_test, y_train, y_test = train_test_split(X_scalaed, y, test_size=0.25, random_state=42)

In [9]:
knn_model = KNeighborsClassifier(n_neighbors=2)
knn_model.fit(X_train, y_train)

In [10]:
y_pred = knn_model.predict(X_test)

In [11]:
y_pred

array([0, 0, 1, ..., 0, 0, 0], dtype=int64)

In [12]:
acc = accuracy_score(y_test, y_pred)
conf = confusion_matrix(y_test, y_pred)
classi = classification_report(y_test, y_pred)

print("\nTha accuracy is: ", acc, "\nThe confusion matrix: \n", conf)
print("\nThe classification report: \n", classi)


Tha accuracy is:  0.9551430781129157 
The confusion matrix: 
 [[895  18]
 [ 40 340]]

The classification report: 
               precision    recall  f1-score   support

           0       0.96      0.98      0.97       913
           1       0.95      0.89      0.92       380

    accuracy                           0.96      1293
   macro avg       0.95      0.94      0.95      1293
weighted avg       0.96      0.96      0.95      1293



# trying different k values

In [13]:
for i in range(1,10): 
    model = KNeighborsClassifier(n_neighbors=i)
    model.fit(X_train,y_train)
    y_pred = model.predict(X_test)
    print(f"Accuracy for {i} th no. of neighbors is {accuracy_score(y_test,y_pred)}")

Accuracy for 1 th no. of neighbors is 0.9450889404485692
Accuracy for 2 th no. of neighbors is 0.9551430781129157
Accuracy for 3 th no. of neighbors is 0.9195668986852281
Accuracy for 4 th no. of neighbors is 0.9249806651198763
Accuracy for 5 th no. of neighbors is 0.8948182521268369
Accuracy for 6 th no. of neighbors is 0.9033255993812839
Accuracy for 7 th no. of neighbors is 0.8700696055684455
Accuracy for 8 th no. of neighbors is 0.8723897911832946
Accuracy for 9 th no. of neighbors is 0.8453209590100541


# SVM:

In [14]:
from sklearn.svm import SVC

In [15]:
svc_model = SVC(kernel='linear', C=2)
svc_model.fit(X_train, y_train)

In [16]:
y_pred = svc_model.predict(X_test)

In [17]:
acc = accuracy_score(y_test, y_pred)
conf = confusion_matrix(y_test, y_pred)
classi = classification_report(y_test, y_pred)

In [18]:
print("The accuracy is: ", acc, "\nThe confusion matrix: \n", conf, "\nThe classification Report: \n", classi)

The accuracy is:  1.0 
The confusion matrix: 
 [[913   0]
 [  0 380]] 
The classification Report: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       913
           1       1.00      1.00      1.00       380

    accuracy                           1.00      1293
   macro avg       1.00      1.00      1.00      1293
weighted avg       1.00      1.00      1.00      1293



# Trying different kernels and C values:

In [19]:
for i in range(1, 11):
    sv_model = SVC(kernel='linear', C=i)
    sv_model.fit(X_train, y_train)
    y_pred = sv_model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f"The accuracy for {i}th C value in linear kernel is: ", acc)

The accuracy for 1th C value in linear kernel is:  1.0
The accuracy for 2th C value in linear kernel is:  1.0
The accuracy for 3th C value in linear kernel is:  1.0
The accuracy for 4th C value in linear kernel is:  1.0
The accuracy for 5th C value in linear kernel is:  1.0
The accuracy for 6th C value in linear kernel is:  1.0
The accuracy for 7th C value in linear kernel is:  1.0
The accuracy for 8th C value in linear kernel is:  1.0
The accuracy for 9th C value in linear kernel is:  1.0
The accuracy for 10th C value in linear kernel is:  1.0


In [20]:
for i in range(1, 11):
    sv_model = SVC(kernel='sigmoid', C=i)
    sv_model.fit(X_train, y_train)
    y_pred = sv_model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f"The accuracy for {i}th C value in linear kernel is: ", acc)

The accuracy for 1th C value in linear kernel is:  0.9621036349574633
The accuracy for 2th C value in linear kernel is:  0.9551430781129157
The accuracy for 3th C value in linear kernel is:  0.9435421500386698
The accuracy for 4th C value in linear kernel is:  0.9396751740139211
The accuracy for 5th C value in linear kernel is:  0.9334880123743233
The accuracy for 6th C value in linear kernel is:  0.9311678267594741
The accuracy for 7th C value in linear kernel is:  0.9280742459396751
The accuracy for 8th C value in linear kernel is:  0.9280742459396751
The accuracy for 9th C value in linear kernel is:  0.9203402938901779
The accuracy for 10th C value in linear kernel is:  0.917246713070379


In [22]:
for i in range(1, 11):
    sv_model = SVC(kernel='rbf', C=i)
    sv_model.fit(X_train, y_train)
    y_pred = sv_model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f"The accuracy for {i}th C value in linear kernel is: ", acc)

The accuracy for 1th C value in linear kernel is:  0.9690641918020109
The accuracy for 2th C value in linear kernel is:  0.9721577726218097
The accuracy for 3th C value in linear kernel is:  0.9721577726218097
The accuracy for 4th C value in linear kernel is:  0.9721577726218097
The accuracy for 5th C value in linear kernel is:  0.9729311678267595
The accuracy for 6th C value in linear kernel is:  0.9729311678267595
The accuracy for 7th C value in linear kernel is:  0.9729311678267595
The accuracy for 8th C value in linear kernel is:  0.9729311678267595
The accuracy for 9th C value in linear kernel is:  0.9729311678267595
The accuracy for 10th C value in linear kernel is:  0.9729311678267595
