<h2>Spam Mail Prediction With Support Vector Machine

<h3>In this program, we will be building a Support Vector Machine (SVM) which will be used for predicting spam mails

<h4>Importing Libraries

In [1]:
import numpy as np
import pandas as pd 
from sklearn.model_selection import train_test_split #Splitting data
from sklearn.feature_extraction.text import TfidfVectorizer #Used to extract features from text
from sklearn.svm import LinearSVC #SVM model
from sklearn.metrics import accuracy_score #Evaluation of model

<h4>Reading the data

In [2]:
#Loading the dataset
df = pd.read_csv('spamham.csv')
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
#Lets replace the null values with null string
df = df.where((pd.notnull(df)), '')
df.shape

(5572, 2)

Our data contains 5572 rows and 2 features

<h4>Lets label spam mail as 0 and Non-spam mail (ham) as 1

In [4]:
df.loc[df['Category'] == 'spam', 'Category',] = 0
df.loc[df['Category'] == 'ham', 'Category',] = 1

In [5]:
#Seperating feature (message) and label (spam or ham) for model
x = df['Message']
y = df['Category']

In [6]:
x.head()

0    Go until jurong point, crazy.. Available only ...
1                        Ok lar... Joking wif u oni...
2    Free entry in 2 a wkly comp to win FA Cup fina...
3    U dun say so early hor... U c already then say...
4    Nah I don't think he goes to usf, he lives aro...
Name: Message, dtype: object

In [7]:
y.head()

0    1
1    1
2    0
3    1
4    1
Name: Category, dtype: object

<h3>Splitting data for training and testing

In [8]:
#We will use 20% of data for testing and 80% for training
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, test_size=0.2, random_state=3)

<h4>Feature Extraction

In [9]:
#Transforming the text data to feature vectors that can be used as input to the SVM model using TfidfVectorizer
feature_extraction = TfidfVectorizer(min_df=1, stop_words='english', lowercase='True')
x_train_features = feature_extraction.fit_transform(x_train)
x_test_features = feature_extraction.transform(x_test)

#Converting values of y_train and y_test to integer
y_train = y_train.astype('int')
y_test = y_test.astype('int')

<h3>Modelling (Support Vector Machine)

In [10]:
svm = LinearSVC()

<h4>Training the model with training data

In [11]:
svm.fit(x_train_features, y_train)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

<h3>Evaluation

In [12]:
#Prediction on training data
train_pred = svm.predict(x_train_features)

#Efficiency
train_acc = accuracy_score(y_train, train_pred)

In [13]:
print(f'Accuracy on training data: {train_acc}')

Accuracy on training data: 0.9993269015032533


<h4>Our model does an excellent job in predicting values for training data

In [14]:
#Prediction on testing data
test_pred = svm.predict(x_test_features)

#Efficiency
test_acc = accuracy_score(y_test, test_pred)

In [15]:
print(f'Accuracy on testing data: {test_acc}')

Accuracy on testing data: 0.9820627802690582


<h4>Again a wonderful job done by our model using testing data

Prediction on a random mail

In [17]:
random_mail = ['We tried to contact you re your reply to our offer of a Video Handset? 750 anytime networks mins? UNLIMITED TEXT? Camcorder? Reply or call 08000930705 NOW']

#Extracting features from the text
random_mail_features = feature_extraction.transform(random_mail)

#Predicting
rand_pred = svm.predict(random_mail_features)
print(rand_pred)

if (rand_pred[0]==1):
    print('HAM MAIL')
else:
    print('SPAM MAIL')

[0]
SPAM MAIL
