<a href="https://colab.research.google.com/github/hasmita-patnana7/OIBSIP/blob/main/Email_Spam_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Importing Libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Importing Dataset

In [2]:
df = pd.read_csv('spam.csv', encoding="ISO-8859-1")

In [3]:
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [4]:
df.tail()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
5567,spam,This is the 2nd time we have tried 2 contact u...,,,
5568,ham,Will Ì_ b going to esplanade fr home?,,,
5569,ham,"Pity, * was in mood for that. So...any other s...",,,
5570,ham,The guy did some bitching but I acted like i'd...,,,
5571,ham,Rofl. Its true to its name,,,


In [5]:
df.columns

Index(['v1', 'v2', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], dtype='object')

In [6]:
df.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1, inplace=True)

In [7]:
df = df.drop_duplicates(keep = 'first')

In [9]:
df.v1.value_counts()

ham     4516
spam     653
Name: v1, dtype: int64

In [10]:
df.replace({'v1' : {'spam' : 0, 'ham' : 1}}, inplace=True)

In [11]:
df.head()

Unnamed: 0,v1,v2
0,1,"Go until jurong point, crazy.. Available only ..."
1,1,Ok lar... Joking wif u oni...
2,0,Free entry in 2 a wkly comp to win FA Cup fina...
3,1,U dun say so early hor... U c already then say...
4,1,"Nah I don't think he goes to usf, he lives aro..."


In [12]:
X = df['v2']
Y = df['v1']

In [13]:
X.head()

0    Go until jurong point, crazy.. Available only ...
1                        Ok lar... Joking wif u oni...
2    Free entry in 2 a wkly comp to win FA Cup fina...
3    U dun say so early hor... U c already then say...
4    Nah I don't think he goes to usf, he lives aro...
Name: v2, dtype: object

In [14]:
Y.head()

0    1
1    1
2    0
3    1
4    1
Name: v1, dtype: int64

Splitting the Dataset into test and training set

In [15]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.2,random_state = 3)

In [16]:
X_train.head()

4443                       COME BACK TO TAMPA FFFFUUUUUUU
982     Congrats! 2 mobile 3G Videophones R yours. cal...
3822    Please protect yourself from e-threats. SIB ne...
3924       As if i wasn't having enough trouble sleeping.
4927    Just hopeing that wasnÛ÷t too pissed up to re...
Name: v2, dtype: object

In [17]:
Y_train.head()

4443    1
982     0
3822    1
3924    1
4927    1
Name: v1, dtype: int64

In [18]:
X_test.head()

4994    Just looked it up and addie goes back Monday, ...
4292    You best watch what you say cause I get drunk ...
4128                 Me i'm not workin. Once i get job...
4429          Yar lor... How u noe? U used dat route too?
660     Under the sea, there lays a rock. In the rock,...
Name: v2, dtype: object

In [19]:
Y_test.head()

4994    1
4292    1
4128    1
4429    1
660     1
Name: v1, dtype: int64

Feature Extraction

In [23]:
feature_extraction = TfidfVectorizer(min_df=1, stop_words='english')

Converting X_train, X_test to inverse document frequency values

In [41]:
X_train_features = feature_extraction.fit_transform(X_train)

X_test_features = feature_extraction.transform(X_test)

Converting Y_train, Y_test values as integer

In [42]:
Y_train = Y_train.astype('int')
Y_test = Y_test.astype('int')

In [26]:
X_train.head()

4443                       COME BACK TO TAMPA FFFFUUUUUUU
982     Congrats! 2 mobile 3G Videophones R yours. cal...
3822    Please protect yourself from e-threats. SIB ne...
3924       As if i wasn't having enough trouble sleeping.
4927    Just hopeing that wasnÛ÷t too pissed up to re...
Name: v2, dtype: object

In [27]:
X_train_features

<4135x7378 sparse matrix of type '<class 'numpy.float64'>'
	with 31488 stored elements in Compressed Sparse Row format>

Training the Model using Logistic Regression

In [28]:
Model = LogisticRegression()

In [29]:
Model.fit(X_train_features,Y_train)

Evaluation of the Model:

Prdiction on Training Data

In [43]:
Training_data_Model_prediction = Model.predict(X_train_features)
Training_data_Model_prediction

array([1, 1, 1, ..., 1, 1, 1])

In [31]:
# Accuracy of the training data
acc_score_for_training_data = accuracy_score(Y_train,Training_data_Model_prediction)
print("Accuracy of Training Data : ",round(acc_score_for_training_data*100, 3))

Accuracy of Training Data :  96.227


Prdiction on Testing Data

In [44]:
Testing_data_Model_prediction = Model.predict(X_test_features)
Testing_data_Model_prediction

array([1, 1, 1, ..., 1, 0, 1])

In [45]:
# Accuracy of the testing data
acc_score_for_testing_data = accuracy_score(Y_test,Testing_data_Model_prediction)
print("Accuracy of Testing Data : ",round(acc_score_for_testing_data*100, 3))

Accuracy of Testing Data :  96.035


Testing the Predicting Model

In [34]:
io_mail = ["Your order will be shipped within 2-3 business days. You will receive a shipping notification email with tracking information once your order has shipped."]

In [35]:
# Convert this text into feature vectors
io_mail_features = feature_extraction.transform(io_mail)

# Making Predictions
predicition = Model.predict(io_mail_features)
print(predicition)
if predicition==[1]:
    print("It is a HAM mail")
else:
    print("It is a SPAM mail")

[1]
It is a HAM mail


In [38]:
io_mail = ["Congratulations! You have been randomly selected to receive a free iPhone 14. To claim your prize, simply click on the following link and enter your shipping information"]

In [39]:
# Convert this text into feature vectors
io_mail_features = feature_extraction.transform(io_mail)

# Making Predictions
predicition = Model.predict(io_mail_features)
print(predicition)
if predicition==[1]:
    print("It is a HAM mail")
else:
    print("It is a SPAM mail")

[0]
It is a SPAM mail
