# Spam Mail Classification


Welcome to our NLP Spam Mail Detection Project! In an age where digital communication has become ubiquitous, the ever-increasing volume of unsolicited and deceptive emails, commonly known as spam, poses a significant threat to our online experience. Our project aims to harness the power of Natural Language Processing (NLP) to build a robust and efficient system that can accurately identify and filter out spam emails from legitimate ones.

At its core, NLP enables computers to understand and process human language, making it a potent tool for detecting spam, which often employs cunning techniques to evade traditional filters. By leveraging cutting-edge NLP algorithms and machine learning techniques, our solution will continuously evolve to adapt to new spamming tactics, ensuring a safer and more enjoyable email experience for users.

Our team of dedicated researchers and engineers is committed to developing an intelligent and scalable spam detection model. We will meticulously curate and label a diverse dataset of emails to train the model effectively. The system's accuracy will be fine-tuned through rigorous testing and iterative improvement, striving to minimize false positives while maintaining high sensitivity to even the sneakiest spam attempts.

With this project, we envision empowering individuals and organizations to reclaim their inboxes and focus on the emails that truly matter. Together, let us embrace the power of NLP to create a spam-free digital ecosystem that fosters communication, collaboration, and security.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorize
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [2]:
data = pd.read_csv("mail_data.csv")
data

Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,605,ham,Subject: enron methanol ; meter # : 988291\r\n...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1
4,2030,ham,Subject: re : indian springs\r\nthis deal is t...,0
...,...,...,...,...
5166,1518,ham,Subject: put the 10 on the ft\r\nthe transport...,0
5167,404,ham,Subject: 3 / 4 / 2000 and following noms\r\nhp...,0
5168,2933,ham,Subject: calpine daily gas nomination\r\n>\r\n...,0
5169,1409,ham,Subject: industrial worksheets for august 2000...,0


In [3]:
data.drop(['Unnamed: 0'],axis=1,inplace=True)

In [4]:
data.drop(['label_num'],axis=1,inplace=True)

In [5]:
data

Unnamed: 0,label,text
0,ham,Subject: enron methanol ; meter # : 988291\r\n...
1,ham,"Subject: hpl nom for january 9 , 2001\r\n( see..."
2,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar..."
3,spam,"Subject: photoshop , windows , office . cheap ..."
4,ham,Subject: re : indian springs\r\nthis deal is t...
...,...,...
5166,ham,Subject: put the 10 on the ft\r\nthe transport...
5167,ham,Subject: 3 / 4 / 2000 and following noms\r\nhp...
5168,ham,Subject: calpine daily gas nomination\r\n>\r\n...
5169,ham,Subject: industrial worksheets for august 2000...


In [6]:
data.isnull().sum()

label    0
text     0
dtype: int64

In [7]:
data.loc[data['label']=="spam","label",]=0
data.loc[data['label']=="ham","label",]=1

In [8]:
data

Unnamed: 0,label,text
0,1,Subject: enron methanol ; meter # : 988291\r\n...
1,1,"Subject: hpl nom for january 9 , 2001\r\n( see..."
2,1,"Subject: neon retreat\r\nho ho ho , we ' re ar..."
3,0,"Subject: photoshop , windows , office . cheap ..."
4,1,Subject: re : indian springs\r\nthis deal is t...
...,...,...
5166,1,Subject: put the 10 on the ft\r\nthe transport...
5167,1,Subject: 3 / 4 / 2000 and following noms\r\nhp...
5168,1,Subject: calpine daily gas nomination\r\n>\r\n...
5169,1,Subject: industrial worksheets for august 2000...


In [9]:
x=data['text']
y=data['label']

In [17]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)

In [23]:
feature_extraction = TfidfVectorizer(min_df=1,stop_words='english',lowercase=True)

x_train_feature = feature_extraction.fit_transform(x_train)
x_test_feature = feature_extraction.transform(x_test)

y_train= y_train.astype('int')
y_test = y_test.astype('int')

In [25]:

lr = LogisticRegression()
lr.fit(x_train_feature,y_train)

In [26]:

prediction_of_training = lr.predict(x_train_feature)
accuracy = accuracy_score(y_train,prediction_of_training)
print('accuracy of training data : ',accuracy)

accuracy of training data :  0.9961315280464217


In [27]:
prediction_of_testing = lr.predict(x_test_feature)
accuracy = accuracy_score(y_test,prediction_of_testing)
print('accuracy of testing data : ',accuracy)

accuracy of testing data :  0.9884057971014493


In [29]:
input_mail=['''Subject: re : indian springs
this deal is to book the teco pvr revenue . it is my understanding that teco
just sends us a check , i haven ' t received an answer as to whether there is a
predermined price associated with this deal or if teco just lets us know what
we are giving . i can continue to chase this deal down if you need ''']
input_data_feature = feature_extraction.transform(input_mail)

prediction = lr.predict(input_data_feature)

if prediction == 0:
    print('The mail is SPAM')
else:
    print('The mail is NOT SPAM')

The mail is NOT SPAM
