# Email-Spam-Classifier Project
In this project we train our model to classify whether a mail is spam or not.

## Importing the data and Data Cleaning
* Import the data into Pandas Dataframe
* Check for Null values and handle them
* Check for Duplicate rows and remove them

In [1]:

import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv("emails.csv")

In [3]:
df

Unnamed: 0,text,spam,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 100,Unnamed: 101,Unnamed: 102,Unnamed: 103,Unnamed: 104,Unnamed: 105,Unnamed: 106,Unnamed: 107,Unnamed: 108,Unnamed: 109
0,Subject: naturally irresistible your corporate...,1,,,,,,,,,...,,,,,,,,,,
1,Subject: the stock trading gunslinger fanny i...,1,,,,,,,,,...,,,,,,,,,,
2,Subject: unbelievable new homes made easy im ...,1,,,,,,,,,...,,,,,,,,,,
3,Subject: 4 color printing special request add...,1,,,,,,,,,...,,,,,,,,,,
4,"Subject: do not have money , get software cds ...",1,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5726,"Subject: re : receipts from visit jim , than...",0,,,,,,,,,...,,,,,,,,,,
5727,Subject: re : enron case study update wow ! a...,0,,,,,,,,,...,,,,,,,,,,
5728,"Subject: re : interest david , please , call...",0,,,,,,,,,...,,,,,,,,,,
5729,Subject: news : aurora 5 . 2 update aurora ve...,0,,,,,,,,,...,,,,,,,,,,


In [4]:
df.shape

(5731, 110)

In [5]:
df = df.iloc[:,0:2]

In [6]:
df.isnull().sum()

text    0
spam    2
dtype: int64

In [7]:
# Dropping the 2 rows which has null values
df.dropna(inplace = True)

In [8]:
df.duplicated().sum()

33

In [9]:
# There are 33 duplicate rows, droping the duplicate rows
df.drop_duplicates(inplace=True)

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5696 entries, 0 to 5730
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    5696 non-null   object
 1   spam    5696 non-null   object
dtypes: object(2)
memory usage: 133.5+ KB


In [11]:
# Removing rows whose spam columns are text
df["spam"] = df['spam'].apply(lambda x: "text_result" if len(x) > 1 else x)
df = df[df["spam"] != "text_result"]

In [12]:
# Converting the data type of spam column from object type to int
df['spam'] = df['spam'].astype('int')

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5694 entries, 0 to 5730
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    5694 non-null   object
 1   spam    5694 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 133.5+ KB


In [14]:
df.iloc[15].text

'Subject: search engine position  be the very first listing in the top search engines immediately .  our company will now place any business with a qualified website  permanently at the top of the major search engines guaranteed never to move  ( ex : yahoo ! , msn , alta vista , etc . ) . this promotion includes unlimited  traffic and is not going to last long . if you are interested in being  guaranteed first position in the top search engines at a promotional fee ,  please contact us promptly to find out if you qualify via email at  searchl 1 @ telefonica . net . pe it \' s very important to include the url ( s ) if you  are interested in promoting ! ! ! this is not pay per click . examples will  be provided .  this promotion is only valid in the usa and canada .  sincerely ,  the search engine placement specialists  if you wish to be removed from this list , please respond to the following  email address and type the word " remove " in your subject line :  search 6 @ speedy . com . 

## Text Preprocessing
1. Convert all charcters to small alphabet.
2. Remove special Characters.
3. Remove stopwords
4. Convert to vectors

In [15]:
# Convert all to small alphabet letters
df['text'] = df['text'].apply(lambda x:x.lower())

In [16]:
df

Unnamed: 0,text,spam
0,subject: naturally irresistible your corporate...,1
1,subject: the stock trading gunslinger fanny i...,1
2,subject: unbelievable new homes made easy im ...,1
3,subject: 4 color printing special request add...,1
4,"subject: do not have money , get software cds ...",1
...,...,...
5726,"subject: re : receipts from visit jim , than...",0
5727,subject: re : enron case study update wow ! a...,0
5728,"subject: re : interest david , please , call...",0
5729,subject: news : aurora 5 . 2 update aurora ve...,0


In [17]:
df["text"][0]

"subject: naturally irresistible your corporate identity  lt is really hard to recollect a company : the  market is full of suqgestions and the information isoverwhelminq ; but a good  catchy logo , stylish statlonery and outstanding website  will make the task much easier .  we do not promise that havinq ordered a iogo your  company will automaticaily become a world ieader : it isguite ciear that  without good products , effective business organization and practicable aim it  will be hotat nowadays market ; but we do promise that your marketing efforts  will become much more effective . here is the list of clear  benefits : creativeness : hand - made , original logos , specially done  to reflect your distinctive company image . convenience : logo and stationery  are provided in all formats ; easy - to - use content management system letsyou  change your website content and even its structure . promptness : you  will see logo drafts within three business days . affordability : your  ma

In [18]:
# Creating a function to Remove special characters from the data
def rem_special_chars(text):
    new_text = ""
    for i in text:
        if i.isalnum() or i == " ":
            new_text += i
    return new_text.strip()

In [19]:
# Removing special characters from the data
df['text'] = df['text'].apply(rem_special_chars)

In [20]:
df

Unnamed: 0,text,spam
0,subject naturally irresistible your corporate ...,1
1,subject the stock trading gunslinger fanny is...,1
2,subject unbelievable new homes made easy im w...,1
3,subject 4 color printing special request addi...,1
4,subject do not have money get software cds fr...,1
...,...,...
5726,subject re receipts from visit jim thanks ...,0
5727,subject re enron case study update wow all ...,0
5728,subject re interest david please call shi...,0
5729,subject news aurora 5 2 update aurora versi...,0


In [21]:
# Checking if the special characters are removed or not.
text = "subject: naturally irresistible your corporate identity  lt is really hard to recollect a company : the  market is full of suqgestions and the information isoverwhelminq ; but a good  catchy logo , stylish statlonery and outstanding website  will make the task much easier .  we do not promise that havinq ordered a iogo your  company will automaticaily become a world ieader : it isguite ciear that  without good products , effective business organization and practicable aim it  will be hotat nowadays market ; but we do promise that your marketing efforts  will become much more effective . here is the list of clear  benefits : creativeness : hand - made , original logos , specially done  to reflect your distinctive company image . convenience : logo and stationery  are provided in all formats ; easy - to - use content management system letsyou  change your website content and even its structure . promptness : you  will see logo drafts within three business days . affordability : your  marketing break - through shouldn ' t make gaps in your budget . 100 % satisfaction  guaranteed : we provide unlimited amount of changes with no extra fees for you to  be surethat you will love the result of this collaboration . have a look at our  portfolio _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ not interested . . . _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _"
rem_special_chars(text)

'subject naturally irresistible your corporate identity  lt is really hard to recollect a company  the  market is full of suqgestions and the information isoverwhelminq  but a good  catchy logo  stylish statlonery and outstanding website  will make the task much easier   we do not promise that havinq ordered a iogo your  company will automaticaily become a world ieader  it isguite ciear that  without good products  effective business organization and practicable aim it  will be hotat nowadays market  but we do promise that your marketing efforts  will become much more effective  here is the list of clear  benefits  creativeness  hand  made  original logos  specially done  to reflect your distinctive company image  convenience  logo and stationery  are provided in all formats  easy  to  use content management system letsyou  change your website content and even its structure  promptness  you  will see logo drafts within three business days  affordability  your  marketing break  through 

In [22]:
from sklearn.feature_extraction.text import CountVectorizer

In [23]:
# Making an object of Count Vectorizer class to get the top 10000 important words from the training set
# that can be used to classify an email.
# Stopwords present in english language are automatically excluded by the count_vectorizer class.

cv = CountVectorizer(stop_words='english',max_features=10000)

In [24]:
# Applying transformation in the trainig set
X = cv.fit_transform(df['text']).toarray()

In [25]:
y = df["spam"].values

In [26]:
y

array([1, 1, 1, ..., 0, 0, 0])

In [27]:
# Splitting the data into training set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2)

In [28]:
print(X_train.shape)
print(X_test.shape)
print(y_test.shape)
print(y_train.shape)

(4555, 10000)
(1139, 10000)
(1139,)
(4555,)


In [29]:
# Using a Naive Bayes model for our Classifier
from sklearn.naive_bayes import MultinomialNB

In [30]:
# Defining our model
clf = MultinomialNB()

In [31]:
clf.fit(X_train, y_train)

In [32]:
from sklearn.metrics import accuracy_score

In [33]:
y_pred = clf.predict(X_test)

In [34]:
# Checking the accuracy
accuracy_score(y_test, y_pred)

0.9894644424934153

In [36]:
# Length of the data that we have used for training our model
len(cv.get_feature_names_out())

10000

In [37]:
# Length of stopwords(eg: a, an, the, etc.)
len(cv.get_stop_words())

318

## Post training
* Save the models, Count_vectorizer object for future use without further training


In [38]:
import pickle

In [39]:
# Saving the models and the Count Vectorizer converter in a pkl file so that it can
# be used in another program without trainig the models.
pickle.dump(cv,open('model/cv.pkl','wb'))
pickle.dump(clf,open('model/clf.pkl','wb'))

FileNotFoundError: ignored