<a href="https://colab.research.google.com/github/itskrutinewalkar/Email-Classifier/blob/main/Email_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Email_Classifier**

In [1]:

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


In [2]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!ls ~/.kaggle
!kaggle datasets download -d purusinghvi/email-spam-classification-dataset

cp: cannot stat 'kaggle.json': No such file or directory
chmod: cannot access '/root/.kaggle/kaggle.json': No such file or directory
Dataset URL: https://www.kaggle.com/datasets/purusinghvi/email-spam-classification-dataset
License(s): MIT
Downloading email-spam-classification-dataset.zip to /content
 77% 33.0M/43.0M [00:00<00:00, 88.3MB/s]
100% 43.0M/43.0M [00:00<00:00, 89.6MB/s]


## **Create a Pandas DataFrame**

In [3]:
df = pd.read_csv('/content/email-spam-classification-dataset.zip')
df.shape

(83448, 2)

## **Separating Data for Analysis**

In [4]:
#check the number of null values and columns
df.isnull().sum()

label    0
text     0
dtype: int64

In [5]:
#check the data availabel for spam and ham
ham = df[df.label == 0]
spam = df[df.label == 1]
ham.shape

(39538, 2)

In [6]:
spam.shape

(43910, 2)

## **Under Sampling Spam Data for better predictions**

In [7]:
spam_sample = spam.sample(n=39537)
spam_sample

Unnamed: 0,label,text
64452,1,on our behalf we can recommend you the canadia...
36429,1,we present you a us licensed online pharmescap...
49886,1,would you like to discover the secrets slot ow...
77219,1,panda software ranks famous people most often ...
10768,1,did you know you can refinance up to escapenum...
...,...,...
27862,1,i aescapenumber aescapenumberthe house work o...
15046,1,to say adios muchachos head on over to go to :...
17298,1,downloadable software ds is a fast paced compa...
29267,1,announces building nymex existing perkins set ...


In [8]:
# concatenate the ham and spam_sample data to form a new dataframe with even divisions of ham and spam data along row axis=0
new_df = pd.concat([ham, spam_sample], axis=0)
new_df

Unnamed: 0,label,text
2,0,computer connection from cnn com wednesday es...
4,0,thanks for all your answers guys i know i shou...
5,0,larry king live at escapenumber escapenumber p...
6,0,michael pobega wrote i'm not sure if it's the ...
7,0,hi i have this error tr sample escapenumber es...
...,...,...
27862,1,i aescapenumber aescapenumberthe house work o...
15046,1,to say adios muchachos head on over to go to :...
17298,1,downloadable software ds is a fast paced compa...
29267,1,announces building nymex existing perkins set ...


In [9]:
new_df.shape

(79075, 2)

## **Splitting the data in features and targets**

In [10]:
#The dataset has two columns text will be features (X) and label will be target (Y)
X = new_df['text']
Y = new_df['label']

In [11]:
X.head()

2     computer connection from cnn com wednesday es...
4    thanks for all your answers guys i know i shou...
5    larry king live at escapenumber escapenumber p...
6    michael pobega wrote i'm not sure if it's the ...
7    hi i have this error tr sample escapenumber es...
Name: text, dtype: object

In [12]:
Y.head()

2    0
4    0
5    0
6    0
7    0
Name: label, dtype: int64

## **Splitting the data into training and testing data**

In [13]:
#using train test split from sklearn
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=2)

In [14]:
X_train.shape

(63260,)

In [15]:
X_test.shape

(15815,)

In [16]:
Y_train.shape

(63260,)

In [17]:
Y_test.shape

(15815,)

## **Convert the text data into numerical data**

In [18]:
feature_extraction = TfidfVectorizer(min_df = 2, stop_words='english', lowercase=True)

In [19]:
X_train_features = feature_extraction.fit_transform(X_train)
X_test_features = feature_extraction.transform(X_test)

In [20]:
Y_train = Y_train.astype('int')
Y_test = Y_test.astype('int')

In [21]:
print(X_train_features)

  (0, 101615)	0.03558329440529161
  (0, 79299)	0.04179648982083858
  (0, 1627)	0.061403395541682476
  (0, 148)	0.059929424787974264
  (0, 1756)	0.05306410977010731
  (0, 98280)	0.0747452322522196
  (0, 19769)	0.0899565899742732
  (0, 1278)	0.059339653699070426
  (0, 267)	0.06441312991311894
  (0, 59090)	0.34239669813076407
  (0, 102845)	0.4147763671317907
  (0, 2472)	0.07465606234276377
  (0, 312)	0.061557836924199597
  (0, 1185)	0.15139938473164272
  (0, 1809)	0.18796382151214822
  (0, 348)	0.1886184002870401
  (0, 36231)	0.3073921522048524
  (0, 38643)	0.2495151327277738
  (0, 32537)	0.3141548989674287
  (0, 61455)	0.31035106652970756
  (0, 42990)	0.053460798381598736
  (0, 0)	0.10543187144007571
  (0, 101581)	0.05313962908894049
  (0, 938)	0.060734873533547125
  (0, 72021)	0.06723989275099353
  :	:
  (63259, 26252)	0.09533914457710697
  (63259, 93925)	0.05672019568935565
  (63259, 21891)	0.06549327423888711
  (63259, 23590)	0.07167151587537758
  (63259, 41132)	0.06943928478509712
  

## **Model Learning**

### *Logistic Regression*

In [22]:
model = LogisticRegression()


In [23]:
#training the logistic regression model with training data
model.fit(X_train_features, Y_train)

### *Multinomial NB*

In [24]:
from sklearn.naive_bayes import MultinomialNB

In [25]:
#create an instance of MultinomialNB classifier
mnb = MultinomialNB()

In [26]:
#train the model
mnb.fit(X_train_features, Y_train)

## **Model Evaluation**

### *Logistic Regression*

In [27]:
#accuracy score on train data
X_train_prediction = model.predict(X_train_features)
training_data_accuracy = accuracy_score(Y_train, X_train_prediction)

In [28]:
training_data_accuracy

0.9903888713246918

In [29]:
#accuracy on test data
X_test_prediction = model.predict(X_test_features)
test_data_accuracy = accuracy_score(Y_test, X_test_prediction)

In [30]:
test_data_accuracy

0.9843186847929181

### *MultinomialNB Classifier*

In [31]:
#make predictions on trained data
mnb_predicton = mnb.predict(X_train_features)
mnb_train_accuracy = accuracy_score(Y_train, mnb_predicton)

In [32]:
mnb_train_accuracy

0.9799557382232058

In [33]:
#make predictions on test data and calculate accuracy score
mnb_prediction = mnb.predict(X_test_features)
mnb_test_accuracy = accuracy_score(Y_test, mnb_prediction)

In [34]:
mnb_test_accuracy

0.9731900094846665

## **Prediction of email**

In [35]:
input_mail = ["Greetings! Your ticket for 'Emily In Paris' has been booked for 9PM",
              "StarsPwn has lunched a new game and we think you can be our testr. Click to dwnload th file",
              "We recently suspected a malicious activity from this computer. Please clck on the below link to verify your authorization!"]
#convert the text to data
input_data_features = feature_extraction.transform(input_mail)

#making prediction on logistic model
logistic_prediction = model.predict(input_data_features)

#making prediction on multinomialNb classifier
multinomial_prediction = mnb.predict(input_data_features)

In [36]:
logistic_prediction

array([1, 0, 1])

In [37]:
multinomial_prediction

array([0, 0, 1])

In [38]:
for i in logistic_prediction:
  if i==0:
    print('Logistic Regression says: Ham')
  else:
    print('Logistic Regression says: Spam')

print()

for i in multinomial_prediction:
  if i==0:
    print('MultinomialNB says: Ham')
  else:
    print('MultinomialNB says: Spam')

Logistic Regression says: Spam
Logistic Regression says: Ham
Logistic Regression says: Spam

MultinomialNB says: Ham
MultinomialNB says: Ham
MultinomialNB says: Spam
