### Naive Bayes Algorithm with Multionomial

Multinomial is the variant of Naive Bayes where the input is in the form of text.

It is used to convert the textual form of the input into the numerical form.


### Aim of this project:-

The main aim of this project is to :-

(i) Convert the textual form of the message into the numerical form .

(ii) Predict whether the message is 'Ham' or 'Spam'.


## Steps performed by this algorithm:

1.   Import all the necessary libraries

2.   Load the dataset

3.    Explore EDA

4.    Divide the dataset into independent and dependent variables

5.    Divide the independent and dependent variables into training and testing data

6.    Convert the input text into vectors using Count Vectorizer and transform the inputs

7.    Train the model

8.    Predict the model on the test data

9.    Evaluate the model performance

10.   Test on the new samples

11.   Predict whether the email is 'Ham' or 'Spam'

### Step 1:  Import all the necessary libraries

In [1321]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing           import StandardScaler
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection         import train_test_split
from sklearn.naive_bayes             import MultinomialNB
from sklearn.metrics                 import accuracy_score, confusion_matrix, classification_report

### Step 2:  Load the dataset

In [1322]:
data = pd.read_csv('Email.csv', encoding = 'latin-1')

print(data)

        v1                                                 v2 Unnamed: 2  \
0      ham  Go until jurong point, crazy.. Available only ...        NaN   
1      ham                      Ok lar... Joking wif u oni...        NaN   
2     spam  Free entry in 2 a wkly comp to win FA Cup fina...        NaN   
3      ham  U dun say so early hor... U c already then say...        NaN   
4      ham  Nah I don't think he goes to usf, he lives aro...        NaN   
...    ...                                                ...        ...   
5567  spam  This is the 2nd time we have tried 2 contact u...        NaN   
5568   ham              Will Ì_ b going to esplanade fr home?        NaN   
5569   ham  Pity, * was in mood for that. So...any other s...        NaN   
5570   ham  The guy did some bitching but I acted like i'd...        NaN   
5571   ham                         Rofl. Its true to its name        NaN   

     Unnamed: 3 Unnamed: 4  
0           NaN        NaN  
1           NaN        NaN  


### Step 3:  Explore EDA

In [1323]:
### Get all the columns rom the dataset
data.columns

Index(['v1', 'v2', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], dtype='object')

In [1324]:
### Remove all the unnecessary columns rom the dataset

data.drop(columns={'Unnamed: 2','Unnamed: 3','Unnamed: 4'}, axis=1, inplace=True)

In [1325]:
data.columns

Index(['v1', 'v2'], dtype='object')

### OBSERVATIONS:

1. All the unnecessary columns has been removed rom the dataset.

2. Only two columns are left

      (i.)   v1 ------------------>  label

      (ii.)  v2 ------------------>  text

In [1326]:
### Replace all the unnecessary columns to the meaningful column names

data.columns = ['label','text']

In [1327]:
data

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ì_ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [1328]:
### Reorder the columns in the dataframe

data = data[['text','label']]

In [1329]:
data

Unnamed: 0,text,label
0,"Go until jurong point, crazy.. Available only ...",ham
1,Ok lar... Joking wif u oni...,ham
2,Free entry in 2 a wkly comp to win FA Cup fina...,spam
3,U dun say so early hor... U c already then say...,ham
4,"Nah I don't think he goes to usf, he lives aro...",ham
...,...,...
5567,This is the 2nd time we have tried 2 contact u...,spam
5568,Will Ì_ b going to esplanade fr home?,ham
5569,"Pity, * was in mood for that. So...any other s...",ham
5570,The guy did some bitching but I acted like i'd...,ham


In [1330]:
### get the top five rows of the dataset

print(data.head())

                                                text label
0  Go until jurong point, crazy.. Available only ...   ham
1                      Ok lar... Joking wif u oni...   ham
2  Free entry in 2 a wkly comp to win FA Cup fina...  spam
3  U dun say so early hor... U c already then say...   ham
4  Nah I don't think he goes to usf, he lives aro...   ham


In [1331]:
### get the bottom five rows o the dataset

print(data.tail())

                                                   text label
5567  This is the 2nd time we have tried 2 contact u...  spam
5568              Will Ì_ b going to esplanade fr home?   ham
5569  Pity, * was in mood for that. So...any other s...   ham
5570  The guy did some bitching but I acted like i'd...   ham
5571                         Rofl. Its true to its name   ham


In [1332]:
### get the total number of records in the dataset

print("Total no of records in the dataset is:", len(data))

Total no of records in the dataset is: 5572


In [1333]:
### Get the shape of the dataset

print(data.shape)

(5572, 2)


In [1334]:
### get the columns used in the dataset

data.columns

Index(['text', 'label'], dtype='object')

In [1335]:
### get the information about the columns used in the dataset

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    5572 non-null   object
 1   label   5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [1336]:
### get the descriptive statistics about the columns used in the dataset

data.describe(include='object')

Unnamed: 0,text,label
count,5572,5572
unique,5169,2
top,"Sorry, I'll call later",ham
freq,30,4825


In [1337]:
### Check if there are any NULL records left in the dataset

data.isnull().sum()

text     0
label    0
dtype: int64

### OBSERVATIONS:

1. There are no NULL Values left in the dataset.

In [1338]:
### Check if there are any duplicates in the dataset

data[data.duplicated()]

Unnamed: 0,text,label
102,As per your request 'Melle Melle (Oru Minnamin...,ham
153,As per your request 'Melle Melle (Oru Minnamin...,ham
206,"As I entered my cabin my PA said, '' Happy B'd...",ham
222,"Sorry, I'll call later",ham
325,No calls..messages..missed calls,ham
...,...,...
5524,You are awarded a SiPix Digital Camera! call 0...,spam
5535,"I know you are thinkin malaria. But relax, chi...",ham
5539,Just sleeping..and surfing,ham
5553,Hahaha..use your brain dear,ham


### OBSERVATIONS:

1. There are 403 duplicate records in the dataset. So we need to remove all these duplicate records from the dataset.

In [1339]:
data.drop_duplicates(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.drop_duplicates(inplace=True)


In [1340]:
### Check if there are any duplicates in the dataset

data[data.duplicated()]

Unnamed: 0,text,label


### OBSERVATIONS:

1. There are no duplicate records in the dataset.

In [1341]:
### Get the shape of the revised dataset

print(data.shape)

(5169, 2)


### OBSERVATIONS:

1. Now the size of the dataset has been reduced from 5572 records to 5169 records. This is because 403 duplicate records has been removed from the dataset.

In [1342]:
data

Unnamed: 0,text,label
0,"Go until jurong point, crazy.. Available only ...",ham
1,Ok lar... Joking wif u oni...,ham
2,Free entry in 2 a wkly comp to win FA Cup fina...,spam
3,U dun say so early hor... U c already then say...,ham
4,"Nah I don't think he goes to usf, he lives aro...",ham
...,...,...
5567,This is the 2nd time we have tried 2 contact u...,spam
5568,Will Ì_ b going to esplanade fr home?,ham
5569,"Pity, * was in mood for that. So...any other s...",ham
5570,The guy did some bitching but I acted like i'd...,ham


### Step 4:  Divide the dataset into independent and dependent variables

In [1343]:
X = data['text']

Y = data['label']

In [1344]:
print(X)

0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
2       Free entry in 2 a wkly comp to win FA Cup fina...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
                              ...                        
5567    This is the 2nd time we have tried 2 contact u...
5568                Will Ì_ b going to esplanade fr home?
5569    Pity, * was in mood for that. So...any other s...
5570    The guy did some bitching but I acted like i'd...
5571                           Rofl. Its true to its name
Name: text, Length: 5169, dtype: object


In [1345]:
print(Y)

0        ham
1        ham
2       spam
3        ham
4        ham
        ... 
5567    spam
5568     ham
5569     ham
5570     ham
5571     ham
Name: label, Length: 5169, dtype: object


### Step 5:   Divide the independent and dependent variables into training and testing data

In [1346]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size = 0.2, random_state=42)

In [1347]:
print(X_train)

2228                       Those were my exact intentions
5529                            What about this one then.
2149                   Waaaat?? Lololo ok next time then!
5058    Free video camera phones with Half Price line ...
5051    Tick, tick, tick .... Where are you ? I could ...
                              ...                        
4740    Many more happy returns of the day. I wish you...
474     Nice line said by a broken heart- Plz don't cu...
3266                    Ok then i come n pick u at engin?
4016    Eek that's a lot of time especially since Amer...
879     U have a Secret Admirer who is looking 2 make ...
Name: text, Length: 4135, dtype: object


In [1348]:
print(X_test)

1617                        Did u download the fring app?
2064    Pass dis to all ur contacts n see wat u get! R...
1272                                                Ok...
3020                       Am in film ill call you later.
3642    Sorry, left phone upstairs. OK, might be hecti...
                              ...                        
4146    Pls help me tell sura that i'm expecting a bat...
1208                      Also maaaan are you missing out
4795    URGENT This is our 2nd attempt to contact U. Y...
3575    The sign of maturity is not when we start sayi...
2820                Oh god..taken the teeth?is it paining
Name: text, Length: 1034, dtype: object


In [1349]:
print("Shape of the input  training data is:",  X_train.shape)
print("Shape of the input  testing  data is:",  X_test.shape)

Shape of the input  training data is: (4135,)
Shape of the input  testing  data is: (1034,)


In [1350]:
print(Y_train)

2228     ham
5529     ham
2149     ham
5058    spam
5051     ham
        ... 
4740     ham
474      ham
3266     ham
4016     ham
879     spam
Name: label, Length: 4135, dtype: object


In [1351]:
print(Y_test)

1617     ham
2064     ham
1272     ham
3020     ham
3642     ham
        ... 
4146     ham
1208     ham
4795    spam
3575     ham
2820     ham
Name: label, Length: 1034, dtype: object


In [1352]:
print("Shape of the output  training data is:",  Y_train.shape)
print("Shape of the output  testing  data is:",  Y_test.shape)

Shape of the output  training data is: (4135,)
Shape of the output  testing  data is: (1034,)


### Step 6: Convert the input text into vectors using Count Vectorizer and transform the inputs

In [1353]:
from sklearn.feature_extraction.text import CountVectorizer

### create an object for CountVectorizer

vectorizer = CountVectorizer(stop_words = 'english')

### using the object for CountVectorizer, transform the inputs

X_train_scaled = vectorizer.fit_transform(X_train)

print(X_train_scaled)

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 31577 stored elements and shape (4135, 7395)>
  Coords	Values
  (0, 2588)	1
  (0, 3545)	1
  (2, 6999)	1
  (2, 4002)	1
  (2, 4696)	1
  (2, 6589)	1
  (3, 2862)	1
  (3, 6938)	1
  (3, 1553)	1
  (3, 4954)	1
  (3, 3168)	1
  (3, 5171)	1
  (3, 3941)	1
  (3, 5450)	1
  (3, 271)	1
  (3, 4419)	1
  (3, 516)	1
  (3, 1997)	1
  (3, 4630)	1
  (3, 4299)	1
  (3, 251)	1
  (3, 6770)	1
  (3, 4344)	1
  (3, 52)	1
  (3, 1534)	1
  :	:
  (4132, 4965)	1
  (4132, 2504)	1
  (4133, 6589)	1
  (4133, 6192)	1
  (4133, 4007)	1
  (4133, 2550)	1
  (4133, 3930)	1
  (4133, 4303)	1
  (4133, 4029)	1
  (4133, 901)	1
  (4133, 2451)	1
  (4133, 4973)	1
  (4133, 5893)	1
  (4134, 6867)	1
  (4134, 4139)	1
  (4134, 1895)	1
  (4134, 4013)	1
  (4134, 6069)	1
  (4134, 5715)	1
  (4134, 799)	1
  (4134, 5512)	1
  (4134, 6543)	1
  (4134, 216)	1
  (4134, 6198)	1
  (4134, 45)	1


In [1354]:
X_test_scaled = vectorizer.transform(X_test)

print(X_test_scaled)

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 6958 stored elements and shape (1034, 7395)>
  Coords	Values
  (0, 969)	1
  (0, 2218)	1
  (0, 2350)	1
  (1, 1305)	1
  (1, 1336)	1
  (1, 1816)	1
  (1, 1897)	1
  (1, 2259)	1
  (1, 2641)	1
  (1, 3015)	1
  (1, 3105)	1
  (1, 3357)	1
  (1, 3643)	1
  (1, 4081)	1
  (1, 4310)	1
  (1, 4647)	1
  (1, 4756)	1
  (1, 4862)	1
  (1, 4984)	1
  (1, 5027)	1
  (1, 5361)	2
  (1, 5395)	1
  (1, 5966)	1
  (1, 6365)	1
  (1, 6550)	1
  :	:
  (1031, 392)	1
  (1031, 698)	1
  (1031, 1077)	1
  (1031, 1113)	1
  (1031, 1742)	1
  (1031, 1813)	1
  (1031, 1895)	1
  (1031, 5185)	1
  (1031, 6871)	1
  (1031, 7331)	1
  (1032, 783)	1
  (1032, 1279)	1
  (1032, 1461)	1
  (1032, 2565)	1
  (1032, 4552)	1
  (1032, 5669)	1
  (1032, 5873)	1
  (1032, 5956)	1
  (1032, 6151)	2
  (1032, 6538)	2
  (1032, 6814)	1
  (1033, 3036)	1
  (1033, 4691)	1
  (1033, 6389)	1
  (1033, 6449)	1


### OBSERVATIONS:

1.  Here we have obtained a unique combination of every word with each unique count.

### Step 7:  Train the model

In [1355]:
X_train_scaled.shape

(4135, 7395)

In [1356]:
Y_train.shape

(4135,)

In [1357]:
from sklearn.naive_bayes import MultinomialNB

### Create an object for MultinomialNB

Multi_NB = MultinomialNB()

### using the object  for MultinomialNB, train the model

Multi_NB.fit(X_train_scaled, Y_train)

### Step 8: Predict the model on the test data

In [1358]:
X_test_scaled

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 6958 stored elements and shape (1034, 7395)>

In [1359]:
Y_pred = Multi_NB.predict(X_test_scaled)

In [1360]:
Y_pred

array(['ham', 'ham', 'ham', ..., 'spam', 'ham', 'ham'], dtype='<U4')

### Step 9:  Evaluate the model performance

In [1361]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

ac = accuracy_score(Y_test, Y_pred)

print("Accuracy score of the model is:", (ac*100.0))

Accuracy score of the model is: 98.35589941972921


In [1362]:
cm = confusion_matrix(Y_test, Y_pred)

print("Confusion Matrix of the model is:", (cm))

Confusion Matrix of the model is: [[885   4]
 [ 13 132]]


In [1363]:
cr = classification_report(Y_test, Y_pred)

print("Classification Report of the model is:", (cr))

Classification Report of the model is:               precision    recall  f1-score   support

         ham       0.99      1.00      0.99       889
        spam       0.97      0.91      0.94       145

    accuracy                           0.98      1034
   macro avg       0.98      0.95      0.96      1034
weighted avg       0.98      0.98      0.98      1034



### Step 10: Test on the new samples

In [1364]:
# Test the model on custom messages
sample_messages = [
    "Congratulations! You've won a free iPhone. Click here to claim.",
    "Hey, are we still meeting for lunch today?"
]

### use the vectorizer, transform the inputs

transformed_data = vectorizer.transform(sample_messages)

print(transformed_data)

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 11 stored elements and shape (2, 7395)>
  Coords	Values
  (0, 1742)	1
  (0, 1761)	1
  (0, 1881)	1
  (0, 2862)	1
  (0, 3573)	1
  (0, 6919)	1
  (0, 7216)	1
  (1, 3275)	1
  (1, 4076)	1
  (1, 4232)	1
  (1, 6622)	1


In [1365]:
### Make the predictions on the transformed data

Y_pred_transformed = Multi_NB.predict(transformed_data)

In [1366]:
Y_pred_transformed

array(['spam', 'ham'], dtype='<U4')

In [1367]:
Y_pred_transformed

array(['spam', 'ham'], dtype='<U4')

### OBSERVATIONS:

1.  Message : "Congratulations! You've won a free iPhone. Click here to claim." ----------> label = "Spam"

2.  Message : "Hey, are we still meeting for lunch today?" -------------------------------> label = "Ham"

### Step 11:  Predict whether the email is 'Ham' or 'Spam'

In [1368]:
if(Y_pred_transformed[1] == "ham"):
    print("The email message is Ham")
else:
    print("The email message is Spam")

The email message is Ham
