<a href="https://colab.research.google.com/github/hellen2021/KNN-and-Naive-Bayes-Classifier/blob/main/Naive_Bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Naive Bayes Classifier
## 1. Defining the question
Implement Naive Bayes for the given dataset, then later on apply the optimization techniques in order to compare the performance
## 2. Defining the metric of success
This exercise will be a success, if the model achieves a performance of atleast 80%
## 3. Understanding the context
The last column of 'spambase.data' denotes whether the e-mail was 
considered spam (1) or not (0), i.e. unsolicited commercial e-mail.  
Most of the attributes indicate whether a particular word or
character was frequently occuring in the e-mail.  The run-length
attributes (55-57) measure the length of sequences of consecutive 
capital letters.  For the statistical measures of each attribute, 
see the end of this file.
## 4. Defining the experimental designs

*   Load the dataset
*   Preview and explore the dataset
*   EDA
*   Data cleaning
*   Modelling using Naive Bayes classifier for different test sizes
*   Evaluation of the model
*   List item


## 6. Challenging the solution
Apply optimization techniques and compare the performances
## 7. Conclusion

## Import necessary libraries

In [1]:
# libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.preprocessing import Normalizer

## Load the dataset

In [2]:
df = pd.read_csv('/content/spambase.data', header = None)

## Preview and Explore the dataset

In [3]:
# head
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


In [4]:
# TAIL
df.tail()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
4596,0.31,0.0,0.62,0.0,0.0,0.31,0.0,0.0,0.0,0.0,...,0.0,0.232,0.0,0.0,0.0,0.0,1.142,3,88,0
4597,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.353,0.0,0.0,1.555,4,14,0
4598,0.3,0.0,0.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.102,0.718,0.0,0.0,0.0,0.0,1.404,6,118,0
4599,0.96,0.0,0.0,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.057,0.0,0.0,0.0,0.0,1.147,5,78,0
4600,0.0,0.0,0.65,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.125,0.0,0.0,1.25,5,40,0


In [5]:
# data types
df.dtypes

0     float64
1     float64
2     float64
3     float64
4     float64
5     float64
6     float64
7     float64
8     float64
9     float64
10    float64
11    float64
12    float64
13    float64
14    float64
15    float64
16    float64
17    float64
18    float64
19    float64
20    float64
21    float64
22    float64
23    float64
24    float64
25    float64
26    float64
27    float64
28    float64
29    float64
30    float64
31    float64
32    float64
33    float64
34    float64
35    float64
36    float64
37    float64
38    float64
39    float64
40    float64
41    float64
42    float64
43    float64
44    float64
45    float64
46    float64
47    float64
48    float64
49    float64
50    float64
51    float64
52    float64
53    float64
54    float64
55      int64
56      int64
57      int64
dtype: object

In [6]:
# info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4601 entries, 0 to 4600
Data columns (total 58 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       4601 non-null   float64
 1   1       4601 non-null   float64
 2   2       4601 non-null   float64
 3   3       4601 non-null   float64
 4   4       4601 non-null   float64
 5   5       4601 non-null   float64
 6   6       4601 non-null   float64
 7   7       4601 non-null   float64
 8   8       4601 non-null   float64
 9   9       4601 non-null   float64
 10  10      4601 non-null   float64
 11  11      4601 non-null   float64
 12  12      4601 non-null   float64
 13  13      4601 non-null   float64
 14  14      4601 non-null   float64
 15  15      4601 non-null   float64
 16  16      4601 non-null   float64
 17  17      4601 non-null   float64
 18  18      4601 non-null   float64
 19  19      4601 non-null   float64
 20  20      4601 non-null   float64
 21  21      4601 non-null   float64
 22  

In [10]:
# missing values
print(df.isnull().sum().sum())

print('There are no missing values')


0
There are no missing values


In [12]:
# duplicate values
duplicates = df.duplicated().sum().sum()
print(duplicates)

print('There are',duplicates, 'duplicate values')

391
There are 391 duplicate values


There exist duplicate values, however, they are possibly true values and not an error of recording the values, since spam messages sent to different people can be the same.

Therefore, we will not drop them!

## EDA


In [None]:
# describe
df.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
count,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,...,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0
mean,0.104553,0.213015,0.280656,0.065425,0.312223,0.095901,0.114208,0.105295,0.090067,0.239413,...,0.038575,0.13903,0.016976,0.269071,0.075811,0.044238,5.191515,52.172789,283.289285,0.394045
std,0.305358,1.290575,0.504143,1.395151,0.672513,0.273824,0.391441,0.401071,0.278616,0.644755,...,0.243471,0.270355,0.109394,0.815672,0.245882,0.429342,31.729449,194.89131,606.347851,0.488698
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.588,6.0,35.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.065,0.0,0.0,0.0,0.0,2.276,15.0,95.0,0.0
75%,0.0,0.0,0.42,0.0,0.38,0.0,0.0,0.0,0.0,0.16,...,0.0,0.188,0.0,0.315,0.052,0.0,3.706,43.0,266.0,1.0
max,4.54,14.28,5.1,42.81,10.0,5.88,7.27,11.11,5.26,18.18,...,4.385,9.752,4.081,32.478,6.003,19.829,1102.5,9989.0,15841.0,1.0


There should have been a proper description of the dataset columns in order to perform extensive exploratory analysis.

Class Distribution:
*   Spam	  1813  (39.4%)
*   Non-Spam  2788  (60.6%)

From the documentation



## Data Cleaning

### Validity
One of the data cleaning process that is crucial for this data is renaming the column names into their actual(meaningful) names, however, we have not been provided with that in the documentaion.

### Completeness
There are no missing values

### Consistency
There exist duplicate values, however, we stated earlier on that they are posssibly true values and we are not going to drop them. Spam messages received by different people could be the similar.

Infact, this will help in identifying the most rampant/ frequent spam messages that people receive.

## Modelling using Naive Bayes Classiffier
Naive Bayes Classiffiers are classified into 3: Gaussian, Multinomial and Bernoulli.



### Gaussian Naive Bayes Classiffier

This type of classifier makes the assumption of normal distribution thus can be best used in cases when all our features are continuous.

In [None]:
# so for our exercise, it qualifies for gaussian
X = df.drop(57, axis = 1)
y = df[57]

80-20 split

In [None]:
# Splitting our data into a training set and a test set
# 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=6) 

In [None]:
# Training our model
# 
clf = GaussianNB()  
model = clf.fit(X_train, y_train) 

In [None]:
# Predicting our test predictors
predicted = model.predict(X_test)


0.8154180238870793


In [None]:
# evaluation
accuracy = accuracy_score(y_test, predicted)
matrix = confusion_matrix(y_test, predicted)

print('........accuracy...........')
print(accuracy)
print('........confusion matrix...........')
print(matrix)

........accuracy...........
0.8154180238870793
........confusion matrix...........
[[403 156]
 [ 14 348]]


70-30 split

In [None]:
# Splitting our data into a training set and a test set
# 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=6) 

# Training our model
# 
clf = GaussianNB()  
model = clf.fit(X_train, y_train) 

# Predicting our test predictors
predicted = model.predict(X_test)

# evaluation
accuracy = accuracy_score(y_test, predicted)
matrix = confusion_matrix(y_test, predicted)

print('........accuracy...........')
print(accuracy)
print('........confusion matrix...........')
print(matrix)

........accuracy...........
0.8233164373642288
........confusion matrix...........
[[619 227]
 [ 17 518]]


60-40 split

In [None]:
# Splitting our data into a training set and a test set
# 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=6) 

# Training our model
# 
clf = GaussianNB()  
model = clf.fit(X_train, y_train) 

# Predicting our test predictors
predicted = model.predict(X_test)

# evaluation
accuracy = accuracy_score(y_test, predicted)
matrix = confusion_matrix(y_test, predicted)

print('........accuracy...........')
print(accuracy)
print('........confusion matrix...........')
print(matrix)

........accuracy...........
0.8288973384030418
........confusion matrix...........
[[816 289]
 [ 26 710]]


The accuracy of the model increases with increase int the test size.

This could be a sign of the model overfitting

The recommended way to train a model is to have the train set large in order for the model to learn correctly.

## Optimizing the model

by normalizing the data

In [None]:
# separate
X = df.drop(57, axis = 1)
y = df[57]

In [None]:
# Create normalizer
normalizer = Normalizer(norm='l2')

# Transform feature matrix
normalizer.transform(X)

array([[0.00000000e+00, 2.24834975e-03, 2.24834975e-03, ...,
        1.31950026e-02, 2.14295835e-01, 9.76626921e-01],
       [2.03297105e-04, 2.71062806e-04, 4.84040726e-04, ...,
        4.95076854e-03, 9.77762266e-02, 9.95187732e-01],
       [2.59683978e-05, 0.00000000e+00, 3.07292708e-04, ...,
        4.25059392e-03, 2.09911216e-01, 9.77710178e-01],
       ...,
       [2.53812946e-03, 0.00000000e+00, 2.53812946e-03, ...,
        1.18784459e-02, 5.07625892e-02, 9.98330921e-01],
       [1.22759766e-02, 0.00000000e+00, 0.00000000e+00, ...,
        1.46672345e-02, 6.39373781e-02, 9.97423098e-01],
       [0.00000000e+00, 0.00000000e+00, 1.59858723e-02, ...,
        3.07420620e-02, 1.22968248e-01, 9.83745986e-01]])

evaluate the performance of the model after normalizing the data

In [None]:
# Splitting our data into a training set and a test set
# 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=6) 

# Training our model
# 
clf = GaussianNB()  
model = clf.fit(X_train, y_train) 

# Predicting our test predictors
predicted = model.predict(X_test)

# evaluation
accuracy = accuracy_score(y_test, predicted)
matrix = confusion_matrix(y_test, predicted)

print('........accuracy...........')
print(accuracy)
print('........confusion matrix...........')
print(matrix)

........accuracy...........
0.8154180238870793
........confusion matrix...........
[[403 156]
 [ 14 348]]


The accuracy is still the same! ~ 81.5%

## Follow up questions

1. Did we have the right data?
YES. But there is need to have a better description of each of the columns/ features, in order to gain a better understanding of the data

## Conclusion

We choose the ~ 81% accuracy, since it still achieves our success(of achieving atleast 80% accuracy)