# Data understanding
1. Define the question

  Perform classification of the testing set samples using the Naive Bayes   Classifier on the data and compute the accuracy.

2. Metrics for success

  The project will be considered successful if we are able to get a high accuracy score after comparing the different test sizes, with the Optimal Naive Bayes Classifier.

3. Recording the experimental design

  The following steps will be followed during this exercise:

 * Data Understanding
 * Data Preparation
 * Data Cleaning
 * Multivariate Analysis
 * Modelling with Naive Bayes Classifier
 * Evaluation

## Data preparation

  Importing necessary libraries

In [23]:
# Import all necessary libraries. 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.ensemble import GradientBoostingClassifier

Reading Data

In [24]:
# Loading our datset
df = pd.read_csv('spambase.data')
#preview our first 5 data
df.head(5)

Unnamed: 0,0,0.64,0.64.1,0.1,0.32,0.2,0.3,0.4,0.5,0.6,0.7,0.64.2,0.8,0.9,0.10,0.32.1,0.11,1.29,1.93,0.12,0.96,0.13,0.14,0.15,0.16,0.17,0.18,0.19,0.20,0.21,0.22,0.23,0.24,0.25,0.26,0.27,0.28,0.29,0.30,0.31,0.32.2,0.33,0.34,0.35,0.36,0.37,0.38,0.39,0.40,0.41,0.42,0.778,0.43,0.44,3.756,61,278,1
0,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,0.21,0.79,0.65,0.21,0.14,0.14,0.07,0.28,3.47,0.0,1.59,0.0,0.43,0.43,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.07,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
1,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,0.38,0.45,0.12,0.0,1.75,0.06,0.06,1.03,1.36,0.32,0.51,0.0,1.16,0.06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.06,0.0,0.0,0.12,0.0,0.06,0.06,0.0,0.0,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
2,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,0.31,0.31,0.31,0.0,0.0,0.31,0.0,0.0,3.18,0.0,0.31,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,0.31,0.31,0.31,0.0,0.0,0.31,0.0,0.0,3.18,0.0,0.31,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,1.85,0.0,0.0,1.85,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.223,0.0,0.0,0.0,0.0,3.0,15,54,1


In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4600 entries, 0 to 4599
Data columns (total 58 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       4600 non-null   float64
 1   0.64    4600 non-null   float64
 2   0.64.1  4600 non-null   float64
 3   0.1     4600 non-null   float64
 4   0.32    4600 non-null   float64
 5   0.2     4600 non-null   float64
 6   0.3     4600 non-null   float64
 7   0.4     4600 non-null   float64
 8   0.5     4600 non-null   float64
 9   0.6     4600 non-null   float64
 10  0.7     4600 non-null   float64
 11  0.64.2  4600 non-null   float64
 12  0.8     4600 non-null   float64
 13  0.9     4600 non-null   float64
 14  0.10    4600 non-null   float64
 15  0.32.1  4600 non-null   float64
 16  0.11    4600 non-null   float64
 17  1.29    4600 non-null   float64
 18  1.93    4600 non-null   float64
 19  0.12    4600 non-null   float64
 20  0.96    4600 non-null   float64
 21  0.13    4600 non-null   float64
 22  

In [26]:
df.shape

(4600, 58)

Spam dataframe has 4600 columns and 58 rows

In [27]:
df.describe()

Unnamed: 0,0,0.64,0.64.1,0.1,0.32,0.2,0.3,0.4,0.5,0.6,0.7,0.64.2,0.8,0.9,0.10,0.32.1,0.11,1.29,1.93,0.12,0.96,0.13,0.14,0.15,0.16,0.17,0.18,0.19,0.20,0.21,0.22,0.23,0.24,0.25,0.26,0.27,0.28,0.29,0.30,0.31,0.32.2,0.33,0.34,0.35,0.36,0.37,0.38,0.39,0.40,0.41,0.42,0.778,0.43,0.44,3.756,61,278,1
count,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0,4600.0
mean,0.104576,0.212922,0.280578,0.065439,0.312222,0.095922,0.114233,0.105317,0.090087,0.239465,0.059837,0.54168,0.09395,0.058639,0.049215,0.248833,0.142617,0.184504,1.662041,0.085596,0.809728,0.121228,0.101667,0.094289,0.549624,0.265441,0.767472,0.124872,0.098937,0.102874,0.064767,0.047059,0.09725,0.047846,0.105435,0.097498,0.136983,0.013204,0.078646,0.064848,0.043676,0.132367,0.046109,0.079213,0.301289,0.179863,0.005446,0.031876,0.038583,0.139061,0.01698,0.26896,0.075827,0.044248,5.191827,52.17087,283.290435,0.393913
std,0.305387,1.2907,0.50417,1.395303,0.672586,0.27385,0.39148,0.401112,0.278643,0.644816,0.201565,0.861791,0.301065,0.335219,0.258871,0.825881,0.444099,0.53093,1.775669,0.509821,1.200938,1.025866,0.350321,0.442681,1.671511,0.887043,3.367639,0.538631,0.593389,0.456729,0.403435,0.328594,0.555966,0.32948,0.532315,0.402664,0.423493,0.220675,0.434718,0.349953,0.361243,0.7669,0.223835,0.622042,1.011787,0.911214,0.076283,0.285765,0.243497,0.270377,0.109406,0.815726,0.245906,0.429388,31.732891,194.912453,606.413764,0.488669
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.588,6.0,35.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,1.31,0.0,0.22,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.065,0.0,0.0,0.0,0.0,2.2755,15.0,95.0,0.0
75%,0.0,0.0,0.42,0.0,0.3825,0.0,0.0,0.0,0.0,0.16,0.0,0.8,0.0,0.0,0.0,0.1,0.0,0.0,2.64,0.0,1.27,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.11,0.0,0.0,0.0,0.0,0.188,0.0,0.31425,0.052,0.0,3.70525,43.0,265.25,1.0
max,4.54,14.28,5.1,42.81,10.0,5.88,7.27,11.11,5.26,18.18,2.61,9.67,5.55,10.0,4.41,20.0,7.14,9.09,18.75,18.18,11.11,17.1,5.45,12.5,20.83,16.66,33.33,9.09,14.28,5.88,12.5,4.76,18.18,4.76,20.0,7.69,6.89,8.33,11.11,4.76,7.14,14.28,3.57,20.0,21.42,22.05,2.17,10.0,4.385,9.752,4.081,32.478,6.003,19.829,1102.5,9989.0,15841.0,1.0


In [28]:
#We need to rename our data with  provided data

column_names = ['word_freq_make'  ,'word_freq_address' ,'word_freq_all' ,'word_freq_3d' ,'word_freq_our'        
,'word_freq_over' ,'word_freq_remove' ,'word_freq_internet' ,'word_freq_order' ,'word_freq_mail','word_freq_receive' ,'word_freq,_will' ,'word_freq_people' ,'word_freq_report' ,'word_freq_addresses'  
,'word_freq_free' ,'word_freq_business' ,'word_freq_email' ,'word_freq_you' ,'word_freq_credit'     
,'word_freq_your' ,'word_freq_font' ,'word_freq_000' ,'word_freq_money' ,'word_freq_hp' ,'word_freq_hpl' ,'word_freq_george' ,'word_freq_650' ,'word_freq_lab' ,'word_freq_labs'       
,'word_freq_telnet' ,'word_freq_857' ,'word_freq_data' ,'word_freq_415' ,'word_freq_85' ,'word_freq_technology' 
,'word_freq_1999' ,'word_freq_parts' ,'word_freq_pm' ,'word_freq_direct' ,'word_freq_cs' ,'word_freq_meeting' ,'word_freq_original' ,'word_freq_project' ,'word_freq_re' ,'word_freq_edu' ,'word_freq_table' ,'word_freq_conference' 
,'char_freq_;' ,'char_freq_(' ,'char_freq_[' ,'char_freq_!' ,'char_freq_$' ,'char_freq_#','capital_run_length_average'
,'capital_run_length_longest' ,'capital_run_length_total', 'class']

df.columns = column_names
df.head(3)

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,word_freq_receive,"word_freq,_will",word_freq_people,word_freq_report,word_freq_addresses,word_freq_free,word_freq_business,word_freq_email,word_freq_you,word_freq_credit,word_freq_your,word_freq_font,word_freq_000,word_freq_money,word_freq_hp,word_freq_hpl,word_freq_george,word_freq_650,word_freq_lab,word_freq_labs,word_freq_telnet,word_freq_857,word_freq_data,word_freq_415,word_freq_85,word_freq_technology,word_freq_1999,word_freq_parts,word_freq_pm,word_freq_direct,word_freq_cs,word_freq_meeting,word_freq_original,word_freq_project,word_freq_re,word_freq_edu,word_freq_table,word_freq_conference,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,class
0,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,0.21,0.79,0.65,0.21,0.14,0.14,0.07,0.28,3.47,0.0,1.59,0.0,0.43,0.43,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.07,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
1,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,0.38,0.45,0.12,0.0,1.75,0.06,0.06,1.03,1.36,0.32,0.51,0.0,1.16,0.06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.06,0.0,0.0,0.12,0.0,0.06,0.06,0.0,0.0,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
2,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,0.31,0.31,0.31,0.0,0.0,0.31,0.0,0.0,3.18,0.0,0.31,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1


## Data Cleaning

In [29]:
# Checking for the missing values
df.isna().sum()

word_freq_make                0
word_freq_address             0
word_freq_all                 0
word_freq_3d                  0
word_freq_our                 0
word_freq_over                0
word_freq_remove              0
word_freq_internet            0
word_freq_order               0
word_freq_mail                0
word_freq_receive             0
word_freq,_will               0
word_freq_people              0
word_freq_report              0
word_freq_addresses           0
word_freq_free                0
word_freq_business            0
word_freq_email               0
word_freq_you                 0
word_freq_credit              0
word_freq_your                0
word_freq_font                0
word_freq_000                 0
word_freq_money               0
word_freq_hp                  0
word_freq_hpl                 0
word_freq_george              0
word_freq_650                 0
word_freq_lab                 0
word_freq_labs                0
word_freq_telnet              0
word_fre

There are no missing values

### Multivariate Analysis

In [30]:
# selecting the data between predictor and target
X = df.drop('class', axis=1)
y = df['class'].values

In [31]:
#splitting the data
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=10)

In [32]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [33]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components=13)

X_train = lda.fit_transform(X_train,y_train)
X_test = lda.transform(X_test)



In [34]:
#preview our first five LDA factors
lda_factors = pd.DataFrame(index = X.columns.values, data = lda.coef_[0].T)
lda_factors.sort_values(0, ascending = False).head(5)

Unnamed: 0,0
word_freq_remove,0.768917
word_freq_free,0.606238
word_freq_our,0.600923
word_freq_your,0.596977
word_freq_000,0.591661


# Modelling

We will first create a base model

### Base modelling

In [35]:
# Defining our predictor and target variables
X = df.drop(['class'], axis = 1)
y = df['class']

# Spliting our dataset

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2,random_state=0)
# Scaling predictor variables
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

# Fitting the data
regressor = LogisticRegression()
regressor.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [36]:
y_pred = regressor.predict(X_test)

# Getting the score of the baseline model. 
cm = confusion_matrix(y_test, y_pred)
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print ('The accuracy score: ', accuracy_score(y_test, y_pred))
print('The confusuion matrix is: ', "\n",cm)


Root Mean Squared Error: 0.2553769592276246
Mean Squared Error: 0.06521739130434782
The accuracy score:  0.9347826086956522
The confusuion matrix is:  
 [[513  25]
 [ 35 347]]


The model has a low RMSE of 0.2553 hence its a good fit


## Naive Bayes Model

In [37]:
#Use Gausian Naive Bayes to fit your model
clf = GaussianNB()  
model = clf.fit(X_train, y_train)

In [38]:
#Predicting 
y_pred = model.predict(X_test)
print ('The accuracy score: ', accuracy_score(y_test, y_pred))

The accuracy score:  0.8402173913043478


The Gausian Naive Bayes model has an accuracy score of 84%

### Hyperparameter Tuning

Lets tune the model to see if the accuracy score will increase

In [39]:
cv_method = RepeatedStratifiedKFold(n_splits=5,  n_repeats=3, random_state=999)
from sklearn.model_selection import GridSearchCV
params_NB = {'var_smoothing': np.logspace(0,-7, num=100)}
gs_NB = GridSearchCV(estimator=model, param_grid=params_NB, cv=cv_method,verbose=1,scoring='accuracy')
gs_NB.fit(X_test, y_test);

Fitting 15 folds for each of 100 candidates, totalling 1500 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1500 out of 1500 | elapsed:    3.5s finished


In [40]:
# Predict
y_pred = gs_NB.predict(X_test)
# Check accuracy Score on test dataset
accuracy_score = accuracy_score(y_test,y_pred)
print('The accuracy_score is: ', accuracy_score)

The accuracy_score is:  0.8369565217391305


The tuned model has an accuracy score of 83% which is low than the previous model. Hence we will opt to use the untuned model

In [41]:
# Create a list to hold the test size
test_size = [0.1, 0.2, 0.3, 0.4, 0.5]

for test in test_size:
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test, random_state = 0)
    
    # Fitting to the classifier
    clf = GaussianNB()  
    model = clf.fit(X_train, y_train)
    # Predicting our test predictors
    predicted = model.predict(X_test)

    print("The accuracy score of test size {} is:".format(test), (metrics.accuracy_score(y_test, predicted)*100))


The accuracy score of test size 0.1 is: 82.3913043478261
The accuracy score of test size 0.2 is: 83.04347826086956
The accuracy score of test size 0.3 is: 82.82608695652173
The accuracy score of test size 0.4 is: 82.01086956521739
The accuracy score of test size 0.5 is: 81.91304347826087


The model works best at 20% of the test size since it has the highest accuracy score

### Challenging the solution

In [45]:
from sklearn.ensemble import GradientBoostingClassifier

gbc = GradientBoostingClassifier(n_estimators=100,max_depth=5)

gbc.fit(X_train,y_train)

#predict for target on train dataset
y_pred_target = gbc.predict(X_train)
print('Target predict on train: ',y_pred_target)

#predict for target on test dataset
y_pred_test = gbc.predict(X_test)
print('Target predict on test: ',y_pred_test)


Target predict on train:  [0 0 1 ... 1 0 0]
Target predict on test:  [1 0 0 ... 1 1 0]


In [46]:
print(confusion_matrix(y_test,y_pred_target))
print(classification_report(y_test,y_pred_target))

[[857 507]
 [581 355]]
              precision    recall  f1-score   support

           0       0.60      0.63      0.61      1364
           1       0.41      0.38      0.39       936

    accuracy                           0.53      2300
   macro avg       0.50      0.50      0.50      2300
weighted avg       0.52      0.53      0.52      2300



In [47]:
print(confusion_matrix(y_test,y_pred_test))
print(classification_report(y_test,y_pred_test))

[[1317   47]
 [  62  874]]
              precision    recall  f1-score   support

           0       0.96      0.97      0.96      1364
           1       0.95      0.93      0.94       936

    accuracy                           0.95      2300
   macro avg       0.95      0.95      0.95      2300
weighted avg       0.95      0.95      0.95      2300



Conclusion

---

The GradientBoostingClassifier model has an accuracy score of 98% and 95% on train and test datasets respectively.

It aslo has a 61% and 95% f1 score on train and test dataset respectively