### Classification Task for Spambase Data Set

I am HsinYu Chang, currently master student in Data Informatics at USC. This is a classification task for a Spambase data set from UCI Machine Learning Repository. The following is the process I have done and details I would like to share, enjoy it!

##### Downloading and Preprocessing the Spambase data set

To start off the classification task, I first downloaded the Spambase data set and shuffled it in order to not being biased with some specific order by data collecting. The spam rate in the dataset is roughly 40%, which is not too imbalance to make a classification without preprocessing. 

In [1]:
import pandas
import numpy as np
from random import *
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import RandomizedSearchCV, KFold
from sklearn.dummy import DummyClassifier
from sklearn.metrics import confusion_matrix

In [2]:
inputfile = 'spambase.data'
spam = np.loadtxt(inputfile, delimiter = ',')
np.random.shuffle(spam)
print('Spam rate in dataset: %.3f' %(sum(spam[:, -1]==1)/len(spam)))

Spam rate in dataset: 0.394


##### Spitting Dataset into Training Set and Test Set

First of all, I splitted the data set into 70%-30%, leaving 30% as the test set to evaluate the performance of the final model. Before starting to train model, I checked on a dummy model to have a basic measure of performance even if simply guessing. With a baseline, I could have a better idea to see how well the model I achieve in the end.

In [3]:
trainsize = int(0.7 * len(spam))
traini = np.array(sample(range(0, len(spam)), trainsize))

In [4]:
X = spam[traini, :-1]
y = spam[traini, -1]

print('Spam rate in train set: %.3f' % (sum(y==1)/len(y)))

Spam rate in train set: 0.396


In [5]:
testX = np.delete(spam[:, :-1], traini, 0)
testy = np.delete(spam[:, -1], traini, 0)

print('Spam rate in train set: %.3f' % (sum(testy==1)/len(testy)))

Spam rate in train set: 0.390


In [6]:
dummy = DummyClassifier()
dummy.fit(X, y)
predicty = dummy.predict(testX)
tn, fp, fn, tp = confusion_matrix(testy, predicty).ravel()
print('(Dummy Model) FPR: %.3f, FNR: %.3f, Overall Error: %.3f' % (fp/(fp+tn), fn/(fn+tp), (fn+fp)/(tn+fp+fn+tp)))

(Dummy Model) FPR: 0.390, FNR: 0.610, Overall Error: 0.476


##### Hyperparameter Searching on Training Set to Build Multinomial Naive Bayes Model 

Next, I built a clssifier based on Multinomial Naive Bayes, which has a strong assumption that all features are independent to all the others. To build the model, I did a 5-fold grid search for hyperparameters, including alpha and prior probability, on the training set with using f1-score as the scoring measure.

In [7]:
#hyperparameter search for multinomialNB
paramDict = {"alpha": [0.1, 0.3, 0.5, 0.7, 0.9, 1.0],
             'fit_prior' : [True, False],
             'class_prior' : [None, [.4,.6], [.3, .7],[.2, .8]]}

In [8]:
niter = 10
multiNB = MultinomialNB()
paraSearch = RandomizedSearchCV(multiNB, param_distributions = paramDict, n_iter=niter, cv=5, scoring = 'f1')
paraSearch.fit(X, y)
bestPara = paraSearch.cv_results_['params'][np.argmin(paraSearch.cv_results_['rank_test_score'])]

In [9]:
bestPara
#paraSearch.cv_results_

{'fit_prior': True, 'class_prior': [0.3, 0.7], 'alpha': 0.1}

##### Step by Step on Building Multinomial Naive Bayes Model with 5-Fold Cross Validation

In [10]:
result =[] 
kfold = KFold(n_splits=5, shuffle = True)

In [11]:
for traini, validi in kfold.split(X): 
    trainX, validX = X[traini, :], X[validi, :]
    trainy, validy = y[traini], y[validi]
    multiNB = MultinomialNB(alpha = bestPara['alpha'], 
                            fit_prior = bestPara['fit_prior'], 
                            class_prior = bestPara['class_prior']   )
    multiNB.fit(trainX, trainy)
    predicty = multiNB.predict(validX)
    tn, fp, fn, tp = confusion_matrix(validy, predicty).ravel()
    result.append([fp/(fp+tn), fn/(fn+tp), (fn+fp)/(tn+fp+fn+tp)])

In [12]:
avgrate = np.mean(result, axis = 0)
finalresult = np.vstack((np.array(result),avgrate))
resultDF = pandas.DataFrame(finalresult)
resultDF.columns = ['FPR', 'FNR', 'Overall Err']
resultDF.index = ['fold %s' % str(i) for i in range(1,6)] + ['Average']

In [13]:
print(resultDF)

              FPR       FNR  Overall Err
fold 1   0.147059  0.254237     0.186335
fold 2   0.207447  0.287313     0.240683
fold 3   0.175060  0.268722     0.208075
fold 4   0.208000  0.211896     0.209627
fold 5   0.205962  0.243636     0.222050
Average  0.188706  0.253161     0.213354


##### Final Multinomial Naive Bayes Model Training on All Training Set

In [14]:
multiNB = MultinomialNB(alpha = bestPara['alpha'], fit_prior = bestPara['fit_prior'], class_prior = bestPara['class_prior'])
multiNB.fit(X, y)

MultinomialNB(alpha=0.1, class_prior=[0.3, 0.7], fit_prior=True)

##### Model Evaluation on Testing Set

In [15]:
predicty = multiNB.predict(testX)
tn, fp, fn, tp = confusion_matrix(testy, predicty).ravel()
print('(Multinomial Naive Bayes) FPR: %.3f, FNR: %.3f, Overall Error: %.3f' % (fp/(fp+tn), fn/(fn+tp), (fn+fp)/(tn+fp+fn+tp)))

(Multinomial Naive Bayes) FPR: 0.172, FNR: 0.270, Overall Error: 0.210
