# Naive Bayes
Classifing student success data by means of the [MultinomialNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB) from the sklearn module. The data set comes from UCI's machine learning repository and can be downloaded [here](https://archive.ics.uci.edu/ml/machine-learning-databases/00320/). A description of the data can be found [here](https://archive.ics.uci.edu/ml/datasets/student+performance).

## Import Data
Import the data into a pandas dataframe. Get dummy variables for each categorical predictor in the data set and return the design matirx. Create a normalized design matrix as well to compare model preformance. Convert response variable to three classes *0 (fail) , 1 (pass),* and *2 (incomplete)*. A student is put into the passing class if they got a score bigger or equal to 10, the failing class if they got a score below 10 but above 0, and the incomplete calss if the student received a 0.

In [1]:
import time
import random
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_selection import SelectPercentile, chi2
from sklearn.naive_bayes import MultinomialNB
from sklearn.utils.extmath import cartesian
from sklearn import metrics
from sklearn import preprocessing


df = pd.read_csv('student-por2.csv')
df = pd.get_dummies(df)#, drop_first=True)

def response_conv(arr):
    new = []
    for i in arr:
        if (i > 0 and i < 10):           # condition where student failed
            new.append(0)                 
                                          
        elif (i >= 10):                   # condition where student passed
            new.append(1)                 
    
        else:                             # condition where student received an incomplete
            new.append(2)
    return(new)                           # 1-dimensional response varibale returned

X = df.drop('G3',1)                       # this is the design matrix
y = list(df.G3)                           # this is the discrete response vector
y_new = response_conv(y)                  # this is the multinomial response vector
X_norm = preprocessing.normalize(X)       # this is the normalized design matrix

random.seed(42)
X1_train, X1_test, y1_train, y1_test = train_test_split(X, y_new, test_size=0.33, random_state=42)
X3_train, X3_test, y3_train, y3_test = train_test_split(X_norm, y_new, test_size=0.33, random_state=42)

## Niave Accuracy
Before we start training and selecting parametrs for our model, we must find the distribution of the classes amongst the response variable. Depnding on which class is the dominate class, our model should preform better than just guessing the dominate class for each observation. For example, if the dominate class is 1 and 1's comprise of 83% of the response data, then our model should have higher than 83% accuracy. 

In [2]:
zero = 0
one = 0
two = 0

for i in y1_test:
    if i == 0:
        zero += 1
    elif i == 1:
        one += 1
    else:
        two += 1

num1 = round((zero/len(y1_test))*100,2)
num2 = round((one/len(y1_test))*100,2)
num3 = round((two/len(y1_test))*100,2)
print("The response vector has the following distribution: \nzeros: %r zeros comprising of %r percent of the response data. \nones: %r ones comprising of %r percent of the response data. \ntwos: %r twos comprising of %r percent of the response data." % (zero,num1,one,num2,two,num3))
print("\n")

The response vector has the following distribution: 
zeros: 23 zeros comprising of 10.7 percent of the response data. 
ones: 187 ones comprising of 86.98 percent of the response data. 
twos: 5 twos comprising of 2.33 percent of the response data.




## Optimal Alpha 
By means of 10-Fold cross validation we return optimal values of alpha for each design matrix. Optimal is decided by selecting alphas that maximize the accuracy of the model.

In [3]:
def opt(X,y):
    acc = []
    alphas = 10.0**-np.arange(1,5)
    for a in alphas:
        nb = MultinomialNB(alpha=a)
        scores = cross_val_score(nb, X, y, cv=10, scoring='accuracy')
        acc.append(scores.mean())
    

    opt_ = alphas[acc.index(max(acc))]
    return(opt_)

a1 = opt(X1_train,y1_train)
a3 = opt(X3_train,y3_train)

print("The optimal alpha value is %r for Non-standardized design matrix." % a1)
print("The optimal alpha value is %r for Normalized design matrix." % a3)

The optimal alpha value is 0.10000000000000001 for Non-standardized design matrix.
The optimal alpha value is 0.10000000000000001 for Normalized design matrix.


## Fit and Predict
Fit a model for both the non-normalized and normalized design matrix using optimal alphas. Predict using the respective testing set and compare predictions into a data frame. 

In [4]:
nb1 = MultinomialNB(alpha=a1).fit(X1_train,y1_train)
nb3 = MultinomialNB(alpha=a3).fit(X3_train,y3_train)

nb_pred1 = nb1.predict(X1_test)
nb_pred3 = nb3.predict(X3_test)

pred = pd.DataFrame(list(zip(y1_test, nb_pred1, nb_pred3)), columns=['y_act','y_nb','y_nb_norm'])
pred.index.name = 'Obs'

pred

Unnamed: 0_level_0,y_act,y_nb,y_nb_norm
Obs,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,1,1,1
1,1,1,1
2,1,1,1
3,1,1,1
4,1,1,1
5,1,1,1
6,1,1,1
7,0,0,1
8,1,1,1
9,1,1,1


## Results
Accuracy, confusion matrix, and classification reports are returned for each design matirx. We see that the non-normalized design matrix yields accuracy lower than the niave accuracy, so it should not be considered as the final model. In fact nor should the normalized design matrix. Notice that with the normalized design matrix are model only guesses the dominate calss for each observation. Therefore neither model will be considered for this particular case.

In [5]:
cm_nb1 = pd.DataFrame(metrics.confusion_matrix(y1_test, nb_pred1), index = ['Fail(0)','Pass(1)','Inc(2)'],columns=['Fail(0)','Pass(1)','Inc(2)'])
cm_nb3 = pd.DataFrame(metrics.confusion_matrix(y3_test, nb_pred3), index = ['Fail(0)','Pass(1)','Inc(2)'],columns=['Fail(0)','Pass(1)','Inc(2)'])

print ("The accuracy of the Non-standardized Random Forest model is: ", nb1.score(X1_test,y1_test))
print ("\n")
print ("The accuracy of the Normalized Random Forest model is: ", nb3.score(X3_test,y3_test))
print ("\n")

print("Non-standardized Random Forest Confusion Matrix: \n", cm_nb1)
print ("\n")
print("Normalized Random Forest Confusion Matrix: \n", cm_nb3)
print ("\n")

print("Classification report for Non-standardized design matrix:\n", metrics.classification_report(y1_test,nb_pred1))
print("\n")
print("Classification report for Normalized design matrix:\n", metrics.classification_report(y3_test,nb_pred3))

The accuracy of the Non-standardized Random Forest model is:  0.855813953488


The accuracy of the Normalized Random Forest model is:  0.86976744186


Non-standardized Random Forest Confusion Matrix: 
          Fail(0)  Pass(1)  Inc(2)
Fail(0)       12        9       2
Pass(1)       19      168       0
Inc(2)         1        0       4


Normalized Random Forest Confusion Matrix: 
          Fail(0)  Pass(1)  Inc(2)
Fail(0)        0       23       0
Pass(1)        0      187       0
Inc(2)         0        5       0


Classification report for Non-standardized design matrix:
              precision    recall  f1-score   support

          0       0.38      0.52      0.44        23
          1       0.95      0.90      0.92       187
          2       0.67      0.80      0.73         5

avg / total       0.88      0.86      0.87       215



Classification report for Normalized design matrix:
              precision    recall  f1-score   support

          0       0.00      0.00      0.0

  'precision', 'predicted', average, warn_for)
