# Classification and Regression Trees (CART)
Classifing student success data by means of the [Decision Tree Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) from the sklearn module. The data set comes from UCI's machine learning repository and can be downloaded [here](https://archive.ics.uci.edu/ml/machine-learning-databases/00320/). A description of the data can be found [here](https://archive.ics.uci.edu/ml/datasets/student+performance).

## Import Data
Import the data into a pandas dataframe. Get dummy variables for each categorical predictor in the data set and return the design matirx. Create a normalized and standardized design matrix as well to compare model preformance. Convert response variable to three classes *0 (fail) , 1 (pass),* and *2 (incomplete)*. A student is put into the passing class if they got a score bigger or equal to 10, the failing class if they got a score below 10 but above 0, and the incomplete calss if the student received a 0.

In [1]:
import time
import random
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_selection import SelectFromModel
from sklearn.tree import DecisionTreeClassifier
from sklearn.utils.extmath import cartesian
from sklearn import metrics
from sklearn import preprocessing

df = pd.read_csv('student-por2.csv')
df = pd.get_dummies(df)#, drop_first=True)

def response_conv(arr):
    new = []
    for i in arr:
        if (i > 0 and i < 10):            # condition where student failed
            new.append(0)                 
                                          
        elif (i >= 10):                   # condition where student passed
            new.append(1)                 
    
        else:                             # condition where student received an incomplete
            new.append(2)
    return(new)                           # 1-dimensional array returned

X = df.drop('G3',1)                       # this is the design matrix
y = list(df.G3)                           # this is the discrete response vector
y_new = response_conv(y)                  # this is the multinomial response vector

clf = DecisionTreeClassifier()
clf.fit(X,y)

model = SelectFromModel(clf,prefit=True)
newX = model.transform(X)                 # select most influential predictors

X_scale = preprocessing.scale(newX)       # scaled design matrix
X_norm = preprocessing.normalize(newX)    # normalized design matrix

random.seed(42)
X1_train, X1_test, y1_train, y1_test = train_test_split(newX, y_new, test_size=0.33, random_state=42)
X2_train, X2_test, y2_train, y2_test = train_test_split(X_scale, y_new, test_size=0.33, random_state=42)
X3_train, X3_test, y3_train, y3_test = train_test_split(X_norm, y_new, test_size=0.33, random_state=42)



## Niave Accuracy
Before we start training and selecting parametrs for our model, we must find the distribution of the classes amongst the response variable. Depnding on which class is the dominate class, our model should preform better than just guessing the dominate class for each observation. For example, if the dominate class is 1 and 1's comprise of 83% of the response data, then our model should have higher than 83% accuracy. 

In [2]:
zero = 0
one = 0
two = 0

for i in y1_test:
    if i == 0:
        zero += 1
    elif i == 1:
        one += 1
    else:
        two += 1

num1 = round((zero/len(y1_test))*100,2)
num2 = round((one/len(y1_test))*100,2)
num3 = round((two/len(y1_test))*100,2)
print("The response vector has the following distribution: \nzeros: %r zeros comprising of %r percent of the response data. \nones: %r ones comprising of %r percent of the response data. \ntwos: %r twos comprising of %r percent of the response data." % (zero,num1,one,num2,two,num3))
print("\n")

The response vector has the following distribution: 
zeros: 23 zeros comprising of 10.7 percent of the response data. 
ones: 187 ones comprising of 86.98 percent of the response data. 
twos: 5 twos comprising of 2.33 percent of the response data.




##  Optimal Model Parameters
We choose the combination of parameters that minimize the negative log loss metric. Return optimal parameters and total run time for the cross validating process.

In [3]:
start_time = time.time()
combos = cartesian([['gini','entropy'],['best','random'],['auto','log2'],np.arange(1,(X1_train.shape[0]-1))])

def opt(X,y):
    acc = []

    for c,s,mf,md in combos:
        dt = DecisionTreeClassifier(criterion=c,splitter=s,max_features=mf,max_depth=int(md))
        scores = cross_val_score(dt, X, y, cv=10, scoring='accuracy')
        acc.append(scores.mean())
    
    opt_ = combos[acc.index(max(acc))]
    return(opt_)

c1,s1,mf1,md1 = opt(X1_train,y1_train)
c2,s2,mf2,md2 = opt(X2_train,y2_train)
c3,s3,mf3,md3 = opt(X3_train,y3_train)

print ("The optimal criterion, splitter, max_features and max_depth are %s, %s, %s, and %r respectively for Non-standardized design matrix." % (str(c1),str(s1),str(mf1),int(md1)))
print ("The optimal criterion, splitter, max_features and max_depth are %s, %s, %s, and %r respectively for Standardized design matrix." % (str(c2),str(s2),str(mf2),int(md2)))
print ("The optimal criterion, splitter, max_features and max_depth are %s, %s, %s, and %r respectively for Normalized design matrix." % (str(c3),str(s3),str(mf3),int(md3)))
print("Run time: %r minutes" % (int(time.time() - start_time)/60))

The optimal criterion, splitter, max_features and max_depth are entropy, best, auto, and 398 respectively for Non-standardized design matrix.
The optimal criterion, splitter, max_features and max_depth are gini, best, log2, and 167 respectively for Standardized design matrix.
The optimal criterion, splitter, max_features and max_depth are entropy, best, auto, and 112 respectively for Normalized design matrix.
Run time: 2.6 minutes


## Fit and Predict
After tuning model parameters to be optimal we fit each design matrix to its optimal model. Predictions are made and returned in a data frame for comparison.

In [4]:
dt1 = DecisionTreeClassifier(criterion=c1,splitter=s1,max_features=mf1,max_depth=int(md1)).fit(X1_train,y1_train)
dt2 = DecisionTreeClassifier(criterion=c2,splitter=s2,max_features=mf2,max_depth=int(md2)).fit(X2_train,y2_train)
dt3 = DecisionTreeClassifier(criterion=c3,splitter=s3,max_features=mf3,max_depth=int(md3)).fit(X3_train,y3_train)

dt_pred1 = dt1.predict(X1_test)
dt_pred2 = dt2.predict(X2_test)
dt_pred3 = dt3.predict(X3_test)

pred = pd.DataFrame(list(zip(y1_test, dt_pred1, dt_pred2, dt_pred3)), columns=['y_act','y_dt','y_dt_stan','y_dt_norm'])
pred.index.name = 'Obs'

# remove comment below to save the predictions in a csv file and view the full data frame in excel
#pred.to_csv("preds.csv")
pred

Unnamed: 0_level_0,y_act,y_dt,y_dt_stan,y_dt_norm
Obs,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,1,1,1,1
1,1,1,1,1
2,1,1,1,1
3,1,1,1,1
4,1,1,1,1
5,1,1,1,1
6,1,1,1,1
7,0,1,1,1
8,1,1,1,1
9,1,1,1,1


## Results
Accuracy, confusion matrix, and classification reports are returned for each design matirx. The normalized design matrix performs worse than our niave accuracy so it should not be considered as the final model. Both the non-standardized and standardized design matrix perform better however the non-standardized model seems to perdict the *passing* class with a bit more accuracy so we'll choose that as our final model.

In [5]:
cm_dt1 = pd.DataFrame(metrics.confusion_matrix(y1_test, dt_pred1), index = ['Fail(0)','Pass(1)','Inc(2)'],columns=['Fail(0)','Pass(1)','Inc(2)'])
cm_dt2 = pd.DataFrame(metrics.confusion_matrix(y2_test, dt_pred2), index = ['Fail(0)','Pass(1)','Inc(2)'],columns=['Fail(0)','Pass(1)','Inc(2)'])
cm_dt3 = pd.DataFrame(metrics.confusion_matrix(y3_test, dt_pred3), index = ['Fail(0)','Pass(1)','Inc(2)'],columns=['Fail(0)','Pass(1)','Inc(2)'])


print ("The accuracy of the Non-standardized Decision Tree model is: ", dt1.score(X1_test,y1_test))
print ("\n")
print ("The accuracy of the Standardized Decision Tree model is: ", dt2.score(X2_test,y2_test))
print ("\n")
print ("The accuracy of the Normalized Decision Tree model is: ", dt3.score(X3_test,y3_test))
print ("\n")

print("Non-standardized Decision Tree Confusion Matrix: \n", cm_dt1)
print ("\n")
print("Standardized Decision Tree Confusion Matrix: \n", cm_dt2)
print ("\n")
print("Normalized Decision Tree Confusion Matrix: \n", cm_dt3)
print ("\n")

print("Classification report for Non-standardized design matrix:\n", metrics.classification_report(y1_test,dt_pred1))
print("\n")
print("Classification report for standardized design matrix:\n", metrics.classification_report(y2_test,dt_pred2))
print("\n")
print("Classification report for Normalized design matrix:\n", metrics.classification_report(y3_test,dt_pred3))

The accuracy of the Non-standardized Decision Tree model is:  0.911627906977


The accuracy of the Standardized Decision Tree model is:  0.888372093023


The accuracy of the Normalized Decision Tree model is:  0.86511627907


Non-standardized Decision Tree Confusion Matrix: 
          Fail(0)  Pass(1)  Inc(2)
Fail(0)       16        7       0
Pass(1)        9      178       0
Inc(2)         2        1       2


Standardized Decision Tree Confusion Matrix: 
          Fail(0)  Pass(1)  Inc(2)
Fail(0)       16        7       0
Pass(1)       16      171       0
Inc(2)         1        0       4


Normalized Decision Tree Confusion Matrix: 
          Fail(0)  Pass(1)  Inc(2)
Fail(0)       10       11       2
Pass(1)       14      173       0
Inc(2)         1        1       3


Classification report for Non-standardized design matrix:
              precision    recall  f1-score   support

          0       0.59      0.70      0.64        23
          1       0.96      0.95      0.95       18