## Feed Unprocessed Data into Classifiers, Score, and Measure Accuracy

**PROCESS**

Pushed in raw, skewed, un-scaled data into the following 4 classifiers:
1. Logistic Regression
- KNeighbors Classifier
- Decision Tree Classifier
- Support Vector Classifier (linear kernel)

**RESULTS**

For each the Madelon and the Cook dataset, the logloss scores were as follows:
- Madelon log loss:
        {'DecisionTree': 13.815665146800779,
         'KNeighbors': 9.6709160277198709,
         'LogisticRegression': 13.355169450800107,
         'SVClassifier': 13.124735030066057}
- Cook log loss:
        {'DecisionTree': 13.229545793211978,
         'KNeighbors': 13.690058934974759,
         'LogisticRegression': 15.469364125630751,
         'SVClassifier': 13.020220360271567}
 
- In both cases, the Logistic Regression performed the worst for raw benchmarking.
- In the case of the Madelon dataset, the KNeighbors Classifier performed the best with log loss of 9.67.
- in the case of the Cook dataset, the SVClassifier performed the best with log loss of 13.02.

**ADDITIONAL RESULTS**
- Additional results are stored in the respective results dictionaries for each of the UCI Madelon and Cook Madelon datasets, respectively. 
- Classification reports and confusion matrices were generated as well to measure accuracy.

In [16]:
import pandas as pd, numpy as np
import matplotlib.pyplot as plt, seaborn as sns
%matplotlib inline

In [17]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC 

from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.metrics import classification_report, confusion_matrix, log_loss

In [75]:
from pprint import pprint

### Load the Data from Pickled DataFrames

In [18]:
pwd

'/home/jovyan/ipynb'

In [19]:
cook_total_sample = pd.read_pickle('../assets/pickled_samples/cook_total_samples.p')
madelon_train_sample = pd.read_pickle('../assets/pickled_samples/madelon_sample_train.p')
madelon_train_sample_label = pd.read_pickle('../assets/pickled_samples/madelon_sample_train_labels.p')

**Madelon:** It's not necessary to load in the test set since that's the hold out data to test the classification model's accuracy. Train/test/split on the training data. 


### Run the Data through the Classifiers and obtain Train & Test scores

#### Madelon Dataset

In [20]:
madelon_train_sample.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,490,491,492,493,494,495,496,497,498,499
1239,492,526,464,476,500,479,445,475,488,467,...,521,477,489,456,487,486,484,467,507,469
1252,479,488,615,478,476,485,471,476,496,475,...,479,473,472,287,479,493,473,465,496,484
988,474,505,502,469,494,477,409,478,500,485,...,500,463,497,482,482,521,482,482,469,456
457,493,499,534,491,489,480,530,479,487,473,...,487,484,496,383,477,486,491,472,549,499
886,476,518,504,486,502,478,460,474,486,480,...,466,480,475,682,506,489,486,468,505,501


In [21]:
madelon_train_sample.shape

(600, 500)

In [22]:
madelon_train_sample_label.shape

(600,)

In [23]:
mad_X_train, mad_X_test, mad_y_train, mad_y_test = train_test_split(madelon_train_sample,\
                                                                    madelon_train_sample_label)

In [24]:
display(mad_X_train.shape)
display(mad_X_test.shape)
display(mad_y_train.shape)
display(mad_y_test.shape)

(450, 500)

(150, 500)

(450,)

(150,)

#### Madelon Dataset (Raw Benchmarking without any Preprocessing)
Uses the out of the box default parameters provided by `sklearn` for the selected classification models.

In [25]:
names_of_classifiers = ['LogisticRegression', 'KNeighbors', 'DecisionTree', 'SVClassifier']

classifiers = [
    LogisticRegression(n_jobs=-1, random_state=42),
    KNeighborsClassifier(n_jobs=-1),
    DecisionTreeClassifier(random_state=42),
    SVC(random_state=42)]

Store the results in a dictionary to subsequenty be able to throw the results to compare into a pandas DataFrame

In [26]:
mad_raw_test_scores = {}
mad_raw_train_scores = {}
mad_raw_y_preds = {}

for name, clfr in zip(names_of_classifiers, classifiers):
    clfr.fit(mad_X_train, mad_y_train)
    
    train_score = clfr.score(mad_X_train, mad_y_train)
    test_score = clfr.score(mad_X_test, mad_y_test)
    y_pred = clfr.predict(mad_X_test)
    
    mad_raw_train_scores[name] = train_score
    mad_raw_test_scores[name] = test_score
    mad_raw_y_preds[name] = y_pred
    

In [27]:
mad_raw_test_scores

{'DecisionTree': 0.59999999999999998,
 'KNeighbors': 0.71999999999999997,
 'LogisticRegression': 0.61333333333333329,
 'SVClassifier': 0.62}

In [28]:
mad_raw_train_scores

{'DecisionTree': 1.0,
 'KNeighbors': 0.79777777777777781,
 'LogisticRegression': 1.0,
 'SVClassifier': 1.0}

In [29]:
mad_raw_y_preds

{'DecisionTree': array([ 1,  1,  1, -1, -1, -1,  1, -1, -1,  1, -1,  1,  1, -1, -1,  1,  1,
         1, -1,  1, -1, -1, -1,  1, -1, -1, -1,  1, -1, -1, -1, -1, -1,  1,
        -1, -1, -1,  1,  1, -1,  1, -1, -1,  1, -1,  1,  1, -1,  1, -1,  1,
        -1, -1, -1,  1,  1,  1,  1, -1, -1,  1,  1,  1, -1, -1,  1,  1, -1,
         1, -1,  1, -1, -1,  1, -1, -1, -1,  1,  1, -1,  1,  1, -1,  1,  1,
        -1,  1, -1,  1, -1,  1,  1, -1,  1, -1,  1, -1,  1,  1,  1,  1, -1,
        -1,  1,  1,  1, -1, -1, -1, -1, -1, -1,  1,  1, -1, -1,  1, -1,  1,
        -1,  1, -1,  1,  1,  1,  1, -1, -1, -1,  1,  1, -1,  1, -1, -1,  1,
        -1,  1,  1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  1]),
 'KNeighbors': array([ 1,  1, -1, -1, -1, -1,  1, -1, -1,  1, -1, -1, -1, -1,  1,  1,  1,
        -1,  1,  1, -1, -1, -1,  1,  1, -1,  1,  1, -1, -1, -1, -1, -1, -1,
        -1,  1,  1,  1, -1, -1, -1, -1, -1,  1, -1,  1, -1, -1, -1, -1,  1,
         1, -1, -1,  1,  1,  1, -1, -1,  1,  1, -1, -1, -1,  1, -1, 

In [39]:
names_of_classifiers

['LogisticRegression', 'KNeighbors', 'DecisionTree', 'SVClassifier']

In [72]:
madelon_classification_reports = {}

for classifier in names_of_classifiers:
    madelon_classification_reports[classifier] = classification_report(mad_y_test, mad_raw_y_preds[classifier])

pprint(madelon_classification_reports)

{'DecisionTree': '             precision    recall  f1-score   support\n'
                 '\n'
                 '         -1       0.62      0.63      0.63        79\n'
                 '          1       0.58      0.56      0.57        71\n'
                 '\n'
                 'avg / total       0.60      0.60      0.60       150\n',
 'KNeighbors': '             precision    recall  f1-score   support\n'
               '\n'
               '         -1       0.69      0.86      0.76        79\n'
               '          1       0.78      0.56      0.66        71\n'
               '\n'
               'avg / total       0.73      0.72      0.71       150\n',
 'LogisticRegression': '             precision    recall  f1-score   support\n'
                       '\n'
                       '         -1       0.65      0.58      0.61        79\n'
                       '          1       0.58      0.65      0.61        71\n'
                       '\n'
                       'avg / tota

In [45]:
names_of_classifiers

['LogisticRegression', 'KNeighbors', 'DecisionTree', 'SVClassifier']

In [42]:
def generate_confusion_matrix_madelon (y_actual, y_preds):
    conf_matrix = pd.DataFrame(confusion_matrix(y_actual, y_preds), columns=['Predicted -1', 'Predicted 1'], \
                               index=['Actual -1', 'Actual 1'])
    return conf_matrix

In [70]:
madelon_confusion_matrices = {}

for classifier in names_of_classifiers:
    madelon_confusion_matrices[classifier] = generate_confusion_matrix_madelon(mad_y_test, mad_raw_y_preds[classifier])
    
madelon_confusion_matrices

{'DecisionTree':            Predicted -1  Predicted 1
 Actual -1            50           29
 Actual 1             31           40,
 'KNeighbors':            Predicted -1  Predicted 1
 Actual -1            68           11
 Actual 1             31           40,
 'LogisticRegression':            Predicted -1  Predicted 1
 Actual -1            46           33
 Actual 1             25           46,
 'SVClassifier':            Predicted -1  Predicted 1
 Actual -1            79            0
 Actual 1             57           14}

In [68]:
madelon_log_loss = {}

for classifier in names_of_classifiers:
    madelon_log_loss[classifier] = log_loss(mad_y_test, mad_raw_y_preds[classifier])

madelon_log_loss

{'DecisionTree': 13.815665146800779,
 'KNeighbors': 9.6709160277198709,
 'LogisticRegression': 13.355169450800107,
 'SVClassifier': 13.124735030066057}

#### Cook Dataset

In [101]:
cook_total_sample.head()

Unnamed: 0,_id,feat_000,feat_001,feat_002,feat_003,feat_004,feat_005,feat_006,feat_007,feat_008,...,feat_991,feat_992,feat_993,feat_994,feat_995,feat_996,feat_997,feat_998,feat_999,target
0,116031,-0.063592,-0.935132,-0.788636,2.006542,0.057752,-0.612374,-0.31929,-0.130704,-0.426335,...,0.079754,-0.609663,1.101417,-0.485404,0.085902,-0.780068,0.155906,0.241406,0.538386,1
1,24415,-0.452243,0.258384,0.620509,0.38908,-0.197159,0.829617,-0.059411,0.910375,-0.323078,...,-0.634202,0.556551,2.037437,-0.4826,-1.418812,0.0792,-0.368648,0.219643,-0.10873,1
2,115872,1.073645,-1.01595,-0.355322,0.452687,-0.744907,-0.776871,0.385545,0.576864,-0.339835,...,-0.270593,0.25033,0.173127,-0.67309,-0.450532,1.538424,0.276987,-0.257989,-0.351097,1
3,62456,-0.269215,1.790995,-0.171136,0.258013,-0.215587,-0.516337,-0.228766,-0.446238,0.41839,...,0.7739,-0.321531,0.847676,-1.532333,-0.613422,-1.498944,-1.059311,0.628973,-0.830657,0
4,173909,0.398804,0.579328,-0.905363,-0.12414,-0.545298,0.409123,-0.179135,0.275275,-0.253539,...,-0.643034,-0.752793,0.176453,0.234722,1.122761,-1.139794,1.231819,-0.783419,1.448478,1


In [6]:
cook_target = cook_total_sample['target']
cook_features = cook_total_sample.drop(['_id', 'target'], axis=1)

In [7]:
display(cook_target.shape)
display(cook_features.shape)

(6600,)

(6600, 1000)

In [8]:
cook_X_train, cook_X_test, cook_y_train, cook_y_test = train_test_split(cook_features, cook_target)

In [11]:
cook_raw_test_scores = {}
cook_raw_train_scores = {}
cook_raw_y_preds = {}

for name, clfr in zip(names_of_classifiers, classifiers):
    clfr.fit(cook_X_train, cook_y_train)
    
    train_score = clfr.score(cook_X_train, cook_y_train)
    test_score = clfr.score(cook_X_test, cook_y_test)
    y_pred = clfr.predict(cook_X_test)
    
    cook_raw_train_scores[name] = train_score
    cook_raw_test_scores[name] = test_score
    cook_raw_y_preds[name] = y_pred
    

In [12]:
cook_raw_test_scores

{'DecisionTree': 0.61696969696969695,
 'KNeighbors': 0.60363636363636364,
 'LogisticRegression': 0.55212121212121212,
 'SVClassifier': 0.62303030303030305}

In [13]:
cook_raw_train_scores

{'DecisionTree': 1.0,
 'KNeighbors': 0.76282828282828286,
 'LogisticRegression': 0.73777777777777775,
 'SVClassifier': 0.97131313131313135}

In [14]:
cook_raw_y_preds

{'DecisionTree': array([0, 1, 1, ..., 1, 0, 1]),
 'KNeighbors': array([1, 0, 0, ..., 1, 0, 0]),
 'LogisticRegression': array([0, 1, 0, ..., 1, 0, 1]),
 'SVClassifier': array([0, 1, 1, ..., 0, 0, 1])}

In [50]:
names_of_classifiers

['LogisticRegression', 'KNeighbors', 'DecisionTree', 'SVClassifier']

In [60]:
cook_classification_reports={}

for classifier in names_of_classifiers:
    cook_classification_reports[classifier] = classification_report(cook_y_test, cook_raw_y_preds[classifier])
    
pprint(cook_classification_reports)

{'DecisionTree': '             precision    recall  f1-score   support\n'
                 '\n'
                 '          0       0.61      0.63      0.62       814\n'
                 '          1       0.63      0.61      0.62       836\n'
                 '\n'
                 'avg / total       0.62      0.62      0.62      1650\n',
 'KNeighbors': '             precision    recall  f1-score   support\n'
               '\n'
               '          0       0.59      0.64      0.61       814\n'
               '          1       0.62      0.57      0.59       836\n'
               '\n'
               'avg / total       0.60      0.60      0.60      1650\n',
 'LogisticRegression': '             precision    recall  f1-score   support\n'
                       '\n'
                       '          0       0.55      0.55      0.55       814\n'
                       '          1       0.56      0.56      0.56       836\n'
                       '\n'
                       'avg / tota

In [49]:
def generate_confusion_matrix_cook (y_actual, y_preds):
    conf_matrix = pd.DataFrame(confusion_matrix(y_actual, y_preds), columns=['Predicted 0', 'Predicted 1'], \
                               index=['Actual 0', 'Actual 1'])
    return conf_matrix

In [64]:
cook_confusion_matrices = {}

for classifier in names_of_classifiers:
    cook_confusion_matrices[classifier] = generate_confusion_matrix_cook(cook_y_test, cook_raw_y_preds[classifier])
    
pprint(cook_confusion_matrices)

{'DecisionTree':           Predicted 0  Predicted 1
Actual 0          509          305
Actual 1          327          509,
 'KNeighbors':           Predicted 0  Predicted 1
Actual 0          517          297
Actual 1          357          479,
 'LogisticRegression':           Predicted 0  Predicted 1
Actual 0          445          369
Actual 1          370          466,
 'SVClassifier':           Predicted 0  Predicted 1
Actual 0          508          306
Actual 1          316          520}


In [65]:
names_of_classifiers

['LogisticRegression', 'KNeighbors', 'DecisionTree', 'SVClassifier']

In [67]:
cook_log_loss = {}

for classifier in names_of_classifiers:
    cook_log_loss[classifier] = log_loss(cook_y_test, cook_raw_y_preds[classifier])

cook_log_loss

{'DecisionTree': 13.229545793211978,
 'KNeighbors': 13.690058934974759,
 'LogisticRegression': 15.469364125630751,
 'SVClassifier': 13.020220360271567}