# Final Project
## Nicholas Schenone - A13599911

- 3 trials
- 7 classifiers
    - SVM
    - Logistic Regression
    - Decision Tree
    - Perceptron
    - Multilayer Perceptron
    - KNN
    - Random Forest
- 3 datasets
    - Heart Disease: https://www.kaggle.com/ronitf/heart-disease-uci
    - Mushroom: https://archive.ics.uci.edu/ml/datasets/Mushroom
    - Adult Data Set: https://archive.ics.uci.edu/ml/datasets/Adult
- 3 partitions (20/80, 50/50, 80/20)
- 3 accuracies per (train, validation, test)

### Imports

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import Perceptron, LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

import json

import seaborn as sns

### Pre-Process Data

In [2]:
def adult_pre_process(data_path="data/adult/adult.csv", split=0.2):
    df_adult = pd.read_csv(data_path)
    df_adult_one_hot = pd.get_dummies(df_adult);
    
    X = df_adult_one_hot.iloc[:,0 : len(df_adult_one_hot.columns) - 1]
    X = StandardScaler().fit_transform(X)

    y = df_adult_one_hot.iloc[:, len(df_adult_one_hot.columns) - 1]
    y = y.values.ravel()

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=split)

    return X, y, X_train, X_test, y_train, y_test

def heart_pre_process(data_path="data/heart_disease/heart.csv", split=0.2):
    df_heart = pd.read_csv(data_path)
    X = df_heart.iloc[:, 0 : len(df_heart.columns) - 1]
    X = StandardScaler().fit_transform(X)

    y = df_heart.iloc[:, len(df_heart.columns) - 1]
    y = y.values.ravel()

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=split)

    return X, y, X_train, X_test, y_train, y_test

def mushroom_pre_process(data_path="data/mushroom/mushroom.csv", split=0.2):
    df_mushroom = pd.read_csv(data_path, header=None)
    df_mush_one_hot = pd.get_dummies(df_mushroom);
    
    X = df_mush_one_hot.iloc[:,1:]
    X = StandardScaler().fit_transform(X)

    y = df_mush_one_hot.iloc[:, :1]
    y = y.values.ravel()
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=split)

    return X, y, X_train, X_test, y_train, y_test

def pre_process(dataset, split=0.2):
    if dataset == "happy":
        return happiness_pre_process(split=split)
    elif dataset == "mush":
        return mushroom_pre_process(split=split)
    elif dataset == "heart":
        return heart_pre_process(split=split)
    elif dataset == "adult":
        return adult_pre_process(split=split)

In [3]:
heart_X, heart_y, heart_X_train, heart_X_test, heart_y_train, heart_y_test = heart_pre_process(split=0.2)

mush_X, mush_y, mush_X_train, mush_X_test, mush_y_train, mush_y_test = mushroom_pre_process(split=0.2)

adult_X, adult_y, adult_X_train, adult_X_test, adult_y_train, adult_y_test = adult_pre_process(split=0.5)

### Classifiers and Functions

In [4]:
# SVM
def clf_SVM(param_grid):
    return svm.SVC(C = param_grid["C"],
                   gamma=param_grid["gamma"],
                   kernel=param_grid["kernel"],
                   max_iter = 10000)

# Logistic Regression
def clf_log(param_grid):
    return LogisticRegression(C = param_grid["C"],
                              penalty = param_grid["penalty"],
                              solver="liblinear",
                              max_iter = 10000)

# Decision Tree
def clf_tree(param_grid):
    return DecisionTreeClassifier(criterion=param_grid["criterion"],
                                  max_depth=param_grid["max_depth"])

# Perceptron
def clf_perc(param_grid):
    return Perceptron(penalty=param_grid["penalty"],
                      alpha=param_grid["alpha"],
                      max_iter=param_grid["max_iter"],
                      tol=param_grid["tol"],
                      early_stopping=param_grid["early_stopping"])

# Multi-Layer Perceptron
def clf_mlp(param_grid):
    return MLPClassifier(activation=param_grid["activation"],
                      solver=param_grid["solver"],
                      hidden_layer_sizes=param_grid["hidden_layer_sizes"],
                      max_iter=param_grid["max_iter"],
                      tol=param_grid["tol"],
                      early_stopping=param_grid["early_stopping"])

# KNN
def clf_knn(param_grid):
    return KNeighborsClassifier(n_neighbors=param_grid["n_neighbors"])

# Random Forest
def clf_rf(param_grid):
    return RandomForestClassifier(bootstrap=param_grid["bootstrap"],
                                 max_depth=param_grid["max_depth"],
                                 max_features=param_grid["max_features"],
                                 min_samples_leaf=param_grid["min_samples_leaf"],
                                 min_samples_split=param_grid["min_samples_split"],
                                 n_estimators=param_grid["n_estimators"])

# General
def clf(model, param_grid):
    if model == "svm":
        return clf_SVM(param_grid)
    elif model=="log":
        return clf_log(param_grid)
    elif model=="tree":
        return clf_tree(param_grid)
    elif model=="perc":
        return clf_perc(param_grid)
    elif model=="mlp":
        return clf_mlp(param_grid)
    elif model=="knn":
        return clf_knn(param_grid)
    elif model=="rf":
        return clf_rf(param_grid)
    
def train_model(classifier, X_train, y_train):
    classifier.fit(X_train, y_train)

def hyper_tune(X_train, y_train, estimator, param_grid, k_top=3):
    grid_search = RandomizedSearchCV(estimator=estimator, param_distributions=param_grid, cv=10, n_iter=20, n_jobs=-1, verbose=10)
    grid_search.fit(X_train, y_train)
    results = pd.DataFrame(grid_search.cv_results_)
    results.sort_values(by='rank_test_score', inplace=True)
    out = []
    [out.append(results.loc[i, 'params']) for i in range(k_top)]
    print(f"Best {k_top} params:", out)
    return out

def evalModel(classifer, X_test, y_test):
    y_pred = classifier.predict(X_test)
    
    accuracy= accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average="macro")
    recall = recall_score(y_test, y_pred, average="macro")
    f_score = f1_score(y_test, y_pred, average="macro") 
    
    return (accuracy, precision, recall, f_score)

## Hyperparameter Tuning

### SVM

In [None]:
svm_param_grid = {
    "C" : [1, 10, 100, 1000, 10000],
    "gamma" : [1e-6, 1e-5, 1e-4, 1e-3, 1e-2],
    "kernel" : ["linear", "rbf"]
}

In [None]:
# Mushroom SVM Tuning
best_param_grid_mush = hyper_tune(mush_X_train, mush_y_train, svm.SVC(), svm_param_grid)

with open('params/svm/best_param_grid_mush', 'w') as f:
    json.dump(best_param_grid_mush, f)

In [None]:
# Heart SVM Tuning
best_param_grid_heart = hyper_tune(heart_X_train, heart_y_train, svm.SVC(), svm_param_grid)

with open('params/svm/best_param_grid_heart', 'w') as f:
    json.dump(best_param_grid_heart, f)

In [None]:
# Adult SVM Tuning
best_param_grid_adult = hyper_tune(adult_X_train, adult_y_train, svm.SVC(), svm_param_grid)

with open('params/svm/best_param_grid_adult', 'w') as f:
    json.dump(best_param_grid_adult, f)

### Logistic Regression

In [None]:
log_param_grid = {
    "C" : [1, 10, 100, 1000, 10000],
    "penalty" : ["l1", "l2"],
}

In [None]:
# Mushroom Logistic Regression Tuning
best_param_grid_mush = hyper_tune(mush_X_train, mush_y_train, LogisticRegression(), log_param_grid)

with open('params/log/best_param_grid_mush', 'w') as f:
    json.dump(best_param_grid_mush, f)

In [None]:
# Heart Logistic Regression Tuning
best_param_grid_heart = hyper_tune(heart_X_train, heart_y_train, LogisticRegression(), log_param_grid)

with open('params/log/best_param_grid_heart', 'w') as f:
    json.dump(best_param_grid_heart, f)

In [None]:
# Adult Logistic Regression Tuning
best_param_grid_adult = hyper_tune(adult_X_train, adult_y_train, LogisticRegression(), log_param_grid)

with open('params/log/best_param_grid_adult', 'w') as f:
    json.dump(best_param_grid_adult, f)

### Decision Tree

In [None]:
tree_param_grid = {
    "criterion" : ['gini', 'entropy'],
    "max_depth" : [4,6,8,12],
}

In [None]:
# Mushroom Decision Tree Tuning
best_param_grid_mush = hyper_tune(mush_X_train, mush_y_train, DecisionTreeClassifier(), tree_param_grid)

with open('params/tree/best_param_grid_mush', 'w') as f:
    json.dump(best_param_grid_mush, f)

In [None]:
# Heart Decision Tree Tuning
best_param_grid_heart = hyper_tune(heart_X_train, heart_y_train, DecisionTreeClassifier(), tree_param_grid)

with open('params/tree/best_param_grid_heart', 'w') as f:
    json.dump(best_param_grid_heart, f)

In [None]:
# Adult Decision Tree Tuning
best_param_grid_adult = hyper_tune(adult_X_train, adult_y_train, DecisionTreeClassifier(), tree_param_grid)

with open('params/tree/best_param_grid_adult', 'w') as f:
    json.dump(best_param_grid_adult, f)

### Perceptron

In [None]:
perc_param_grid = {
    "penalty" : [None, "l1", "l2", "elasticnet"],
    "alpha" : [0.001, 0.0001, 0.00001],
    "max_iter" : [500, 1000, 2000],
    "tol" : [1e-4, 1e-3, 1e-2],
    "early_stopping" : [True, False]
}

In [None]:
# Mushroom Perceptron Tuning
best_param_grid_mush = hyper_tune(mush_X_train, mush_y_train, Perceptron(), perc_param_grid)

with open('params/perc/best_param_grid_mush', 'w') as f:
    json.dump(best_param_grid_mush, f)

In [None]:
# Heart Perceptron Tuning
best_param_grid_heart = hyper_tune(heart_X_train, heart_y_train, Perceptron(), perc_param_grid)

with open('params/perc/best_param_grid_heart', 'w') as f:
    json.dump(best_param_grid_heart, f)

In [None]:
# Adult Perceptron Tuning
best_param_grid_adult = hyper_tune(adult_X_train, adult_y_train, Perceptron(), perc_param_grid)

with open('params/perc/best_param_grid_adult', 'w') as f:
    json.dump(best_param_grid_adult, f)

### Multi-Layer Perceptron

In [None]:
mlp_param_grid = {
    "hidden_layer_sizes" : [(100,), (50,), (200,), (25,)],
    "activation" : ["identity", "logistic", "tanh", "relu"],
    "solver" : ["lbfgs", "sgd", "adam"],
    "max_iter" : [200, 100, 300],
    "tol" : [1e-4, 1e-3, 1e-5],
    "early_stopping" : [True, False]
}

In [None]:
# Mushroom Perceptron Tuning
best_param_grid_mush = hyper_tune(mush_X_train, mush_y_train, MLPClassifier(), mlp_param_grid)

with open('params/mlp/best_param_grid_mush', 'w') as f:
    json.dump(best_param_grid_mush, f)

In [None]:
# Heart Perceptron Tuning
best_param_grid_heart = hyper_tune(heart_X_train, heart_y_train, MLPClassifier(), mlp_param_grid)

with open('params/mlp/best_param_grid_heart', 'w') as f:
    json.dump(best_param_grid_heart, f)

In [None]:
# Adult Perceptron Tuning
best_param_grid_adult = hyper_tune(adult_X_train, adult_y_train, MLPClassifier(), mlp_param_grid)

with open('params/mlp/best_param_grid_adult', 'w') as f:
    json.dump(best_param_grid_adult, f)

### KNN

In [None]:
knn_param_grid = {
    "n_neighbors" : [1, 3, 5, 9, 15, 25, 50, 75, 100],
}

In [None]:
# Mushroom Decision Tree Tuning
best_param_grid_mush = hyper_tune(mush_X_train, mush_y_train, KNeighborsClassifier(), knn_param_grid)

with open('params/knn/best_param_grid_mush', 'w') as f:
    json.dump(best_param_grid_mush, f)

In [None]:
# Heart Decision Tree Tuning
best_param_grid_heart = hyper_tune(heart_X_train, heart_y_train, KNeighborsClassifier(), knn_param_grid)

with open('params/knn/best_param_grid_heart', 'w') as f:
    json.dump(best_param_grid_heart, f)

In [None]:
# Adult Decision Tree Tuning
best_param_grid_adult = hyper_tune(adult_X_train, adult_y_train, KNeighborsClassifier(), knn_param_grid)

with open('params/knn/best_param_grid_adult', 'w') as f:
    json.dump(best_param_grid_adult, f)

### Random Forest

In [None]:
rf_param_grid = {
    'bootstrap': [True, False],
    'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
    'max_features': ['auto', 'sqrt'],
    'min_samples_leaf': [1, 2, 4],
    'min_samples_split': [2, 5, 10],
    'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]
}

In [None]:
# Mushroom Decision Tree Tuning
best_param_grid_mush = hyper_tune(mush_X_train, mush_y_train, RandomForestClassifier(), rf_param_grid)

with open('params/rf/best_param_grid_mush', 'w') as f:
    json.dump(best_param_grid_mush, f)

In [None]:
# Heart Decision Tree Tuning
best_param_grid_heart = hyper_tune(heart_X_train, heart_y_train, RandomForestClassifier(), rf_param_grid)

with open('params/rf/best_param_grid_heart', 'w') as f:
    json.dump(best_param_grid_heart, f)

In [None]:
# Adult Decision Tree Tuning
best_param_grid_adult = hyper_tune(adult_X_train, adult_y_train, RandomForestClassifier(), rf_param_grid)

with open('params/rf/best_param_grid_adult', 'w') as f:
    json.dump(best_param_grid_adult, f)

## Evaluate with best Params

### SVM

In [None]:
with open('params/svm/best_param_grid_mush', 'r') as f:
    best_param_grid_mush = json.load(f)
    
with open('params/svm/best_param_grid_heart', 'r') as f:
    best_param_grid_heart = json.load(f)
    
with open('params/svm/best_param_grid_adult', 'r') as f:
    best_param_grid_adult = json.load(f)

In [None]:
# Mushroom SVM
for i in range(len(best_param_grid_mush)):
    print("Best params:", best_param_grid_mush[i], "\n")
    classifier = clf_SVM(best_param_grid_mush[i])
    train_model(classifier, mush_X_train, mush_y_train)
    acc, prec, rec, f = evalModel(classifier, mush_X_test, mush_y_test)
    print(f"Accuracy: {acc} \nPrecision: {prec} \nRecall: {rec} \nF1 Score {f}")

In [None]:
# Heart SVM
for i in range(len(best_param_grid_heart)):
    print("Best params:", best_param_grid_heart[i], "\n")
    classifier = clf_SVM(best_param_grid_heart[i])
    train_model(classifier, heart_X_train, heart_y_train)
    acc, prec, rec, f = evalModel(classifier, heart_X_test, heart_y_test)
    print(f"Accuracy: {acc} \nPrecision: {prec} \nRecall: {rec} \nF1 Score {f}")

In [None]:
# Adult SVM
for i in range(len(best_param_grid_adult)):
    print("Best params:", best_param_grid_adult[i], "\n")
    classifier = clf_SVM(best_param_grid_adult[i])
    train_model(classifier, adult_X_train, adult_y_train)
    acc, prec, rec, f = evalModel(classifier, adult_X_test, adult_y_test)
    print(f"Accuracy: {acc} \nPrecision: {prec} \nRecall: {rec} \nF1 Score {f}")

### Logistic Regression

In [None]:
with open('params/log/best_param_grid_mush', 'r') as f:
    best_param_grid_mush = json.load(f)
    
with open('params/log/best_param_grid_heart', 'r') as f:
    best_param_grid_heart = json.load(f)
    
with open('params/log/best_param_grid_adult', 'r') as f:
    best_param_grid_adult = json.load(f)

In [None]:
# Mushroom Logistic Regression Training/Eval
for i in range(len(best_param_grid_mush)):
    print("Best params:", best_param_grid_mush[i], "\n")
    classifier = clf_log(best_param_grid_mush[i])
    train_model(classifier, mush_X_train, mush_y_train)
    acc, prec, rec, f = evalModel(classifier, mush_X_test, mush_y_test)
    print(f"Accuracy: {acc} \nPrecision: {prec} \nRecall: {rec} \nF1 Score {f}")

In [None]:
# Heart Logistic Regression Training/Eval
for i in range(len(best_param_grid_heart)):
    print("Best params:", best_param_grid_heart[i], "\n")
    classifier = clf_log(best_param_grid_heart[i])
    train_model(classifier, heart_X_train, heart_y_train)
    acc, prec, rec, f = evalModel(classifier, heart_X_test, heart_y_test)
    print(f"Accuracy: {acc} \nPrecision: {prec} \nRecall: {rec} \nF1 Score {f}")

In [None]:
# Adult Logistic Regression Training/Eval
for i in range(len(best_param_grid_adult)):
    print("Best params:", best_param_grid_adult[i], "\n")
    classifier = clf_log(best_param_grid_adult[i])
    train_model(classifier, adult_X_train, adult_y_train)
    acc, prec, rec, f = evalModel(classifier, adult_X_test, adult_y_test)
    print(f"Accuracy: {acc} \nPrecision: {prec} \nRecall: {rec} \nF1 Score {f}")

### Decision Tree

In [None]:
with open('params/tree/best_param_grid_mush', 'r') as f:
    best_param_grid_mush = json.load(f)
    
with open('params/tree/best_param_grid_heart', 'r') as f:
    best_param_grid_heart = json.load(f)
    
with open('params/tree/best_param_grid_adult', 'r') as f:
    best_param_grid_adult = json.load(f)

In [None]:
# Mushroom Decision Tree Training/Eval
for i in range(len(best_param_grid_mush)):
    print("Best params:", best_param_grid_mush[i], "\n")
    classifier = clf_tree(best_param_grid_mush[i])
    train_model(classifier, mush_X_train, mush_y_train)
    acc, prec, rec, f = evalModel(classifier, mush_X_test, mush_y_test)
    print(f"Accuracy: {acc} \nPrecision: {prec} \nRecall: {rec} \nF1 Score {f}")

In [None]:
# Heart Decision Tree Training/Eval
for i in range(len(best_param_grid_heart)):
    print("Best params:", best_param_grid_heart[i], "\n")
    classifier = clf_tree(best_param_grid_heart[i])
    train_model(classifier, heart_X_train, heart_y_train)
    acc, prec, rec, f = evalModel(classifier, heart_X_test, heart_y_test)
    print(f"Accuracy: {acc} \nPrecision: {prec} \nRecall: {rec} \nF1 Score {f}")

In [None]:
# Adult Decision Tree Training/Eval
for i in range(len(best_param_grid_adult)):
    print("Best params:", best_param_grid_adult[i], "\n")
    classifier = clf_tree(best_param_grid_adult[i])
    train_model(classifier, adult_X_train, adult_y_train)
    acc, prec, rec, f = evalModel(classifier, adult_X_test, adult_y_test)
    print(f"Accuracy: {acc} \nPrecision: {prec} \nRecall: {rec} \nF1 Score {f}")

### Perceptron

In [None]:
with open('params/perc/best_param_grid_mush', 'r') as f:
    best_param_grid_mush = json.load(f)
    
with open('params/perc/best_param_grid_heart', 'r') as f:
    best_param_grid_heart = json.load(f)
    
with open('params/perc/best_param_grid_adult', 'r') as f:
    best_param_grid_adult = json.load(f)

In [None]:
# Mushroom Decision Tree Training/Eval
for i in range(len(best_param_grid_mush)):
    print("Best params:", best_param_grid_mush[i], "\n")
    classifier = clf_perc(best_param_grid_mush[i])
    train_model(classifier, mush_X_train, mush_y_train)
    acc, prec, rec, f = evalModel(classifier, mush_X_test, mush_y_test)
    print(f"Accuracy: {acc} \nPrecision: {prec} \nRecall: {rec} \nF1 Score {f}")

In [None]:
# Heart Decision Tree Training/Eval
for i in range(len(best_param_grid_heart)):
    print("Best params:", best_param_grid_heart[i], "\n")
    classifier = clf_perc(best_param_grid_heart[i])
    train_model(classifier, heart_X_train, heart_y_train)
    acc, prec, rec, f = evalModel(classifier, heart_X_test, heart_y_test)
    print(f"Accuracy: {acc} \nPrecision: {prec} \nRecall: {rec} \nF1 Score {f}")

In [None]:
# Adult Perceptron Training/Eval
for i in range(len(best_param_grid_adult)):
    print("Best params:", best_param_grid_adult[i], "\n")
    classifier = clf_perc(best_param_grid_adult[i])
    train_model(classifier, adult_X_train, adult_y_train)
    acc, prec, rec, f = evalModel(classifier, adult_X_test, adult_y_test)
    print(f"Accuracy: {acc} \nPrecision: {prec} \nRecall: {rec} \nF1 Score {f}")

### Multi-Layer Perceptron

In [None]:
with open('params/mlp/best_param_grid_mush', 'r') as f:
    best_param_grid_mush = json.load(f)
    
with open('params/mlp/best_param_grid_heart', 'r') as f:
    best_param_grid_heart = json.load(f)
    
with open('params/mlp/best_param_grid_adult', 'r') as f:
    best_param_grid_adult = json.load(f)

In [None]:
# Mushroom MLP Training/Eval
for i in range(len(best_param_grid_mush)):
    print("Best params:", best_param_grid_mush[i], "\n")
    classifier = clf_mlp(best_param_grid_mush[i])
    train_model(classifier, mush_X_train, mush_y_train)
    acc, prec, rec, f = evalModel(classifier, mush_X_test, mush_y_test)
    print(f"Accuracy: {acc} \nPrecision: {prec} \nRecall: {rec} \nF1 Score {f}")

In [None]:
# Heart MLP Training/Eval
for i in range(len(best_param_grid_heart)):
    print("Best params:", best_param_grid_heart[i], "\n")
    classifier = clf_mlp(best_param_grid_heart[i])
    train_model(classifier, heart_X_train, heart_y_train)
    acc, prec, rec, f = evalModel(classifier, heart_X_test, heart_y_test)
    print(f"Accuracy: {acc} \nPrecision: {prec} \nRecall: {rec} \nF1 Score {f}")

In [None]:
# Adult MLP Training/Eval
for i in range(len(best_param_grid_adult)):
    print("Best params:", best_param_grid_adult[i], "\n")
    classifier = clf_mlp(best_param_grid_adult[i])
    train_model(classifier, adult_X_train, adult_y_train)
    acc, prec, rec, f = evalModel(classifier, adult_X_test, adult_y_test)
    print(f"Accuracy: {acc} \nPrecision: {prec} \nRecall: {rec} \nF1 Score {f}")

### KNN

In [None]:
with open('params/knn/best_param_grid_mush', 'r') as f:
    best_param_grid_mush = json.load(f)
    
with open('params/knn/best_param_grid_heart', 'r') as f:
    best_param_grid_heart = json.load(f)
    
with open('params/knn/best_param_grid_adult', 'r') as f:
    best_param_grid_adult = json.load(f)

In [None]:
# Mushroom KNN Training/Eval
for i in range(len(best_param_grid_mush)):
    print("Best params:", best_param_grid_mush[i], "\n")
    classifier = clf_knn(best_param_grid_mush[i])
    train_model(classifier, mush_X_train, mush_y_train)
    acc, prec, rec, f = evalModel(classifier, mush_X_test, mush_y_test)
    print(f"Accuracy: {acc} \nPrecision: {prec} \nRecall: {rec} \nF1 Score {f}")

In [None]:
# Heart KNN Training/Eval
for i in range(len(best_param_grid_heart)):
    print("Best params:", best_param_grid_heart[i], "\n")
    classifier = clf_knn(best_param_grid_heart[i])
    train_model(classifier, heart_X_train, heart_y_train)
    acc, prec, rec, f = evalModel(classifier, heart_X_test, heart_y_test)
    print(f"Accuracy: {acc} \nPrecision: {prec} \nRecall: {rec} \nF1 Score {f}")

In [None]:
# Adult KNN Training/Eval
for i in range(len(best_param_grid_adult)):
    print("Best params:", best_param_grid_adult[i], "\n")
    classifier = clf_knn(best_param_grid_adult[i])
    train_model(classifier, adult_X_train, adult_y_train)
    acc, prec, rec, f = evalModel(classifier, adult_X_test, adult_y_test)
    print(f"Accuracy: {acc} \nPrecision: {prec} \nRecall: {rec} \nF1 Score {f}")

### Random Forest

In [None]:
with open('params/rf/best_param_grid_mush', 'r') as f:
    best_param_grid_mush = json.load(f)
    
with open('params/rf/best_param_grid_heart', 'r') as f:
    best_param_grid_heart = json.load(f)
    
with open('params/rf/best_param_grid_adult', 'r') as f:
    best_param_grid_adult = json.load(f)

In [None]:
# Mushroom RF Training/Eval
for i in range(len(best_param_grid_mush)):
    print("Best params:", best_param_grid_mush[i], "\n")
    classifier = clf_rf(best_param_grid_mush[i])
    train_model(classifier, mush_X_train, mush_y_train)
    acc, prec, rec, f = evalModel(classifier, mush_X_test, mush_y_test)
    print(f"Accuracy: {acc} \nPrecision: {prec} \nRecall: {rec} \nF1 Score {f}")

In [None]:
# Heart RF Training/Eval
for i in range(len(best_param_grid_heart)):
    print("Best params:", best_param_grid_heart[i], "\n")
    classifier = clf_rf(best_param_grid_heart[i])
    train_model(classifier, heart_X_train, heart_y_train)
    acc, prec, rec, f = evalModel(classifier, heart_X_test, heart_y_test)
    print(f"Accuracy: {acc} \nPrecision: {prec} \nRecall: {rec} \nF1 Score {f}")

In [None]:
# Adult RF Training/Eval
for i in range(len(best_param_grid_adult)):
    print("Best params:", best_param_grid_adult[i], "\n")
    classifier = clf_rf(best_param_grid_adult[i])
    train_model(classifier, adult_X_train, adult_y_train)
    acc, prec, rec, f = evalModel(classifier, adult_X_test, adult_y_test)
    print(f"Accuracy: {acc} \nPrecision: {prec} \nRecall: {rec} \nF1 Score {f}")

## Evaluation Pipeline

- 3 trials
    - 3 datasets
        - 4 models
            - 3 splits (80/20, 20/80, 50/50)    

In [7]:
num_trials = 3
datasets = ['adult', 'mush', 'heart']
splits = [0.2, 0.8]
models = ['log', 'svm', 'tree', 'perc', 'mlp', 'knn', 'rf']

In [None]:
"""
For i in three different datasets
   For j in three different partitions (20/80,50/50,80/20):
        For t in three different trials
            For c in three different classifiers
                 cross validate
                 find the optimal hyper-parameter
                 train using the hyper-parameter above
                 obtain the training and validation accuracy/error
                 test
                 obtain the testing accuracy
       compute the averaged accuracy (training, validation, and testing) for each classifier c out of three trials
       rank order the classifiers
""";

In [None]:
# Loop through datasets
dataset_scores = {}
for dataset in datasets:
    
    # Loop through dataset splits
    split_scores = {}
    for split in splits:

        # Prepare data splits
        X, y, X_train, X_test, y_train, y_test = pre_process(dataset=dataset, split=split)
        
        # Loop through trials
        trial_scores = {}
        for i in range(num_trials):
            
            # Loop through models
            model_scores = {}
            for model in models:
                
                # Load best model params for given model and dataset
                with open(f'params/{model}/best_param_grid_{dataset}', 'r') as f:
                    best_param_grid = json.load(f)
    
                    # Create classifier
                    classifier = clf(model=model, param_grid=best_param_grid[i])

                    # Train classifier
                    print(f"Training {dataset}-{split}-{i}-{model}")
                    train_model(classifier, X_train, y_train)

                    # Evaluate classifier
                    print(f"Evaluating {dataset}-{split}-{i}-{model}")
                    acc, prec, rec, f = evalModel(classifier, X_test, y_test)
                    test = {"accuracy" : acc, "precision": prec, "recall" : rec, "f1_score" : f}  
                    acc, prec, rec, f = evalModel(classifier, X_train, y_train)
                    train = {"accuracy" : acc, "precision": prec, "recall" : rec, "f1_score" : f} 
                    
                    classifier_eval = {"train" : train, "test" : test}
                    
                # Add evaluation scores for given model
                model_scores.update({f"model_{model}" : classifier_eval})
                
            # Add model scores for given trial
            trial_scores.update({f"trial_{i}" : model_scores})

        # Add trial scores for given model
        split_scores.update({f"split_{split}": trial_scores})

    # Add split scores for given dataset
    dataset_scores.update({f"data_{dataset}": split_scores})
    
with open('scores/dataset_scores', 'w') as f:
    json.dump(dataset_scores, f)

## Dataset Results

In [8]:
with open('scores/dataset_scores', 'r') as f:
    dataset_scores = json.load(f)

In [9]:
def data_table(_split_name, _model_name, dataset_scores=dataset_scores, models=models):
    # Init empty lists to gather data
    adult_train_acc = []
    adult_test_acc = []
    adult_train_f1 = []
    adult_test_f1 = []
    
    heart_train_acc = []
    heart_test_acc = []
    heart_train_f1 = []
    heart_test_f1 = []
    
    mush_train_acc = []
    mush_test_acc = []
    mush_train_f1 = []
    mush_test_f1 = []
    
    # Loop through all datasets, trials, splits, etc and add data
    # to empty lists
    for dataset_name, dataset_data in dataset_scores.items():
        for trial_name, trial_data in dataset_data[_split_name].items():
            if dataset_name == "data_adult":
                adult_train_acc.append(trial_data[_model_name]["train"]["accuracy"])
                adult_test_acc.append(trial_data[_model_name]["test"]["accuracy"])
                adult_train_f1.append(trial_data[_model_name]["train"]["f1_score"])
                adult_test_f1.append(trial_data[_model_name]["test"]["f1_score"])
            elif dataset_name == "data_mush":
                mush_train_acc.append(trial_data[_model_name]["train"]["accuracy"])
                mush_test_acc.append(trial_data[_model_name]["test"]["accuracy"])
                mush_train_f1.append(trial_data[_model_name]["train"]["f1_score"])
                mush_test_f1.append(trial_data[_model_name]["test"]["f1_score"])
            elif dataset_name == "data_heart":
                heart_train_acc.append(trial_data[_model_name]["train"]["accuracy"])
                heart_test_acc.append(trial_data[_model_name]["test"]["accuracy"])
                heart_train_f1.append(trial_data[_model_name]["train"]["f1_score"])
                heart_test_f1.append(trial_data[_model_name]["test"]["f1_score"])
    
    # Convert lists to numpy arrays for computations (mean, std)
    adult_train_acc = np.asarray(adult_train_acc)
    adult_test_acc = np.asarray(adult_test_acc)
    adult_train_f1 = np.asarray(adult_train_f1)
    adult_test_f1 = np.asarray(adult_test_f1)
    
    heart_train_acc = np.asarray(heart_train_acc)
    heart_test_acc = np.asarray(heart_test_acc)
    heart_train_f1 = np.asarray(heart_train_f1)
    heart_test_f1 = np.asarray(heart_test_f1)
    
    mush_train_acc = np.asarray(mush_train_acc)
    mush_test_acc = np.asarray(mush_test_acc)
    mush_train_f1 = np.asarray(mush_train_f1)
    mush_test_f1 = np.asarray(mush_test_f1)
    
    # Display variable dictionaries
    disp_split = {
        "split_0.2" : "20/80",
        "split_0.8" : "80/20"
    }
    
    disp_model = {
        'model_log': "Logarithmic Regression",
        'model_svm': "Support Vector Machine",
        'model_tree': "Decision Tree",
        'model_perc': "Perceptron",
        'model_mlp': "Multi-Layer Perceptron",
        'model_knn': "K-Nearest Neighbor",
        'model_rf': "Random Forest"
    }
    
    # Create dataframe
    df = pd.DataFrame([[
        f"{np.around(100 * adult_train_acc.mean(), decimals=2)} ± {np.around(100 * adult_train_acc.std(), decimals=2)}%",
        f"{np.around(100 * adult_test_acc.mean(), decimals=2)} ± {np.around(100 * adult_test_acc.std(), decimals=2)}%",
        f"{np.around(100 * mush_train_acc.mean(), decimals=2)} ± {np.around(100 * mush_train_acc.std(), decimals=2)}%",
        f"{np.around(100 * mush_test_acc.mean(), decimals=2)} ± {np.around(100 * mush_test_acc.std(), decimals=2)}%",
        f"{np.around(100 * heart_train_acc.mean(), decimals=2)} ± {np.around(100 * heart_train_acc.std(), decimals=2)}%",
        f"{np.around(100 * heart_test_acc.mean(), decimals=2)} ± {np.around(100 * heart_test_acc.std(), decimals=2)}%",
    ]], columns=[f"Adult Train Acc {disp_split[_split_name]}",
                 f"Adult Test Acc {disp_split[_split_name]}",
                 f"Mushroom Train Acc {disp_split[_split_name]}",
                 f"Mushroom Test Acc {disp_split[_split_name]}",
                 f"Heart Train Acc {disp_split[_split_name]}",
                 f"Heart Test Acc {disp_split[_split_name]}"])
    df[f"Avg Train Acc {disp_split[_split_name]}"] = f"{np.around(100 * np.array([adult_train_acc.mean(), mush_train_acc.mean(), heart_train_acc.mean()]).mean(), decimals=2)}%"
    
    df[f"Avg Test Acc {disp_split[_split_name]}"] = f"{np.around(100 * np.array([adult_test_acc.mean(), mush_test_acc.mean(), heart_test_acc.mean()]).mean(), decimals=2)}%"
    df = df.T
    df.columns = [f"{disp_model[_model_name]}"]
    return df.T

### 20/80 Split

In [10]:
model_list = ['model_log', 'model_svm', 'model_tree', 'model_perc', 'model_mlp', 'model_knn','model_rf']
data_tables = []
for model in model_list:
    data_tables.append(data_table("split_0.2", model))
data_table_split_02 = pd.concat(data_tables).sort_values(by=['Avg Test Acc 20/80'], ascending=False)
data_table_split_02

Unnamed: 0,Adult Train Acc 20/80,Adult Test Acc 20/80,Mushroom Train Acc 20/80,Mushroom Test Acc 20/80,Heart Train Acc 20/80,Heart Test Acc 20/80,Avg Train Acc 20/80,Avg Test Acc 20/80
Multi-Layer Perceptron,99.99 ± 0.0%,99.96 ± 0.03%,99.98 ± 0.02%,100.0 ± 0.0%,89.12 ± 0.39%,71.04 ± 0.77%,96.37%,90.33%
Logarithmic Regression,100.0 ± 0.0%,100.0 ± 0.0%,100.0 ± 0.0%,100.0 ± 0.0%,88.84 ± 0.0%,70.49 ± 0.0%,96.28%,90.16%
Decision Tree,100.0 ± 0.0%,100.0 ± 0.0%,100.0 ± 0.0%,100.0 ± 0.0%,96.42 ± 3.46%,69.95 ± 2.04%,98.81%,89.98%
Perceptron,99.81 ± 0.19%,99.8 ± 0.19%,100.0 ± 0.0%,100.0 ± 0.0%,82.64 ± 1.55%,69.95 ± 2.04%,94.15%,89.91%
Support Vector Machine,100.0 ± 0.0%,99.99 ± 0.02%,97.06 ± 4.16%,97.19 ± 3.97%,92.29 ± 5.46%,69.95 ± 0.77%,96.45%,89.04%
Random Forest,98.01 ± 2.75%,97.69 ± 2.56%,100.0 ± 0.0%,100.0 ± 0.0%,96.14 ± 2.81%,68.85 ± 1.34%,98.05%,88.85%
K-Nearest Neighbor,98.37 ± 1.19%,95.93 ± 0.18%,100.0 ± 0.0%,100.0 ± 0.0%,92.98 ± 4.97%,70.49 ± 1.34%,97.12%,88.81%


### 80/20 Split

In [11]:
model_list = ['model_log', 'model_svm', 'model_tree', 'model_perc', 'model_mlp', 'model_knn','model_rf']
data_tables = []
for model in model_list:
    data_tables.append(data_table("split_0.8", model))
data_table_split_08 = pd.concat(data_tables).sort_values(by=['Avg Test Acc 80/20'], ascending=False)
data_table_split_08

Unnamed: 0,Adult Train Acc 80/20,Adult Test Acc 80/20,Mushroom Train Acc 80/20,Mushroom Test Acc 80/20,Heart Train Acc 80/20,Heart Test Acc 80/20,Avg Train Acc 80/20,Avg Test Acc 80/20
Logarithmic Regression,100.0 ± 0.0%,100.0 ± 0.0%,100.0 ± 0.0%,99.97 ± 0.04%,100.0 ± 0.0%,82.03 ± 0.19%,100.0%,94.0%
Multi-Layer Perceptron,99.98 ± 0.01%,99.87 ± 0.09%,99.92 ± 0.12%,99.78 ± 0.21%,96.67 ± 2.72%,80.93 ± 1.18%,98.86%,93.53%
Perceptron,99.85 ± 0.12%,99.74 ± 0.09%,100.0 ± 0.0%,99.93 ± 0.03%,98.89 ± 0.79%,78.6 ± 2.67%,99.58%,92.76%
Random Forest,97.87 ± 2.64%,97.12 ± 2.48%,100.0 ± 0.0%,100.0 ± 0.0%,98.33 ± 2.36%,78.05 ± 1.27%,98.73%,91.72%
Decision Tree,100.0 ± 0.0%,100.0 ± 0.0%,100.0 ± 0.0%,100.0 ± 0.0%,99.44 ± 0.79%,73.66 ± 1.46%,99.81%,91.22%
K-Nearest Neighbor,97.51 ± 1.83%,94.16 ± 0.15%,99.96 ± 0.03%,99.9 ± 0.01%,96.11 ± 2.83%,78.05 ± 0.19%,97.86%,90.7%
Support Vector Machine,100.0 ± 0.0%,99.97 ± 0.05%,83.62 ± 23.16%,83.93 ± 22.56%,97.22 ± 3.93%,80.38 ± 1.03%,93.61%,88.09%


## Instructions

Single person project and no team work.

Report format:

Write a report with >1,000 words (excluding references) including main sections: a) abstract, b) introduction, c) method, d) experiment, e) conclusion, and f) references. You can follow the paper format as e.g leading machine learning journals such as Journal of Machine Learning Research (http://www.jmlr.org/) or IEEE Trans. on Pattern Analysis and Machine Intelligence (http://www.computer.org/web/tpami), or leading conferences like NeurIPS (https://papers.nips.cc/) and ICML (https://icml.cc/). There is no page limit for your report.

Bonus points: 

If you feel that your work deserves bonus points due to reasons such as: a) novel ideas and applications, b) large efforts in your own data collection/preparation, c) state-of-the-art classification results, or d) new algorithms, please create a "Bonus Points" section to specifically describe why you deserve bonus points.

In this project you will choose any three classifiers out of those tested in


We have been discussing the classification problem in the form of two-class classifiers throughout the class. Some classifiers like decision tree, KNN, random forests stay agnostic w.r.t the number of classes but others like SVM and Boosting where explicit objective functions are involved don't.



The basic requirement for the final project is based on the two-class classification problem. If you have additional bandwidth, you can experiment on the multi-class classification setting. When preparing the dataset to train your classifier (two-class), please try to merge the labels into two groups, positives and negatives, if your dataset happens to consist multi-class labels.



Train your classifiers using the setting (not all metrics are needed) described in the empirical study by Caruana and Niculescu-Mizil. You are supposed to reproduce consistent results as in the paper. However, do expect some small variations. When evaluating the algorithms, you don’t need to use all the metrics that were reported in the paper. Using one metric, e.g. the classification accuracy, is sufficient. Please report the cross-validated classification results with the corresponding learned hyper-parameters.

Note that since you are choosing your own libraries for the classifiers, there are implementation details that will affect the classification results. Even the same SVM but with different implementations, you won't be able to see identical results when trained on the same dataset. Therefore, don't expect the identical results as those in the paper, as you are probably using a subset and not all the features. If you see a bit difference in ranking, it should ok but the overall trend should be consistent, e.g. random forest should do well, more training data leads to better results, knn is not necessarily very bad etc.

If you compute accuracy and follow the basic requirement picking 3 classifiers and 3 datasets. You are looking at 3 trials X 3 classifiers X 3 datasets X 3 partitions (20/80, 50/50, 80/20). Each time you always report the best accuracy under the chosen hyper-parameter. Since for the accuracy is averaged among three 3 trials to rank order the classifiers, you will report 3 classifiers X 3 datasets X 3 partitions  (20/80, 50/50, 80/20)  X 3. accuracies (train, validation, test). When trying to debug, always try to see the training accuracy to see if you are able to at least push the training accuracy high (to overfit the data) as a sanity check making sure your implementation is correct. The heatmaps for your hyper-parameters are the details that do not need to be too carefully compared with. The searching for the hyper-parameters is internal and the final conclusion about the classifiers is based on the best hyper-parameter you have obtained for each time.

Since the exact data setting might have changed, the specific parameters and hyper-parameters reported in Caruana and Niculescu-Mizil paper serve as a guideline but you don't need to try all of them. You can try a few standard ones, as long as your classification results are reasonable. If you pick the multi-layer perceptron as one of your classifiers, note that you may need to increase the number of layers to e.g. 3 and create more neurons in each layer to attain good results, for some datasets.

You can alternatively or additionally adopt the datasets and classifiers reported in a follow-up paper, Caruana et al. ICML 2008.
 
You are encouraged to use Python, but using other programming languages and platforms is ok. The candidate classifiers include:
1. Boosting family classifiers
http://www.mathworks.com/matlabcentral/fileexchange/21317-adaboost
or
https://github.com/dmlc/xgboost
2. Support vector machines
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
3. Random Forests
http://www.stat.berkeley.edu/~breiman/RandomForests/
4. Decision Tree
http://www.rulequest.com/Personal/ (please see also see a sample matlab code in the attachment)
5. K-nearest neighbors
http://www.mathworks.com/matlabcentral/fileexchange/19345-efficient-k-nearest-neighbor-searchusing-jit
6. Neural Nets
http://www.cs.colostate.edu/~anderson/code/
http://www.mathworks.com/products/neural-network/code-examples.html
7. Logistic regression classifier
8. Bagging family

The links above are for your reference. You can implement your own classifier or download other
versions you like online (But you need to make sure the code online is reliable). You are supposed to
write a formal report describing about the experiments you run and the corresponding results (plus
code).


Grading
Note that if you do well by satisfying the minimum requirement e.g. 3 classifiers on 3 datasets with cross-validation, you will receive a decent score but not the full 100 points. We are looking for something a bit more and please see the guideline below.

When reporting the experimental results, there are two main sets of comparisons we are looking for:
a. For each dataset on each paritition, show the comparison for different algorithms, and hopefully be consistent with the findings in the paper with Random Forests being the best etc.
b. For each classifier on each partition, show the comparison on different partitions and you are supposed to show the increase of test accuracy (decrease of test error) with more training data and less test data.

Note that the performance and function calls vary due to the particular ML libraries you are using. For example, the same SVM classifier provided in different toolboxes might result in different errors even trained on the same dataset. But the overall differences should be reasonable and interpretable. You may obtain a ranking that is somewhat different from that in the paper, due to differences in detailed implementation of the classifiers, different training sizes, features ect. But the overall trend should be explainable. For example, random forest usually has a pretty good performance; knn might not be as bad as you had thought, kernel-based SVM is sometimes sensitive to the hyper-parameters; using more data in training will lead to improvement, especially on difficult cases.

The merit and grading of your project can be judged from aspects described below that are common
when reviewing a paper:
1. How challenging and large are the datasets you are studying? (10 points)
2. Any aspects that are new in terms of algorithm development, uniqueness of the data, or new
applications? (10 points)
3. Is your experimental design comprehensive? Have you done thoroughly experiments in tuning
hyper-parameters and performing cross validation (you should also try different data partitions, e.g 20% training and 80% testing, 50% training and 50% testing, and 80% training and 20% testing for multiple rounds, e.g. 3 times each for the above three partitions and compute average scores to remove potentials of having accidental results); try to report both the training and testing errors after cross-validation; it is encouraged to also report the training and validation errors during cross-validation using classification error/accuracy curves w.r.t. the hyper-parameters. (50 points)
4. Is your report written in a professional way with sections including abstract, introduction, data
and problem description, method description, experiments, conclusion, and references? (30
points)
5. Bonus points will be assigned to projects in which new ideas have been developed and implemented, or thorough experiments where extensive empirical studies have been carried out (e.g. evaluated on >=5 classifiers and >=4 datasets).

### Links

https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74
https://stackoverflow.com/questions/47793569/choosing-top-k-models-using-gridsearchcv-in-scikit-learn