# Final Project
## Nicholas Schenone - A13599911

- 3 trials
- 7 classifiers
    - SVM
    - Logistic Regression
    - Decision Tree
    - Perceptron
    - Multilayer Perceptron
    - KNN
    - Random Forest
- 3 datasets
    - Heart Disease: https://www.kaggle.com/ronitf/heart-disease-uci
    - Mushroom: https://archive.ics.uci.edu/ml/datasets/Mushroom
    - Somerville Happiness Survey Data Set: https://archive.ics.uci.edu/ml/datasets/Somerville+Happiness+Survey
- 3 partitions (20/80, 50/50, 80/20)
- 3 accuracies per (train, validation, test)

### Imports

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import Perceptron, LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

import json

import seaborn as sns

### Pre-Process Data

In [2]:
def adult_pre_process(data_path="data/adult/adult.csv", split=0.2):
    df_adult = pd.read_csv(data_path)
    df_adult_one_hot = pd.get_dummies(df_adult);
    
    X = df_adult_one_hot.iloc[:,0 : len(df_adult_one_hot.columns) - 1]
    X = StandardScaler().fit_transform(X)

    y = df_adult_one_hot.iloc[:, len(df_adult_one_hot.columns) - 1]
    y = y.values.ravel()

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=split)

    return X, y, X_train, X_test, y_train, y_test

In [3]:
def heart_pre_process(data_path="data/heart_disease/heart.csv", split=0.2):
    df_heart = pd.read_csv(data_path)
    X = df_heart.iloc[:, 0 : len(df_heart.columns) - 1]
    X = StandardScaler().fit_transform(X)

    y = df_heart.iloc[:, len(df_heart.columns) - 1]
    y = y.values.ravel()

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=split)

    return X, y, X_train, X_test, y_train, y_test

In [4]:
def mushroom_pre_process(data_path="data/mushroom/mushroom.csv", split=0.2):
    df_mushroom = pd.read_csv(data_path, header=None)
    df_mush_one_hot = pd.get_dummies(df_mushroom);
    
    X = df_mush_one_hot.iloc[:,1:]
    X = StandardScaler().fit_transform(X)

    y = df_mush_one_hot.iloc[:, :1]
    y = y.values.ravel()
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=split)

    return X, y, X_train, X_test, y_train, y_test

In [5]:
def happiness_pre_process(data_path="data/happiness_survey/SomervilleHappinessSurvey2015.csv", split=0.2):
    df_happy = pd.read_csv(data_path, encoding = "utf-16")
    df_happy_one_hot = pd.get_dummies(df_happy.astype(str));
    
    X = df_happy_one_hot.iloc[:,1:]
    X = StandardScaler().fit_transform(X)

    y = df_happy_one_hot.iloc[:, :1]
    y = y.values.ravel()

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=split)

    return X, y, X_train, X_test, y_train, y_test

In [6]:
def pre_process(dataset, split=0.2):
    if dataset == "happy":
        return happiness_pre_process(split=split)
    elif dataset == "mush":
        return mushroom_pre_process(split=split)
    elif dataset == "heart":
        return heart_pre_process(split=split)

In [7]:
heart_X, heart_y, heart_X_train, heart_X_test, heart_y_train, heart_y_test = heart_pre_process(split=0.2)

In [8]:
mush_X, mush_y, mush_X_train, mush_X_test, mush_y_train, mush_y_test = mushroom_pre_process(split=0.2)

In [9]:
happy_X, happy_y, happy_X_train, happy_X_test, happy_y_train, happy_y_test = happiness_pre_process(split=0.2)

In [10]:
adult_X, adult_y, adult_X_train, adult_X_test, adult_y_train, adult_y_test = adult_pre_process(split=0.2)

### SVM

In [11]:
def clf_SVM(param_grid):
    return svm.SVC(C = param_grid["C"], gamma=param_grid["gamma"], kernel=param_grid["kernel"], max_iter = 10000)

### Logistic Regression

In [12]:
def clf_log(param_grid):
    return LogisticRegression(C = param_grid["C"], penalty = param_grid["penalty"], solver="liblinear", max_iter = 10000)

### Decision Tree

In [13]:
def clf_tree(param_grid):
    return DecisionTreeClassifier(criterion=param_grid["criterion"], max_depth=param_grid["max_depth"])

### Perceptron

In [14]:
def clf_perc(param_grid):
    return Perceptron(penalty=param_grid["penalty"],
                      alpha=param_grid["alpha"],
                      max_iter=param_grid["max_iter"],
                      tol=param_grid["tol"],
                      early_stopping=param_grid["early_stopping"])

### Multi-Layer Perceptron

In [15]:
def clf_mlp(param_grid):
    return MLPClassifier(activation=param_grid["activation"],
                      solver=param_grid["solver"],
                      hidden_layer_sizes=param_grid["hidden_layer_sizes"],
                      max_iter=param_grid["max_iter"],
                      tol=param_grid["tol"],
                      early_stopping=param_grid["early_stopping"])

### KNN

In [16]:
def clf_knn(param_grid):
    return KNeighborsClassifier(n_neighbors=param_grid["n_neighbors"])

### Random Forest

In [17]:
def clf_rf(param_grid):
    return RandomForestClassifier(bootstrap=param_grid["bootstrap"],
                                 max_depth=param_grid["max_depth"],
                                 max_features=param_grid["max_features"],
                                 min_samples_leaf=param_grid["min_samples_leaf"],
                                 min_samples_split=param_grid["min_samples_split"],
                                 n_estimators=param_grid["n_estimators"])

### General

In [18]:
def clf(model, param_grid):
    if model == "svm":
        return clf_SVM(param_grid)
    elif model=="log":
        return clf_log(param_grid)
    elif model=="tree":
        return clf_tree(param_grid)
    elif model=="perc":
        return clf_perc(param_grid)
    elif model=="mlp":
        return clf_mlp(param_grid)
    elif model=="knn":
        return clf_knn(param_grid)
    elif model=="rf":
        return clf_rf(param_grid)

In [19]:
def train_model(classifier, X_train, y_train):
    classifier.fit(X_train, y_train)

## Hyperparameter Tuning

In [20]:
def hyper_tune(X_train, y_train, estimator, param_grid):
    grid_search = RandomizedSearchCV(estimator=estimator, param_distributions=param_grid, cv=15, n_iter=30, n_jobs=-1, verbose=10)
    grid_search.fit(X_train, y_train)
    print("Best params:", grid_search.best_params_)
    return grid_search.best_params_

### SVM

In [21]:
svm_param_grid = {
    "C" : [1, 10, 100, 1000, 10000],
    "gamma" : [1e-6, 1e-5, 1e-4, 1e-3, 1e-2],
    "kernel" : ["linear", "rbf"]
}

In [None]:
# Happiness SVM Tuning
best_param_grid_happy = hyper_tune(happy_X_train, happy_y_train, svm.SVC(), svm_param_grid)

with open('params/svm/best_param_grid_happy', 'w') as f:
    json.dump(best_param_grid_happy, f)

In [None]:
# Mushroom SVM Tuning
best_param_grid_mush = hyper_tune(mush_X_train, mush_y_train, svm.SVC(), svm_param_grid)

with open('params/svm/best_param_grid_mush', 'w') as f:
    json.dump(best_param_grid_mush, f)

In [None]:
# Heart SVM Tuning
best_param_grid_heart = hyper_tune(heart_X_train, heart_y_train, svm.SVC(), svm_param_grid)

with open('params/svm/best_param_grid_heart', 'w') as f:
    json.dump(best_param_grid_heart, f)

In [22]:
# Adult SVM Tuning
best_param_grid_adult = hyper_tune(adult_X_train, adult_y_train, svm.SVC(), svm_param_grid)

with open('params/svm/best_param_grid_adult', 'w') as f:
    json.dump(best_param_grid_adult, f)

Fitting 15 folds for each of 30 candidates, totalling 450 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:    3.9s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    5.3s
ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/joblib/parallel.py", line 908, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/usr/local/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 554, in wrap_future_result
    return future.result(timeout=timeout)
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py", line 430, in result
    self._condition.wait(timeout)
  File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 296, in wait
    waiter.acquire()
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3326, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-22-5b27c09cc80f>", line 2, in <mo

TypeError: can only concatenate str (not "list") to str

### Logistic Regression

In [None]:
log_param_grid = {
    "C" : [1, 10, 100, 1000, 10000],
    "penalty" : ["l1", "l2"],
}

In [None]:
# Happiness Logistic Regression Tuning
best_param_grid_happy = hyper_tune(happy_X_train, happy_y_train, LogisticRegression(), log_param_grid)

with open('params/log/best_param_grid_happy', 'w') as f:
    json.dump(best_param_grid_happy, f)

In [None]:
# Mushroom Logistic Regression Tuning
best_param_grid_mush = hyper_tune(mush_X_train, mush_y_train, LogisticRegression(), log_param_grid)

with open('params/log/best_param_grid_mush', 'w') as f:
    json.dump(best_param_grid_mush, f)

In [None]:
# Heart Logistic Regression Tuning
best_param_grid_heart = hyper_tune(heart_X_train, heart_y_train, LogisticRegression(), log_param_grid)

with open('params/log/best_param_grid_heart', 'w') as f:
    json.dump(best_param_grid_heart, f)

### Decision Tree

In [None]:
tree_param_grid = {
    "criterion" : ['gini', 'entropy'],
    "max_depth" : [4,6,8,12],
}

In [None]:
# Happiness Decision tree Tuning
best_param_grid_happy = hyper_tune(happy_X_train, happy_y_train, DecisionTreeClassifier(), tree_param_grid)

with open('params/tree/best_param_grid_happy', 'w') as f:
    json.dump(best_param_grid_happy, f)

In [None]:
# Mushroom Decision Tree Tuning
best_param_grid_mush = hyper_tune(mush_X_train, mush_y_train, DecisionTreeClassifier(), tree_param_grid)

with open('params/tree/best_param_grid_mush', 'w') as f:
    json.dump(best_param_grid_mush, f)

In [None]:
# Heart Decision Tree Tuning
best_param_grid_heart = hyper_tune(heart_X_train, heart_y_train, DecisionTreeClassifier(), tree_param_grid)

with open('params/tree/best_param_grid_heart', 'w') as f:
    json.dump(best_param_grid_heart, f)

### Perceptron

In [None]:
perc_param_grid = {
    "penalty" : [None, "l1", "l2", "elasticnet"],
    "alpha" : [0.001, 0.0001, 0.00001],
    "max_iter" : [500, 1000, 2000],
    "tol" : [1e-4, 1e-3, 1e-2],
    "early_stopping" : [True, False]
}

In [None]:
# Happiness Perceptron Tuning
best_param_grid_happy = hyper_tune(happy_X_train, happy_y_train, Perceptron(), perc_param_grid)

with open('params/perc/best_param_grid_happy', 'w') as f:
    json.dump(best_param_grid_happy, f)

In [None]:
# Mushroom Perceptron Tuning
best_param_grid_mush = hyper_tune(mush_X_train, mush_y_train, Perceptron(), perc_param_grid)

with open('params/perc/best_param_grid_mush', 'w') as f:
    json.dump(best_param_grid_mush, f)

In [None]:
# Heart Perceptron Tuning
best_param_grid_heart = hyper_tune(heart_X_train, heart_y_train, Perceptron(), perc_param_grid)

with open('params/perc/best_param_grid_heart', 'w') as f:
    json.dump(best_param_grid_heart, f)

### Multi-Layer Perceptron

In [None]:
mlp_param_grid = {
    "hidden_layer_sizes" : [(100,), (50,), (200,), (25,)],
    "activation" : ["identity", "logistic", "tanh", "relu"],
    "solver" : ["lbfgs", "sgd", "adam"],
    "max_iter" : [200, 100, 300],
    "tol" : [1e-4, 1e-3, 1e-5],
    "early_stopping" : [True, False]
}

In [None]:
# Happiness Multi-Layer Perceptron Tuning
best_param_grid_happy = hyper_tune(happy_X_train, happy_y_train, MLPClassifier(), mlp_param_grid)

with open('params/mlp/best_param_grid_happy', 'w') as f:
    json.dump(best_param_grid_happy, f)

In [None]:
# Mushroom Perceptron Tuning
best_param_grid_mush = hyper_tune(mush_X_train, mush_y_train, MLPClassifier(), mlp_param_grid)

with open('params/mlp/best_param_grid_mush', 'w') as f:
    json.dump(best_param_grid_mush, f)

In [None]:
# Heart Perceptron Tuning
best_param_grid_heart = hyper_tune(heart_X_train, heart_y_train, MLPClassifier(), mlp_param_grid)

with open('params/mlp/best_param_grid_heart', 'w') as f:
    json.dump(best_param_grid_heart, f)

### KNN

In [None]:
knn_param_grid = {
    "n_neighbors" : [1, 3, 5, 9, 15, 25, 50, 75, 100],
}

In [None]:
# Happiness Decision tree Tuning
best_param_grid_happy = hyper_tune(happy_X_train, happy_y_train, KNeighborsClassifier(), knn_param_grid)

with open('params/knn/best_param_grid_happy', 'w') as f:
    json.dump(best_param_grid_happy, f)

In [None]:
# Mushroom Decision Tree Tuning
best_param_grid_mush = hyper_tune(mush_X_train, mush_y_train, KNeighborsClassifier(), knn_param_grid)

with open('params/knn/best_param_grid_mush', 'w') as f:
    json.dump(best_param_grid_mush, f)

In [None]:
# Heart Decision Tree Tuning
best_param_grid_heart = hyper_tune(heart_X_train, heart_y_train, KNeighborsClassifier(), knn_param_grid)

with open('params/knn/best_param_grid_heart', 'w') as f:
    json.dump(best_param_grid_heart, f)

### Random Forest

In [None]:
rf_param_grid = {
    'bootstrap': [True, False],
    'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
    'max_features': ['auto', 'sqrt'],
    'min_samples_leaf': [1, 2, 4],
    'min_samples_split': [2, 5, 10],
    'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]
}

In [None]:
# Happiness Random Forest Tuning
best_param_grid_happy = hyper_tune(happy_X_train, happy_y_train, RandomForestClassifier(), rf_param_grid)

with open('params/rf/best_param_grid_happy', 'w') as f:
    json.dump(best_param_grid_happy, f)

In [None]:
# Mushroom Decision Tree Tuning
best_param_grid_mush = hyper_tune(mush_X_train, mush_y_train, RandomForestClassifier(), rf_param_grid)

with open('params/rf/best_param_grid_mush', 'w') as f:
    json.dump(best_param_grid_mush, f)

In [None]:
# Heart Decision Tree Tuning
best_param_grid_heart = hyper_tune(heart_X_train, heart_y_train, RandomForestClassifier(), rf_param_grid)

with open('params/rf/best_param_grid_heart', 'w') as f:
    json.dump(best_param_grid_heart, f)

## Evaluate with best Params

In [None]:
def evalModel(classifer, X_test, y_test):
    y_pred = classifier.predict(X_test)
    
    accuracy= accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average="macro")
    recall = recall_score(y_test, y_pred, average="macro")
    f_score = f1_score(y_test, y_pred, average="macro") 
    
    return (accuracy, precision, recall, f_score)

### SVM

In [None]:
with open('params/svm/best_param_grid_happy', 'r') as f:
    best_param_grid_happy = json.load(f)
    
with open('params/svm/best_param_grid_mush', 'r') as f:
    best_param_grid_mush = json.load(f)
    
with open('params/svm/best_param_grid_heart', 'r') as f:
    best_param_grid_heart = json.load(f)

In [None]:
# Happiness SVM Training/Eval
print("Best params:", best_param_grid_happy, "\n")
classifier = clf_SVM(best_param_grid_happy)
train_model(classifier, happy_X_train, happy_y_train)
acc, prec, rec, f = evalModel(classifier, happy_X_test, happy_y_test)
print(f"Accuracy: {acc} \nPrecision: {prec} \nRecall: {rec} \nF1 Score {f}")

In [None]:
# Mushroom SVM
print("Best params:", best_param_grid_mush, "\n")
classifier = clf_SVM(best_param_grid_mush)
train_model(classifier, mush_X_train, mush_y_train)
acc, prec, rec, f = evalModel(classifier, mush_X_test, mush_y_test)
print(f"Accuracy: {acc} \nPrecision: {prec} \nRecall: {rec} \nF1 Score {f}")

In [None]:
# Heart SVM
print("Best params:", best_param_grid_heart, "\n")
classifier = clf_SVM(best_param_grid_heart)
train_model(classifier, heart_X_train, heart_y_train)
acc, prec, rec, f = evalModel(classifier, heart_X_test, heart_y_test)
print(f"Accuracy: {acc} \nPrecision: {prec} \nRecall: {rec} \nF1 Score {f}")

### Logistic Regression

In [None]:
with open('params/log/best_param_grid_happy', 'r') as f:
    best_param_grid_happy = json.load(f)
    
with open('params/log/best_param_grid_mush', 'r') as f:
    best_param_grid_mush = json.load(f)
    
with open('params/log/best_param_grid_heart', 'r') as f:
    best_param_grid_heart = json.load(f)

In [None]:
# Happiness Logistic Regression Training/Eval
print("Best params:", best_param_grid_happy, "\n")
classifier = clf_log(best_param_grid_happy)
train_model(classifier, happy_X_train, happy_y_train)
acc, prec, rec, f = evalModel(classifier, happy_X_test, happy_y_test)
print(f"Accuracy: {acc} \nPrecision: {prec} \nRecall: {rec} \nF1 Score {f}")

In [None]:
# Mushroom Logistic Regression Training/Eval
print("Best params:", best_param_grid_mush, "\n")
classifier = clf_log(best_param_grid_mush)
train_model(classifier, mush_X_train, mush_y_train)
acc, prec, rec, f = evalModel(classifier, mush_X_test, mush_y_test)
print(f"Accuracy: {acc} \nPrecision: {prec} \nRecall: {rec} \nF1 Score {f}")

In [None]:
# Heart Logistic Regression Training/Eval
print("Best params:", best_param_grid_heart, "\n")
classifier = clf_log(best_param_grid_heart)
train_model(classifier, heart_X_train, heart_y_train)
acc, prec, rec, f = evalModel(classifier, heart_X_test, heart_y_test)
print(f"Accuracy: {acc} \nPrecision: {prec} \nRecall: {rec} \nF1 Score {f}")

### Decision Tree

In [None]:
with open('params/tree/best_param_grid_happy', 'r') as f:
    best_param_grid_happy = json.load(f)
    
with open('params/tree/best_param_grid_mush', 'r') as f:
    best_param_grid_mush = json.load(f)
    
with open('params/tree/best_param_grid_heart', 'r') as f:
    best_param_grid_heart = json.load(f)

In [None]:
# Happiness Decision Tree Training/Eval
print("Best params:", best_param_grid_happy, "\n")
classifier = clf_tree(best_param_grid_happy)
train_model(classifier, happy_X_train, happy_y_train)
acc, prec, rec, f = evalModel(classifier, happy_X_test, happy_y_test)
print(f"Accuracy: {acc} \nPrecision: {prec} \nRecall: {rec} \nF1 Score {f}")

In [None]:
# Mushroom Decision Tree Training/Eval
print("Best params:", best_param_grid_mush, "\n")
classifier = clf_tree(best_param_grid_mush)
train_model(classifier, mush_X_train, mush_y_train)
acc, prec, rec, f = evalModel(classifier, mush_X_test, mush_y_test)
print(f"Accuracy: {acc} \nPrecision: {prec} \nRecall: {rec} \nF1 Score {f}")

In [None]:
# Heart Decision Tree Training/Eval
print("Best params:", best_param_grid_heart, "\n")
classifier = clf_tree(best_param_grid_heart)
train_model(classifier, heart_X_train, heart_y_train)
acc, prec, rec, f = evalModel(classifier, heart_X_test, heart_y_test)
print(f"Accuracy: {acc} \nPrecision: {prec} \nRecall: {rec} \nF1 Score {f}")

### Perceptron

In [None]:
with open('params/perc/best_param_grid_happy', 'r') as f:
    best_param_grid_happy = json.load(f)
    
with open('params/perc/best_param_grid_mush', 'r') as f:
    best_param_grid_mush = json.load(f)
    
with open('params/perc/best_param_grid_heart', 'r') as f:
    best_param_grid_heart = json.load(f)

In [None]:
# Happiness Decision Tree Training/Eval
print("Best params:", best_param_grid_happy, "\n")
classifier = clf_perc(best_param_grid_happy)
train_model(classifier, happy_X_train, happy_y_train)
acc, prec, rec, f = evalModel(classifier, happy_X_test, happy_y_test)
print(f"Accuracy: {acc} \nPrecision: {prec} \nRecall: {rec} \nF1 Score {f}")

In [None]:
# Mushroom Decision Tree Training/Eval
print("Best params:", best_param_grid_mush, "\n")
classifier = clf_perc(best_param_grid_mush)
train_model(classifier, mush_X_train, mush_y_train)
acc, prec, rec, f = evalModel(classifier, mush_X_test, mush_y_test)
print(f"Accuracy: {acc} \nPrecision: {prec} \nRecall: {rec} \nF1 Score {f}")

In [None]:
# Heart Decision Tree Training/Eval
print("Best params:", best_param_grid_heart, "\n")
classifier = clf_perc(best_param_grid_heart)
train_model(classifier, heart_X_train, heart_y_train)
acc, prec, rec, f = evalModel(classifier, heart_X_test, heart_y_test)
print(f"Accuracy: {acc} \nPrecision: {prec} \nRecall: {rec} \nF1 Score {f}")

### Multi-Layer Perceptron

In [None]:
with open('params/mlp/best_param_grid_happy', 'r') as f:
    best_param_grid_happy = json.load(f)
    
with open('params/mlp/best_param_grid_mush', 'r') as f:
    best_param_grid_mush = json.load(f)
    
with open('params/mlp/best_param_grid_heart', 'r') as f:
    best_param_grid_heart = json.load(f)

In [None]:
# Happiness MLP Training/Eval
print("Best params:", best_param_grid_happy, "\n")
classifier = clf_mlp(best_param_grid_happy)
train_model(classifier, happy_X_train, happy_y_train)
acc, prec, rec, f = evalModel(classifier, happy_X_test, happy_y_test)
print(f"Accuracy: {acc} \nPrecision: {prec} \nRecall: {rec} \nF1 Score {f}")

In [None]:
# Mushroom MLP Training/Eval
print("Best params:", best_param_grid_mush, "\n")
classifier = clf_mlp(best_param_grid_mush)
train_model(classifier, mush_X_train, mush_y_train)
acc, prec, rec, f = evalModel(classifier, mush_X_test, mush_y_test)
print(f"Accuracy: {acc} \nPrecision: {prec} \nRecall: {rec} \nF1 Score {f}")

In [None]:
# Heart MLP Training/Eval
print("Best params:", best_param_grid_heart, "\n")
classifier = clf_mlp(best_param_grid_heart)
train_model(classifier, heart_X_train, heart_y_train)
acc, prec, rec, f = evalModel(classifier, heart_X_test, heart_y_test)
print(f"Accuracy: {acc} \nPrecision: {prec} \nRecall: {rec} \nF1 Score {f}")

### KNN

In [None]:
with open('params/knn/best_param_grid_happy', 'r') as f:
    best_param_grid_happy = json.load(f)
    
with open('params/knn/best_param_grid_mush', 'r') as f:
    best_param_grid_mush = json.load(f)
    
with open('params/knn/best_param_grid_heart', 'r') as f:
    best_param_grid_heart = json.load(f)

In [None]:
# Happiness KNN Training/Eval
print("Best params:", best_param_grid_happy, "\n")
classifier = clf_knn(best_param_grid_happy)
train_model(classifier, happy_X_train, happy_y_train)
acc, prec, rec, f = evalModel(classifier, happy_X_test, happy_y_test)
print(f"Accuracy: {acc} \nPrecision: {prec} \nRecall: {rec} \nF1 Score {f}")

In [None]:
# Mushroom KNN Training/Eval
print("Best params:", best_param_grid_mush, "\n")
classifier = clf_knn(best_param_grid_mush)
train_model(classifier, mush_X_train, mush_y_train)
acc, prec, rec, f = evalModel(classifier, mush_X_test, mush_y_test)
print(f"Accuracy: {acc} \nPrecision: {prec} \nRecall: {rec} \nF1 Score {f}")

In [None]:
# Heart KNN Training/Eval
print("Best params:", best_param_grid_heart, "\n")
classifier = clf_knn(best_param_grid_heart)
train_model(classifier, heart_X_train, heart_y_train)
acc, prec, rec, f = evalModel(classifier, heart_X_test, heart_y_test)
print(f"Accuracy: {acc} \nPrecision: {prec} \nRecall: {rec} \nF1 Score {f}")

### Random Forest

In [None]:
with open('params/rf/best_param_grid_happy', 'r') as f:
    best_param_grid_happy = json.load(f)
    
with open('params/rf/best_param_grid_mush', 'r') as f:
    best_param_grid_mush = json.load(f)
    
with open('params/rf/best_param_grid_heart', 'r') as f:
    best_param_grid_heart = json.load(f)

In [None]:
# Happiness RF Training/Eval
print("Best params:", best_param_grid_happy, "\n")
classifier = clf_rf(best_param_grid_happy)
train_model(classifier, happy_X_train, happy_y_train)
acc, prec, rec, f = evalModel(classifier, happy_X_test, happy_y_test)
print(f"Accuracy: {acc} \nPrecision: {prec} \nRecall: {rec} \nF1 Score {f}")

In [None]:
# Mushroom RF Training/Eval
print("Best params:", best_param_grid_mush, "\n")
classifier = clf_rf(best_param_grid_mush)
train_model(classifier, mush_X_train, mush_y_train)
acc, prec, rec, f = evalModel(classifier, mush_X_test, mush_y_test)
print(f"Accuracy: {acc} \nPrecision: {prec} \nRecall: {rec} \nF1 Score {f}")

In [None]:
# Heart RF Training/Eval
print("Best params:", best_param_grid_heart, "\n")
classifier = clf_rf(best_param_grid_heart)
train_model(classifier, heart_X_train, heart_y_train)
acc, prec, rec, f = evalModel(classifier, heart_X_test, heart_y_test)
print(f"Accuracy: {acc} \nPrecision: {prec} \nRecall: {rec} \nF1 Score {f}")

## Evaluation Pipeline

- 3 trials
    - 3 datasets
        - 4 models
            - 3 splits (80/20, 20/80, 50/50)    

In [None]:
num_trials = 3
datasets = ['happy', 'mush', 'heart']
splits = [0.2, 0.5, 0.8]
models = ['log', 'svm', 'tree', 'perc', 'mlp', 'knn', 'rf']

In [None]:
# Loop through datasets
dataset_scores = {}
for dataset in datasets:
    
    # Loop through models
    model_scores = {}
    for model in models:
        
        # Load best model params for given model and dataset
        with open(f'params/{model}/best_param_grid_{dataset}', 'r') as f:
            best_param_grid = json.load(f)
            
            # Loop through trials
            trial_scores = {}
            for i in range(num_trials):
                
                # Loop through dataset splits
                split_scores = {}
                for split in splits:
                    
                    # Prepare data splits
                    X, y, X_train, X_test, y_train, y_test = pre_process(dataset=dataset, split=split)
                    
                    # Create classifier
                    classifier = clf(model=model, param_grid=best_param_grid)
                    
                    # Train classifier
                    train_model(classifier, X_train, y_train)
                    
                    # Evaluate classifier
                    acc, prec, rec, f = evalModel(classifier, X_test, y_test)
                    classifier_eval = {"accuracy" : acc, "precision": prec, "recall" : rec, "f1_score" : f}
                    
                    # Add evaluation scores for given split
                    split_scores.update({f"split_{split}": classifier_eval})
                    
                # Add split scores for given trial
                trial_scores.update({f"trial_{i}" : split_scores})
                
        # Add trial scores for given model
        model_scores.update({f"model_{model}" : trial_scores})
        
    # Add model scores for given dataset
    dataset_scores.update({f"data_{dataset}": model_scores})
    
with open('scores/dataset_scores', 'w') as f:
    json.dump(dataset_scores, f)

In [None]:
with open('scores/dataset_scores', 'r') as f:
    dataset_scores = json.load(f)

## Dataset Results

In [None]:
def query_data(_dataset_name, _model_name, dataset_scores=dataset_scores):
    split_02 = []
    split_05 = []
    split_08 = []
    for dataset, data in dataset_scores.items():
        for model_name, model_data in data.items():
            for trial_name, trial_data in model_data.items():
                for split_name, split_data in trial_data.items():
                    if dataset==_dataset_name and model_name==_model_name:
                        if split_name=="split_0.2":
                            split_02.append(split_data["accuracy"])
                        elif split_name=="split_0.5":
                            split_05.append(split_data["accuracy"])
                        elif split_name=="split_0.8":
                            split_08.append(split_data["accuracy"])
                        
    df = pd.DataFrame([split_02, split_05, split_08], columns=["Trial 1", "Trial 2", "Trial 3"])
    df["Trial_Avg"] = df.T.mean()
#     df["Type"] = _model_name
#     df["Data"] = _dataset_name
    df.loc[3] = df.mean()
    df = df.rename({0:"Split 0.2", 1:"Split 0.5", 2: "Split 0.8", 3: "Split_Avg"})
    return df

### Happiness Survey

In [None]:
happy_df = pd.concat({"Log Regression": query_data("data_happy", "model_log"),
                      "SVM" : query_data("data_happy", "model_svm"),
                      "Decision Tree" : query_data("data_happy", "model_tree"),
                      "Perceptron" : query_data("data_happy", "model_perc"),
                      "MLP" : query_data("data_happy", "model_mlp"),
                     "KNN" : query_data("data_happy", "model_knn"),
                     "Random Forest" : query_data("data_happy", "model_rf")})
happy_df

### Mushroom

In [None]:
mush_df = pd.concat({"Log Regression": query_data("data_mush", "model_log"),
                              "SVM" : query_data("data_mush", "model_svm"),
                    "Decision Tree" : query_data("data_mush", "model_tree"),
                    "Perceptron" : query_data("data_mush", "model_perc"),
                    "MLP" : query_data("data_mush", "model_mlp"),
                    "KNN" : query_data("data_mush", "model_knn"),
                    "Random Forest" : query_data("data_mush", "model_rf")})
mush_df

In [None]:
heart_df = pd.concat({"Log Regression": query_data("data_heart", "model_log"),
                              "SVM" : query_data("data_heart", "model_svm"),
                     "Decision Tree" : query_data("data_heart", "model_tree"),
                     "Perceptron" : query_data("data_heart", "model_perc"),
                     "MLP" : query_data("data_heart", "model_mlp"),
                     "KNN" : query_data("data_heart", "model_knn"),
                     "Random Forest" : query_data("data_heart", "model_rf")})
heart_df

In [None]:
main_df = pd.concat({"Happy" : happy_df, "Mush" : mush_df, "Heart" : heart_df})
main_df.to_csv("data/main_df.csv")
main_df

## Instructions

Single person project and no team work.

Report format:

Write a report with >1,000 words (excluding references) including main sections: a) abstract, b) introduction, c) method, d) experiment, e) conclusion, and f) references. You can follow the paper format as e.g leading machine learning journals such as Journal of Machine Learning Research (http://www.jmlr.org/) or IEEE Trans. on Pattern Analysis and Machine Intelligence (http://www.computer.org/web/tpami), or leading conferences like NeurIPS (https://papers.nips.cc/) and ICML (https://icml.cc/). There is no page limit for your report.

Bonus points: 

If you feel that your work deserves bonus points due to reasons such as: a) novel ideas and applications, b) large efforts in your own data collection/preparation, c) state-of-the-art classification results, or d) new algorithms, please create a "Bonus Points" section to specifically describe why you deserve bonus points.

In this project you will choose any three classifiers out of those tested in


We have been discussing the classification problem in the form of two-class classifiers throughout the class. Some classifiers like decision tree, KNN, random forests stay agnostic w.r.t the number of classes but others like SVM and Boosting where explicit objective functions are involved don't.



The basic requirement for the final project is based on the two-class classification problem. If you have additional bandwidth, you can experiment on the multi-class classification setting. When preparing the dataset to train your classifier (two-class), please try to merge the labels into two groups, positives and negatives, if your dataset happens to consist multi-class labels.



Train your classifiers using the setting (not all metrics are needed) described in the empirical study by Caruana and Niculescu-Mizil. You are supposed to reproduce consistent results as in the paper. However, do expect some small variations. When evaluating the algorithms, you don’t need to use all the metrics that were reported in the paper. Using one metric, e.g. the classification accuracy, is sufficient. Please report the cross-validated classification results with the corresponding learned hyper-parameters.

Note that since you are choosing your own libraries for the classifiers, there are implementation details that will affect the classification results. Even the same SVM but with different implementations, you won't be able to see identical results when trained on the same dataset. Therefore, don't expect the identical results as those in the paper, as you are probably using a subset and not all the features. If you see a bit difference in ranking, it should ok but the overall trend should be consistent, e.g. random forest should do well, more training data leads to better results, knn is not necessarily very bad etc.

If you compute accuracy and follow the basic requirement picking 3 classifiers and 3 datasets. You are looking at 3 trials X 3 classifiers X 3 datasets X 3 partitions (20/80, 50/50, 80/20). Each time you always report the best accuracy under the chosen hyper-parameter. Since for the accuracy is averaged among three 3 trials to rank order the classifiers, you will report 3 classifiers X 3 datasets X 3 partitions  (20/80, 50/50, 80/20)  X 3. accuracies (train, validation, test). When trying to debug, always try to see the training accuracy to see if you are able to at least push the training accuracy high (to overfit the data) as a sanity check making sure your implementation is correct. The heatmaps for your hyper-parameters are the details that do not need to be too carefully compared with. The searching for the hyper-parameters is internal and the final conclusion about the classifiers is based on the best hyper-parameter you have obtained for each time.

Since the exact data setting might have changed, the specific parameters and hyper-parameters reported in Caruana and Niculescu-Mizil paper serve as a guideline but you don't need to try all of them. You can try a few standard ones, as long as your classification results are reasonable. If you pick the multi-layer perceptron as one of your classifiers, note that you may need to increase the number of layers to e.g. 3 and create more neurons in each layer to attain good results, for some datasets.

You can alternatively or additionally adopt the datasets and classifiers reported in a follow-up paper, Caruana et al. ICML 2008.
 
You are encouraged to use Python, but using other programming languages and platforms is ok. The candidate classifiers include:
1. Boosting family classifiers
http://www.mathworks.com/matlabcentral/fileexchange/21317-adaboost
or
https://github.com/dmlc/xgboost
2. Support vector machines
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
3. Random Forests
http://www.stat.berkeley.edu/~breiman/RandomForests/
4. Decision Tree
http://www.rulequest.com/Personal/ (please see also see a sample matlab code in the attachment)
5. K-nearest neighbors
http://www.mathworks.com/matlabcentral/fileexchange/19345-efficient-k-nearest-neighbor-searchusing-jit
6. Neural Nets
http://www.cs.colostate.edu/~anderson/code/
http://www.mathworks.com/products/neural-network/code-examples.html
7. Logistic regression classifier
8. Bagging family

The links above are for your reference. You can implement your own classifier or download other
versions you like online (But you need to make sure the code online is reliable). You are supposed to
write a formal report describing about the experiments you run and the corresponding results (plus
code).


Grading
Note that if you do well by satisfying the minimum requirement e.g. 3 classifiers on 3 datasets with cross-validation, you will receive a decent score but not the full 100 points. We are looking for something a bit more and please see the guideline below.

When reporting the experimental results, there are two main sets of comparisons we are looking for:
a. For each dataset on each paritition, show the comparison for different algorithms, and hopefully be consistent with the findings in the paper with Random Forests being the best etc.
b. For each classifier on each partition, show the comparison on different partitions and you are supposed to show the increase of test accuracy (decrease of test error) with more training data and less test data.

Note that the performance and function calls vary due to the particular ML libraries you are using. For example, the same SVM classifier provided in different toolboxes might result in different errors even trained on the same dataset. But the overall differences should be reasonable and interpretable. You may obtain a ranking that is somewhat different from that in the paper, due to differences in detailed implementation of the classifiers, different training sizes, features ect. But the overall trend should be explainable. For example, random forest usually has a pretty good performance; knn might not be as bad as you had thought, kernel-based SVM is sometimes sensitive to the hyper-parameters; using more data in training will lead to improvement, especially on difficult cases.

The merit and grading of your project can be judged from aspects described below that are common
when reviewing a paper:
1. How challenging and large are the datasets you are studying? (10 points)
2. Any aspects that are new in terms of algorithm development, uniqueness of the data, or new
applications? (10 points)
3. Is your experimental design comprehensive? Have you done thoroughly experiments in tuning
hyper-parameters and performing cross validation (you should also try different data partitions, e.g 20% training and 80% testing, 50% training and 50% testing, and 80% training and 20% testing for multiple rounds, e.g. 3 times each for the above three partitions and compute average scores to remove potentials of having accidental results); try to report both the training and testing errors after cross-validation; it is encouraged to also report the training and validation errors during cross-validation using classification error/accuracy curves w.r.t. the hyper-parameters. (50 points)
4. Is your report written in a professional way with sections including abstract, introduction, data
and problem description, method description, experiments, conclusion, and references? (30
points)
5. Bonus points will be assigned to projects in which new ideas have been developed and implemented, or thorough experiments where extensive empirical studies have been carried out (e.g. evaluated on >=5 classifiers and >=4 datasets).

### Links

https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74