# Combining two feature sets

This notebook duplicates tests done in the `project3 combined features.ipynb` notebook, except uses decision trees as the feature selector. Below is a description of the tests.

In this notebook I will test combining two sets of features that were independently somewhat successful:
* Connection strengths between regions (only positive values)
* Local network statistics for each region

Only the most correlated features from each of these sets will be used, chosen by decision tree feature importance.

Note that the local network statistics need to be saved as a pickle file in this directory. They can be generated by running the code in `project3 network analysis - local level.ipynb`.

**Ensembling** was tested with a combination of the models as well. The (mean) ROC AUC score was calculated for the cross validated training data, and for the held out test set.

### Outcomes:
Wrt feature selection:
*

Previous outcomes:
* Combining these feature sets improved the ROC AUC and accuracy scores. These combined features were used in my final analysis.

* Additionally, a soft-voting ensemble classifier with logistic regression, Naive Bayes, and RBF SVM models improved on the ROC AUC score. This ensemble model is used for my final analysis.

In [1]:
%matplotlib inline

In [5]:
from bs4 import BeautifulSoup
from collections import defaultdict, OrderedDict

import numpy as np
import pandas as pd

from matplotlib import pyplot as plt
import seaborn as sns

from scipy.stats import pearsonr, spearmanr

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import confusion_matrix, f1_score, recall_score, precision_score
from sklearn.metrics import roc_auc_score, plot_roc_curve, make_scorer, roc_curve
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
#from lightgbm import LGBMClassifier

from imblearn.over_sampling import RandomOverSampler, ADASYN

import os
import re

import pickle as pkl

### Load data

In [6]:
load_connection_features = True
if load_connection_features:
    with open("all_connection_features.pkl", "rb") as f:
        X = pkl.load(f)

In [7]:
with open("all_local_node_features.pkl", "rb") as f:
    node_measures = pkl.load(f)

In [8]:
X = pd.concat([X, node_measures], axis=1)

## Split data into training and hold out

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X.drop(columns=["adhd"]), X["adhd"], test_size=.2, random_state=2)

In [10]:
# standard scale the data
scale = True
if scale:
    scaler = StandardScaler().fit(X_train)
    X_train = scaler.transform(X_train)
    X_test = scaler.transform(X_test)

In [11]:
# percentage of subjects with adhd in the training and testing sets
print("train positive class ratio: {:.3f}".format(np.count_nonzero(y_train == 1)/len(y_train)))
print("test positive class ratio: {:.3f}".format(np.count_nonzero(y_test == 1)/len(y_test)))

train positive class ratio: 0.368
test positive class ratio: 0.356


## Features selection using random forest feature importance

In [88]:
def most_important_features(X_tr, y_tr, X_te, num_features):
    print("training random forest classifier...")
    rf = RandomForestClassifier(max_depth=5, n_estimators=100)
    rf.fit(X_tr, y_tr)
    
    most_important_idxs = np.argsort(-rf.feature_importances_)
    most_important_idxs = most_important_idxs[:num_features]
    return (X_tr[:, most_important_idxs], X_te[:, most_important_idxs])

In [89]:
X_train, X_test = most_important_features(X_train, y_train, X_test, 1000)

training random forest classifier...


## Modeling

In [90]:
# trains the specified models and prints a confusion matrix for test set predictions
def test_models(models, X_train, y_train, X_test, y_test):
    for model_name in list(models.keys()):
        m = models[model_name]
        m.fit(X_train, y_train)
        
        preds = m.predict(X_test)
        
        roc_auc = roc_auc_score(y_test, preds)
        
        print("{}:".format(model_name))
        print("train acc = {:.3f}".format(m.score(X_train, y_train)))
        print("test acc = {:.3f}".format(m.score(X_test, y_test)))
        print(confusion_matrix(y_test, preds))
        print("ROC AUC = {:.3f}".format(roc_auc))
        print("-----------------------")

In [91]:
models = {"Logistic regression": LogisticRegressionCV(),
          "KNN": KNeighborsClassifier(n_neighbors=5),
          "SVM": svm.SVC(kernel="rbf"),
          "Naive Bayes": GaussianNB(), 
          "Random forest": RandomForestClassifier(max_depth=4), 
          #"Gradient boosting machine": LGBMClassifier(max_depth=4)
         }
test_models(models, X_train, y_train, X_test, y_test)

Logistic regression:
train acc = 0.887
test acc = 0.625
[[54 13]
 [26 11]]
ROC AUC = 0.552
-----------------------
KNN:
train acc = 0.726
test acc = 0.654
[[67  0]
 [36  1]]
ROC AUC = 0.514
-----------------------
SVM:
train acc = 0.986
test acc = 0.587
[[49 18]
 [25 12]]
ROC AUC = 0.528
-----------------------
Naive Bayes:
train acc = 0.808
test acc = 0.529
[[40 27]
 [22 15]]
ROC AUC = 0.501
-----------------------
Random forest:
train acc = 0.889
test acc = 0.644
[[66  1]
 [36  1]]
ROC AUC = 0.506
-----------------------


## Oversample the minority class
All of these classifiers have horrible positive class precision due to small positive class. If seperating subjects by gender, males already have an almost even split, this won't do anything.

In [71]:
X_train_resampled, y_train_resampled = RandomOverSampler(random_state=0).fit_sample(X_train, y_train)

In [72]:
models = {"Logistic regression": LogisticRegression(),
          "KNN": KNeighborsClassifier(n_neighbors=5),
          "SVM": svm.SVC(kernel="rbf"),
          "Naive Bayes": GaussianNB(), 
          "Random forest": RandomForestClassifier(max_depth=4), 
          #"Gradient boosting machine": LGBMClassifier(max_depth=4)
         }
test_models(models, X_train_resampled, y_train_resampled, X_test, y_test)

Logistic regression:
train acc = 1.000
test acc = 0.548
[[39 28]
 [19 18]]
ROC AUC = 0.534
-----------------------
KNN:
train acc = 0.747
test acc = 0.596
[[57 10]
 [32  5]]
ROC AUC = 0.493
-----------------------
SVM:
train acc = 0.998
test acc = 0.538
[[43 24]
 [24 13]]
ROC AUC = 0.497
-----------------------
Naive Bayes:
train acc = 0.798
test acc = 0.519
[[40 27]
 [23 14]]
ROC AUC = 0.488
-----------------------
Random forest:
train acc = 0.975
test acc = 0.596
[[52 15]
 [27 10]]
ROC AUC = 0.523
-----------------------


#### Synthetic oversampling:

In [None]:
X_adasyn, y_adasyn = ADASYN(random_state=0).fit_sample(X_train, y_train)

In [None]:
models = {"Logistic regression": LogisticRegression(),
          "KNN": KNeighborsClassifier(n_neighbors=5),
          "SVM": svm.SVC(kernel="rbf"),
          "Naive Bayes": GaussianNB(), 
          "Random forest": RandomForestClassifier(max_depth=4), 
          "Gradient boosting machine": LGBMClassifier(max_depth=4)
         }
test_models(models, X_adasyn, y_adasyn, X_test, y_test)

## Neural net

In [73]:
mlp = MLPClassifier(hidden_layer_sizes=(2,), max_iter=2000)
mlp.fit(X_train, y_train)
print("train acc = {:.3f}".format(mlp.score(X_train, y_train)))
print("test acc = {:.3f}".format(mlp.score(X_test, y_test)))
print("ROC AUC = {:.3f}".format(roc_auc_score(mlp.predict(X_test), y_test)))
print(confusion_matrix(y_test, mlp.predict(X_test)))

train acc = 0.959
test acc = 0.529
ROC AUC = 0.469
[[45 22]
 [27 10]]


## Ensemble model

In [74]:
estimators = [("Logistic regression", LogisticRegression(C=0.0886)),
              #("mlp", MLPClassifier(hidden_layer_sizes=(2,), max_iter=2000)),
              #("KNN", KNeighborsClassifier(n_neighbors=5)),
              ("SVM", svm.SVC(kernel="rbf", probability=True)),
              ("Naive Bayes", GaussianNB()),
              #("Random forest", RandomForestClassifier(max_depth=4)),
              #("Gradient boosting machine", LGBMClassifier(max_depth=4))
              ]
voter = VotingClassifier(estimators, voting="soft")

voter.fit(X_train, y_train)

preds = voter.predict(X_test)
roc_auc = roc_auc_score(preds, y_test)

print("train acc = {:.3f}".format(voter.score(X_train, y_train)))
print("test acc = {:.3f}".format(voter.score(X_test, y_test)))
print(confusion_matrix(y_test, preds))
print("ROC AUC = {:.3f}".format(roc_auc))

train acc = 0.945
test acc = 0.538
[[41 26]
 [22 15]]
ROC AUC = 0.508


In [None]:
conf_mat = print_confusion_matrix(confusion_matrix(y_test, preds), class_names=["No ADHD", "ADHD"], fontsize=17)

In [None]:
conf_mat.savefig("confusion_matrix.png", dpi=300, transparent=True)

In [None]:
print("recall:", recall_score(y_test, preds))
print("precision:", precision_score(y_test, preds))
print("ROC AUC:", roc_auc_score(y_test, preds))

In [None]:
probas = voter.predict_proba(X_test)
fpr, tpr, _ = roc_curve(y_test, probas[:,1], pos_label=1)

In [None]:
plt.figure(figsize=(10,8))
plt.plot(fpr, tpr, linewidth=3)
plt.plot([0,1], [0,1], linestyle="--", linewidth=2)
plt.xlabel("False Positive Rate", fontsize=16, labelpad=10)
plt.ylabel("True\nPositive\nRate", fontsize=16, rotation=0, labelpad=40)
plt.title("ROC Curve", fontsize=24)
plt.tight_layout()
plt.savefig("ROC_curve.png", dpi=200, transparent=True)

## Average cross validated scores

In [None]:
model_avg_roc_aucs = {}

In [None]:
models = [("Logistic regression", LogisticRegression(C=0.0886)),
              ("KNN", KNeighborsClassifier(n_neighbors=5)),
              ("SVM", svm.SVC(kernel="rbf")),
              ("Naive Bayes", GaussianNB()),
              ("Random forest", RandomForestClassifier(max_depth=4)),
              ("Gradient boosting machine", LGBMClassifier(max_depth=4)),
              ("mlp", MLPClassifier(hidden_layer_sizes=(2,), max_iter=1500))]

for model_name, model in models:
    roc_scorer = make_scorer(roc_auc_score)
    scores = cross_val_score(model, X_train, y_train, scoring=roc_scorer, cv = 5)
    model_avg_roc_aucs[model_name] = np.mean(scores)

In [None]:
model_avg_roc_aucs

## Ensemble Cross val score

In [None]:
estimators = [("Logistic regression", LogisticRegression(C=0.0886)),
              ("SVM", svm.SVC(probability=True, kernel="rbf")),
              ("Naive Bayes", GaussianNB())]

voter = VotingClassifier(estimators, voting="soft")

voter.fit(X_train, y_train)

roc_scorer = make_scorer(roc_auc_score)
scores = cross_val_score(voter, X_train, y_train, scoring=roc_scorer, cv = 5)
print(scores)
print("mean:", np.mean(scores))

In [None]:
preds = voter.predict(X_test)
print(roc_auc_score(y_test, preds))
print(confusion_matrix(y_test, preds))