# Stacking ensemble to generalize the results of each tabNet result

When we get the results of each single tabNet result using only one MRI data, we choose stacking ensemble model to generalize the results and get higher AUC scores.

We stack the single tabNet model which is trained to predict suicidal ideation of the children.

In this code, we stack tabNet model trained with structural MRI and model trained with diffusion MRI.

# Get ready to work!

## 1. Import libraries and load dataset

In [None]:
from google.colab import drive
drive.mount('content')

Drive already mounted at content; to attempt to forcibly remount, call drive.mount("content", force_remount=True).


In [None]:
!pip install pytorch_tabnet

Collecting pytorch_tabnet
  Downloading https://files.pythonhosted.org/packages/94/e5/2a808d611a5d44e3c997c0d07362c04a56c70002208e00aec9eee3d923b5/pytorch_tabnet-3.1.1-py3-none-any.whl
Installing collected packages: pytorch-tabnet
Successfully installed pytorch-tabnet-3.1.1


In [None]:
from sklearn.datasets import load_iris
from mlxtend.classifier import StackingClassifier
from mlxtend.feature_selection import ColumnSelector
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from pathlib import Path
import pandas as pd
import numpy as np
import dill
import os

#train_out = Path('si_ppc_rsf(corr)_s_dMRI_train.csv')
#test_out =Path('si_ppc_rsf(corr)_s_dMRI_test.csv')
train_out = '/content/sample_data/si_ppc_rsf(corr)_s_dMRI_train.csv'
test_out = '/content/sample_data/si_ppc_rsf(corr)_s_dMRI_test.csv'

train_data= pd.read_csv(train_out)
test_data= pd.read_csv(test_out)
print(len(train_data), len(test_data))

target ='Suicidalideation'
unused_feat = ['subjectkey', 'abcd_site']

# get class value from data (not trained model)
real_train_subjkey = train_data['subjectkey']
real_test_subjkey = test_data['subjectkey']

5510 317


## 2. Load model trained with single MRI dataset

To stack two model, we have to load the result probability values of the single TabNet models.

Probability values will be the input of the meta-learner.


### Open the tabNet model trained with structural MRI

In 'Suicidality_TabNet.ipynb' code, there are functions saving probability(preds_prob_) and class value(Y_) with subject keys.

In this code, we load the results of saved model to get probability and class value.

*   probability: After training, the result probability from testing/predicting by subjects
*   class value: ground truth by subjects


In [None]:
"""sMRI model"""
with open('/content/sample_data/new_SMRI.pkl', 'rb') as f:
    sMRI_model = dill.load(f, encoding='utf-8')

# define probability array
sMRI_validation_prob_result = np.array(sMRI_model.valid_prob_result.iloc[:,2])
sMRI_test_prob_result = np.array(sMRI_model.test_prob_result.iloc[:,2])

# define subject array
sMRI_y_valid_subject = sMRI_model.valid_prob_result.iloc[:,0].tolist()
sMRI_y_test_subject = sMRI_model.test_prob_result.iloc[:,0].tolist()

print(len(sMRI_validation_prob_result))
print(len(sMRI_test_prob_result))
print(len(sMRI_y_valid_subject))
print(len(sMRI_y_test_subject))

5510
317
5510
317


In [None]:
sMRI_model.valid_prob_result

Unnamed: 0,subjectkey,Y_,preds_prob_
0,NDARINVZVGAMFG7,0,0.027991
1,NDARINVWU8LHADL,0,0.037669
2,NDARINVHP3KMLEB,0,0.050731
3,NDARINV4HX3418K,0,0.038738
4,NDARINVT3Z5YXE1,0,0.065922
...,...,...,...
5505,NDARINVP4GPP1NB,1,0.084422
5506,NDARINV8P2DZ5TR,0,0.060809
5507,NDARINVBZH975R2,0,0.049012
5508,NDARINV34HCN2RW,0,0.057252


In [None]:
"""Import four arrays from three stored TabNet models"""
"""dMRI model"""
with open('/content/sample_data/new_DMRI.pkl', 'rb') as f:
    dMRI_model = dill.load(f, encoding='utf-8')

# define probability array
dMRI_validation_prob_result = np.array(dMRI_model.valid_prob_result.iloc[:,2])
dMRI_test_prob_result = np.array(dMRI_model.test_prob_result.iloc[:,2])

# define subject array
dMRI_y_valid_subject = dMRI_model.valid_prob_result.iloc[:,0].tolist()
dMRI_y_test_subject = dMRI_model.test_prob_result.iloc[:,0].tolist()

print(len(dMRI_validation_prob_result))
print(len(dMRI_test_prob_result))
print(len(dMRI_y_valid_subject))
print(len(dMRI_y_test_subject))

5510
317
5510
317


In [None]:
dMRI_model.valid_prob_result

Unnamed: 0,subjectkey,Y_,preds_prob_
0,NDARINVZVGAMFG7,0,0.050195
1,NDARINVWU8LHADL,0,0.064870
2,NDARINVHP3KMLEB,0,0.068794
3,NDARINV4HX3418K,0,0.058853
4,NDARINVT3Z5YXE1,0,0.061689
...,...,...,...
5505,NDARINVP4GPP1NB,1,0.113498
5506,NDARINV8P2DZ5TR,0,0.089805
5507,NDARINVBZH975R2,0,0.098930
5508,NDARINV34HCN2RW,0,0.087397


## Sort train/test data by subject key value loaded from each TabNet model

In each TabNet model, the orders of subject keys are the same.
So we sort train data using dMRI model and sort test data using sMRI model.

In [None]:
# train data -> using subjectkey order of dMRI model
real_train_y = []
real_train_new_subjkey = []
for i in range(len(dMRI_y_valid_subject)):
    for j in range(len(real_train_subjkey)):
        if(dMRI_y_valid_subject[i] == real_train_subjkey[j]):
            real_train_y.append(train_data[target][j])
            real_train_new_subjkey.append(real_train_subjkey[j])
            break
print(dMRI_y_valid_subject == real_train_new_subjkey)

# test data -> using subjectkey order of sMRI model
real_test_y = []
real_test_new_subjkey = []
for i in range(len(sMRI_y_test_subject)):
    for j in range(len(real_test_subjkey)):
        if(sMRI_y_test_subject[i] == real_test_subjkey[j]):
            real_test_y.append(test_data[target][j])
            real_test_new_subjkey.append(real_test_subjkey[j])
            break
print(sMRI_y_test_subject == real_test_new_subjkey)

True
True


In [None]:
dMRI_y_valid_subject == sMRI_y_valid_subject == real_train_new_subjkey

True

In [None]:
dMRI_y_test_subject == sMRI_y_test_subject == real_test_new_subjkey

True

## Make train/test input dataset used in meta learner

In [None]:
"""Build Train, Test Dataset for Meta Learner"""

"""y (ground truth) value array used in training"""
y_train_data = {
    "sMRI" : sMRI_validation_prob_result,
    "dMRI" : dMRI_validation_prob_result
}
MRI_train_y = pd.DataFrame(y_train_data, columns=["sMRI", "dMRI"])
print(MRI_train_y)

"""y (ground truth) value array used in testing"""
y_test_data = {
    "sMRI" : sMRI_test_prob_result,
    "dMRI" : dMRI_test_prob_result
}
MRI_test_y = pd.DataFrame(y_test_data, columns=["sMRI", "dMRI"])
print(MRI_test_y)

# X (input value) in meta learner
print(len(MRI_train_y), len(MRI_test_y))

# Y (output value) in meta learner
print(len(real_train_y), len(real_test_y))

          sMRI      dMRI
0     0.027991  0.050195
1     0.037669  0.064870
2     0.050731  0.068794
3     0.038738  0.058853
4     0.065922  0.061689
...        ...       ...
5505  0.084422  0.113498
5506  0.060809  0.089805
5507  0.049012  0.098930
5508  0.057252  0.087397
5509  0.051134  0.090343

[5510 rows x 2 columns]
         sMRI      dMRI
0    0.059449  0.067849
1    0.071267  0.060311
2    0.036613  0.041507
3    0.160847  0.092787
4    0.053307  0.046663
..        ...       ...
312  0.078056  0.068899
313  0.081891  0.120372
314  0.124456  0.146850
315  0.044546  0.033701
316  0.073807  0.056147

[317 rows x 2 columns]
5510 317
5510 317


# Data is ready! Let's define meta learner and train the model

We train three model (logistic regression, Xgboost, Random Forest) to compare results and find the best model.

## 1. Logistic Regression

In [None]:
"""1. Logistic Regression"""
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

model = LogisticRegression()
model.fit(MRI_train_y, real_train_y)
print("training data ACC:",model.score(MRI_train_y, real_train_y))

y_pred = model.predict(MRI_test_y)
y_pred_prob = model.predict_proba(MRI_test_y)[:,1]

print("testing data ACC:",model.score(MRI_test_y, real_test_y))
print("testing data AUC:",roc_auc_score(real_test_y, y_pred_prob))

training data ACC: 0.8865698729582577
testing data ACC: 0.5141955835962145
testing data AUC: 0.7371626462861238


## 2. Xgboost

In [None]:
"""2. Xgboost"""
import xgboost as xgb
from sklearn.metrics import roc_auc_score, accuracy_score

xgb_model = xgb.XGBClassifier(objective="binary:logistic", random_state=42, eval_metric="auc")
xgb_model.fit(MRI_train_y, real_train_y)
print("training data ACC:",xgb_model.score(MRI_train_y, real_train_y))

y_pred_xg = xgb_model.predict(MRI_test_y)
print("testing data ACC:",accuracy_score(real_test_y, y_pred_xg))

preds_prob = xgb_model.predict_proba(MRI_test_y)
print("testing data AUC:",roc_auc_score(y_score=preds_prob[:,1], y_true=real_test_y))

training data ACC: 0.8967332123411978
testing data ACC: 0.5047318611987381
testing data AUC: 0.7323461507841733


## 3. Random Forest

In [None]:
"""3. Random Forest"""
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(MRI_train_y, real_train_y)

y_preda = clf.predict(MRI_test_y)
print(accuracy_score(real_test_y, y_preda))
y_pred_prob = clf.predict_proba(MRI_test_y)[:,1]
test_auc = roc_auc_score(y_score=y_pred_prob, y_true=real_test_y)
print(test_auc)

0.501577287066246
0.7390932250616988


# Saving the probability result from meta learner

To see probability distribution, we save the predicting results of the meta learner.

In [None]:
"""transform probability of predicting result to csv file"""
import pandas as pd
df = pd.DataFrame()
df['subjectkey'] = real_test_new_subjkey
df['Y_'] = y_pred_prob
df['preds_prob'] = real_test_y
df.to_csv('sMRI_dMRI_test_prob.csv', index=True)