# Prediction Assignment BIOS 624

## Q1 - Develop a predictive model for mortality

Solution: deep learning for tabular data

### Notes & explanations

Some of the columns are continuous (like age) and we will treat them as float numbers we can feed our model directly. Others are categorical and we will convert them to a unique index that we will feed to embedding layers.

The last part is the list of pre-processors we apply to our data:

- Categorify is going to take every categorical variable and make a map from integer to unique categories, then replace the values by the corresponding index.
- FillMissing will fill the missing values in the continuous variables by the median of existing values (you can choose a specific value if you prefer)
- Normalize will normalize the continuous variables (substract the mean and divide by the std)

### Set-up

In [1]:
# loading necessary libraries
from fastai.tabular.all import *
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import KFold
from imblearn.over_sampling import SMOTE
import scipy.stats

In [2]:
# reading in data
df = pd.read_table('mri with variable names.dat', delimiter=" ")

In [3]:
# exploring data
df.head()

Unnamed: 0,ptid,age,male,weight,height,packyrs,yrsquit,alcoh,physact,chf,...,sbp,aai,fev,dsst,vent,whgrd,numinf,volinf,obstime,death
1,1,72,1,173,169,54.0,0,0.0,9.8,0,...,139,1.03,1.28,25,2,2,1,7.46,2126,0
2,2,82,0,139,170,0.0,0,0.25,0.7,0,...,146,1.11,2.55,51,4,2,3,0.14,1841,0
3,3,89,1,145,170,0.0,0,1.25,1.6,0,...,134,1.01,2.38,27,4,1,2,0.18,1875,0
4,4,72,1,190,181,33.0,17,9.5,3.5,0,...,147,0.98,2.69,43,3,2,1,0.04,1897,0
5,5,70,0,153,158,0.0,0,0.25,0.7,0,...,117,0.94,2.03,48,2,1,0,0.0,2107,0


In [4]:
# deleting columns irrelevant for question to be answered
del df['ptid']
del df['obstime']

In [5]:
# checking column names and types
df.dtypes

age          int64
male         int64
weight       int64
height       int64
packyrs     object
yrsquit      int64
alcoh      float64
physact    float64
chf          int64
chd          int64
stroke       int64
diabete      int64
genhlth      int64
ldl         object
alb         object
crt         object
plt         object
sbp          int64
aai         object
fev         object
dsst        object
vent         int64
whgrd        int64
numinf      object
volinf      object
death        int64
dtype: object

In [6]:
# imputing missing values using median
df = df.replace(r'^[.]$', np.nan, regex=True)
df = df.fillna(df.median())

In [7]:
# making sure all categorical columns are set to the correct type
cat_cols = ['male', 'chf', 'chd', 'stroke', 'diabete', 'genhlth', 'vent', 'whgrd', 'numinf']
df[cat_cols] = df[cat_cols].astype('int')
df[cat_cols] = df[cat_cols].astype('object')

In [8]:
# making sure all continuous columns are set to the correct type
num_cols = ['age', 'weight', 'height', 'packyrs', 'yrsquit', 'alcoh', 'physact', 'ldl', 'alb', 'crt', 'plt', 'sbp', 'aai', 'fev', 'dsst', 'volinf']
df[num_cols] = df[num_cols].apply(pd.to_numeric, errors='raise')

In [9]:
# instantiating metrics
Recall = Recall()
Precision = Precision()
RocAucBinary = RocAucBinary()

In [10]:
# creating function to calculate single sample CI
def mean_confidence_interval(data, confidence=0.95):
    a = 1.0 * np.array(data)
    n = len(a)
    m, se = np.mean(a), scipy.stats.sem(a)
    h = se * scipy.stats.t.ppf((1 + confidence) / 2., n-1)
    return m, m-h, m+h

### Single model build

In [None]:
# building tabular object for base model
base_object = TabularPandas(df, procs=[Categorify, FillMissing, Normalize],
                            cat_names = ['male', 'chf', 'chd', 'stroke', 'diabete', 'vent', 'whgrd', 'numinf'],
                            cont_names = ['age', 'weight', 'height', 'packyrs', 'yrsquit', 'alcoh', 'physact', 'ldl', 'alb', 'crt', 'plt', 'sbp', 'aai', 'fev', 'dsst', 'volinf'],
                            y_names='death',
                            y_block=CategoryBlock,
                            splits=RandomSplitter(valid_pct=0.3, seed=5)(range_of(df)))

In [None]:
# building data_loader
base_dls = base_object.dataloaders(bs=64)

In [None]:
# view processed data
base_dls.show_batch()

In [None]:
# build learner
base_learn = tabular_learner(base_dls, metrics=[accuracy, error_rate, Recall, Precision, RocAucBinary])

In [None]:
# training learner for 10 epochs
base_learn.fit_one_cycle(10)

In [None]:
# showing snapshot of predictions
base_learn.show_results()

In [None]:
# plotting confusion matrix
interp = ClassificationInterpretation.from_learner(base_learn)
interp.plot_confusion_matrix()

### Stratified K-fold CV

We declare our cat and cont vars, our procs & metrics. Along wtih this, to stay in v2 style our lists will be of type L.

From here, we will use the StratifiedKFold to generate 10 shuffled splits, and split them with the .split method. From here, we can go into each of those splits and they will contain our indexs. Convert them to L's and we can directly pass them into our TabularPandas. From here, we create our DataBunch, Learner, train it and print out the validation set statistics.

In [None]:
# creating lists for metrics
val_acc = L()
val_sen = L()
val_spe = L()
val_ppv = L()
val_npv = L()
val_auc = L()

# shortening for-loop
cat_names = ['male', 'chf', 'chd', 'stroke', 'diabete', 'vent', 'whgrd', 'numinf']
cont_names = ['age', 'weight', 'height', 'packyrs', 'yrsquit', 'alcoh', 'physact', 'ldl', 'alb', 'crt', 'plt', 'sbp', 'aai', 'fev', 'dsst', 'volinf']
procs = [Categorify, FillMissing, Normalize]
metrics=[accuracy, Recall, Precision, RocAucBinary]

# stratified k-fold CV
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=4)
res = skf.split(df.index, df["death"])
for x, y in res:
  ix = (L(list(x)), L(list(y)))
  to = TabularPandas(df, procs, cat_names, cont_names, y_names="death", y_block=CategoryBlock, splits=ix)
  data = to.dataloaders()
  learn = tabular_learner(data, layers=[200,100], loss_func=CrossEntropyLossFlat(), metrics=metrics)
  learn.fit(10)
  val_acc.append(learn.validate()[1]) # accuracy
  val_sen.append(learn.validate()[2]) # sensitivity
  val_ppv.append(learn.validate()[3]) # ppv
  val_auc.append(learn.validate()[4]) # auc
  interp = ClassificationInterpretation.from_learner(learn)
  upp, low = interp.confusion_matrix()
  tn, fp = upp[0], upp[1]
  fn, tp = low[0], low[1]
  specificity = tn/(fp + tn)
  npv = tn/(tn+fn)
  val_npv.append(npv) # npv
  val_spe.append(specificity) # specificity

In [None]:
print(f'\nAccuracy:\nmean/conf: {mean_confidence_interval(val_acc)}\nstd: {np.std(val_acc)}')
print(f'\nSensitivity:\nmean/conf: {mean_confidence_interval(val_sen)}\nstd: {np.std(val_sen)}')
print(f'\nSpecificity:\nmean/conf: {mean_confidence_interval(val_spe)}\nstd: {np.std(val_spe)}')
print(f'\nPPV:\nmean/conf: {mean_confidence_interval(val_ppv)}\nstd: {np.std(val_ppv)}')
print(f'\nNPV:\nmean/conf: {mean_confidence_interval(val_npv)}\nstd: {np.std(val_npv)}')
print(f'\nAUC:\nmean/conf: {mean_confidence_interval(val_auc)}\nstd: {np.std(val_auc)}')

### K-fold cross-validation with SMOTE

In [17]:
# creating lists for metrics
val_acc = L()
val_sen = L()
val_spe = L()
val_ppv = L()
val_npv = L()
val_auc = L()

# shortening for-loop
cat_names = ['male', 'chf', 'chd', 'stroke', 'diabete', 'vent', 'whgrd', 'numinf']
cont_names = ['age', 'weight', 'height', 'packyrs', 'yrsquit', 'alcoh', 'physact', 'ldl', 'alb', 'crt', 'plt', 'sbp', 'aai', 'fev', 'dsst', 'volinf']
procs = [Categorify, FillMissing, Normalize]
metrics=[accuracy, Recall, Precision, RocAucBinary]

# k-fold
skf = KFold(n_splits=10, shuffle=True, random_state=4)
res = skf.split(df.index)
for x, y in res:
  # creating SMOTEr
  sm = SMOTE(random_state=4)
    
  # splitting into train & valid sets
  X_train, y_train = df.iloc[x,:-1], df.iloc[x,-1:]
  df_val = df.iloc[y,:]

  # oversampling training set
  X_train_oversampled, y_train_oversampled = sm.fit_sample(X_train, y_train)

  # converting training set back to df
  X_train = pd.DataFrame(X_train_oversampled, columns=X_train.columns)
  y_train = pd.DataFrame(y_train_oversampled, columns=y_train.columns)
  df_train = pd.concat([X_train, y_train], axis=1)
  
  # concatenating train & val sets
  df = pd.concat([df_train, df_val], axis=0, sort=False, ignore_index=True)
  train_idx = range(0,len(df_train))
  valid_idx = range(len(df_train)+1, len(df))
  splits = (L(list(train_idx)), L(list(valid_idx)))
    
  # building model 
  to = TabularPandas(df, procs, cat_names, cont_names, y_names="death", y_block=CategoryBlock, splits=splits)
  data = to.dataloaders()
  learn = tabular_learner(data, layers=[200,100], loss_func=CrossEntropyLossFlat(), metrics=metrics)
  learn.fit(8)
  
  # extracting model metrics
  val_acc.append(learn.validate()[1]) # accuracy
  val_sen.append(learn.validate()[2]) # sensitivity
  val_ppv.append(learn.validate()[3]) # ppv
  val_auc.append(learn.validate()[4]) # auc
  interp = ClassificationInterpretation.from_learner(learn)
  upp, low = interp.confusion_matrix()
  tn, fp = upp[0], upp[1]
  fn, tp = low[0], low[1]
  specificity = tn/(fp + tn)
  npv = tn/(tn+fn)
  val_npv.append(npv) # npv
  val_spe.append(specificity) # specificity

epoch,train_loss,valid_loss,accuracy,recall_score,precision_score,roc_auc_score,time
0,0.447438,0.650844,0.72,0.526316,0.869565,0.862731,00:00
1,0.328826,0.61676,0.706667,0.447368,0.944444,0.834282,00:00
2,0.26051,0.584457,0.706667,0.473684,0.9,0.843528,00:00
3,0.21539,0.55898,0.733333,0.5,0.95,0.857752,00:00
4,0.181432,0.523126,0.746667,0.526316,0.952381,0.876956,00:00
5,0.154699,0.465508,0.786667,0.605263,0.958333,0.897582,00:00
6,0.134332,0.388927,0.88,0.789474,0.967742,0.927454,00:00
7,0.120775,0.464075,0.813333,0.710526,0.9,0.897582,00:00


epoch,train_loss,valid_loss,accuracy,recall_score,precision_score,roc_auc_score,time
0,0.473971,0.62563,0.746667,0.589744,0.884615,0.88604,00:00
1,0.343965,0.545063,0.773333,0.692308,0.84375,0.882479,00:00
2,0.269965,0.4755,0.8,0.74359,0.852941,0.891738,00:00
3,0.223187,0.414865,0.813333,0.794872,0.837838,0.915954,00:00
4,0.191441,0.364092,0.84,0.794872,0.885714,0.935897,00:00
5,0.164473,0.30381,0.866667,0.846154,0.891892,0.954416,00:00
6,0.145224,0.2593,0.906667,0.846154,0.970588,0.967236,00:00
7,0.129808,0.234615,0.906667,0.871795,0.944444,0.970798,00:00


epoch,train_loss,valid_loss,accuracy,recall_score,precision_score,roc_auc_score,time
0,0.467455,0.675361,0.533333,0.146341,1.0,0.817791,00:00
1,0.344837,0.66396,0.546667,0.195122,0.888889,0.847202,00:00
2,0.274392,0.690429,0.573333,0.243902,0.909091,0.850072,00:00
3,0.224507,0.691239,0.613333,0.317073,0.928571,0.874462,00:00
4,0.190223,0.690229,0.626667,0.341463,0.933333,0.912482,00:00
5,0.166144,0.588538,0.68,0.439024,0.947368,0.912482,00:00
6,0.145531,0.456696,0.786667,0.634146,0.962963,0.946915,00:00
7,0.127656,0.358189,0.8,0.707317,0.90625,0.934003,00:00


epoch,train_loss,valid_loss,accuracy,recall_score,precision_score,roc_auc_score,time
0,0.457452,0.661628,0.573333,0.219512,1.0,0.85868,00:00
1,0.33076,0.643658,0.56,0.195122,1.0,0.870875,00:00
2,0.265328,0.602675,0.68,0.463415,0.904762,0.862984,00:00
3,0.218738,0.591072,0.64,0.390244,0.888889,0.885222,00:00
4,0.184524,0.52721,0.746667,0.585366,0.923077,0.903156,00:00
5,0.156196,0.462537,0.813333,0.707317,0.935484,0.913199,00:00
6,0.135479,0.388837,0.866667,0.829268,0.918919,0.923242,00:00
7,0.117834,0.354658,0.88,0.829268,0.944444,0.940459,00:00


epoch,train_loss,valid_loss,accuracy,recall_score,precision_score,roc_auc_score,time
0,0.488132,0.633477,0.567568,0.03125,0.5,0.834077,00:00
1,0.360593,0.58626,0.581081,0.0625,0.666667,0.851935,00:00
2,0.284765,0.535591,0.77027,0.5625,0.857143,0.857143,00:00
3,0.231581,0.498516,0.783784,0.59375,0.863636,0.870536,00:00
4,0.195034,0.463052,0.810811,0.65625,0.875,0.890625,00:00
5,0.16665,0.397305,0.837838,0.71875,0.884615,0.922619,00:00
6,0.149426,0.351701,0.851351,0.6875,0.956522,0.934524,00:00
7,0.135327,0.26394,0.878378,0.75,0.96,0.958333,00:00


epoch,train_loss,valid_loss,accuracy,recall_score,precision_score,roc_auc_score,time
0,0.471309,0.64677,0.72973,0.387097,0.923077,0.806452,00:00
1,0.349531,0.593045,0.675676,0.290323,0.818182,0.820705,00:00
2,0.273122,0.552989,0.743243,0.451613,0.875,0.83871,00:00
3,0.225676,0.524844,0.743243,0.451613,0.875,0.864966,00:00
4,0.191287,0.477074,0.783784,0.548387,0.894737,0.877719,00:00
5,0.16627,0.445493,0.797297,0.580645,0.9,0.895724,00:00
6,0.143589,0.378077,0.797297,0.645161,0.833333,0.909977,00:00
7,0.126843,0.387394,0.810811,0.709677,0.814815,0.909977,00:00


epoch,train_loss,valid_loss,accuracy,recall_score,precision_score,roc_auc_score,time
0,0.478608,0.679157,0.513514,0.027027,1.0,0.768444,00:00
1,0.352949,0.684007,0.513514,0.027027,1.0,0.799854,00:00
2,0.277816,0.728965,0.513514,0.027027,1.0,0.805698,00:00
3,0.228228,0.794272,0.540541,0.081081,1.0,0.850986,00:00
4,0.196718,0.756942,0.608108,0.216216,1.0,0.869248,00:00
5,0.171896,0.631713,0.675676,0.351351,1.0,0.924032,00:00
6,0.15006,0.436357,0.783784,0.594595,0.956522,0.953251,00:00
7,0.132707,0.273392,0.878378,0.837838,0.911765,0.962747,00:00


epoch,train_loss,valid_loss,accuracy,recall_score,precision_score,roc_auc_score,time
0,0.49474,0.675592,0.540541,0.276596,1.0,0.86052,00:00
1,0.35965,0.656656,0.581081,0.340426,1.0,0.871552,00:00
2,0.283618,0.623986,0.635135,0.425532,1.0,0.870764,00:00
3,0.235495,0.564231,0.689189,0.553191,0.928571,0.881797,00:00
4,0.20108,0.504062,0.77027,0.680851,0.941176,0.91253,00:00
5,0.177648,0.449014,0.797297,0.702128,0.970588,0.929078,00:00
6,0.156911,0.372321,0.837838,0.765957,0.972973,0.942474,00:00
7,0.136998,0.356892,0.891892,0.851064,0.97561,0.944838,00:00


epoch,train_loss,valid_loss,accuracy,recall_score,precision_score,roc_auc_score,time
0,0.465536,0.666048,0.486486,0.0,0.0,0.906433,00:00
1,0.348209,0.641444,0.527027,0.078947,1.0,0.902047,00:00
2,0.272595,0.610102,0.635135,0.315789,0.923077,0.895468,00:00
3,0.225235,0.572149,0.689189,0.394737,1.0,0.916667,00:00
4,0.190532,0.527736,0.72973,0.473684,1.0,0.923977,00:00
5,0.165011,0.482141,0.72973,0.5,0.95,0.937135,00:00
6,0.146191,0.367411,0.837838,0.710526,0.964286,0.945906,00:00
7,0.128765,0.303802,0.878378,0.789474,0.967742,0.954678,00:00


  _warn_prf(average, modifier, msg_start, len(result))


epoch,train_loss,valid_loss,accuracy,recall_score,precision_score,roc_auc_score,time
0,0.486872,0.645222,0.662162,0.142857,0.8,0.75,00:00
1,0.356278,0.603817,0.662162,0.142857,0.8,0.770963,00:00
2,0.279125,0.571558,0.662162,0.285714,0.615385,0.778727,00:00
3,0.229151,0.540948,0.662162,0.285714,0.615385,0.81677,00:00
4,0.193964,0.496642,0.675676,0.285714,0.666667,0.850932,00:00
5,0.168067,0.463329,0.783784,0.535714,0.833333,0.871118,00:00
6,0.146207,0.383754,0.783784,0.464286,0.928571,0.916925,00:00
7,0.125699,0.308927,0.878378,0.785714,0.88,0.93323,00:00


In [18]:
print(f'\nAccuracy:\nmean/conf: {mean_confidence_interval(val_acc)}\nstd: {np.std(val_acc)}')
print(f'\nSensitivity:\nmean/conf: {mean_confidence_interval(val_sen)}\nstd: {np.std(val_sen)}')
print(f'\nSpecificity:\nmean/conf: {mean_confidence_interval(val_spe)}\nstd: {np.std(val_spe)}')
print(f'\nPPV:\nmean/conf: {mean_confidence_interval(val_ppv)}\nstd: {np.std(val_ppv)}')
print(f'\nNPV:\nmean/conf: {mean_confidence_interval(val_npv)}\nstd: {np.std(val_npv)}')
print(f'\nAUC:\nmean/conf: {mean_confidence_interval(val_auc)}\nstd: {np.std(val_auc)}')


Accuracy:
mean/conf: (0.8616216301918029, 0.8343118232577359, 0.88893143712587)
std: 0.03621738672786539

Sensitivity:
mean/conf: (0.7842673610342727, 0.7395654001227702, 0.8289693219457753)
std: 0.0592823013994374

Specificity:
mean/conf: (0.9365102659056742, 0.9156470041072541, 0.9573735277040943)
std: 0.02766818611286141

PPV:
mean/conf: (0.9205070101167487, 0.8855552102679969, 0.9554588099655006)
std: 0.04635195170047576

NPV:
mean/conf: (0.8143422109255853, 0.7789553115588727, 0.849729110292298)
std: 0.04692896667334382

AUC:
mean/conf: (0.9406645482765745, 0.9241111986419083, 0.9572178979112408)
std: 0.021952519357155184


## Q2 - Is self-rated health predictive of mortality beyond the predictive capabilities of the available data?

Solution: build a model that incorporates the `genhlth` variable and compare metrics with the base model

In [19]:
# creating lists for metrics
val_acc = L()
val_sen = L()
val_spe = L()
val_ppv = L()
val_npv = L()
val_auc = L()

# shortening for-loop
cat_names = ['male', 'chf', 'chd', 'stroke', 'diabete', 'vent', 'whgrd', 'numinf', 'genhlth']
cont_names = ['age', 'weight', 'height', 'packyrs', 'yrsquit', 'alcoh', 'physact', 'ldl', 'alb', 'crt', 'plt', 'sbp', 'aai', 'fev', 'dsst', 'volinf']
procs = [Categorify, FillMissing, Normalize]
metrics=[accuracy, Recall, Precision, RocAucBinary]

# k-fold
skf = KFold(n_splits=10, shuffle=True, random_state=4)
res = skf.split(df.index)
for x, y in res:
  # creating SMOTEr
  sm = SMOTE(random_state=4)
    
  # splitting into train & valid sets
  X_train, y_train = df.iloc[x,:-1], df.iloc[x,-1:]
  df_val = df.iloc[y,:]

  # oversampling training set
  X_train_oversampled, y_train_oversampled = sm.fit_sample(X_train, y_train)

  # converting training set back to df
  X_train = pd.DataFrame(X_train_oversampled, columns=X_train.columns)
  y_train = pd.DataFrame(y_train_oversampled, columns=y_train.columns)
  df_train = pd.concat([X_train, y_train], axis=1)
  
  # concatenating train & val sets
  df = pd.concat([df_train, df_val], axis=0, sort=False, ignore_index=True)
  train_idx = range(0,len(df_train))
  valid_idx = range(len(df_train)+1, len(df))
  splits = (L(list(train_idx)), L(list(valid_idx)))
    
  # building model 
  to = TabularPandas(df, procs, cat_names, cont_names, y_names="death", y_block=CategoryBlock, splits=splits)
  data = to.dataloaders()
  learn = tabular_learner(data, layers=[200,100], loss_func=CrossEntropyLossFlat(), metrics=metrics)
  learn.fit(8)
  
  # extracting model metrics
  val_acc.append(learn.validate()[1]) # accuracy
  val_sen.append(learn.validate()[2]) # sensitivity
  val_ppv.append(learn.validate()[3]) # ppv
  val_auc.append(learn.validate()[4]) # auc
  interp = ClassificationInterpretation.from_learner(learn)
  upp, low = interp.confusion_matrix()
  tn, fp = upp[0], upp[1]
  fn, tp = low[0], low[1]
  specificity = tn/(fp + tn)
  npv = tn/(tn+fn)
  val_npv.append(npv) # npv
  val_spe.append(specificity) # specificity

epoch,train_loss,valid_loss,accuracy,recall_score,precision_score,roc_auc_score,time
0,0.483857,0.641573,0.710526,0.34375,0.916667,0.795455,00:00
1,0.340107,0.593925,0.684211,0.375,0.75,0.785511,00:00
2,0.259793,0.557761,0.684211,0.40625,0.722222,0.794744,00:00
3,0.214813,0.538672,0.697368,0.40625,0.764706,0.81392,00:00
4,0.178411,0.491366,0.75,0.5625,0.782609,0.853693,00:00
5,0.147236,0.432992,0.789474,0.625,0.833333,0.889915,00:00
6,0.129006,0.350824,0.842105,0.71875,0.884615,0.928977,00:00
7,0.113757,0.287235,0.881579,0.78125,0.925926,0.944602,00:00


epoch,train_loss,valid_loss,accuracy,recall_score,precision_score,roc_auc_score,time
0,0.472684,0.6555,0.565789,0.057143,1.0,0.894077,00:00
1,0.355762,0.619505,0.605263,0.142857,1.0,0.883624,00:00
2,0.277918,0.592921,0.618421,0.228571,0.8,0.889895,00:00
3,0.227987,0.603405,0.631579,0.228571,0.888889,0.897561,00:00
4,0.189412,0.563075,0.671053,0.342857,0.857143,0.905923,00:00
5,0.160192,0.467584,0.763158,0.542857,0.904762,0.929617,00:00
6,0.136178,0.336833,0.855263,0.742857,0.928571,0.943554,00:00
7,0.11869,0.3177,0.894737,0.942857,0.846154,0.92892,00:00


epoch,train_loss,valid_loss,accuracy,recall_score,precision_score,roc_auc_score,time
0,0.503609,0.64537,0.631579,0.27027,0.909091,0.82675,00:00
1,0.365643,0.599305,0.671053,0.378378,0.875,0.838531,00:00
2,0.27994,0.557731,0.710526,0.459459,0.894737,0.862786,00:00
3,0.225768,0.514412,0.710526,0.459459,0.894737,0.887041,00:00
4,0.187809,0.453857,0.776316,0.567568,0.954545,0.925156,00:00
5,0.158134,0.349627,0.855263,0.72973,0.964286,0.962578,00:00
6,0.133873,0.298479,0.894737,0.837838,0.939394,0.957034,00:00
7,0.114718,0.312968,0.868421,0.864865,0.864865,0.945946,00:00


epoch,train_loss,valid_loss,accuracy,recall_score,precision_score,roc_auc_score,time
0,0.483022,0.662225,0.644737,0.324324,0.857143,0.801802,00:00
1,0.344319,0.611027,0.671053,0.378378,0.875,0.835066,00:00
2,0.265369,0.551421,0.710526,0.432432,0.941176,0.863479,00:00
3,0.212169,0.512065,0.763158,0.540541,0.952381,0.883576,00:00
4,0.177009,0.455876,0.763158,0.567568,0.913043,0.909217,00:00
5,0.151529,0.431495,0.815789,0.702703,0.896552,0.91684,00:00
6,0.129349,0.412184,0.828947,0.756757,0.875,0.919612,00:00
7,0.110746,0.4358,0.842105,0.783784,0.878788,0.906445,00:00


epoch,train_loss,valid_loss,accuracy,recall_score,precision_score,roc_auc_score,time
0,0.473282,0.65974,0.657895,0.414634,0.894737,0.81115,00:00
1,0.34282,0.630837,0.671053,0.439024,0.9,0.806272,00:00
2,0.261674,0.601404,0.697368,0.487805,0.909091,0.81115,00:00
3,0.20814,0.574517,0.736842,0.585366,0.888889,0.843902,00:00
4,0.173425,0.553304,0.75,0.585366,0.923077,0.864111,00:00
5,0.146233,0.512143,0.802632,0.682927,0.933333,0.885714,00:00
6,0.125925,0.482142,0.855263,0.780488,0.941176,0.891289,00:00
7,0.107868,0.482173,0.842105,0.756098,0.939394,0.902439,00:00


epoch,train_loss,valid_loss,accuracy,recall_score,precision_score,roc_auc_score,time
0,0.470561,0.662757,0.592105,0.225,1.0,0.852778,00:00
1,0.341519,0.640754,0.618421,0.275,1.0,0.854167,00:00
2,0.262361,0.629142,0.644737,0.325,1.0,0.853472,00:00
3,0.20893,0.625676,0.618421,0.325,0.866667,0.855556,00:00
4,0.174448,0.630433,0.657895,0.4,0.888889,0.870139,00:00
5,0.14763,0.533385,0.697368,0.475,0.904762,0.897222,00:00
6,0.129917,0.479982,0.75,0.575,0.92,0.899306,00:00
7,0.114443,0.566154,0.736842,0.65,0.8125,0.866667,00:00


epoch,train_loss,valid_loss,accuracy,recall_score,precision_score,roc_auc_score,time
0,0.48055,0.660844,0.565789,0.083333,1.0,0.834028,00:00
1,0.356907,0.641298,0.552632,0.055556,1.0,0.841667,00:00
2,0.279016,0.624345,0.605263,0.166667,1.0,0.849306,00:00
3,0.225475,0.631225,0.605263,0.166667,1.0,0.879167,00:00
4,0.191258,0.601345,0.697368,0.361111,1.0,0.904167,00:00
5,0.163417,0.547818,0.710526,0.388889,1.0,0.949306,00:00
6,0.141344,0.420436,0.802632,0.583333,1.0,0.9625,00:00
7,0.121117,0.327109,0.815789,0.638889,0.958333,0.98125,00:00


epoch,train_loss,valid_loss,accuracy,recall_score,precision_score,roc_auc_score,time
0,0.504243,0.67178,0.513158,0.097561,1.0,0.836237,00:00
1,0.367382,0.612395,0.684211,0.439024,0.947368,0.855749,00:00
2,0.27938,0.594407,0.684211,0.439024,0.947368,0.862021,00:00
3,0.225564,0.562378,0.710526,0.536585,0.88,0.858537,00:00
4,0.187198,0.518708,0.736842,0.560976,0.92,0.890592,00:00
5,0.158531,0.47281,0.776316,0.609756,0.961538,0.917073,00:00
6,0.137789,0.353135,0.815789,0.756098,0.885714,0.935889,00:00
7,0.118067,0.335912,0.855263,0.829268,0.894737,0.934495,00:00


epoch,train_loss,valid_loss,accuracy,recall_score,precision_score,roc_auc_score,time
0,0.476569,0.681789,0.486842,0.093023,1.0,0.871036,00:00
1,0.349162,0.64487,0.552632,0.209302,1.0,0.865398,00:00
2,0.263387,0.6267,0.631579,0.348837,1.0,0.875264,00:00
3,0.211905,0.631826,0.671053,0.418605,1.0,0.887245,00:00
4,0.17759,0.618386,0.671053,0.418605,1.0,0.906977,00:00
5,0.151751,0.543789,0.75,0.581395,0.961538,0.916843,00:00
6,0.133737,0.466951,0.776316,0.627907,0.964286,0.940099,00:00
7,0.116661,0.433655,0.815789,0.697674,0.967742,0.926709,00:00


epoch,train_loss,valid_loss,accuracy,recall_score,precision_score,roc_auc_score,time
0,0.493271,0.655958,0.746667,0.605263,0.851852,0.807966,00:00
1,0.349441,0.619579,0.72,0.473684,0.947368,0.808677,00:00
2,0.267575,0.608234,0.706667,0.447368,0.944444,0.830014,00:00
3,0.213,0.573499,0.72,0.473684,0.947368,0.85064,00:00
4,0.175168,0.498259,0.773333,0.605263,0.92,0.879801,00:00
5,0.146233,0.461411,0.826667,0.657895,1.0,0.911095,00:00
6,0.129648,0.377428,0.853333,0.736842,0.965517,0.93101,00:00
7,0.1089,0.348813,0.866667,0.789474,0.9375,0.943812,00:00


In [20]:
print(f'\nAccuracy:\nmean/conf: {mean_confidence_interval(val_acc)}\nstd: {np.std(val_acc)}')
print(f'\nSensitivity:\nmean/conf: {mean_confidence_interval(val_sen)}\nstd: {np.std(val_sen)}')
print(f'\nSpecificity:\nmean/conf: {mean_confidence_interval(val_spe)}\nstd: {np.std(val_spe)}')
print(f'\nPPV:\nmean/conf: {mean_confidence_interval(val_ppv)}\nstd: {np.std(val_ppv)}')
print(f'\nNPV:\nmean/conf: {mean_confidence_interval(val_npv)}\nstd: {np.std(val_npv)}')
print(f'\nAUC:\nmean/conf: {mean_confidence_interval(val_auc)}\nstd: {np.std(val_auc)}')


Accuracy:
mean/conf: (0.8419298231601715, 0.8097058936658017, 0.8741537526545413)
std: 0.042734337859167554

Sensitivity:
mean/conf: (0.7734158636868396, 0.7059776259860925, 0.8408541013875866)
std: 0.08943441969217657

Specificity:
mean/conf: (0.9129982437909268, 0.8765889457857379, 0.9494075417961156)
std: 0.048284838831983874

PPV:
mean/conf: (0.9025938566048923, 0.8657898088508564, 0.9393979043589282)
std: 0.04880834324009758

NPV:
mean/conf: (0.8028951278706481, 0.7466167839140084, 0.8591734718272878)
std: 0.07463452789696814

AUC:
mean/conf: (0.928128463357206, 0.9060055699637994, 0.9502513567506126)
std: 0.02933866898080705


## Q3 - Comment on how useful some of the harder to obtain measurements (MRI, lab variables) are in predicting mortality relative to the other variables

Solution: build a model that does not have hard-to-obtain measurements and compare it with the base model

In [21]:
# creating lists for metrics
val_acc = L()
val_sen = L()
val_spe = L()
val_ppv = L()
val_npv = L()
val_auc = L()

# shortening for-loop
cat_names = ['male', 'chf', 'chd', 'stroke', 'diabete', 'genhlth']
cont_names = ['age', 'weight', 'height', 'packyrs', 'yrsquit', 'alcoh', 'physact', 'sbp', 'aai', 'fev', 'dsst']
procs = [Categorify, FillMissing, Normalize]
metrics=[accuracy, Recall, Precision, RocAucBinary]

# k-fold
skf = KFold(n_splits=10, shuffle=True, random_state=4)
res = skf.split(df.index)
for x, y in res:
  # creating SMOTEr
  sm = SMOTE(random_state=4)
    
  # splitting into train & valid sets
  X_train, y_train = df.iloc[x,:-1], df.iloc[x,-1:]
  df_val = df.iloc[y,:]

  # oversampling training set
  X_train_oversampled, y_train_oversampled = sm.fit_sample(X_train, y_train)

  # converting training set back to df
  X_train = pd.DataFrame(X_train_oversampled, columns=X_train.columns)
  y_train = pd.DataFrame(y_train_oversampled, columns=y_train.columns)
  df_train = pd.concat([X_train, y_train], axis=1)
  
  # concatenating train & val sets
  df = pd.concat([df_train, df_val], axis=0, sort=False, ignore_index=True)
  train_idx = range(0,len(df_train))
  valid_idx = range(len(df_train)+1, len(df))
  splits = (L(list(train_idx)), L(list(valid_idx)))
    
  # building model 
  to = TabularPandas(df, procs, cat_names, cont_names, y_names="death", y_block=CategoryBlock, splits=splits)
  data = to.dataloaders()
  learn = tabular_learner(data, layers=[200,100], loss_func=CrossEntropyLossFlat(), metrics=metrics)
  learn.fit(8)
  
  # extracting model metrics
  val_acc.append(learn.validate()[1]) # accuracy
  val_sen.append(learn.validate()[2]) # sensitivity
  val_ppv.append(learn.validate()[3]) # ppv
  val_auc.append(learn.validate()[4]) # auc
  interp = ClassificationInterpretation.from_learner(learn)
  upp, low = interp.confusion_matrix()
  tn, fp = upp[0], upp[1]
  fn, tp = low[0], low[1]
  specificity = tn/(fp + tn)
  npv = tn/(tn+fn)
  val_npv.append(npv) # npv
  val_spe.append(specificity) # specificity

epoch,train_loss,valid_loss,accuracy,recall_score,precision_score,roc_auc_score,time
0,0.486357,0.674063,0.506494,0.075,0.75,0.756757,00:00
1,0.368575,0.652198,0.571429,0.225,0.818182,0.785135,00:00
2,0.295396,0.62412,0.649351,0.4,0.842105,0.789189,00:00
3,0.243594,0.664427,0.61039,0.325,0.8125,0.810135,00:00
4,0.211072,0.683981,0.61039,0.325,0.8125,0.825676,00:00
5,0.183534,0.66613,0.649351,0.375,0.882353,0.843919,00:00
6,0.167855,0.643484,0.649351,0.35,0.933333,0.881081,00:00
7,0.156072,0.531903,0.727273,0.55,0.88,0.876351,00:00


epoch,train_loss,valid_loss,accuracy,recall_score,precision_score,roc_auc_score,time
0,0.495277,0.655982,0.649351,0.324324,0.857143,0.766892,00:00
1,0.365857,0.618204,0.714286,0.459459,0.894737,0.766892,00:00
2,0.283877,0.595161,0.74026,0.540541,0.869565,0.777027,00:00
3,0.232605,0.599217,0.74026,0.513514,0.904762,0.799324,00:00
4,0.19894,0.532605,0.753247,0.513514,0.95,0.862838,00:00
5,0.179606,0.498499,0.74026,0.513514,0.904762,0.879054,00:00
6,0.162586,0.34627,0.857143,0.702703,1.0,0.962162,00:00
7,0.148435,0.321802,0.87013,0.756757,0.965517,0.951351,00:00


epoch,train_loss,valid_loss,accuracy,recall_score,precision_score,roc_auc_score,time
0,0.502955,0.688747,0.493506,0.0,0.0,0.679487,00:00
1,0.378247,0.718998,0.506494,0.052632,0.5,0.676788,00:00
2,0.299606,0.797646,0.545455,0.157895,0.666667,0.683536,00:00
3,0.245796,0.877854,0.571429,0.210526,0.727273,0.705128,00:00
4,0.209684,0.84071,0.584416,0.210526,0.8,0.771255,00:00
5,0.183099,0.722019,0.623377,0.315789,0.8,0.792848,00:00
6,0.161525,0.587364,0.649351,0.447368,0.73913,0.848853,00:00
7,0.14545,0.456572,0.805195,0.763158,0.828571,0.889339,00:00


epoch,train_loss,valid_loss,accuracy,recall_score,precision_score,roc_auc_score,time
0,0.54876,0.653392,0.675325,0.4,0.777778,0.75102,00:00
1,0.394127,0.607173,0.662338,0.6,0.636364,0.748299,00:00
2,0.303503,0.5647,0.701299,0.685714,0.666667,0.776871,00:00
3,0.24692,0.53302,0.727273,0.742857,0.684211,0.802041,00:00
4,0.210415,0.454811,0.792208,0.828571,0.74359,0.872789,00:00
5,0.186431,0.40117,0.831169,0.8,0.823529,0.903401,00:00
6,0.166498,0.355984,0.844156,0.828571,0.828571,0.92517,00:00
7,0.148379,0.345227,0.883117,0.885714,0.861111,0.927891,00:00


epoch,train_loss,valid_loss,accuracy,recall_score,precision_score,roc_auc_score,time
0,0.494911,0.650665,0.597403,0.03125,1.0,0.743056,00:00
1,0.368548,0.63453,0.623377,0.09375,1.0,0.747917,00:00
2,0.289463,0.685911,0.636364,0.125,1.0,0.780556,00:00
3,0.239909,0.709343,0.675325,0.25,0.888889,0.799306,00:00
4,0.207631,0.67747,0.675325,0.25,0.888889,0.847222,00:00
5,0.181747,0.598417,0.701299,0.3125,0.909091,0.89375,00:00
6,0.162672,0.458381,0.766234,0.46875,0.9375,0.926389,00:00
7,0.148263,0.386926,0.779221,0.5625,0.857143,0.915278,00:00


epoch,train_loss,valid_loss,accuracy,recall_score,precision_score,roc_auc_score,time
0,0.481399,0.675428,0.480519,0.069767,1.0,0.850205,00:00
1,0.353027,0.705899,0.480519,0.069767,1.0,0.854309,00:00
2,0.275181,0.732626,0.597403,0.27907,1.0,0.870725,00:00
3,0.221982,0.76981,0.597403,0.27907,1.0,0.889877,00:00
4,0.187358,0.729261,0.61039,0.302326,1.0,0.911765,00:00
5,0.161322,0.625447,0.714286,0.488372,1.0,0.940492,00:00
6,0.146541,0.513073,0.779221,0.627907,0.964286,0.928181,00:00
7,0.131822,0.495627,0.792208,0.651163,0.965517,0.928181,00:00


epoch,train_loss,valid_loss,accuracy,recall_score,precision_score,roc_auc_score,time
0,0.513015,0.66325,0.61039,0.166667,0.5,0.71773,00:00
1,0.381074,0.621636,0.714286,0.466667,0.7,0.741135,00:00
2,0.291025,0.573808,0.701299,0.433333,0.684211,0.765248,00:00
3,0.232971,0.538857,0.714286,0.5,0.681818,0.79078,00:00
4,0.202395,0.475907,0.766234,0.566667,0.772727,0.852482,00:00
5,0.18012,0.404744,0.818182,0.666667,0.833333,0.907092,00:00
6,0.160474,0.356738,0.844156,0.766667,0.821429,0.919858,00:00
7,0.145731,0.339147,0.844156,0.8,0.8,0.921277,00:00


epoch,train_loss,valid_loss,accuracy,recall_score,precision_score,roc_auc_score,time
0,0.500625,0.661053,0.649351,0.257143,0.9,0.755782,00:00
1,0.374494,0.606059,0.727273,0.457143,0.888889,0.784354,00:00
2,0.294156,0.553191,0.753247,0.571429,0.833333,0.806803,00:00
3,0.243689,0.518875,0.74026,0.542857,0.826087,0.82381,00:00
4,0.215467,0.483309,0.766234,0.571429,0.869565,0.865306,00:00
5,0.188862,0.43279,0.779221,0.657143,0.821429,0.893197,00:00
6,0.169889,0.406344,0.805195,0.714286,0.833333,0.897279,00:00
7,0.153295,0.414352,0.818182,0.8,0.8,0.894558,00:00


epoch,train_loss,valid_loss,accuracy,recall_score,precision_score,roc_auc_score,time
0,0.502719,0.683642,0.539474,0.347826,0.761905,0.697101,00:00
1,0.364457,0.699911,0.539474,0.326087,0.789474,0.699275,00:00
2,0.281707,0.73203,0.565789,0.391304,0.782609,0.710145,00:00
3,0.225418,0.74096,0.605263,0.456522,0.807692,0.734058,00:00
4,0.191116,0.670623,0.657895,0.521739,0.857143,0.780435,00:00
5,0.166063,0.552743,0.671053,0.543478,0.862069,0.846377,00:00
6,0.148071,0.449967,0.763158,0.73913,0.85,0.886957,00:00
7,0.132282,0.433399,0.776316,0.804348,0.822222,0.900725,00:00


epoch,train_loss,valid_loss,accuracy,recall_score,precision_score,roc_auc_score,time
0,0.502347,0.679205,0.526316,0.266667,0.8,0.751971,00:00
1,0.360312,0.662066,0.578947,0.355556,0.842105,0.768459,00:00
2,0.279023,0.707118,0.552632,0.311111,0.823529,0.774194,00:00
3,0.222144,0.744051,0.552632,0.311111,0.823529,0.812186,00:00
4,0.18644,0.739325,0.552632,0.288889,0.866667,0.848029,00:00
5,0.160772,0.651647,0.618421,0.4,0.9,0.863082,00:00
6,0.142067,0.551246,0.763158,0.644444,0.935484,0.885305,00:00
7,0.127564,0.566104,0.789474,0.733333,0.891892,0.871685,00:00


In [22]:
print(f'\nAccuracy:\nmean/conf: {mean_confidence_interval(val_acc)}\nstd: {np.std(val_acc)}')
print(f'\nSensitivity:\nmean/conf: {mean_confidence_interval(val_sen)}\nstd: {np.std(val_sen)}')
print(f'\nSpecificity:\nmean/conf: {mean_confidence_interval(val_spe)}\nstd: {np.std(val_spe)}')
print(f'\nPPV:\nmean/conf: {mean_confidence_interval(val_ppv)}\nstd: {np.std(val_ppv)}')
print(f'\nNPV:\nmean/conf: {mean_confidence_interval(val_npv)}\nstd: {np.std(val_npv)}')
print(f'\nAUC:\nmean/conf: {mean_confidence_interval(val_auc)}\nstd: {np.std(val_auc)}')


Accuracy:
mean/conf: (0.8085270047187805, 0.7749227236909356, 0.8421312857466253)
std: 0.04456491562300755

Sensitivity:
mean/conf: (0.7306972887325849, 0.6523093040705463, 0.8090852733946234)
std: 0.10395562158960438

Specificity:
mean/conf: (0.8834921548786662, 0.8321036022673226, 0.9348807074900097)
std: 0.06814984403967433

PPV:
mean/conf: (0.8671973993698131, 0.8240176725714968, 0.9103771261681294)
std: 0.05726356352623619

NPV:
mean/conf: (0.7699658334478464, 0.7094295513704043, 0.8305021155252885)
std: 0.08028126835019553

AUC:
mean/conf: (0.9076634587311176, 0.8894836530479785, 0.9258432644142568)
std: 0.024109473005550868
