# HCT Survival Predictions

Goal:  Develop models to improve the prediction of transplant survival rates for patients undergoing allogeneic Hematopoietic Cell Transplantation (HCT) — an important step in ensuring that every patient has a fair chance at a successful outcome, regardless of their background.

Improving survival predictions for allogeneic HCT patients is a vital healthcare challenge. Current predictive models often fall short in addressing disparities related to socioeconomic status, race, and geography. Addressing these gaps is crucial for enhancing patient care, optimizing resource utilization, and rebuilding trust in the healthcare system.

The goal is to address disparities by bridging diverse data sources, refining algorithms, and reducing biases to ensure equitable outcomes for patients across diverse race groups. Your work will help create a more just and effective healthcare environment, ensuring every patient receives the care they deserve.

Dataset Description

The dataset consists of 59 variables related to hematopoietic stem cell transplantation (HSCT), encompassing a range of demographic and medical characteristics of both recipients and donors, such as age, sex, ethnicity, disease status, and treatment details. The primary outcome of interest is event-free survival, represented by the variable efs, while the time to event-free survival is captured by the variable efs_time. These two variables together encode the target for a censored time-to-event analysis. The data, which features equal representation across recipient racial categories including White, Asian, African-American, Native American, Pacific Islander, and More than One Race, was synthetically generated using the data generator from synthcity, trained on a large cohort of real CIBMTR data.


    train.csv - the training set, with target efs (Event-free survival)
    test.csv - the test set; your task is to predict the value of efs for this data
    sample_submission.csv - a sample submission file in the correct format with all predictions set to 0.50
    data_dictionary.csv - a list of all features and targets used in dataset and their descriptions


## Import Package

In [1]:
import numpy as np
import pandas as pd

## Import Dataset

In [2]:
path_data_dictionary = "C:/Users/julia/Desktop/Yanjun/hct competition/data_dictionary.csv"
path_test = "C:/Users/julia/Desktop/Yanjun/hct competition/test.csv"
path_train = "C:/Users/julia/Desktop/Yanjun/hct competition/train.csv"
data_dictionary = pd.read_csv(path_data_dictionary)
path_submission = "C:/Users/julia/Desktop/Yanjun/hct competition/sample_submission.csv"
test = pd.read_csv(path_test)
train = pd.read_csv(path_train)
submission = pd.read_csv(path_submission)

In [3]:
data_dictionary

Unnamed: 0,variable,description,type,values
0,dri_score,Refined disease risk index,Categorical,['Intermediate' 'High' 'N/A - non-malignant in...
1,psych_disturb,Psychiatric disturbance,Categorical,['Yes' 'No' nan 'Not done']
2,cyto_score,Cytogenetic score,Categorical,['Intermediate' 'Favorable' 'Poor' 'TBD' nan '...
3,diabetes,Diabetes,Categorical,['No' 'Yes' nan 'Not done']
4,hla_match_c_high,Recipient / 1st donor allele level (high resol...,Numerical,
5,hla_high_res_8,Recipient / 1st donor allele-level (high resol...,Numerical,
6,tbi_status,TBI,Categorical,"['No TBI' 'TBI + Cy +- Other' 'TBI +- Other, <..."
7,arrhythmia,Arrhythmia,Categorical,['No' nan 'Yes' 'Not done']
8,hla_low_res_6,Recipient / 1st donor antigen-level (low resol...,Numerical,
9,graft_type,Graft type,Categorical,['Peripheral blood' 'Bone marrow']


In [4]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28800 entries, 0 to 28799
Data columns (total 60 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   ID                      28800 non-null  int64  
 1   dri_score               28646 non-null  object 
 2   psych_disturb           26738 non-null  object 
 3   cyto_score              20732 non-null  object 
 4   diabetes                26681 non-null  object 
 5   hla_match_c_high        24180 non-null  float64
 6   hla_high_res_8          22971 non-null  float64
 7   tbi_status              28800 non-null  object 
 8   arrhythmia              26598 non-null  object 
 9   hla_low_res_6           25530 non-null  float64
 10  graft_type              28800 non-null  object 
 11  vent_hist               28541 non-null  object 
 12  renal_issue             26885 non-null  object 
 13  pulm_severe             26665 non-null  object 
 14  prim_disease_hct        28800 non-null

In [5]:
# remove Id from train
train1 = train.copy()
train1 = train1.drop(columns = 'ID')

In [6]:
train1.efs.value_counts('percent')

# efs = 1 means that this patient has Event after transplant surgery
# efs = 0 means that this patient doesn't has Event after transplant surgery

efs
1.0    0.539306
0.0    0.460694
Name: proportion, dtype: float64

In [7]:
train1.head()

Unnamed: 0,dri_score,psych_disturb,cyto_score,diabetes,hla_match_c_high,hla_high_res_8,tbi_status,arrhythmia,hla_low_res_6,graft_type,...,tce_div_match,donor_related,melphalan_dose,hla_low_res_8,cardiac,hla_match_drb1_high,pulm_moderate,hla_low_res_10,efs,efs_time
0,N/A - non-malignant indication,No,,No,,,No TBI,No,6.0,Bone marrow,...,,Unrelated,"N/A, Mel not given",8.0,No,2.0,No,10.0,0.0,42.356
1,Intermediate,No,Intermediate,No,2.0,8.0,"TBI +- Other, >cGy",No,6.0,Peripheral blood,...,Permissive mismatched,Related,"N/A, Mel not given",8.0,No,2.0,Yes,10.0,1.0,4.672
2,N/A - non-malignant indication,No,,No,2.0,8.0,No TBI,No,6.0,Bone marrow,...,Permissive mismatched,Related,"N/A, Mel not given",8.0,No,2.0,No,10.0,0.0,19.793
3,High,No,Intermediate,No,2.0,8.0,No TBI,No,6.0,Bone marrow,...,Permissive mismatched,Unrelated,"N/A, Mel not given",8.0,No,2.0,No,10.0,0.0,102.349
4,High,No,,No,2.0,8.0,No TBI,No,6.0,Peripheral blood,...,Permissive mismatched,Related,MEL,8.0,No,2.0,No,10.0,0.0,16.223


In [8]:
test

Unnamed: 0,ID,dri_score,psych_disturb,cyto_score,diabetes,hla_match_c_high,hla_high_res_8,tbi_status,arrhythmia,hla_low_res_6,...,karnofsky_score,hepatic_mild,tce_div_match,donor_related,melphalan_dose,hla_low_res_8,cardiac,hla_match_drb1_high,pulm_moderate,hla_low_res_10
0,28800,N/A - non-malignant indication,No,,No,,,No TBI,No,6.0,...,90.0,No,,Unrelated,"N/A, Mel not given",8.0,No,2.0,No,10.0
1,28801,Intermediate,No,Intermediate,No,2.0,8.0,"TBI +- Other, >cGy",No,6.0,...,90.0,No,Permissive mismatched,Related,"N/A, Mel not given",8.0,No,2.0,Yes,10.0
2,28802,N/A - non-malignant indication,No,,No,2.0,8.0,No TBI,No,6.0,...,90.0,No,Permissive mismatched,Related,"N/A, Mel not given",8.0,No,2.0,No,10.0


## Analyse of the Data

### Check the proposition of missing data

In [9]:
missing = np.round(train1.isna().sum()/len(train), 3) * 100
missing

dri_score                  0.5
psych_disturb              7.2
cyto_score                28.0
diabetes                   7.4
hla_match_c_high          16.0
hla_high_res_8            20.2
tbi_status                 0.0
arrhythmia                 7.6
hla_low_res_6             11.4
graft_type                 0.0
vent_hist                  0.9
renal_issue                6.6
pulm_severe                7.4
prim_disease_hct           0.0
hla_high_res_6            18.3
cmv_status                 2.2
hla_high_res_10           24.9
hla_match_dqb1_high       18.1
tce_imm_match             38.7
hla_nmdp_6                14.6
hla_match_c_low            9.7
rituximab                  7.5
hla_match_drb1_low         9.2
hla_match_dqb1_low        14.6
prod_type                  0.0
cyto_score_detail         41.4
conditioning_intensity    16.6
ethnicity                  2.0
year_hct                   0.0
obesity                    6.1
mrd_hct                   57.6
in_vivo_tcd                0.8
tce_matc

In [10]:
df_missing = pd.DataFrame(missing, columns=['values']).reset_index()

In [11]:
# mark different variables which has different category of missing data percentage:

# function to differentiate different category percentage of missing data
def color_map(percent):
  cmap = []
  for x in percent:
    if x >= 20:
      temp = 'background-color: red'
    elif x >= 5:
      temp = 'background-color: yellow'
    else:
      temp = 'background-color: green'
    cmap.append(temp)
  return cmap
# df_missing.style.map(color_map)
df_missing.style.apply(lambda x: color_map(df_missing['values']), subset= ['index','values'], axis = 0)

Unnamed: 0,index,values
0,dri_score,0.5
1,psych_disturb,7.2
2,cyto_score,28.0
3,diabetes,7.4
4,hla_match_c_high,16.0
5,hla_high_res_8,20.2
6,tbi_status,0.0
7,arrhythmia,7.6
8,hla_low_res_6,11.4
9,graft_type,0.0


### 1.process missing data, 2.use dummy to transfer category variables to numerical variables, 3.deceide which features import to efs and efs_time 

In [12]:
# 1 find variables which has percentage of missing data less than 5%
indexless5 = np.where(missing < 5)
missingless5 = missing.index[indexless5]
print('variables which has percentage of missing data less than 5% :', missingless5)

# # 2 find variables which has percentage of missing data 5% - 20%
# index5to20 = np.where((missing>= 5) & (missing < 20))
# missing5to20 = missing.index[index5to20]
# print('variables which has percentage of missing data 5% - 20%: ', missing5to20)

# # 3 find variables which has percentage of missing data  more than 20%
# index20 = np.where((missing>= 20))
# missing20 = missing.index[index20]
# print('variables which has percentage of missing data more than 20% :', missing20)


variables which has percentage of missing data less than 5% : Index(['dri_score', 'tbi_status', 'graft_type', 'vent_hist',
       'prim_disease_hct', 'cmv_status', 'prod_type', 'ethnicity', 'year_hct',
       'in_vivo_tcd', 'age_at_hct', 'gvhd_proph', 'sex_match', 'race_group',
       'comorbidity_score', 'karnofsky_score', 'donor_related',
       'melphalan_dose', 'efs', 'efs_time'],
      dtype='object')


In [13]:
# replace missing data in variables which has less than 5 % of missing data with mode or mean 
misssing_less5_object = train1.loc[:, missingless5].select_dtypes(include = 'object').columns
for col in misssing_less5_object:
  train1[col] = train1[col].fillna(train1[col].mode()[0])

misssing_less5_numerical = train1.loc[:, missingless5].select_dtypes(exclude = 'object').columns
for col in misssing_less5_numerical:
  train1[col] = train1[col].fillna(train1[col].mean())

train1[missingless5].isnull().sum()
# check weather there are missing data ln variables which has less than 5 percentage of missing data originally 

dri_score            0
tbi_status           0
graft_type           0
vent_hist            0
prim_disease_hct     0
cmv_status           0
prod_type            0
ethnicity            0
year_hct             0
in_vivo_tcd          0
age_at_hct           0
gvhd_proph           0
sex_match            0
race_group           0
comorbidity_score    0
karnofsky_score      0
donor_related        0
melphalan_dose       0
efs                  0
efs_time             0
dtype: int64

In [14]:
# train12 = train1.copy()

In [15]:
# # for variables which has 5% to 20% of missing data, create a missing indicator
# for col in missing5to20:
#   train12[col +'_missing'] = train1[col].isnull().astype(int)

In [16]:
# Use Cox Proportional Hazards Model to determine whether the missing data has an impact on survival
from lifelines import CoxPHFitter

In [17]:
# # Penalized Cox models, such as Ridge or Lasso regression, can help mitigate issues caused by collinearity and missing data. In lifelines, you can use the penalizer argument to apply regularization to the Cox model.
# cph = CoxPHFitter(penalizer=0.1)
# # use new data train2 (which is train1 remove missing20)
# train2 = train12.copy()
# train2 = train2.drop(columns = missing20)

# # replace category nan with missing, and replace numerical nan with median 
# object_columns = train2.select_dtypes(include = object).columns
# train2[object_columns]= train2.select_dtypes(include = object).fillna('Missing')

# numerical_columns = train2.select_dtypes(exclude = object).columns
# train2[numerical_columns] = train2[numerical_columns].fillna(train2[numerical_columns].median())

# # because Cox Proportional Hazards Model only cope with numerical variables, so change category to numerical(get_dummies)
# train2 = pd.get_dummies(train2, drop_first= True, dtype = int)
# # remove gvhd_proph_FK+- others(not MMF,MTX) this columns in train2, because this columns all have the same value
# train2 = train2.drop(columns = ['gvhd_proph_FK+- others(not MMF,MTX)'])

# # standardize train2, train2 chnage to train
# from sklearn.preprocessing import StandardScaler
# std = StandardScaler()
# train2 = train2.astype(float)
# train2.iloc[:,:] = std.fit_transform(train2.iloc[:,:])

In [18]:
# # use train2 to fit the model 
# cph.fit(train2, duration_col= 'efs_time', event_col= 'efs')
# cph.print_summary()
# # the result of the model is moderate , beacuse the C- index is 0.62, so it need to improve the data quality

###### group variables with missing data less than 5% and more than 5%

In [19]:
# # Use Cox Proportional Hazards Model to determine whether the missing data has an impact on survival
# from lifelines import CoxPHFitter
# # 1 find variables which has percentage of missing data more than 5%
# indexmore5 = np.where(missing>= 5)
# missingmore5 = missing.index[indexmore5]
# print('variables which has percentage of missing data more than 5%: ', missingmore5)

In [20]:
# traintemp = train1.copy()

In [21]:
# for variables which has more than 5% of missing data, create a missing indicator
# for col in missingmore5:
#   traintemp[col +'_missing'] = train1[col].isnull().astype(int)

In [22]:
# # Penalized Cox models, such as Ridge or Lasso regression, can help mitigate issues caused by collinearity and missing data. In lifelines, you can use the penalizer argument to apply regularization to the Cox model.
# cph = CoxPHFitter(penalizer=0.1)
# # use new data train2 (which is train1 remove missing20)
# train3 = traintemp.copy()

# # replace category nan with missing, and replace numerical nan with median 
# object_columns = train3.select_dtypes(include = object).columns
# train3[object_columns]= train3.select_dtypes(include = object).fillna('Missing')

# numerical_columns = train3.select_dtypes(exclude = object).columns
# train3[numerical_columns] = train3[numerical_columns].fillna(train3[numerical_columns].median())

# # because Cox Proportional Hazards Model only cope with numerical variables, so change category to numerical(get_dummies)
# train3 = pd.get_dummies(train3, drop_first= True, dtype = int)
# # remove gvhd_proph_FK+- others(not MMF,MTX) this columns in train3, because this columns all have the same value
# train3 = train3.drop(columns = ['gvhd_proph_FK+- others(not MMF,MTX)'])

# # standardize train3, train3 chnage to train
# from sklearn.preprocessing import StandardScaler
# std = StandardScaler()
# train3 = train3.astype(float)
# train3.iloc[:,:] = std.fit_transform(train3.iloc[:,:])

In [23]:
# # use train3 to fit the model 
# cph.fit(train3, duration_col= 'efs_time', event_col= 'efs')
# cph.print_summary()
# # the result of the model is moderate , beacuse the C- index is 0.63, so it need to improve the data quality

### Plot all variables 

In [24]:
# import seaborn as sns
# import matplotlib.pyplot as plt

# # get numerical and object variables
# numerical_variables = train1.select_dtypes(exclude = object).columns
# object_variables = train1.select_dtypes(include = object).columns

# num_numerical_variables = len(numerical_variables)
# num_object = len(object_variables)

# # plot the boxplot of all numerical variables 
# fig, axes = plt.subplots(6, 4, figsize = (20, 20))
# axes = axes.flatten()
# for i, j in enumerate(numerical_variables):
#   sns.histplot(data = train1, x = j, ax = axes[i])
# plt.show()

In [25]:
# # plot the plot of all numerical variables vs efs
# fig, axes = plt.subplots(6, 4, figsize = (20, 20))
# axes = axes.flatten()
# for i, j in enumerate(numerical_variables[0:-2]):
#   sns.histplot(data = train1, x = j, hue = 'efs', multiple= 'dodge' , ax = axes[i])
# for i in range(-2, 0):
#   axes[i].axis('off')
# plt.show()

# # plot the plot of all numerical variables vs efs_time

# fig, axes = plt.subplots(6, 4, figsize = (20, 20))
# axes = axes.flatten()
# for i, j in enumerate(numerical_variables[0:-2]):
#   sns.scatterplot(data = train1, x = j, y = 'efs_time', hue = 'efs' , ax = axes[i])
# for i in range(-2, 0):
#   axes[i].axis('off')
# plt.show()

In [26]:
# # check the plot of all category variables
# fig, axes = plt.subplots(7, 5, figsize = (20, 25))
# axes = axes.flatten()
# for i, j in enumerate(object_variables):
#   sns.histplot(data = train1, x = j, ax = axes[i])
# plt.show()

In [27]:
# # check the plot of all category variables vs efs and efs_time

# fig, axes = plt.subplots(7, 5, figsize = (20, 25))
# axes = axes.flatten()
# for i, j in enumerate(object_variables):
#   sns.histplot(data = train1, x = j, hue = 'efs', ax = axes[i])
# plt.show()


# fig, axes = plt.subplots(7, 5, figsize = (20, 25))
# axes = axes.flatten()
# for i, j in enumerate(object_variables):
#   sns.scatterplot(data = train1, x = j, y = 'efs_time', hue = 'efs', ax = axes[i])
# plt.show()


#### We can see some variables with missing data are very important, so we are going to use random forests for survival to handle missing values directly

In [28]:
train1.isnull().mean().round(3)

dri_score                 0.000
psych_disturb             0.072
cyto_score                0.280
diabetes                  0.074
hla_match_c_high          0.160
hla_high_res_8            0.202
tbi_status                0.000
arrhythmia                0.076
hla_low_res_6             0.114
graft_type                0.000
vent_hist                 0.000
renal_issue               0.066
pulm_severe               0.074
prim_disease_hct          0.000
hla_high_res_6            0.183
cmv_status                0.000
hla_high_res_10           0.249
hla_match_dqb1_high       0.181
tce_imm_match             0.387
hla_nmdp_6                0.146
hla_match_c_low           0.097
rituximab                 0.075
hla_match_drb1_low        0.092
hla_match_dqb1_low        0.146
prod_type                 0.000
cyto_score_detail         0.414
conditioning_intensity    0.166
ethnicity                 0.000
year_hct                  0.000
obesity                   0.061
mrd_hct                   0.576
in_vivo_

In [29]:
# use get_dummy to change category to numerical variables
train1 = pd.get_dummies(train1, drop_first= True, dtype = float)

# split train14 into train4 and test4
from sklearn.model_selection import train_test_split
train4, test4 = train_test_split(train1, train_size= 0.7, random_state= 1, shuffle= True, stratify= train1['efs'])
train4 = train4.reset_index(drop = True).astype(np.float32)
test4 = test4.reset_index(drop = True)

In [30]:
import gc
gc.collect()

from sksurv.ensemble import RandomSurvivalForest
from sksurv.util import Surv

# split X and y from train4
train4_X = train4.drop(columns= ['efs', 'efs_time']).astype(np.float32)
train4_y = train4['efs_time']
train4_r = train4['efs']

# use Surv to create survival object
surv_object = Surv.from_arrays( train4_r, train4_y)

# create and fit the random survival forest model
rsf = RandomSurvivalForest(n_estimators= 10, max_depth= 10, n_jobs = 1, max_features= 'sqrt', random_state= 1)
rsf.fit(train4_X, surv_object)

In [31]:
# after training the model, force garbage collection
import gc
gc.collect()
c_index = rsf.score(train4_X, surv_object)
c_index

0.6919570532422651

In [32]:
test4_X = test4.drop(columns = ['efs', 'efs_time'])
test4_y = test4['efs_time']
test4_r = test4['efs']
# calculate the C-index for test data
y_test = Surv.from_arrays(test4_r, test4_y)
c_test_index = rsf.score(test4_X, y_test)
c_test_index

0.6460427419474594

In [33]:
import gc
gc.collect()
from sklearn.inspection import permutation_importance

# access the feature importance 
# because train4_X is quite big data, use smaller data from train4_X
small_train4_X = train4_X.sample(frac = 0.1, random_state= 1)
small_surv_object = surv_object[small_train4_X.index]
feature_importances = permutation_importance(rsf, small_train4_X, small_surv_object, n_repeats= 3, n_jobs = 1 , random_state= 1)

### There are negative permutation importance, whihc means that shuffle these variables improve the modelcompare the original data, this is maybe because the variables are irrelevant, noisy or correlated with other variables, Let check the feature correlation

In [34]:
feature_importances

{'importances_mean': array([ 6.42949799e-04,  1.66210252e-03,  6.28536694e-04,  7.95227382e-04,
         1.80174252e-03,  2.14943256e-03,  6.65196112e-04,  4.88792244e-04,
         1.05184331e-03,  2.21168046e-03,  1.28318408e-02,  1.06761417e-03,
         7.30733961e-03,  4.17040050e-04,  1.52194030e-02,  9.31420777e-04,
         1.44705482e-03,  1.80371649e-02,  8.11228017e-03,  1.62220537e-03,
         5.02160921e-04,  4.66525042e-03,  5.77986385e-04,  3.28399456e-03,
         6.45456425e-05,  2.02201236e-04,  0.00000000e+00,  2.78235585e-04,
         1.48444534e-03,  1.18118526e-02,  5.34955956e-04,  2.22254251e-04,
         9.81762200e-05,  1.95694451e-03,  8.97999085e-04,  2.52333774e-04,
         0.00000000e+00,  1.31180141e-04,  3.57904543e-03, -1.25331345e-06,
        -3.23772641e-05,  1.29529945e-03,  2.86110571e-03,  0.00000000e+00,
         0.00000000e+00,  2.91186491e-04,  3.82124826e-03,  1.13863527e-03,
         2.11183316e-04,  2.08467803e-04, -1.66899574e-04,  3.720669

In [38]:
# calculate VIF (Variance Inflation Factor) 
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools import add_constant

# use train4_X(efs and efs_time are already removed) to calculate VIF


# calculate VIF for each feature (because there are some missing data, so can't use VIF directly, so need to remove all missing data)

vif_feature = pd.DataFrame()
vif_feature['variables'] = train4_X.columns


train4_X_dropna = train4_X.dropna()

indicater = 1

while indicater:
  vif = [variance_inflation_factor(exog = train4_X_dropna, exog_idx = i) for i in range(len(train4_X_dropna.columns))]
  vif_feature['vif'] = vif
  vif_feature['variables'] = train4_X_dropna.columns
  if max(vif) > 5: 
    index_max = np.where(vif == max(vif))
    column_max = train4_X_dropna.columns[index_max]
    train4_X_dropna = train4_X_dropna.drop(columns= column_max )
  else: 
    indicater = 0

vif_feature

ValueError: Length of values (147) does not match length of index (148)

In [82]:
X_const.isnull().sum()

const                                  0
hla_match_c_high                       0
hla_high_res_8                       865
hla_low_res_6                        249
hla_high_res_6                       865
                                    ... 
melphalan_dose_N/A, Mel not given      0
cardiac_Not done                       0
cardiac_Yes                            0
pulm_moderate_Not done                 0
pulm_moderate_Yes                      0
Length: 149, dtype: int64

In [73]:
correlation_matrix.columns

Index(['hla_match_c_high', 'hla_high_res_8', 'hla_low_res_6', 'hla_high_res_6',
       'hla_high_res_10', 'hla_match_dqb1_high', 'hla_nmdp_6',
       'hla_match_c_low', 'hla_match_drb1_low', 'hla_match_dqb1_low',
       ...
       'tce_div_match_GvH non-permissive', 'tce_div_match_HvG non-permissive',
       'tce_div_match_Permissive mismatched', 'donor_related_Related',
       'donor_related_Unrelated', 'melphalan_dose_N/A, Mel not given',
       'cardiac_Not done', 'cardiac_Yes', 'pulm_moderate_Not done',
       'pulm_moderate_Yes'],
      dtype='object', length=148)

In [61]:
train4_X_const.columns

Index(['const', 'hla_match_c_high', 'hla_high_res_8', 'hla_low_res_6',
       'hla_high_res_6', 'hla_high_res_10', 'hla_match_dqb1_high',
       'hla_nmdp_6', 'hla_match_c_low', 'hla_match_drb1_low',
       ...
       'tce_div_match_GvH non-permissive', 'tce_div_match_HvG non-permissive',
       'tce_div_match_Permissive mismatched', 'donor_related_Related',
       'donor_related_Unrelated', 'melphalan_dose_N/A, Mel not given',
       'cardiac_Not done', 'cardiac_Yes', 'pulm_moderate_Not done',
       'pulm_moderate_Yes'],
      dtype='object', length=149)

In [60]:
correlation_matrix

Unnamed: 0,hla_match_c_high,hla_high_res_8,hla_low_res_6,hla_high_res_6,hla_high_res_10,hla_match_dqb1_high,hla_nmdp_6,hla_match_c_low,hla_match_drb1_low,hla_match_dqb1_low,...,tce_div_match_GvH non-permissive,tce_div_match_HvG non-permissive,tce_div_match_Permissive mismatched,donor_related_Related,donor_related_Unrelated,"melphalan_dose_N/A, Mel not given",cardiac_Not done,cardiac_Yes,pulm_moderate_Not done,pulm_moderate_Yes
hla_match_c_high,1.000000,0.858538,0.757077,0.751491,0.853467,0.614428,0.748911,0.745979,0.692020,0.596936,...,-0.030697,-0.044676,0.077449,-0.408446,0.405292,-0.069573,-0.009688,0.001649,-0.034490,-0.017343
hla_high_res_8,0.858538,1.000000,0.910200,0.983467,0.986260,0.701509,0.885265,0.802705,0.815420,0.666526,...,-0.003310,-0.024853,0.056577,-0.453528,0.452058,-0.084268,-0.015178,0.017800,-0.034193,-0.012739
hla_low_res_6,0.757077,0.910200,1.000000,0.899234,0.913121,0.686176,0.883286,0.764865,0.895341,0.641250,...,0.018420,-0.005400,0.042416,-0.434485,0.433132,-0.084070,-0.015819,0.023425,-0.031134,-0.008212
hla_high_res_6,0.751491,0.983467,0.899234,1.000000,0.968904,0.681565,0.872655,0.761326,0.798444,0.640985,...,0.005804,-0.015887,0.044247,-0.430348,0.429599,-0.083827,-0.017480,0.023287,-0.033292,-0.010242
hla_high_res_10,0.853467,0.986260,0.913121,0.968904,1.000000,0.809601,0.889208,0.813504,0.830553,0.695501,...,-0.001359,-0.023968,0.067165,-0.473065,0.472203,-0.089225,-0.014965,0.014823,-0.035754,-0.013207
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
"melphalan_dose_N/A, Mel not given",-0.069573,-0.084268,-0.084070,-0.083827,-0.089225,-0.077194,-0.085291,-0.062270,-0.078727,-0.081855,...,0.032097,0.047951,-0.052356,0.027837,-0.024740,1.000000,0.001327,-0.021210,-0.006328,0.016960
cardiac_Not done,-0.009688,-0.015178,-0.015819,-0.017480,-0.014965,-0.013553,-0.017886,-0.010341,-0.015350,-0.020291,...,-0.002071,-0.009693,-0.011445,0.017627,-0.015878,0.001327,1.000000,-0.016815,0.022287,-0.001426
cardiac_Yes,0.001649,0.017800,0.023425,0.023287,0.014823,0.002281,0.027029,0.000203,0.015407,0.015261,...,0.028497,0.008328,0.002387,-0.014842,0.013610,-0.021210,-0.016815,1.000000,0.008882,0.062991
pulm_moderate_Not done,-0.034490,-0.034193,-0.031134,-0.033292,-0.035754,-0.032096,-0.037855,-0.029345,-0.023028,-0.022643,...,-0.002030,0.001309,-0.008690,0.024916,-0.023092,-0.006328,0.022287,0.008882,1.000000,-0.035781


### Use random forests, XGBoost, or LightGBM to handle missing data, or Cox Proporttional Hazzrds model 