# HCT Survival Predictions

Goal:  Develop models to improve the prediction of transplant survival rates for patients undergoing allogeneic Hematopoietic Cell Transplantation (HCT) — an important step in ensuring that every patient has a fair chance at a successful outcome, regardless of their background.

Improving survival predictions for allogeneic HCT patients is a vital healthcare challenge. Current predictive models often fall short in addressing disparities related to socioeconomic status, race, and geography. Addressing these gaps is crucial for enhancing patient care, optimizing resource utilization, and rebuilding trust in the healthcare system.

The goal is to address disparities by bridging diverse data sources, refining algorithms, and reducing biases to ensure equitable outcomes for patients across diverse race groups. Your work will help create a more just and effective healthcare environment, ensuring every patient receives the care they deserve.

Dataset Description

The dataset consists of 59 variables related to hematopoietic stem cell transplantation (HSCT), encompassing a range of demographic and medical characteristics of both recipients and donors, such as age, sex, ethnicity, disease status, and treatment details. The primary outcome of interest is event-free survival, represented by the variable efs, while the time to event-free survival is captured by the variable efs_time. These two variables together encode the target for a censored time-to-event analysis. The data, which features equal representation across recipient racial categories including White, Asian, African-American, Native American, Pacific Islander, and More than One Race, was synthetically generated using the data generator from synthcity, trained on a large cohort of real CIBMTR data.


    train.csv - the training set, with target efs (Event-free survival)
    test.csv - the test set; your task is to predict the value of efs for this data
    sample_submission.csv - a sample submission file in the correct format with all predictions set to 0.50
    data_dictionary.csv - a list of all features and targets used in dataset and their descriptions


## Import Package

In [1]:
import dask.dataframe as dd
from sklearn.ensemble import RandomForestClassifier
from dask_ml.model_selection import train_test_split
import pandas as pd
import numpy as np

## Import Dataset

In [2]:
# path_data_dictionary = "C:/Users/julia/Desktop/Yanjun/hct competition/data_dictionary.csv"
# path_test = "C:/Users/julia/Desktop/Yanjun/hct competition/test.csv"
path_train = "C:/Users/julia/Desktop/Yanjun/hct competition/train.csv"
# data_dictionary = pd.read_csv(path_data_dictionary)
# path_submission = "C:/Users/julia/Desktop/Yanjun/hct competition/sample_submission.csv"
# test = pd.read_csv(path_test)
train = dd.read_csv(path_train)
train = train.repartition(npartitions = 4)
# submission = pd.read_csv(path_submission)

In [3]:
train

Unnamed: 0_level_0,ID,dri_score,psych_disturb,cyto_score,diabetes,hla_match_c_high,hla_high_res_8,tbi_status,arrhythmia,hla_low_res_6,graft_type,vent_hist,renal_issue,pulm_severe,prim_disease_hct,hla_high_res_6,cmv_status,hla_high_res_10,hla_match_dqb1_high,tce_imm_match,hla_nmdp_6,hla_match_c_low,rituximab,hla_match_drb1_low,hla_match_dqb1_low,prod_type,cyto_score_detail,conditioning_intensity,ethnicity,year_hct,obesity,mrd_hct,in_vivo_tcd,tce_match,hla_match_a_high,hepatic_severe,donor_age,prior_tumor,hla_match_b_low,peptic_ulcer,age_at_hct,hla_match_a_low,gvhd_proph,rheum_issue,sex_match,hla_match_b_high,race_group,comorbidity_score,karnofsky_score,hepatic_mild,tce_div_match,donor_related,melphalan_dose,hla_low_res_8,cardiac,hla_match_drb1_high,pulm_moderate,hla_low_res_10,efs,efs_time
npartitions=4,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1
,int64,string,string,string,string,float64,float64,string,string,float64,string,string,string,string,string,float64,string,float64,float64,string,float64,float64,string,float64,float64,string,string,string,string,int64,string,string,string,string,float64,string,float64,string,float64,string,float64,float64,string,string,string,float64,string,float64,float64,string,string,string,string,float64,string,float64,string,float64,float64,float64
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


In [4]:
# remove Id from train
train = train.drop(columns = 'ID')
import gc
gc.collect()


31

In [5]:
# # check the correlation of missing data
# import missingno as msno
# msno.heatmap(train)
# # from the plot, i can see some missing variables are correlated

### Check the proposition of missing data

In [6]:
missing = train.isnull().mean().compute() * 100
missing
# missing = np.round(train.isna().sum()/len(train), 3) * 100
df_missing = pd.DataFrame(missing, columns=['values']).sort_values(by = 'values', ascending = True)
# # mark different variables which has different category of missing data percentage:

# function to differentiate different category percentage of missing data
def color_map(percent):
  cmap = []
  for x in percent:
    if x >= 20:
      temp = 'background-color: red'
    elif x >= 5:
      temp = 'background-color: orange'
    elif x >= 1:
      temp = 'background-color: yellow'
    else:
      temp = 'background-color: green'
    cmap.append(temp)
  return cmap
# df_missing.style.map(color_map)
df_missing.style.apply(lambda x: color_map(df_missing['values']), axis = 0)

Unnamed: 0,values
efs_time,0.0
race_group,0.0
age_at_hct,0.0
efs,0.0
year_hct,0.0
prod_type,0.0
tbi_status,0.0
prim_disease_hct,0.0
graft_type,0.0
dri_score,0.534722


In [7]:
del missing, df_missing
gc.collect()

48

# Remove the missing data from  train and assence the importance of variables using randomforestsurvival

### First , Use clean data to find the important variables

In [8]:
# clean_data = train.copy()
# clean_data = clean_data.dropna()
# len(clean_data)

In [9]:
from sksurv.ensemble import RandomSurvivalForest
# use clean data find variables importance

# # change category variable to numerical variables in clean data
# clean_data = pd.get_dummies(data = clean_data, drop_first= True, dtype = int)
# # first baance the clean_data
# from imblearn.over_sampling import SMOTE

# smote = SMOTE(sampling_strategy = 'auto', random_state = 1)
# cond = clean_data.columns == 'efs'
# X_cond = clean_data.columns[~cond]
# X, y = smote.fit_resample(clean_data[X_cond], clean_data['efs'])


In [10]:
# new_clean = pd.concat([X, y], axis = 1)
# del X, y, clean_data, cond, smote
# gc.collect()

In [11]:
from sksurv.util import Surv
# use this new clean dataset to get the important variables 
# rsf = RandomSurvivalForest(n_estimators= 30, max_depth= 20, max_features= 'sqrt', random_state= 1)
# y = new_clean[['efs', 'efs_time']]
# Y = Surv.from_dataframe('efs', 'efs_time', y)
# cond = new_clean.columns.isin(['efs', 'efs_time'])
# X = new_clean[new_clean.columns[~cond]]

In [12]:
# del new_clean
# gc.collect()

In [13]:
# rsf.fit(X, Y)
# rsf.score(X, Y)

In [14]:
# use permutation importance to calculate importance of features
from sksurv.metrics import concordance_index_censored
from sklearn.inspection import permutation_importance
# create a dataframe of feature importance
# def C_score(estimator, X, y):
#   y_pred = estimator.predict(X)
#   return concordance_index_censored(Y['efs'], Y['efs_time'], y_pred)[0]
  
# feature_importance = permutation_importance(rsf, X = X, y = Y, scoring = C_score, n_repeats = 3, random_state = 1)
# feature_importance 

In [15]:
# mean_importance = feature_importance.importances_mean.mean()
# index = np.where(feature_importance.importances_mean >= mean_importance)
# import_variables_1 = X.columns[index]
# import_variables_1

In [16]:
# use these variables to fit the randomforestsurvival on original dataset, before doing this, clean old dataset, like X, Y, y
# del X, Y, y, index, feature_importance, mean_importance
# gc.collect()


#### The first methods to find important variables is based on clean data, it is biased, so it need full data to use randomforestsurvival to find the important variables

### Second , use full data (smaples ) and randomforestsurvival to find the important variables

In [17]:
# balance full orginal data

# from sklearn.utils import resample
y_counts = train.efs.value_counts().compute()
minority = y_counts.index[np.where(y_counts != y_counts.max())].values[0]
majority_size = y_counts.max()
minority_size = y_counts.min()
new = train[train.efs == minority].sample(frac = (majority_size - minority_size)/ minority_size, replace = True, random_state = 1 )
train = dd.concat([train, new], axis = 0)
train = train.repartition(npartitions = 4)


In [18]:
train.efs.value_counts().compute()

efs
1.0    15532
0.0    15533
Name: count, dtype: int64

In [19]:
del y_counts, minority, new, majority_size, minority_size
gc.collect()

20

In [20]:
train1 = train.copy()

In [21]:
# dd.get_dummies only can process dataframe which is categoryical , so i need to take numerical variables out first
train1_numerical = train1.select_dtypes(include = ['int', 'float'])
train1_object = train1.select_dtypes(exclude = ['int', 'float'])

In [22]:
# categorize train1_object
train1_object = train1_object.categorize()

In [23]:
# change category variables to numerical variables
train1_encoded = dd.get_dummies(train1_object, drop_first= True, dtype = float)

In [None]:
train1_numerical.divisions

In [25]:
# tconcate train1 and train1-numerical 
train1 = dd.concat([train1_encoded, train1_numerical], axis = 1)

In [26]:
del train1_object, train1_numerical, train1_encoded
gc.collect()

0

In [None]:
# from sksurv.ensemble import RandomSurvivalForest
# # define randomforestsurvival
# # randomsurvival = RandomSurvivalForest(n_estimators= 10, max_depth = 15, random_state= 1, max_features= 'sqrt')

# from sksurv.util import Surv
# from sksurv.metrics import concordance_index_censored
# from sklearn.inspection import permutation_importance

# # use small sample from train1 to calculate the important variables
# small_train1 = train1.sample(frac= 0.05, random_state = 1)
# cond = train1.columns.isin(['efs', 'efs_time'])
# small_X = small_train1[small_train1.columns[~cond]]
# small_y = train1.loc[small_X.index, ['efs', 'efs_time']]
# small_y = Surv.from_dataframe('efs', 'efs_time',small_y )

# del small_train1
# gc.collect()


In [None]:
# trainydataframe = train1[['efs','efs_time']]
# trainydataframe = Surv.from_dataframe('efs', 'efs_time', trainydataframe)
# randomsurvival.fit(train1[train1.columns[~cond]],trainydataframe)
# randomsurvival.score(train1[train1.columns[~cond]],trainydataframe)

In [None]:
# importance_feature = permutation_importance(estimator= randomsurvival, X = small_X, y = small_y,random_state= 1 , n_repeats = 3)

In [None]:
# del randomsurvival
# gc.collect()

In [None]:
# bulid a dataframe for feature importance
# importance = pd.DataFrame(data = importance_feature.importances_mean, index = small_X.columns,columns = ['importance1'])

In [None]:
# del importance_feature
# gc.collect()

In [None]:
# del trainydataframe
# gc.collect()

In [None]:
# # reorder variables accoring to the order of importance of variables
# importance = importance.sort_values(by = 'importance1', ascending= False)

In [None]:
# mean_ = importance.importance1.mean()
# cond = importance.importance1 > mean_ * 0.3
# important_variables = importance[cond].index

In [None]:
# del mean_, cond, small_X, small_y
# gc.collect()

In [27]:
train1

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Unnamed: 0_level_0,dri_score_High - TED AML case <missing cytogenetics,dri_score_Intermediate,dri_score_Intermediate - TED AML case <missing cytogenetics,dri_score_Low,dri_score_Missing disease status,dri_score_N/A - disease not classifiable,dri_score_N/A - non-malignant indication,dri_score_N/A - pediatric,dri_score_TBD cytogenetics,dri_score_Very high,psych_disturb_Not done,psych_disturb_Yes,cyto_score_Intermediate,cyto_score_Normal,cyto_score_Not tested,cyto_score_Other,cyto_score_Poor,cyto_score_TBD,diabetes_Not done,diabetes_Yes,tbi_status_TBI + Cy +- Other,"tbi_status_TBI +- Other, -cGy, fractionated","tbi_status_TBI +- Other, -cGy, single","tbi_status_TBI +- Other, -cGy, unknown dose","tbi_status_TBI +- Other, <=cGy","tbi_status_TBI +- Other, >cGy","tbi_status_TBI +- Other, unknown dose",arrhythmia_Not done,arrhythmia_Yes,graft_type_Peripheral blood,vent_hist_Yes,renal_issue_Not done,renal_issue_Yes,pulm_severe_Not done,pulm_severe_Yes,prim_disease_hct_ALL,prim_disease_hct_AML,prim_disease_hct_CML,prim_disease_hct_HD,prim_disease_hct_HIS,prim_disease_hct_IEA,prim_disease_hct_IIS,prim_disease_hct_IMD,prim_disease_hct_IPA,prim_disease_hct_MDS,prim_disease_hct_MPN,prim_disease_hct_NHL,prim_disease_hct_Other acute leukemia,prim_disease_hct_Other leukemia,prim_disease_hct_PCD,prim_disease_hct_SAA,prim_disease_hct_Solid tumor,cmv_status_+/-,cmv_status_-/+,cmv_status_-/-,tce_imm_match_G/G,tce_imm_match_H/B,tce_imm_match_H/H,tce_imm_match_P/B,tce_imm_match_P/G,tce_imm_match_P/H,tce_imm_match_P/P,rituximab_Yes,prod_type_PB,cyto_score_detail_Intermediate,cyto_score_detail_Not tested,cyto_score_detail_Poor,cyto_score_detail_TBD,"conditioning_intensity_N/A, F(pre-TED) not submitted",conditioning_intensity_NMA,conditioning_intensity_No drugs reported,conditioning_intensity_RIC,conditioning_intensity_TBD,ethnicity_Non-resident of the U.S.,ethnicity_Not Hispanic or Latino,obesity_Not done,obesity_Yes,mrd_hct_Positive,in_vivo_tcd_Yes,tce_match_GvH non-permissive,tce_match_HvG non-permissive,tce_match_Permissive,hepatic_severe_Not done,hepatic_severe_Yes,prior_tumor_Not done,prior_tumor_Yes,peptic_ulcer_Not done,peptic_ulcer_Yes,gvhd_proph_CDselect alone,gvhd_proph_CSA + MMF +- others(not FK),"gvhd_proph_CSA + MTX +- others(not MMF,FK)","gvhd_proph_CSA +- others(not FK,MMF,MTX)",gvhd_proph_CSA alone,gvhd_proph_Cyclophosphamide +- others,gvhd_proph_Cyclophosphamide alone,gvhd_proph_FK+ MMF +- others,gvhd_proph_FK+ MTX +- others(not MMF),"gvhd_proph_FK+- others(not MMF,MTX)",gvhd_proph_FKalone,gvhd_proph_No GvHD Prophylaxis,gvhd_proph_Other GVHD Prophylaxis,"gvhd_proph_Parent Q = yes, but no agent",gvhd_proph_TDEPLETION +- other,gvhd_proph_TDEPLETION alone,rheum_issue_Not done,rheum_issue_Yes,sex_match_F-M,sex_match_M-F,sex_match_M-M,race_group_Asian,race_group_Black or African-American,race_group_More than one race,race_group_Native Hawaiian or other Pacific Islander,race_group_White,hepatic_mild_Not done,hepatic_mild_Yes,tce_div_match_GvH non-permissive,tce_div_match_HvG non-permissive,tce_div_match_Permissive mismatched,donor_related_Related,donor_related_Unrelated,"melphalan_dose_N/A, Mel not given",cardiac_Not done,cardiac_Yes,pulm_moderate_Not done,pulm_moderate_Yes,hla_match_c_high,hla_high_res_8,hla_low_res_6,hla_high_res_6,hla_high_res_10,hla_match_dqb1_high,hla_nmdp_6,hla_match_c_low,hla_match_drb1_low,hla_match_dqb1_low,year_hct,hla_match_a_high,donor_age,hla_match_b_low,age_at_hct,hla_match_a_low,hla_match_b_high,comorbidity_score,karnofsky_score,hla_low_res_8,hla_match_drb1_high,hla_low_res_10,efs,efs_time
npartitions=4,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1,Unnamed: 106_level_1,Unnamed: 107_level_1,Unnamed: 108_level_1,Unnamed: 109_level_1,Unnamed: 110_level_1,Unnamed: 111_level_1,Unnamed: 112_level_1,Unnamed: 113_level_1,Unnamed: 114_level_1,Unnamed: 115_level_1,Unnamed: 116_level_1,Unnamed: 117_level_1,Unnamed: 118_level_1,Unnamed: 119_level_1,Unnamed: 120_level_1,Unnamed: 121_level_1,Unnamed: 122_level_1,Unnamed: 123_level_1,Unnamed: 124_level_1,Unnamed: 125_level_1,Unnamed: 126_level_1,Unnamed: 127_level_1,Unnamed: 128_level_1,Unnamed: 129_level_1,Unnamed: 130_level_1,Unnamed: 131_level_1,Unnamed: 132_level_1,Unnamed: 133_level_1,Unnamed: 134_level_1,Unnamed: 135_level_1,Unnamed: 136_level_1,Unnamed: 137_level_1,Unnamed: 138_level_1,Unnamed: 139_level_1,Unnamed: 140_level_1,Unnamed: 141_level_1,Unnamed: 142_level_1,Unnamed: 143_level_1,Unnamed: 144_level_1,Unnamed: 145_level_1,Unnamed: 146_level_1,Unnamed: 147_level_1,Unnamed: 148_level_1,Unnamed: 149_level_1,Unnamed: 150_level_1
,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,int64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


## Use KNN to impute new dataset( important variables)

In [None]:
from sklearn.impute import KNNImputer
from dask.distributed import Client
from dask_ml.preprocessing import StandardScaler

from dask_ml.wrappers import ParallelPostFit
from dask_ml.model_selection import train_test_split

In [None]:
# # self function  which split X and y in dataset in dask
# def X_y(daskdata):
#   cond = daskdata.columns.isin(['efs', 'efs_time'])
  
#   X_columns = [col for col in daskdata.columns if col not in ['efs', 'efs_time']]  # Convert columns to list of strings
#   X = daskdata[X_columns]
#   y = daskdata['efs']
#   return X, y

# KNN function use knnimputer to impute missing data

def KNN(chunk, n_neighbors = 5):

  Knn= KNNImputer(n_neighbors= n_neighbors)
  imputed_X = Knn.fit_transform(chunk)
  return pd.DataFrame(imputed_X, columns = chunk.columns)

In [None]:
# split X an y in each chunks
cond = train1.columns.isin(['efs', 'efs_time'])
train1_X = train1[train1.columns[~cond]]
# train1_y = train1.compute()

# # split X and y in train1 in each chunks
# meta_X = pd.DataFrame(columns = [col for col in train1.columns if col not in ['efs', 'efs_time']], dtype=float)
# meta_y = pd.DataFrame(columns = ['efs'], dtype = float)
# train1_X, train1_y = train1.map_partitions(X_y, meta = (meta_X, meta_y))

# standardized the data 
scaler = StandardScaler()
train1_X_imputed = scaler.fit_transform(train1_X)



In [None]:
# use KNN function to impute missing data in all chunks
train1_X = train1_X_imputed.map_partitions(KNN, meta = train1_X_imputed)

In [None]:
# check wether there are missing data 
train1_X.isna().sum().compute()

In [None]:
# del train1_X_imputed
del train1_X_imputed
gc.collect()

In [None]:
cond = train.columns.isin(['efs', 'efs_time'])
train[train.columns[~cond]]

In [None]:
train['efs']

### Split the dask dataset into train and test, use Random Forest to train the data

In [None]:
train1_y = train1['efs']
(train1_y.compute().index == train1_X.compute().index).mean()

In [None]:
train1_y = dd.from_pandas(train1['efs'].compute(), npartitions = 4)
train1_y = train1_y.repartition(npartitions = 5)

In [None]:
# split important_complete into train and test data
train_complete, test_complete = train_test_split(train1_X, train1_y, test_size= 0.2, random_state= 1, shuffle= True)

In [None]:
# # remove the outlier
# from scipy import stats
# z_scores = np.abs(stats.zscore(train1_complete))
# threshold = 3
# # train1_complete = train1_complete[(z_scores < threshold).all(axis = 1)]
# train1_complete[(z_scores < threshold).all(axis = 1)]

In [None]:
# split important_complete into train and test data
train_complete, test_complete = train_test_split(train1_complete, test_size= 0.3, random_state= 1, shuffle= True, stratify= train1_complete['efs'])


In [None]:
# use randomforestsurvival to train the data
cond = train_complete.columns.isin(['efs', 'efs_time'])

train_y = Surv.from_dataframe('efs', 'efs_time', train_complete)
test_y = Surv.from_dataframe('efs', 'efs_time', test_complete)
RFS = RandomSurvivalForest(n_estimators= 15, random_state= 1, max_depth= 15, max_features= 'sqrt')

# use model above to fit the data
RFS.fit(train_complete[train_complete.columns[~cond]], train_y)

In [None]:
RFS.score(train_complete[train_complete.columns[~cond]], train_y)

In [None]:
RFS.score(test_complete[test_complete.columns[~cond]], test_y)

In [None]:
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

# model = RandomForestClassifier(n_estimators = 100)
# rfe = RFE(estimator=model)
# fit = rfe.fit(train_complete[train_complete.columns[~cond]], train_complete['efs'])

In [None]:
ranking = rfe.ranking_
# Sort the features by their ranking
sorted_features = sorted(zip(train_complete.columns[~cond], ranking), key=lambda x: x[1])
important_variables = []
# Display the sorted features with their ranking
print("Feature Importance (based on RFE ranking):")
for feature, rank in sorted_features:
    if rank < 48:
        important_variables.append(feature)
# give me variables that ranking is less than 47(contain 47)
important_variables

In [None]:
del sorted_features, ranking
gc.collect()

## Use batch processing to train randomsurvivalforest on important variables

In [None]:
# first save new important variables and efs efs_time o train and test in new profile and delete old dataset 
important_variables.append('efs')
important_variables.append('efs_time')

In [None]:
train_complete[important_variables].to_csv("C:/Users/julia/Desktop/Yanjun/new_data.csv", index = False)
test_complete[important_variables].to_csv("C:/Users/julia/Desktop/Yanjun/new_test_data.csv", index = False)

In [None]:
del train1, train1_complete, train_complete, test_complete, important_variables, rfe, fit, model, RFS, train_y, test_y
gc.collect()

Use chunks to train randomssurvivalforest 

In [None]:
chunks = pd.read_csv("C:/Users/julia/Desktop/Yanjun/new_data.csv", chunksize = 1000 )

# define a function which can change dataframe to surv in chunks

def surv_y_dataframe(chunk):
  cond = chunk.columns.isin(['efs', 'efs_time'])
  y_chunk = Surv.from_dataframe('efs', 'efs_time', chunk)
  x_chunk = chunk[chunk.columns[~cond]]
  return x_chunk, y_chunk 

In [None]:
# define a random survival forest 
estmators = 100
RSF = RandomSurvivalForest(n_estimators= estmators,  random_state= 1, warm_start= True, max_depth= 15, min_samples_split= 20, min_samples_leaf= 10,  max_features= 'sqrt')

# train the first chunk on randomforest
first_chunk = next(chunks)
x_first_chunk, y_first_chunk = surv_y_dataframe(first_chunk)

RSF.fit(x_first_chunk, y_first_chunk)

In [None]:
RSF.score(x_first_chunk, y_first_chunk)

In [None]:
# increase the number of trees increamentally with batch update
for chunk in chunks:
  x_chunk, y_chunk = surv_y_dataframe(chunk)
  estmators = estmators + 10
  RSF.n_estimators = estmators
  
  # fit update model with new chunk
  RSF.fit(x_chunk, y_chunk)
  print(RSF.score(x_chunk, y_chunk))

In [None]:
test = pd.read_csv("C:/Users/julia/Desktop/Yanjun/new_test_data.csv")
x_test, y_test = surv_y_dataframe(test)
RSF.score(x_test, y_test)