# HCT Survival Predictions

Goal:  Develop models to improve the prediction of transplant survival rates for patients undergoing allogeneic Hematopoietic Cell Transplantation (HCT) — an important step in ensuring that every patient has a fair chance at a successful outcome, regardless of their background.

Improving survival predictions for allogeneic HCT patients is a vital healthcare challenge. Current predictive models often fall short in addressing disparities related to socioeconomic status, race, and geography. Addressing these gaps is crucial for enhancing patient care, optimizing resource utilization, and rebuilding trust in the healthcare system.

The goal is to address disparities by bridging diverse data sources, refining algorithms, and reducing biases to ensure equitable outcomes for patients across diverse race groups. Your work will help create a more just and effective healthcare environment, ensuring every patient receives the care they deserve.

Dataset Description

The dataset consists of 59 variables related to hematopoietic stem cell transplantation (HSCT), encompassing a range of demographic and medical characteristics of both recipients and donors, such as age, sex, ethnicity, disease status, and treatment details. The primary outcome of interest is event-free survival, represented by the variable efs, while the time to event-free survival is captured by the variable efs_time. These two variables together encode the target for a censored time-to-event analysis. The data, which features equal representation across recipient racial categories including White, Asian, African-American, Native American, Pacific Islander, and More than One Race, was synthetically generated using the data generator from synthcity, trained on a large cohort of real CIBMTR data.


    train.csv - the training set, with target efs (Event-free survival)
    test.csv - the test set; your task is to predict the value of efs for this data
    sample_submission.csv - a sample submission file in the correct format with all predictions set to 0.50
    data_dictionary.csv - a list of all features and targets used in dataset and their descriptions


## Import Package

In [None]:
import dask.dataframe as dd

## Import Dataset

In [None]:
# path_data_dictionary = "C:/Users/julia/Desktop/Yanjun/hct competition/data_dictionary.csv"
# path_test = "C:/Users/julia/Desktop/Yanjun/hct competition/test.csv"
path_train = "C:/Users/julia/Desktop/Yanjun/hct competition/train.csv"
# data_dictionary = pd.read_csv(path_data_dictionary)
# path_submission = "C:/Users/julia/Desktop/Yanjun/hct competition/sample_submission.csv"
# test = pd.read_csv(path_test)
train = pd.read_csv(path_train)
# submission = pd.read_csv(path_submission)

In [None]:
train.head(5)

In [None]:
# remove Id from train
train = train.drop(columns = 'ID')
import gc
gc.collect()


In [None]:
# check the correlation of missing data
import missingno as msno
msno.heatmap(train)
# from the plot, i can see some missing variables are correlated

### Check the proposition of missing data

In [None]:
missing = np.round(train.isna().sum()/len(train), 3) * 100
df_missing = pd.DataFrame(missing, columns=['values']).sort_values(by = 'values', ascending = True)
# mark different variables which has different category of missing data percentage:

# function to differentiate different category percentage of missing data
def color_map(percent):
  cmap = []
  for x in percent:
    if x >= 20:
      temp = 'background-color: red'
    elif x >= 5:
      temp = 'background-color: orange'
    elif x >= 1:
      temp = 'background-color: yellow'
    else:
      temp = 'background-color: green'
    cmap.append(temp)
  return cmap
# df_missing.style.map(color_map)
df_missing.style.apply(lambda x: color_map(df_missing['values']), axis = 0)

In [None]:
del missing, df_missing
gc.collect()

# Remove the missing data from  train and assence the importance of variables using randomforestsurvival

### First , Use clean data to find the important variables

In [None]:
# clean_data = train.copy()
# clean_data = clean_data.dropna()
# len(clean_data)

In [None]:
from sksurv.ensemble import RandomSurvivalForest
# use clean data find variables importance

# # change category variable to numerical variables in clean data
# clean_data = pd.get_dummies(data = clean_data, drop_first= True, dtype = int)
# # first baance the clean_data
# from imblearn.over_sampling import SMOTE

# smote = SMOTE(sampling_strategy = 'auto', random_state = 1)
# cond = clean_data.columns == 'efs'
# X_cond = clean_data.columns[~cond]
# X, y = smote.fit_resample(clean_data[X_cond], clean_data['efs'])


In [None]:
# new_clean = pd.concat([X, y], axis = 1)
# del X, y, clean_data, cond, smote
# gc.collect()

In [None]:
from sksurv.util import Surv
# use this new clean dataset to get the important variables 
# rsf = RandomSurvivalForest(n_estimators= 30, max_depth= 20, max_features= 'sqrt', random_state= 1)
# y = new_clean[['efs', 'efs_time']]
# Y = Surv.from_dataframe('efs', 'efs_time', y)
# cond = new_clean.columns.isin(['efs', 'efs_time'])
# X = new_clean[new_clean.columns[~cond]]

In [None]:
# del new_clean
# gc.collect()

In [None]:
# rsf.fit(X, Y)
# rsf.score(X, Y)

In [None]:
# use permutation importance to calculate importance of features
from sksurv.metrics import concordance_index_censored
from sklearn.inspection import permutation_importance
# create a dataframe of feature importance
# def C_score(estimator, X, y):
#   y_pred = estimator.predict(X)
#   return concordance_index_censored(Y['efs'], Y['efs_time'], y_pred)[0]
  
# feature_importance = permutation_importance(rsf, X = X, y = Y, scoring = C_score, n_repeats = 3, random_state = 1)
# feature_importance 

In [None]:
# mean_importance = feature_importance.importances_mean.mean()
# index = np.where(feature_importance.importances_mean >= mean_importance)
# import_variables_1 = X.columns[index]
# import_variables_1

In [None]:
# use these variables to fit the randomforestsurvival on original dataset, before doing this, clean old dataset, like X, Y, y
# del X, Y, y, index, feature_importance, mean_importance
# gc.collect()


#### The first methods to find important variables is based on clean data, it is biased, so it need full data to use randomforestsurvival to find the important variables

### Second , use full data (smaples ) and randomforestsurvival to find the important variables

In [None]:
# balance full orginal data


from sklearn.utils import resample
y_counts = train.efs.value_counts()
minority = y_counts.index[np.where(y_counts != y_counts.max())].values[0]

new = resample(train[train.efs == minority], replace = True, n_samples = (y_counts.max() - y_counts).max(), random_state = 1).reset_index(drop = True)
train = pd.concat([train, new], axis = 0)
train = train.reset_index(drop = True)
train['efs'].value_counts()

In [None]:
del y_counts, minority, new
gc.collect()

In [None]:
train1 = train.copy()

In [None]:
# change category variables to numerical variables
train1 = pd.get_dummies(train1, drop_first= True, dtype = float)

In [None]:
from sksurv.ensemble import RandomSurvivalForest
# define randomforestsurvival
# randomsurvival = RandomSurvivalForest(n_estimators= 10, max_depth = 15, random_state= 1, max_features= 'sqrt')

from sksurv.util import Surv
from sksurv.metrics import concordance_index_censored
from sklearn.inspection import permutation_importance

# # use small sample from train1 to calculate the important variables
# small_train1 = train1.sample(frac= 0.05, random_state = 1)
# cond = train1.columns.isin(['efs', 'efs_time'])
# small_X = small_train1[small_train1.columns[~cond]]
# small_y = train1.loc[small_X.index, ['efs', 'efs_time']]
# small_y = Surv.from_dataframe('efs', 'efs_time',small_y )

# del small_train1
# gc.collect()


In [None]:
# trainydataframe = train1[['efs','efs_time']]
# trainydataframe = Surv.from_dataframe('efs', 'efs_time', trainydataframe)
# randomsurvival.fit(train1[train1.columns[~cond]],trainydataframe)
# randomsurvival.score(train1[train1.columns[~cond]],trainydataframe)

In [None]:
# importance_feature = permutation_importance(estimator= randomsurvival, X = small_X, y = small_y,random_state= 1 , n_repeats = 3)

In [None]:
# del randomsurvival
# gc.collect()

In [None]:
# bulid a dataframe for feature importance
# importance = pd.DataFrame(data = importance_feature.importances_mean, index = small_X.columns,columns = ['importance1'])

In [None]:
# del importance_feature
# gc.collect()

In [None]:
# del trainydataframe
# gc.collect()

In [None]:
# # reorder variables accoring to the order of importance of variables
# importance = importance.sort_values(by = 'importance1', ascending= False)

In [None]:
# mean_ = importance.importance1.mean()
# cond = importance.importance1 > mean_ * 0.3
# important_variables = importance[cond].index

In [None]:
# del mean_, cond, small_X, small_y
# gc.collect()

## Use KNN to impute new dataset( important variables)

In [None]:
train1= train1.astype(float)

from sklearn.impute import KNNImputer
from sklearn.preprocessing  import  StandardScaler
scaler = StandardScaler()
cond = train1.columns.isin(['efs', 'efs_time'])
train1.loc[:,~cond] = scaler.fit_transform(train1.loc[:,~cond])

# use knn to impute missing data
imputer = KNNImputer(n_neighbors =  5)
train1_complete = imputer.fit_transform(train1)
train1_complete = pd.DataFrame(train1_complete, columns= train1.columns)
train1_complete

In [None]:
# del important_X
gc.collect()

In [None]:
# combine important_complete with y
# important_complete['efs'] = train1['efs']
# important_complete['efs_time'] = train1['efs_time']

In [None]:
# # remove the outlier
# from scipy import stats
# z_scores = np.abs(stats.zscore(train1_complete))
# threshold = 3
# # train1_complete = train1_complete[(z_scores < threshold).all(axis = 1)]
# train1_complete[(z_scores < threshold).all(axis = 1)]

In [None]:
# split important_complete into train and test data
from sklearn.model_selection import train_test_split


train_complete, test_complete = train_test_split(train1_complete, test_size= 0.3, random_state= 1, shuffle= True, stratify= train1_complete['efs'])


In [None]:
# use randomforestsurvival to train the data
cond = train_complete.columns.isin(['efs', 'efs_time'])

train_y = Surv.from_dataframe('efs', 'efs_time', train_complete)
test_y = Surv.from_dataframe('efs', 'efs_time', test_complete)
RFS = RandomSurvivalForest(n_estimators= 15, random_state= 1, max_depth= 15, max_features= 'sqrt')

# use model above to fit the data
RFS.fit(train_complete[train_complete.columns[~cond]], train_y)

In [None]:
RFS.score(train_complete[train_complete.columns[~cond]], train_y)

In [None]:
RFS.score(test_complete[test_complete.columns[~cond]], test_y)

In [None]:
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators = 100)
rfe = RFE(estimator=model)
fit = rfe.fit(train_complete[train_complete.columns[~cond]], train_complete['efs'])

In [None]:
ranking = rfe.ranking_
# Sort the features by their ranking
sorted_features = sorted(zip(train_complete.columns[~cond], ranking), key=lambda x: x[1])
important_variables = []
# Display the sorted features with their ranking
print("Feature Importance (based on RFE ranking):")
for feature, rank in sorted_features:
    if rank < 48:
        important_variables.append(feature)
# give me variables that ranking is less than 47(contain 47)
important_variables

In [None]:
del sorted_features, ranking
gc.collect()

## Use batch processing to train randomsurvivalforest on important variables

In [None]:
# first save new important variables and efs efs_time o train and test in new profile and delete old dataset 
important_variables.append('efs')
important_variables.append('efs_time')

In [None]:
train_complete[important_variables].to_csv("C:/Users/julia/Desktop/Yanjun/new_data.csv", index = False)
test_complete[important_variables].to_csv("C:/Users/julia/Desktop/Yanjun/new_test_data.csv", index = False)

In [None]:
del train1, train1_complete, train_complete, test_complete, important_variables, rfe, fit, model, RFS, train_y, test_y
gc.collect()

Use chunks to train randomssurvivalforest 

In [None]:
chunks = pd.read_csv("C:/Users/julia/Desktop/Yanjun/new_data.csv", chunksize = 1000 )

# define a function which can change dataframe to surv in chunks

def surv_y_dataframe(chunk):
  cond = chunk.columns.isin(['efs', 'efs_time'])
  y_chunk = Surv.from_dataframe('efs', 'efs_time', chunk)
  x_chunk = chunk[chunk.columns[~cond]]
  return x_chunk, y_chunk 

In [None]:
# define a random survival forest 
estmators = 100
RSF = RandomSurvivalForest(n_estimators= estmators,  random_state= 1, warm_start= True, max_depth= 15, min_samples_split= 20, min_samples_leaf= 10,  max_features= 'sqrt')

# train the first chunk on randomforest
first_chunk = next(chunks)
x_first_chunk, y_first_chunk = surv_y_dataframe(first_chunk)

RSF.fit(x_first_chunk, y_first_chunk)

In [None]:
RSF.score(x_first_chunk, y_first_chunk)

In [None]:
# increase the number of trees increamentally with batch update
for chunk in chunks:
  x_chunk, y_chunk = surv_y_dataframe(chunk)
  estmators = estmators + 10
  RSF.n_estimators = estmators
  
  # fit update model with new chunk
  RSF.fit(x_chunk, y_chunk)
  print(RSF.score(x_chunk, y_chunk))

In [None]:
test = pd.read_csv("C:/Users/julia/Desktop/Yanjun/new_test_data.csv")
x_test, y_test = surv_y_dataframe(test)
RSF.score(x_test, y_test)