<!-- <p align="center">
  <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/4d/Logo-gustave-roussy.jpg/1200px-Logo-gustave-roussy.jpg" alt="Logo 1" width="250"/>
  <img src="https://upload.wikimedia.org/wikipedia/en/thumb/3/3f/Qube_Research_%26_Technologies_Logo.svg/1200px-Qube_Research_%26_Technologies_Logo.svg.png" alt="Logo 2" width="200" style="margin-left: 20px;"/>
</p> -->

# Data Challenge : Leukemia Risk Prediction


*GOAL OF THE CHALLENGE and WHY IT IS IMPORTANT:*

The goal of the challenge is to **predict disease risk for patients with blood cancer**, in the context of specific subtypes of adult myeloid leukemias.

The risk is measured through the **overall survival** of patients, i.e. the duration of survival from the diagnosis of the blood cancer to the time of death or last follow-up.

Estimating the prognosis of patients is critical for an optimal clinical management. 
For exemple, patients with low risk-disease will be offered supportive care to improve blood counts and quality of life, while patients with high-risk disease will be considered for hematopoietic stem cell transplantion.

The performance metric used in the challenge is the **IPCW-C-Index**.

*THE DATASETS*

The **training set is made of 3,323 patients**.

The **test set is made of 1,193 patients**.

For each patient, you have acces to CLINICAL data and MOLECULAR data.

The details of the data are as follows:

- OUTCOME:
  * OS_YEARS = Overall survival time in years
  * OS_STATUS = 1 (death) , 0 (alive at the last follow-up)

- CLINICAL DATA, with one line per patient:
  
  * ID = unique identifier per patient
  * CENTER = clinical center
  * BM_BLAST = Bone marrow blasts in % (blasts are abnormal blood cells)
  * WBC = White Blood Cell count in Giga/L 
  * ANC = Absolute Neutrophil count in Giga/L
  * MONOCYTES = Monocyte count in Giga/L
  * HB = Hemoglobin in g/dL
  * PLT = Platelets coutn in Giga/L
  * CYTOGENETICS = A description of the karyotype observed in the blood cells of the patients, measured by a cytogeneticist. Cytogenetics is the science of chromosomes. A karyotype is performed from the blood tumoral cells. The convention for notation is ISCN (https://en.wikipedia.org/wiki/International_System_for_Human_Cytogenomic_Nomenclature). Cytogenetic notation are: https://en.wikipedia.org/wiki/Cytogenetic_notation. Note that a karyotype can be normal or abnornal. The notation 46,XX denotes a normal karyotype in females (23 pairs of chromosomes including 2 chromosomes X) and 46,XY in males (23 pairs of chromosomes inclusing 1 chromosme X and 1 chromsome Y). A common abnormality in the blood cancerous cells might be for exemple a loss of chromosome 7 (monosomy 7, or -7), which is typically asssociated with higher risk disease

In [None]:
## side note : mesures sont dans des échelles différentes donc attention à normaliser et standardiser les données

## 2. exploter les données CYTOGENETIC DATA

- GENE MOLECULAR DATA, with one line per patient per somatic mutation. Mutations are detected from the sequencing of the blood tumoral cells. 
We call somatic (= acquired) mutations the mutations that are found in the tumoral cells but not in other cells of the body.

  * ID = unique identifier per patient
  * CHR START END = position of the mutation on the human genome
  * REF ALT = reference and alternate (=mutant) nucleotide
  * GENE = the affected gene
  * PROTEIN_CHANGE = the consequence of the mutation on the protei that is expressed by a given gene
  * EFFECT = a broad categorization of the mutation consequences on a given gene.
  * VAF = Variant Allele Fraction = it represents the **proportion** of cells with the deleterious mutations. 

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
from sklearn.model_selection import train_test_split
from sksurv.ensemble import RandomSurvivalForest
from sksurv.linear_model import CoxPHSurvivalAnalysis
from sksurv.metrics import concordance_index_censored , concordance_index_ipcw
from sklearn.impute import SimpleImputer
from sksurv.util import Surv

import os
import re
# print(os.listdir("."))

# Clinical Data
# TODO : adapting to my local path
df = pd.read_csv("../data/X_train/clinical_train.csv")
df_eval = pd.read_csv("../data/X_test/clinical_test.csv")

# Molecular Data
maf_df = pd.read_csv("../data/X_train/molecular_train.csv")
maf_eval = pd.read_csv("../data/X_test/molecular_test.csv")

target_df = pd.read_csv("../data/target_train.csv")
# TODO: sera fourni le 15 mars
# target_df_test = pd.read_csv("./target_test.csv")

# Preview the data
df.head()

Unnamed: 0,ID,CENTER,BM_BLAST,WBC,ANC,MONOCYTES,HB,PLT,CYTOGENETICS
0,P132697,MSK,14.0,2.8,0.2,0.7,7.6,119.0,"46,xy,del(20)(q12)[2]/46,xy[18]"
1,P132698,MSK,1.0,7.4,2.4,0.1,11.6,42.0,"46,xx"
2,P116889,MSK,15.0,3.7,2.1,0.1,14.2,81.0,"46,xy,t(3;3)(q25;q27)[8]/46,xy[12]"
3,P132699,MSK,1.0,3.9,1.9,0.1,8.9,77.0,"46,xy,del(3)(q26q27)[15]/46,xy[5]"
4,P132700,MSK,6.0,128.0,9.7,0.9,11.1,195.0,"46,xx,t(3;9)(p13;q22)[10]/46,xx[10]"


### Step 1: Data Preparation (clinical data only)

For survival analysis, we’ll format the dataset so that OS_YEARS represents the time variable and OS_STATUS represents the event indicator.

In [2]:
### (sort of) Data Cleaning and Preprocessing

# Drop rows where 'OS_YEARS' is NaN if conversion caused any issues
# initial shape
# print(target_df.shape)
# drop rows with missing values
target_df.dropna(subset=['OS_YEARS', 'OS_STATUS'], inplace=True)
# final shape
# print(target_df.shape)
# percentage of rows dropped:
print(f'Percentage of initially dropt rows {(1 - target_df.shape[0] / df.shape[0]) * 100:.2f}%')

# Check the data types to ensure 'OS_STATUS' is boolean and 'OS_YEARS' is numeric
print(target_df[['OS_STATUS', 'OS_YEARS']].dtypes)

# Contarget_dfvert 'OS_YEARS' to numeric if it isn’t already
target_df['OS_YEARS'] = pd.to_numeric(target_df['OS_YEARS'], errors='coerce')

# Ensure 'OS_STATUS' is boolean
target_df['OS_STATUS'] = target_df['OS_STATUS'].astype(bool)

Percentage of initially dropt rows 4.51%
OS_STATUS    float64
OS_YEARS     float64
dtype: object


In [3]:
### Feature Selection :

# Zero : features they selected for the Benchmark model :
features_basic = ['BM_BLAST', 'HB', 'PLT']
# Accuracies for each implemented model :
# Benchmark LigthGBM: 
# Benchmark CoxPH:
# Benchmark RandomSurvivalForest:
# Benchmark when adding the Nmut count as a feature :


# First: Naively add all the features  (except the Gene column):
# features= ['BM_BLAST', 'HB', 'PLT', 'WBC', 'ANC', 'MONOCYTES']
# Very naive : slight improvement

# Second : construct some features based on scientific knowledge, and add them to the dataframe and the features list
df['BLAST_per_WBC'] = df['BM_BLAST'] / df['WBC']
df['ANC_per_WBC'] = df['ANC'] / df['WBC']
# df['MONOCYTES_per_WBC'] = df['MONOCYTES'] / df['WBC']           # too much missing values, first drop it, then see how to exploit it
df["PL_per_HB"] = df['PLT'] / df['HB']
# add these features to the features list
features_additional = ['BLAST_per_WBC', 'ANC_per_WBC', 'PL_per_HB']

df.head()

Unnamed: 0,ID,CENTER,BM_BLAST,WBC,ANC,MONOCYTES,HB,PLT,CYTOGENETICS,BLAST_per_WBC,ANC_per_WBC,PL_per_HB
0,P132697,MSK,14.0,2.8,0.2,0.7,7.6,119.0,"46,xy,del(20)(q12)[2]/46,xy[18]",5.0,0.071429,15.657895
1,P132698,MSK,1.0,7.4,2.4,0.1,11.6,42.0,"46,xx",0.135135,0.324324,3.62069
2,P116889,MSK,15.0,3.7,2.1,0.1,14.2,81.0,"46,xy,t(3;3)(q25;q27)[8]/46,xy[12]",4.054054,0.567568,5.704225
3,P132699,MSK,1.0,3.9,1.9,0.1,8.9,77.0,"46,xy,del(3)(q26q27)[15]/46,xy[5]",0.25641,0.487179,8.651685
4,P132700,MSK,6.0,128.0,9.7,0.9,11.1,195.0,"46,xx,t(3;9)(p13;q22)[10]/46,xx[10]",0.046875,0.075781,17.567568


In [4]:
# la meme pour le df_eval
df_eval['BLAST_per_WBC'] = df_eval['BM_BLAST'] / df_eval['WBC']
df_eval['ANC_per_WBC'] = df_eval['ANC'] / df_eval['WBC']
df_eval["PL_per_HB"] = df_eval['PLT'] / df_eval['HB']

In [5]:
# Third approach - exploiter les données cytogénétiques
import re
high_risk_patterns = [r"-7", r"-5", r"-17", r"del\(5\)", r"del\(17\)", r"del\(20\)", r"del\(9\)",
                      r"\+8", r"t\(3;3\)", r"complex", r"t\(9;11\)", r"i\(17\)"]

def categorize_cytogenetics(cytogenetics):
    if pd.isna(cytogenetics) or cytogenetics.strip() == "":
        return "Unknown"
    cytogenetics = cytogenetics.upper()
    for pattern in high_risk_patterns:
        if re.search(pattern, cytogenetics):
            return "High_Risk"
    return "Low_Intermediate"

df["CYTO_RISK"] = df["CYTOGENETICS"].apply(categorize_cytogenetics)
df = pd.get_dummies(df, columns=["CYTO_RISK"])


In [6]:
# drop the "Unknown" column
# est ce que ça suffit de garder la "High risk column" ?
df.drop(columns=["CYTO_RISK_Unknown"], inplace=True)
df.drop(columns=["CYTO_RISK_Low_Intermediate"], inplace=True)

In [7]:
features_cytogenetics = ['CYTO_RISK_High_Risk']
df.head()

Unnamed: 0,ID,CENTER,BM_BLAST,WBC,ANC,MONOCYTES,HB,PLT,CYTOGENETICS,BLAST_per_WBC,ANC_per_WBC,PL_per_HB,CYTO_RISK_High_Risk
0,P132697,MSK,14.0,2.8,0.2,0.7,7.6,119.0,"46,xy,del(20)(q12)[2]/46,xy[18]",5.0,0.071429,15.657895,False
1,P132698,MSK,1.0,7.4,2.4,0.1,11.6,42.0,"46,xx",0.135135,0.324324,3.62069,False
2,P116889,MSK,15.0,3.7,2.1,0.1,14.2,81.0,"46,xy,t(3;3)(q25;q27)[8]/46,xy[12]",4.054054,0.567568,5.704225,False
3,P132699,MSK,1.0,3.9,1.9,0.1,8.9,77.0,"46,xy,del(3)(q26q27)[15]/46,xy[5]",0.25641,0.487179,8.651685,False
4,P132700,MSK,6.0,128.0,9.7,0.9,11.1,195.0,"46,xx,t(3;9)(p13;q22)[10]/46,xx[10]",0.046875,0.075781,17.567568,False


In [8]:
# meme transformation pour le df_eval
df_eval["CYTO_RISK"] = df_eval["CYTOGENETICS"].apply(categorize_cytogenetics)
df_eval = pd.get_dummies(df_eval, columns=["CYTO_RISK"])
df_eval.drop(columns=["CYTO_RISK_Unknown"], inplace=True)
df_eval.drop(columns=["CYTO_RISK_Low_Intermediate"], inplace=True)

In [9]:
# # Third Approach - Feature Engineering: construct a feature based on the CYTOGENETIC DATA

# # TODO : use regular expression 
# # les classes sont mal définies

# # # Define risk groups based on cytogenetic data
# # high_risk_karyotypes = ['-7', 'del(5q)', 't(3;3)', 'complex karyotype']
# # low_risk_karyotypes = ['46,xx', '46,xy']

# # # Function to categorize karyotypes
# # def categorize_cytogenetics(cytogenetics):
# #     if any(hr in cytogenetics for hr in high_risk_karyotypes):
# #         return 'High_Risk'
# #     elif any(lr in cytogenetics for lr in low_risk_karyotypes):
# #         return 'Low_Risk'
# #     else:
# #         return 'Unknown'

# # TODO : regularize the expressions in the CYTOGENETICS column (check la standardisation sur Google)? beofre categorizing?

# low_risk_karyotypes = ['46,xx', '46,xy']
# def categorize_cytogenetics(cytogenetics):
#     if any(lr in cytogenetics for lr in low_risk_karyotypes):
#         return 'Low_Risk'
#     else:
#         return 'High_Risk'


# # Apply the function to create a new categorical column
# df['CYTO_RISK'] = df['CYTOGENETICS'].fillna('Unknown').apply(categorize_cytogenetics)
# df_eval['CYTO_RISK'] = df_eval['CYTOGENETICS'].fillna('Unknown').apply(categorize_cytogenetics)


# # One-Hot Encode the new categorical column
# df = pd.get_dummies(df, columns=['CYTO_RISK'], drop_first=True)
# df_eval = pd.get_dummies(df_eval, columns=['CYTO_RISK'], drop_first=True)

# # Add new categorical features to feature list
# cyto_features = ['CYTO_RISK_High_Risk', 'CYTO_RISK_Unknown']
# features = features_basic + cyto_features

# print("Updated feature set with cytogenetics:", features)

# df.head()

# # TODO : why did they add the 'Unknown' category ?

In [10]:
# features = features_basic
# features = features_basic + features_additional

features = features_basic + features_additional + features_cytogenetics
print("The features are:", features)
target = ['OS_YEARS', 'OS_STATUS']

# Create the survival data format
X = df.loc[df['ID'].isin(target_df['ID']), features]
y = Surv.from_dataframe('OS_STATUS', 'OS_YEARS', target_df)

print(target_df[['OS_STATUS', 'OS_YEARS']].dtypes)

The features are: ['BM_BLAST', 'HB', 'PLT', 'BLAST_per_WBC', 'ANC_per_WBC', 'PL_per_HB', 'CYTO_RISK_High_Risk']
OS_STATUS       bool
OS_YEARS     float64
dtype: object


### Step 2: Splitting the Dataset
We’ll split the data into training and testing sets to evaluate the model’s performance.

In [11]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [12]:
# Compute percentage of missing values per column
missing_percentage = X_train.isnull().mean() * 100
# Print the missing values percentage (genre de manière générale pas selon les catégories)
print(missing_percentage)


# Survival-aware imputation for missing values
imputer = SimpleImputer(strategy="median")
X_train[features] = imputer.fit_transform(X_train[features])
X_test[features] = imputer.transform(X_test[features])

# X_train[features_cytogenetics] = imputer.fit_transform(X_train[features_cytogenetics])
# X_test[features_cytogenetics] = imputer.transform(X_test[features_cytogenetics])

# # do the same for the additional features
# X_train[features_additional] = imputer.fit_transform(X_train[features_additional])
# X_test[features_additional] = imputer.transform(X_test[features_additional])

# Attention additional features have more missing values : 
# WBC           7.3 %
# ANC           4.6 %
# MONOCYTES    16.3 %

BM_BLAST               2.746511
HB                     2.566412
PLT                    3.061684
BLAST_per_WBC          8.239532
ANC_per_WBC            8.374606
PL_per_HB              3.151733
CYTO_RISK_High_Risk    0.000000
dtype: float64


In [13]:
# TODO : maybe drop les datas ou ya bcp de manque??

### Step 3: Training Standard Machine Learning Methods

In this step, we train a standard LightGBM model on survival data, but we do not account for censoring. Instead of treating the event status, we use only the observed survival times as the target variable. This approach disregards whether an individual’s event (e.g., death) was observed or censored, effectively treating the problem as a standard regression task. While this method provides a basic benchmark, it may be less accurate than survival-specific models (but still be explored!), as it does not leverage the information contained in censored observations.

In [14]:
# Import necessary libraries
import lightgbm as lgb
from sksurv.metrics import concordance_index_censored
from sksurv.util import Surv

# Define LightGBM parameters
lgbm_params = {
    'max_depth': 3,         # TODO : when adding features try to increase the depth of the tree
    'learning_rate': 0.05,
    'verbose': -1
}

# Prepare the data for LightGBM
# Scale the target (OS_YEARS) to reduce skew, apply weights based on event status
X_train_lgb = X_train  # Features for training
y_train_transformed = y_train['OS_YEARS']

# Create LightGBM dataset
train_dataset = lgb.Dataset(X_train_lgb, label=y_train_transformed)

# Train the LightGBM model
model = lgb.train(params=lgbm_params, train_set=train_dataset)

# Make predictions on the training and testing sets
pred_train = -model.predict(X_train)
pred_test = -model.predict(X_test)

# Evaluate the model using Concordance Index IPCW
train_ci_ipcw = concordance_index_ipcw(y_train, y_train, pred_train, tau=7)[0]
test_ci_ipcw = concordance_index_ipcw(y_train, y_test, pred_test, tau=7)[0]
print(f"LightGBM Survival Model Concordance Index IPCW on train: {train_ci_ipcw:.2f}")
print(f"LightGBM Survival Model Concordance Index IPCW on test: {test_ci_ipcw:.2f}")


LightGBM Survival Model Concordance Index IPCW on train: 0.70
LightGBM Survival Model Concordance Index IPCW on test: 0.66


**Observation:** le modèle commence à overfitter sur le train set quand on ajoute des features

In [15]:
# VISUALIZING THE 
# 
# # Assuming the LightGBM model is defined as `model`
# plt.figure(figsize=(20, 10))
# lgb.plot_tree(model, tree_index=0, figsize=(20, 10), show_info=['split_gain', 'internal_value', 'internal_count', 'leaf_count'])
# plt.title("First Tree in LightGBM Model")
# plt.show()

### Step 4: Cox Proportional Hazards Model

To account for censoring in survival analysis, we use a Cox Proportional Hazards (Cox PH) model, a widely used method that estimates the effect of covariates on survival times without assuming a specific baseline survival distribution. The Cox PH model is based on the hazard function, $h(t | X)$, which represents the instantaneous risk of an event (e.g., death) at time $t$ given covariates $X$. The model assumes that the hazard can be expressed as:

$$h(t | X) = h_0(t) \exp(\beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p)$$


where $h_0(t)$ is the baseline hazard function, and $\beta$ values are coefficients for each covariate, representing the effect of $X$ on the hazard. Importantly, the proportional hazards assumption implies that the hazard ratios between individuals are constant over time. This approach effectively leverages both observed and censored survival times, making it a more suitable method for survival data compared to standard regression techniques that ignore censoring.


In [16]:
# Initialize and train the Cox Proportional Hazards model
cox = CoxPHSurvivalAnalysis()
cox.fit(X_train, y_train)

# Evaluate the model using Concordance Index IPCW
cox_cindex_train = concordance_index_ipcw(y_train, y_train, cox.predict(X_train), tau=7)[0]
cox_cindex_test = concordance_index_ipcw(y_train, y_test, cox.predict(X_test), tau=7)[0]
print(f"Cox Proportional Hazard Model Concordance Index IPCW on train: {cox_cindex_train:.2f}")
print(f"Cox Proportional Hazard Model Concordance Index IPCW on test: {cox_cindex_test:.2f}")

Cox Proportional Hazard Model Concordance Index IPCW on train: 0.67
Cox Proportional Hazard Model Concordance Index IPCW on test: 0.67


### Step 5: Naive Approach to Incorporate Mutations

In this step, we take a very naive approach to account for genetic mutations by simply counting the total number of somatic mutations per patient. Instead of analyzing specific mutations or their biological impact, we use this aggregate count as a basic feature to reflect the mutational burden for each individual. Although simplistic, this feature can serve as a general indicator of genetic variability across patients, which may influence survival outcomes. More sophisticated mutation analysis could be incorporated in future models to improve predictive power.


In [17]:
# Step: Extract the number of somatic mutations per patient
# Group by 'ID' and count the number of mutations (rows) per patient
tmp = maf_df.groupby('ID').size().reset_index(name='Nmut')

# Merge with the training dataset and replace missing values in 'Nmut' with 0
df_2 = df.merge(tmp, on='ID', how='left').fillna({'Nmut': 0})

In [18]:
# Select features
# features = ['BM_BLAST', 'HB', 'PLT', 'Nmut']
# features = ['BM_BLAST', 'HB', 'PLT', 'WBC', 'ANC', 'MONOCYTES', 'Nmut']
# features = features_basic + ['BLAST_per_WBC', 'ANC_per_WBC', 'PL_per_HB', 'Nmut']
# features = features_basic + ['Nmut']
features = features + ['Nmut']
target = ['OS_YEARS', 'OS_STATUS']

# Create the survival data format
X = df_2.loc[df_2['ID'].isin(target_df['ID']), features]
y = Surv.from_dataframe('OS_STATUS', 'OS_YEARS', target_df)

In [19]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# TODO : change the random state?

In [20]:
# Survival-aware imputation for missing values
imputer = SimpleImputer(strategy="median")
# basic features
# X_train[['BM_BLAST', 'HB', 'PLT', 'Nmut']] = imputer.fit_transform(X_train[['BM_BLAST', 'HB', 'PLT', 'Nmut']])
# X_test[['BM_BLAST', 'HB', 'PLT', 'Nmut']] = imputer.transform(X_test[['BM_BLAST', 'HB', 'PLT', 'Nmut']])

# pareil :
X_train[features] = imputer.fit_transform(X_train[features])
X_test[features] = imputer.transform(X_test[features])

In [21]:
print("the features are:", features)

the features are: ['BM_BLAST', 'HB', 'PLT', 'BLAST_per_WBC', 'ANC_per_WBC', 'PL_per_HB', 'CYTO_RISK_High_Risk', 'Nmut']


In [22]:
# Initialize and train the Cox Proportional Hazards model
# cox = CoxPHSurvivalAnalysis()
cox = CoxPHSurvivalAnalysis(alpha=1.)
cox.fit(X_train, y_train)

# Evaluate the model using Concordance Index IPCW
cox_cindex_train = concordance_index_ipcw(y_train, y_train, cox.predict(X_train), tau=7)[0]
cox_cindex_test = concordance_index_ipcw(y_train, y_test, cox.predict(X_test), tau=7)[0]
print(f"Cox Proportional Hazard Model Concordance Index IPCW on train: {cox_cindex_train:.2f}")
print(f"Cox Proportional Hazard Model Concordance Index IPCW on test: {cox_cindex_test:.2f}")

Cox Proportional Hazard Model Concordance Index IPCW on train: 0.69
Cox Proportional Hazard Model Concordance Index IPCW on test: 0.69


### Inference on test set

In [23]:

tmp_eval = maf_eval.groupby('ID').size().reset_index(name='Nmut')

# Merge with the training dataset and replace missing values in 'Nmut' with 0
df_eval = df_eval.merge(tmp_eval, on='ID', how='left').fillna({'Nmut': 0})

df_eval.head()

Unnamed: 0,ID,CENTER,BM_BLAST,WBC,ANC,MONOCYTES,HB,PLT,CYTOGENETICS,BLAST_per_WBC,ANC_per_WBC,PL_per_HB,CYTO_RISK_High_Risk,Nmut
0,KYW1,KYW,68.0,3.45,0.5865,,7.6,48.0,"47,XY,+X,del(9)(q?)[15]/47,XY,+X[5]",19.710145,0.17,6.315789,False,4.0
1,KYW2,KYW,35.0,3.18,1.2402,,10.0,32.0,"46,XY,der(3)?t(3;11)(q26.2;q23),add(4)(p15).de...",11.006289,0.39,3.2,False,3.0
2,KYW3,KYW,,12.4,8.68,,12.3,25.0,"47,XX,+8",,0.7,2.03252,True,3.0
3,KYW4,KYW,61.0,5.55,2.0535,,8.0,44.0,Normal,10.990991,0.37,5.5,False,3.0
4,KYW5,KYW,2.0,1.21,0.7381,,8.6,27.0,"43,XY,dic(5;17)(q11.2;p11.2),-7,-13,-20,-22,+r...",1.652893,0.61,3.139535,True,3.0


In [25]:
# Basic features
# df_eval[['BM_BLAST', 'HB', 'PLT', 'Nmut']] = imputer.transform(df_eval[['BM_BLAST', 'HB', 'PLT', 'Nmut']])

# Additional features
# print(features)
# df_eval.head()
df_eval[features] = imputer.transform(df_eval[features])
prediction_on_test_set = cox.predict(df_eval.loc[:, features])

In [26]:
prediction_on_test_set

array([ 0.93783922, -0.33934019, -0.75878526, ..., -1.46714868,
       -1.21101545, -1.33908206])

#### Saving the submission

In [27]:
# submission = pd.Series(prediction_on_test_set, index=df_eval['ID'], name='OS_YEARS')
submission = pd.Series(prediction_on_test_set, index=df_eval['ID'], name='risk_score')

In [28]:
submission

ID
KYW1       0.937839
KYW2      -0.339340
KYW3      -0.758785
KYW4       0.583121
KYW5      -0.205101
             ...   
KYW1189   -1.339082
KYW1190   -1.211015
KYW1191   -1.467149
KYW1192   -1.211015
KYW1193   -1.339082
Name: risk_score, Length: 1193, dtype: float64

In [None]:
import datetime

os.makedirs('./output', exist_ok=True)

# I just want the date and hour and minute
now = datetime.datetime.now()
now = now.strftime("%Y-%m-%d_%H-%M")

submission.to_csv(f'./output/submission_{now}.csv')
# submission.to_csv('./output/all_clinical_features_submission.csv')

In [63]:
submission

ID
KYW1       0.863839
KYW2      -0.513840
KYW3      -1.732613
KYW4       0.493821
KYW5      -1.125188
             ...   
KYW1189   -1.566600
KYW1190   -1.442281
KYW1191   -1.690919
KYW1192   -1.442281
KYW1193   -1.566600
Name: OS_YEARS, Length: 1193, dtype: float64

In [35]:

# random_submission = pd.Series(np.random.uniform(0, 1, len(submission)),index =submission.index, name='OS_YEARS')
# random_submission.to_csv('./random_submission.csv')
# random_submission