# Sistema de Modelado y Análisis de Supervivencia (médica)

En este notebook se detalla un sistema o secuencia de pasos a seguir que se pueden aplicar a cualquier modelo de Machine Learning que involucren problemas de tipo ***supervivencia***.

## 1. Dataset
* Cada fila representa a un paciente.
* El objetivo es predecir el riesgo (**risk**) de que un paciente muera
    * Esto se determina en la columna **event**.
---
En los análisis de supervivencia tenemos:
* Los targets:
    * **event**: True/False (muere/no muere).
    * **time**: El tiempo cuando el evento ocurre.
* Explanatory variables:
    * El resto de variables (age, prior_therapy, etc)

Por lo cual la formula de la ecuación matemática del modelo será:
$$
risk = (w_0) + (w_1) \cdot age + (w_2) \cdot prior\_therapy
$$

Con los modelos de Machine Learning, se busca encontrar los mejores valores (optimización) para los pesos ***w1*** y ***w2*** de la ecuación anterior, para finalmente calcular el **risk** asociado.

In [44]:
import pandas as pd

df_patients = pd.read_excel("../data/data_lung_cancer_smote.xlsx")
list_columns_categorical = df_patients.select_dtypes(include="object").columns
df_patients[list_columns_categorical] = df_patients[list_columns_categorical].astype("category")        # Transformación Object a Category. Paso importante para que OneHotEncoder() reconozca las variables categóricas y las transforme.
df_patients

Unnamed: 0,event,time,age,karnofsky_score,months_from_diagnosis,prior_therapy,treatment,celltype
0,True,2.373626,69.000000,60.000000,7.000000,No,Standard,Squamous
1,True,7.516484,38.000000,60.000000,3.000000,No,Standard,Squamous
2,True,4.153846,63.000000,60.000000,9.000000,Yes,Standard,Squamous
3,True,3.890110,65.000000,70.000000,11.000000,Yes,Standard,Squamous
4,True,0.329670,49.000000,20.000000,5.000000,No,Standard,Squamous
...,...,...,...,...,...,...,...,...
238,False,3.142881,65.810046,64.640822,4.762009,No,Standard,Smallcell
239,False,3.380047,36.495508,70.684273,21.551683,Yes,Test,Smallcell
240,False,3.082424,65.029553,81.087920,4.852974,No,Standard,Squamous
241,False,2.986648,62.424988,77.842548,4.084998,No,Standard,Smallcell


## 2. Feature Selection

Selección de las variables a utilizar en el modelo:
* `y (target)`: **event** y **time**
    * Para que estas variables puedan ser procesadas por el modelo, deben transformarse a otra estructura de datos (*numpy records array*). Esto se hace con *.to_records()*
* `x (explanatory)`: Variables relevantes para calcular el riesgo de un paciente.

### 2.1 Preprocessing Data

1. Revisar **NaN**: Eliminarlos del dataset
2. Transformar los datos categóricos de las variables Exploratory (X) a numéricos con **OneHotEncoder()**
    * La variable target (y) no necesita transformación de categóricos a numéricos ya que cuando se aplica el algoritmo de ML con .fit() este hace la transformación de forma automática.

In [45]:
df_patients.isna().sum()

event                    0
time                     0
age                      0
karnofsky_score          0
months_from_diagnosis    0
prior_therapy            0
treatment                0
celltype                 0
dtype: int64

In [46]:
y = df_patients[["event", "time"]].to_records(index=False)

In [47]:
X = df_patients.drop(["event", "time"], axis=1)
X

Unnamed: 0,age,karnofsky_score,months_from_diagnosis,prior_therapy,treatment,celltype
0,69.000000,60.000000,7.000000,No,Standard,Squamous
1,38.000000,60.000000,3.000000,No,Standard,Squamous
2,63.000000,60.000000,9.000000,Yes,Standard,Squamous
3,65.000000,70.000000,11.000000,Yes,Standard,Squamous
4,49.000000,20.000000,5.000000,No,Standard,Squamous
...,...,...,...,...,...,...
238,65.810046,64.640822,4.762009,No,Standard,Smallcell
239,36.495508,70.684273,21.551683,Yes,Test,Smallcell
240,65.029553,81.087920,4.852974,No,Standard,Squamous
241,62.424988,77.842548,4.084998,No,Standard,Smallcell


In [48]:
from sksurv.preprocessing import OneHotEncoder

In [49]:
encoder = OneHotEncoder()

In [50]:
X = encoder.fit_transform(X)

### 2.2 Train Test Split

In [51]:
from sklearn.model_selection import train_test_split

In [52]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [53]:
pd.DataFrame({
    "Dataset": ["X_train", "X_test", "y_train", "y_test"],
    "Registros": [len(X_train), len(X_test), len(y_train), len(y_test)]
})

Unnamed: 0,Dataset,Registros
0,X_train,170
1,X_test,73
2,y_train,170
3,y_test,73


### 2.3 Grid Search CV (Cross Validation)

In [76]:
model_tree = SurvivalTree()
model_tree.get_params()         # Muestra los parámetros actuales del modelo

{'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_samples_leaf': 3,
 'min_samples_split': 6,
 'min_weight_fraction_leaf': 0.0,
 'random_state': None,
 'splitter': 'best'}

In [77]:
from sklearn.model_selection import GridSearchCV

In [91]:
model_cv_tree = GridSearchCV(
    verbose=2,
    cv=3,
    estimator=model_tree,
    param_grid={
        'max_depth': [2, 3, 4, 5],
        'min_samples_leaf': [10, 20, 30, 40, 50]
    }
)

In [92]:
model_cv_tree.fit(X_train, y_train)

Fitting 3 folds for each of 20 candidates, totalling 60 fits
[CV] END ...................max_depth=2, min_samples_leaf=10; total time=   0.0s
[CV] END ...................max_depth=2, min_samples_leaf=10; total time=   0.0s
[CV] END ...................max_depth=2, min_samples_leaf=10; total time=   0.0s
[CV] END ...................max_depth=2, min_samples_leaf=20; total time=   0.0s
[CV] END ...................max_depth=2, min_samples_leaf=20; total time=   0.0s
[CV] END ...................max_depth=2, min_samples_leaf=20; total time=   0.0s
[CV] END ...................max_depth=2, min_samples_leaf=30; total time=   0.0s
[CV] END ...................max_depth=2, min_samples_leaf=30; total time=   0.0s
[CV] END ...................max_depth=2, min_samples_leaf=30; total time=   0.0s
[CV] END ...................max_depth=2, min_samples_leaf=40; total time=   0.0s
[CV] END ...................max_depth=2, min_samples_leaf=40; total time=   0.0s
[CV] END ...................max_depth=2, min_sam

In [93]:
model_cv_tree.best_params_

{'max_depth': 5, 'min_samples_leaf': 10}

In [94]:
model_cv_tree.score(X_test, y_test)

0.7826375082836315

## 3. The Cox PH Model

### 3.1 Fit - Ajuste del modelo (ecuación matemática)

In [55]:
from sksurv.linear_model import CoxPHSurvivalAnalysis

In [56]:
model_cox = CoxPHSurvivalAnalysis()

In [97]:
model_cox.get_params()

{'alpha': 0, 'n_iter': 100, 'ties': 'breslow', 'tol': 1e-09, 'verbose': 0}

#### Grid Search CV - Selección de los mejores parámetros

In [96]:
from sklearn.model_selection import GridSearchCV

In [116]:
model_cv_cox = GridSearchCV(
    verbose=2,
    estimator=model_cox,
    param_grid={
        'ties': ['breslow', 'efron'],
        "n_iter" : [50, 100, 200, 300]
    }
)

In [117]:
model_cv_cox.fit(X_train, y_train)

Fitting 5 folds for each of 8 candidates, totalling 40 fits
[CV] END ............................n_iter=50, ties=breslow; total time=   0.0s
[CV] END ............................n_iter=50, ties=breslow; total time=   0.0s
[CV] END ............................n_iter=50, ties=breslow; total time=   0.0s
[CV] END ............................n_iter=50, ties=breslow; total time=   0.0s
[CV] END ............................n_iter=50, ties=breslow; total time=   0.0s
[CV] END ..............................n_iter=50, ties=efron; total time=   0.0s
[CV] END ..............................n_iter=50, ties=efron; total time=   0.0s
[CV] END ..............................n_iter=50, ties=efron; total time=   0.0s
[CV] END ..............................n_iter=50, ties=efron; total time=   0.0s
[CV] END ..............................n_iter=50, ties=efron; total time=   0.0s
[CV] END ...........................n_iter=100, ties=breslow; total time=   0.0s
[CV] END ...........................n_iter=100, t

In [106]:
model_cv_cox.best_params_

{'n_iter': 50, 'ties': 'breslow'}

### 3.2 Score - Evaluación de las predicciones

In [108]:
model_cv_cox.score(X_test, y_test)

0.756129887342611

## 4. Decision Tree Model

### 4.1 Fit - Ajuste del modelo

In [110]:
from sksurv.tree import SurvivalTree

In [111]:
model_tree = SurvivalTree()

In [113]:
model_tree.get_params()

{'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_samples_leaf': 3,
 'min_samples_split': 6,
 'min_weight_fraction_leaf': 0.0,
 'random_state': None,
 'splitter': 'best'}

In [118]:
model_cv_tree = GridSearchCV(
    verbose = 2,
    estimator = model_tree,
    param_grid = {
        "splitter": ["best", "random"],
        "max_depth": [3, 5, 10, 15, 20],
        'max_features': [1, 2, 3, 4, 5, 6],
        'min_samples_leaf': [3, 5, 10]
    }
)

In [119]:
model_cv_tree.fit(X_train, y_train)

Fitting 5 folds for each of 180 candidates, totalling 900 fits
[CV] END max_depth=3, max_features=1, min_samples_leaf=3, splitter=best; total time=   0.0s
[CV] END max_depth=3, max_features=1, min_samples_leaf=3, splitter=best; total time=   0.0s
[CV] END max_depth=3, max_features=1, min_samples_leaf=3, splitter=best; total time=   0.0s
[CV] END max_depth=3, max_features=1, min_samples_leaf=3, splitter=best; total time=   0.0s
[CV] END max_depth=3, max_features=1, min_samples_leaf=3, splitter=best; total time=   0.0s
[CV] END max_depth=3, max_features=1, min_samples_leaf=3, splitter=random; total time=   0.0s
[CV] END max_depth=3, max_features=1, min_samples_leaf=3, splitter=random; total time=   0.0s
[CV] END max_depth=3, max_features=1, min_samples_leaf=3, splitter=random; total time=   0.0s
[CV] END max_depth=3, max_features=1, min_samples_leaf=3, splitter=random; total time=   0.0s
[CV] END max_depth=3, max_features=1, min_samples_leaf=3, splitter=random; total time=   0.0s
[CV] EN

In [121]:
model_cv_tree.best_params_

{'max_depth': 5, 'max_features': 6, 'min_samples_leaf': 5, 'splitter': 'best'}

In [63]:
model_tree.predict(X_test)

array([ 32.2       ,  22.        ,   0.        ,   3.        ,
         0.        ,  11.        ,   0.        ,  32.        ,
        81.83333333, 102.05      ,  68.83333333,   0.        ,
         0.        ,   0.        , 102.05      ,   0.        ,
        11.        , 111.33333333,  25.        ,   0.        ,
        56.        ,  56.        ,  85.83333333,  60.33333333,
         7.33333333,   0.        , 110.5       ,  22.        ,
        55.38333333,   0.        ,  25.        ,   0.        ,
        56.        ,   7.33333333, 111.33333333,   0.        ,
       102.05      ,  25.        ,  32.        ,  33.83333333,
        55.38333333,  85.83333333,  32.2       ,  18.        ,
        25.        , 119.        ,  18.        ,  23.        ,
         0.        ,  70.5       ,   0.        ,  29.65      ,
        55.38333333,  85.83333333,  60.33333333, 102.05      ,
         0.        ,   0.        ,  23.        ,  22.        ,
       119.        ,  85.83333333,  32.        ,  22.  

### 4.3 Score - Evaluación de las predicciones

In [122]:
model_cv_tree.score(X_test, y_test)

0.8240556660039762

## 5. Random Forest Model

### 5.1 Fit - Ajuste del modelo

In [65]:
from sksurv.ensemble import RandomSurvivalForest

In [66]:
model_rf = RandomSurvivalForest()

In [124]:
model_rf.get_params()

{'bootstrap': True,
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_samples_leaf': 3,
 'min_samples_split': 6,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [125]:
model_cv_rf = GridSearchCV(
    verbose = 2,
    estimator = model_rf,
    param_grid = {
        "n_estimators": [50, 100, 200, 300],
        'min_samples_leaf': [2, 3, 5, 10]
    }
)

In [126]:
model_cv_rf.fit(X_train, y_train)

Fitting 5 folds for each of 16 candidates, totalling 80 fits
[CV] END ................min_samples_leaf=2, n_estimators=50; total time=   0.0s
[CV] END ................min_samples_leaf=2, n_estimators=50; total time=   0.0s
[CV] END ................min_samples_leaf=2, n_estimators=50; total time=   0.0s
[CV] END ................min_samples_leaf=2, n_estimators=50; total time=   0.0s
[CV] END ................min_samples_leaf=2, n_estimators=50; total time=   0.0s
[CV] END ...............min_samples_leaf=2, n_estimators=100; total time=   0.0s
[CV] END ...............min_samples_leaf=2, n_estimators=100; total time=   0.0s
[CV] END ...............min_samples_leaf=2, n_estimators=100; total time=   0.0s
[CV] END ...............min_samples_leaf=2, n_estimators=100; total time=   0.0s
[CV] END ...............min_samples_leaf=2, n_estimators=100; total time=   0.0s
[CV] END ...............min_samples_leaf=2, n_estimators=200; total time=   0.1s
[CV] END ...............min_samples_leaf=2, n_es

In [128]:
model_cv_rf.best_params_

{'min_samples_leaf': 2, 'n_estimators': 200}

In [129]:
model_cv_rf.score(X_test, y_test)

0.8933068257123923

## 6. Support Vector Machine

In [70]:
from sksurv.svm import FastSurvivalSVM

In [130]:
model_svm = FastSurvivalSVM()

In [132]:
model_svm.get_params()

{'alpha': 1,
 'fit_intercept': False,
 'max_iter': 20,
 'optimizer': None,
 'random_state': None,
 'rank_ratio': 1.0,
 'timeit': False,
 'tol': None,
 'verbose': False}

In [136]:
model_cv_svm = GridSearchCV(
    verbose = 2,
    estimator = model_svm,
    param_grid = {
        "alpha": [1, 2, 3],
        "max_iter": [5, 10, 20, 40]
    }
)

In [137]:
model_cv_svm.fit(X_train, y_train)

Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV] END ................................alpha=1, max_iter=5; total time=   0.0s
[CV] END ................................alpha=1, max_iter=5; total time=   0.0s
[CV] END ................................alpha=1, max_iter=5; total time=   0.0s
[CV] END ................................alpha=1, max_iter=5; total time=   0.0s
[CV] END ................................alpha=1, max_iter=5; total time=   0.0s
[CV] END ...............................alpha=1, max_iter=10; total time=   0.0s
[CV] END ...............................alpha=1, max_iter=10; total time=   0.0s
[CV] END ...............................alpha=1, max_iter=10; total time=   0.0s
[CV] END ...............................alpha=1, max_iter=10; total time=   0.0s


  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...............................alpha=1, max_iter=10; total time=   0.0s
[CV] END ...............................alpha=1, max_iter=20; total time=   0.0s
[CV] END ...............................alpha=1, max_iter=20; total time=   0.0s
[CV] END ...............................alpha=1, max_iter=20; total time=   0.0s
[CV] END ...............................alpha=1, max_iter=20; total time=   0.0s
[CV] END ...............................alpha=1, max_iter=20; total time=   0.0s
[CV] END ...............................alpha=1, max_iter=40; total time=   0.0s
[CV] END ...............................alpha=1, max_iter=40; total time=   0.0s
[CV] END ...............................alpha=1, max_iter=40; total time=   0.0s
[CV] END ...............................alpha=1, max_iter=40; total time=   0.0s
[CV] END ...............................alpha=1, max_iter=40; total time=   0.0s


  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)


[CV] END ................................alpha=2, max_iter=5; total time=   0.0s
[CV] END ................................alpha=2, max_iter=5; total time=   0.0s
[CV] END ................................alpha=2, max_iter=5; total time=   0.0s
[CV] END ................................alpha=2, max_iter=5; total time=   0.0s
[CV] END ................................alpha=2, max_iter=5; total time=   0.0s
[CV] END ...............................alpha=2, max_iter=10; total time=   0.0s
[CV] END ...............................alpha=2, max_iter=10; total time=   0.0s
[CV] END ...............................alpha=2, max_iter=10; total time=   0.0s
[CV] END ...............................alpha=2, max_iter=10; total time=   0.0s
[CV] END ...............................alpha=2, max_iter=10; total time=   0.0s
[CV] END ...............................alpha=2, max_iter=20; total time=   0.0s
[CV] END ...............................alpha=2, max_iter=20; total time=   0.0s
[CV] END ...................

  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)


[CV] END ...............................alpha=3, max_iter=10; total time=   0.0s
[CV] END ...............................alpha=3, max_iter=10; total time=   0.0s
[CV] END ...............................alpha=3, max_iter=10; total time=   0.0s
[CV] END ...............................alpha=3, max_iter=10; total time=   0.0s
[CV] END ...............................alpha=3, max_iter=20; total time=   0.0s
[CV] END ...............................alpha=3, max_iter=20; total time=   0.0s
[CV] END ...............................alpha=3, max_iter=20; total time=   0.0s
[CV] END ...............................alpha=3, max_iter=20; total time=   0.0s
[CV] END ...............................alpha=3, max_iter=20; total time=   0.0s
[CV] END ...............................alpha=3, max_iter=40; total time=   0.0s
[CV] END ...............................alpha=3, max_iter=40; total time=   0.0s
[CV] END ...............................alpha=3, max_iter=40; total time=   0.0s
[CV] END ...................

In [138]:
model_cv_svm.score(X_test, y_test)

0.7468522200132538

## 7. Resultado de los modelos

In [72]:
columnas = ["Modelo", "Score"]

In [139]:
df_resultados = pd.DataFrame({
    "Modelos": ["Cox PH", "Decision Tree", "Random Forest", "SVM"],
    "Score": [model_cv_cox.score(X_test,y_test), model_cv_tree.score(X_test,y_test), model_cv_rf.score(X_test,y_test), model_cv_svm.score(X_test,y_test)]
    }
    )
df_resultados.style.background_gradient()

Unnamed: 0,Modelos,Score
0,Cox PH,0.75613
1,Decision Tree,0.824056
2,Random Forest,0.893307
3,SVM,0.746852
