# Examen Final: Clasificación de Mortalidad Hospitalaria  
**Machine Learning Supervisado**  

# Mateo Hernández Gualdron


*September 25, 2025*

---

## Propósito  
El propósito de este examen es evaluar el cumplimiento de los objetivos de aprendizaje planteados al comienzo del curso:

- Identificar problemas que se pueden resolver usando Machine Learning supervisado.  
- Implementar una solución de Machine Learning supervisado a problemas prácticos.  
- Evaluar el desempeño de modelos de Machine Learning supervisado.  

---

## 1. Descripción del Problema  
Predecir la mortalidad hospitalaria de pacientes críticamente enfermos es importante debido a la creciente preocupación sobre la pérdida de control de los pacientes hacia el final de la vida.  

Una predicción acertada permite tomar decisiones anticipadas para reducir la frecuencia de un proceso de muerte mecánico, doloroso y prolongado.  

El objetivo del examen es diseñar un clasificador que permita predecir la mortalidad de estos pacientes a partir de varias características fisiológicas, demográficas y de severidad de la enfermedad.  

---

## 2. Dataset  
El archivo de datos proporcionado es **`data_train.csv`**. Este archivo contiene un total de **9105 registros de pacientes**.  

Las características del dataset incluyen:  
- Información fisiológica de los pacientes.  
- Datos demográficos.  
- Información sobre la severidad de la enfermedad.  
- Indicadores de mortalidad a 2 y 6 meses.  
- Columna de mortalidad hospitalaria (`hospdead`, la etiqueta a predecir).  

---

## 3. Evaluación  
Usted es libre de emplear cualquier tipo de modelo, así como de utilizar el preprocesamiento de los datos que considere adecuado.  

La evaluación del examen final se basará en los siguientes criterios:

- **(10%)** Adecuación y preprocesamiento de los datos.  
- **(10%)** Evidencia del entrenamiento apropiado de sus modelos.  
- **(20%)** Selección del tipo de modelo, y método de selección de modelo y/o regularización utilizados, incluyendo evidencia numérica.  
- **(20%)** Calidad de su modelo final. Se debe especificar y justificar claramente el criterio de desempeño utilizado.  
  - ¿Cuál es el desempeño esperado de su modelo en datos futuros?  
  - Dé evidencia numérica.  
- **(20%)** Análisis de sus resultados.  
- **(20%)** Calidad y orden del informe.  

---

## 4. Entregables  
El entregable es un informe (**Jupyter Notebook**) del procedimiento llevado a cabo para llegar a su modelo final.  

Este informe debe estar bien estructurado e incluir la información requerida en la evaluación. Incluya **gráficas y tablas** que le permitan presentar la información de manera concisa y clara.  

El código debe estar bien estructurado y apropiadamente comentado. En el notebook deben visualizarse las ejecuciones realizadas.  


# Preprocesamiento

In [1]:
import pandas as pd
data_raw = pd.read_csv('data_train.csv',sep=',')
data_raw.drop('Unnamed: 0', axis=1, inplace=True)
data_raw.head(6)

Unnamed: 0,age,sex,dzgroup,dzclass,num.co,scoma,avtisst,race,sps,aps,...,dnr,dnrday,meanbp,hrt,resp,temp,crea,sod,adlsc,hospdead
0,62.84998,male,Lung Cancer,Cancer,0,0.0,7.0,other,33.898438,20.0,...,no dnr,5.0,97.0,69.0,22.0,36.0,1.199951,141.0,7.0,0
1,60.33899,female,Cirrhosis,COPD/CHF/Cirrhosis,2,44.0,29.0,white,52.695312,74.0,...,,,43.0,112.0,34.0,34.59375,5.5,132.0,1.0,1
2,52.74698,female,Cirrhosis,COPD/CHF/Cirrhosis,2,0.0,13.0,white,20.5,45.0,...,no dnr,17.0,70.0,88.0,28.0,37.39844,2.0,134.0,0.0,0
3,42.38498,female,Lung Cancer,Cancer,2,0.0,7.0,white,20.097656,19.0,...,no dnr,3.0,75.0,88.0,32.0,35.0,0.799927,139.0,0.0,0
4,79.88495,female,ARF/MOSF w/Sepsis,ARF/MOSF,1,26.0,18.666656,white,23.5,30.0,...,no dnr,16.0,59.0,112.0,20.0,37.89844,0.799927,143.0,2.0,0
5,93.01599,male,Coma,Coma,1,55.0,5.0,white,19.398438,27.0,...,no dnr,4.0,110.0,101.0,44.0,38.39844,0.699951,140.0,1.0,1


## Nulos y Duplicados

In [3]:
# nulos y duplicados
print(f'Datos nulos: {data_raw.isnull().sum().sum()}')
print(f'Datos duplicados: {data_raw.duplicated().sum().sum()}')

Datos nulos: 261
Datos duplicados: 0


In [4]:
# Exploración de datos nulos por columna
data_raw.isnull().sum()

age          0
sex          0
dzgroup      0
dzclass      0
num.co       0
scoma        1
avtisst     82
race        42
sps          1
aps          1
surv2m       1
surv6m       1
hday         0
diabetes     0
dementia     0
ca           0
dnr         30
dnrday      30
meanbp       1
hrt          1
resp         1
temp         1
crea        67
sod          1
adlsc        0
hospdead     0
dtype: int64

In [6]:
data_raw.dropna(inplace=True)
data_raw.head(5)

Unnamed: 0,age,sex,dzgroup,dzclass,num.co,scoma,avtisst,race,sps,aps,...,dnr,dnrday,meanbp,hrt,resp,temp,crea,sod,adlsc,hospdead
0,62.84998,male,Lung Cancer,Cancer,0,0.0,7.0,other,33.898438,20.0,...,no dnr,5.0,97.0,69.0,22.0,36.0,1.199951,141.0,7.0,0
2,52.74698,female,Cirrhosis,COPD/CHF/Cirrhosis,2,0.0,13.0,white,20.5,45.0,...,no dnr,17.0,70.0,88.0,28.0,37.39844,2.0,134.0,0.0,0
3,42.38498,female,Lung Cancer,Cancer,2,0.0,7.0,white,20.097656,19.0,...,no dnr,3.0,75.0,88.0,32.0,35.0,0.799927,139.0,0.0,0
4,79.88495,female,ARF/MOSF w/Sepsis,ARF/MOSF,1,26.0,18.666656,white,23.5,30.0,...,no dnr,16.0,59.0,112.0,20.0,37.89844,0.799927,143.0,2.0,0
5,93.01599,male,Coma,Coma,1,55.0,5.0,white,19.398438,27.0,...,no dnr,4.0,110.0,101.0,44.0,38.39844,0.699951,140.0,1.0,1


In [7]:
data_raw.shape

(8888, 26)

## Transformación de columnas

In [28]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, RobustScaler
data = data_raw.copy()

In [29]:
categoricas = data.select_dtypes(include=['object', 'category']).columns
numericas = data.select_dtypes(exclude=['object','category']).columns
print(f'variables categoricas {categoricas}')
print(f'variables numéricas {numericas}')

variables categoricas Index(['sex', 'dzgroup', 'dzclass', 'race', 'ca', 'dnr'], dtype='object')
variables numéricas Index(['age', 'num.co', 'scoma', 'avtisst', 'sps', 'aps', 'surv2m', 'surv6m',
       'hday', 'diabetes', 'dementia', 'dnrday', 'meanbp', 'hrt', 'resp',
       'temp', 'crea', 'sod', 'adlsc', 'hospdead'],
      dtype='object')


In [32]:
preprocesador = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(drop='first'), categoricas),
        ('num', RobustScaler(), numericas)
    ]
)
pipeline_procesamiento = Pipeline(steps=[
    ('preprocesador', preprocesador)
])

# Mantén tu dataset original en `data`
data_proc = pipeline_procesamiento.fit_transform(data)

# Obtén nombres de columnas transformadas
feature_names = pipeline_procesamiento.named_steps['preprocesador'].get_feature_names_out()

# Convierte a DataFrame
data_proc = pd.DataFrame(data_proc, columns=feature_names, index=data.index)

data_proc.head(10)

Unnamed: 0,cat__sex_male,cat__dzgroup_CHF,cat__dzgroup_COPD,cat__dzgroup_Cirrhosis,cat__dzgroup_Colon Cancer,cat__dzgroup_Coma,cat__dzgroup_Lung Cancer,cat__dzgroup_MOSF w/Malig,cat__dzclass_COPD/CHF/Cirrhosis,cat__dzclass_Cancer,...,num__dementia,num__dnrday,num__meanbp,num__hrt,num__resp,num__temp,num__crea,num__sod,num__adlsc,num__hospdead
0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,...,0.0,-0.307692,0.454545,-0.645833,-0.2,-0.347655,0.0,0.571429,2.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.615385,-0.159091,-0.25,0.4,0.351565,0.800049,-0.428571,-0.333333,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,...,0.0,-0.461538,-0.045455,-0.25,0.8,-0.847655,-0.400024,0.285714,-0.333333,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.538462,-0.409091,0.25,-0.4,0.601565,-0.400024,0.857143,0.333333,0.0
5,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,-0.384615,0.75,0.020833,2.0,0.851565,-0.5,0.428571,0.0,1.0
6,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.022727,0.416667,0.4,0.351565,0.399902,-0.714286,0.0,0.0
7,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,-0.153846,-0.113636,0.0,0.2,0.44922,0.800049,0.285714,-0.333333,0.0
8,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,...,1.0,-0.538462,0.454545,-0.916667,-0.4,-0.05078,-0.199951,0.857143,2.0,0.0
9,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.0,-0.076923,0.159091,-0.125,-0.4,0.75,-0.400024,0.285714,-0.1684,0.0
10,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,-0.153846,-0.159091,0.104167,0.0,-0.097655,-0.199951,-0.428571,0.0,0.0


In [33]:
data_proc.columns

Index(['cat__sex_male', 'cat__dzgroup_CHF', 'cat__dzgroup_COPD',
       'cat__dzgroup_Cirrhosis', 'cat__dzgroup_Colon Cancer',
       'cat__dzgroup_Coma', 'cat__dzgroup_Lung Cancer',
       'cat__dzgroup_MOSF w/Malig', 'cat__dzclass_COPD/CHF/Cirrhosis',
       'cat__dzclass_Cancer', 'cat__dzclass_Coma', 'cat__race_black',
       'cat__race_hispanic', 'cat__race_other', 'cat__race_white',
       'cat__ca_no', 'cat__ca_yes', 'cat__dnr_dnr before sadm',
       'cat__dnr_no dnr', 'num__age', 'num__num.co', 'num__scoma',
       'num__avtisst', 'num__sps', 'num__aps', 'num__surv2m', 'num__surv6m',
       'num__hday', 'num__diabetes', 'num__dementia', 'num__dnrday',
       'num__meanbp', 'num__hrt', 'num__resp', 'num__temp', 'num__crea',
       'num__sod', 'num__adlsc', 'num__hospdead'],
      dtype='object')

# Train_Test_Split

In [38]:
X = data_proc.drop('num__hospdead', axis=1)
y = data_proc['num__hospdead']
X

Unnamed: 0,cat__sex_male,cat__dzgroup_CHF,cat__dzgroup_COPD,cat__dzgroup_Cirrhosis,cat__dzgroup_Colon Cancer,cat__dzgroup_Coma,cat__dzgroup_Lung Cancer,cat__dzgroup_MOSF w/Malig,cat__dzclass_COPD/CHF/Cirrhosis,cat__dzclass_Cancer,...,num__diabetes,num__dementia,num__dnrday,num__meanbp,num__hrt,num__resp,num__temp,num__crea,num__sod,num__adlsc
0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,...,0.0,0.0,-0.307692,0.454545,-0.645833,-0.2,-0.347655,0.000000,0.571429,2.000000
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.615385,-0.159091,-0.250000,0.4,0.351565,0.800049,-0.428571,-0.333333
3,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,...,0.0,0.0,-0.461538,-0.045455,-0.250000,0.8,-0.847655,-0.400024,0.285714,-0.333333
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.538462,-0.409091,0.250000,-0.4,0.601565,-0.400024,0.857143,0.333333
5,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,-0.384615,0.750000,0.020833,2.0,0.851565,-0.500000,0.428571,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9100,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.076923,0.727273,0.083333,-0.2,-0.500000,-0.100098,-0.857143,-0.333333
9101,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.538462,-0.772727,-2.083333,-1.6,0.949220,4.699463,-0.285714,-0.333333
9102,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,-0.076923,0.772727,-0.354167,0.0,0.000000,1.499756,0.285714,0.508464
9103,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,-0.307692,0.500000,0.208333,0.0,-0.148435,2.300049,-0.285714,-0.333333


In [39]:
y

0       0.0
2       0.0
3       0.0
4       0.0
5       1.0
       ... 
9100    0.0
9101    0.0
9102    0.0
9103    1.0
9104    0.0
Name: num__hospdead, Length: 8888, dtype: float64

In [45]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=77)

In [46]:
X_train.shape

(7110, 38)

In [47]:
X_test.shape

(1778, 38)

In [48]:
X_train

Unnamed: 0,cat__sex_male,cat__dzgroup_CHF,cat__dzgroup_COPD,cat__dzgroup_Cirrhosis,cat__dzgroup_Colon Cancer,cat__dzgroup_Coma,cat__dzgroup_Lung Cancer,cat__dzgroup_MOSF w/Malig,cat__dzclass_COPD/CHF/Cirrhosis,cat__dzclass_Cancer,...,num__diabetes,num__dementia,num__dnrday,num__meanbp,num__hrt,num__resp,num__temp,num__crea,num__sod,num__adlsc
8936,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.307692,0.750000,-0.562500,0.4,-0.347655,0.300049,0.714286,-0.333333
5525,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.230769,1.431818,0.729167,-0.2,1.300785,0.399902,0.714286,-0.333333
2145,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,-0.153846,-0.477273,-0.625000,0.1,-0.300780,-0.199951,0.428571,-0.333333
5584,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.230769,-0.136364,0.250000,0.0,0.449220,-0.400024,0.714286,-0.333333
655,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,9.153846,-0.318182,-0.520833,0.0,-0.199215,3.899658,0.000000,0.333333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
172,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,0.153846,0.772727,-0.416667,-0.8,-0.398435,1.499756,-0.428571,0.000000
4924,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,...,0.0,0.0,-0.307692,1.318182,0.520833,-0.6,0.652345,0.000000,0.571429,0.000000
8004,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,0.076923,0.750000,-0.854167,0.4,-0.500000,0.899658,-0.571429,-0.333333
2330,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,-0.461538,0.136364,0.083333,0.6,0.000000,0.300049,0.571429,0.652018


In [49]:
y_train

8936    0.0
5525    0.0
2145    0.0
5584    0.0
655     1.0
       ... 
172     0.0
4924    0.0
8004    0.0
2330    0.0
9011    1.0
Name: num__hospdead, Length: 7110, dtype: float64

# Modelo

## GridSearch

In [None]:
from xgboost import XGBClassifier
from sklearn.model_selection import KFold,GridSearchCV
from sklearn.metrics import accuracy_score, recall_score, classification_report, auc, roc_auc_score, precision_score, f1_score

model = XGBClassifier(
    use_label_encoder=False,
    eval_metric="logloss",
    random_state=42
)

param_grid = {
    'n_estimators': [100, 200, 500],   # número de árboles
    'max_depth': [3, 5, 7],            # profundidad máxima de cada árbol
    'learning_rate': [0.01, 0.1, 0.2], # tasa de aprendizaje
    'subsample': [0.8, 1.0],           # fracción de datos usada en cada árbol
    'colsample_bytree': [0.8, 1.0]     # fracción de features usada en cada árbol
}

grid = GridSearchCV(
    estimator=model,
    param_grid=param_grid,
    scoring='roc_auc',
    cv=5,
    n_jobs=-1,
    verbose=1
)

grid.fit(X_train, y_train)

print("Mejores parámetros:", grid.best_params_)
print("Mejor AUC:", grid.best_score_)

## Mejor Modelo

In [None]:
mejores_params = grid.best_params_
print("Mejores parámetros encontrados:", mejores_params)

# modelo entrenado con esos parámetros
mejor_modelo = grid.best_estimator_

y_pred = mejor_modelo.predict(X_test)
y_proba = mejor_modelo.predict_proba(X_test)[:, 1]

accuracy = accuracy_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
rocauc = roc_auc_score(y_test, y_proba)
precision = precision_score(y_test, y_proba)
f1score = f1_score(y_test, y_proba)


# Classification report en formato dict
report = classification_report(y_test, y_pred, output_dict=True)

# Construir DataFrame con métricas globales
metrics = {
    "accuracy": accuracy,
    "precision":precision,
    "recall": recall,
    "roc_auc": rocauc,
    'F1_Score':f1score
}
df_metrics = pd.DataFrame([metrics])

print("Métricas globales:")
print(df_metrics)

# Classification report en DataFrame
df_class_report = pd.DataFrame(report).transpose()

print("\nReporte por clase:")
print(df_class_report)