Este cuadernillo realiza un análisis exploratorio sobre el dataset generado en el cuadernillo anterior (`join.ipynb`).  
El flujo de trabajo inicia con la apertura del archivo `mh.csv`, que contiene un conjunto de datos limpio, unificado y anonimizado proveniente de los registros clínicos.  

Además de la fase exploratoria, se plantean y ajustan algunos modelos orientados al análisis de tiempos y a la evaluación de posibles escenarios de predicción.


En este bloque se importan las librerías principales necesarias para el análisis y la visualización básica de los datos.

In [21]:
# Block 1 — Imports
# Load core libraries for data analysis and visualization

import pandas as pd
import matplotlib.pyplot as plt

# Optional: friendlier display in notebook
pd.set_option("display.max_columns", 120)
pd.set_option("display.width", 150)

En este bloque se abre el archivo mh.csv, generado en el cuaderno de join, y se convierten a formato de fecha las columnas que contienen información temporal.

In [22]:
# Block 2 — Load dataset from Join pipeline
# Load the anonymized dataset and parse datetime columns

# Path to the dataset generated in the Join notebook
path = r"C:\Users\wilmerbelza\Documents\Prediction model\mh.csv"

# Load dataset
df = pd.read_csv(path)

# Convert date columns to datetime
dt_candidates = [
    "FechaRegistro", "FechaEgreso", "FechaTriage",
    "FechaPrimeraEvolucion", "FechaPrimeraInterconsulta", "FechaPrimeraEvaluacion"
]
for col in [c for c in dt_candidates if c in df.columns]:
    df[col] = pd.to_datetime(df[col], errors="coerce")

print("Dataset shape:", df.shape)
print("Datetime columns parsed:", [c for c in dt_candidates if c in df.columns])

# Preview first rows
print(df.head(5))

Dataset shape: (46135, 16)
Datetime columns parsed: ['FechaRegistro', 'FechaEgreso', 'FechaTriage', 'FechaPrimeraEvolucion', 'FechaPrimeraInterconsulta', 'FechaPrimeraEvaluacion']
            id_anon  Ingreso     Genero  EdadAtencion       FechaRegistro         FechaTriage  ClasificacionTriage  \
0  6124828209892124        1   Femenino            27 2023-08-17 18:15:08 2023-08-17 18:31:41                    3   
1  9294009b87eac4f1        1  Masculino            26 2023-04-04 12:16:32 2023-04-04 12:21:24                    2   
2  63cd4a4d6b899314        2  Masculino            26 2023-09-22 11:16:51 2023-09-22 11:20:51                    3   
3  8a3c5faefad7b7eb        3  Masculino            26 2023-09-23 11:01:40 2023-09-23 11:03:53                    3   
4  bda4010562c40190        4   Femenino            22 2023-01-04 09:10:34 2023-01-04 09:15:22                    3   

                                      MotivoConsulta FechaPrimeraEvolucion FechaPrimeraInterconsulta FechaPrime

En este bloque se muestra una síntesis del dataset, que incluye la estructura de las columnas, los tipos de datos asociados y la cantidad de valores no nulos presentes en cada variable.

In [23]:
# Block 3 — Dataset overview
# Show basic info about columns, dtypes, and non-null counts

df.info()

# Quick descriptive statistics for numeric columns
print("\n=== Descriptive statistics (numeric) ===")
print(df.describe().T)

# Quick descriptive statistics for datetime columns
print("\n=== Date ranges (min/max) ===")
for col in ["FechaRegistro","FechaTriage","FechaPrimeraEvaluacion"]:
    if col in df.columns:
        print(f"{col}: {df[col].min()} → {df[col].max()} (non-null {df[col].notna().mean():.2%})")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46135 entries, 0 to 46134
Data columns (total 16 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   id_anon                    46135 non-null  object        
 1   Ingreso                    46135 non-null  int64         
 2   Genero                     46135 non-null  object        
 3   EdadAtencion               46135 non-null  int64         
 4   FechaRegistro              46135 non-null  datetime64[ns]
 5   FechaTriage                46135 non-null  datetime64[ns]
 6   ClasificacionTriage        46135 non-null  int64         
 7   MotivoConsulta             46135 non-null  object        
 8   FechaPrimeraEvolucion      46113 non-null  datetime64[ns]
 9   FechaPrimeraInterconsulta  23639 non-null  datetime64[ns]
 10  FechaPrimeraEvaluacion     46135 non-null  datetime64[ns]
 11  FechaEgreso                46135 non-null  datetime64[ns]
 12  Sali

En este bloque se analizan los intervalos de tiempo entre Registro, Triage y Evaluación. Se incluyen estadísticas descriptivas y ejemplos iniciales para comprobar la coherencia de los cálculos.

In [24]:
# Block 4 — Time deltas exploration
# Explore descriptive stats and sample rows for time differences

time_cols = ['min_registro_a_triage','min_triage_a_eval','min_registro_a_eval']

print("=== Time deltas (minutes) descriptive statistics ===")
print(df[time_cols].describe(percentiles=[0.05,0.25,0.5,0.75,0.95,0.99]).T)

# Show first 10 rows with key time columns to inspect coherence
preview_cols = ['id_anon','Ingreso','FechaRegistro','FechaTriage','FechaPrimeraEvaluacion'] + time_cols
print("\n=== Preview of first 10 rows with time deltas ===")
print(df[preview_cols].head(10).to_string(index=False))

=== Time deltas (minutes) descriptive statistics ===
                         count        mean         std       min         5%        25%        50%        75%         95%         99%           max
min_registro_a_triage  46135.0   12.300046   40.603870  0.200000   2.133333   4.750000   8.166667  14.683333   31.450000   53.416667   2898.516667
min_triage_a_eval      46135.0  103.447122  950.345975  0.900000   9.600000  20.783333  40.383333  85.633333  200.838333  306.711000  53574.800000
min_registro_a_eval    46135.0  115.747168  951.155759  2.016667  16.450000  31.416667  52.800000  98.658333  215.333333  326.831667  53589.183333

=== Preview of first 10 rows with time deltas ===
         id_anon  Ingreso       FechaRegistro         FechaTriage FechaPrimeraEvaluacion  min_registro_a_triage  min_triage_a_eval  min_registro_a_eval
6124828209892124        1 2023-08-17 18:15:08 2023-08-17 18:31:41    2023-08-17 20:10:06              16.550000          98.416667           114.966667
9294

En este bloque se analiza la distribución de la Clasificación de Triage, presentando tanto las frecuencias absolutas y relativas como la edad promedio de los pacientes en cada categoría.

In [25]:
# Block 5 — Triage classification exploration
# Distribution of triage levels and relation with age

# 1) Frequency table
triage_counts = df['ClasificacionTriage'].value_counts(dropna=False).sort_index()
triage_props  = df['ClasificacionTriage'].value_counts(normalize=True, dropna=False).sort_index()

print("=== ClasificacionTriage distribution ===")
print(pd.DataFrame({'count': triage_counts, 'proportion': triage_props}))

# 2) Average age by triage level
if 'EdadAtencion' in df.columns:
    age_by_triage = df.groupby('ClasificacionTriage')['EdadAtencion'].agg(['count','mean','median'])
    print("\n=== Age statistics by ClasificacionTriage ===")
    print(age_by_triage)

=== ClasificacionTriage distribution ===
                     count  proportion
ClasificacionTriage                   
1                      423    0.009169
2                     9392    0.203576
3                    35635    0.772407
4                      657    0.014241
5                       28    0.000607

=== Age statistics by ClasificacionTriage ===
                     count       mean  median
ClasificacionTriage                          
1                      423  53.548463    54.0
2                     9392  54.833901    55.0
3                    35635  44.521734    41.0
4                      657  38.735160    37.0
5                       28  42.892857    39.0


En este bloque se examinan los tiempos entre Registro, Triage y Evaluación de acuerdo con la Clasificación de Triage, presentando estadísticas descriptivas diferenciadas para cada nivel.

In [26]:
# Block 6 — Time deltas by triage classification
# Compare delays by triage level

time_cols = ['min_registro_a_triage','min_triage_a_eval','min_registro_a_eval']

# Group by triage and compute descriptive statistics
time_by_triage = df.groupby('ClasificacionTriage')[time_cols].agg(['count','mean','median','quantile'])

# Simplify: show count, mean, median, p90
summary = df.groupby('ClasificacionTriage')[time_cols].agg(
    count=('min_registro_a_eval','count'),
    mean_reg_tri=('min_registro_a_triage','mean'),
    median_reg_tri=('min_registro_a_triage','median'),
    mean_tri_eval=('min_triage_a_eval','mean'),
    median_tri_eval=('min_triage_a_eval','median'),
    mean_reg_eval=('min_registro_a_eval','mean'),
    median_reg_eval=('min_registro_a_eval','median'),
    p90_reg_eval=('min_registro_a_eval', lambda s: s.quantile(0.9))
).reset_index()

print("=== Time deltas by ClasificacionTriage ===")
print(summary)

=== Time deltas by ClasificacionTriage ===
   ClasificacionTriage  count  mean_reg_tri  median_reg_tri  mean_tri_eval  median_tri_eval  mean_reg_eval  median_reg_eval  p90_reg_eval
0                    1    423      8.556738        5.983333      28.686446        21.683333      37.243184        30.166667     61.163333
1                    2   9392     11.631793        7.316667      22.641030        18.850000      34.272823        29.225000     55.950000
2                    3  35635     12.523935        8.433333      81.146308        52.450000      93.670244        64.716667    181.180000
3                    4    657     12.121512        7.983333    2336.070497        96.533333    2348.192009       110.416667   7206.753333
4                    5     28     12.251786       10.250000    4332.394643       834.725000    4344.646429       847.433333  11220.213333


En este bloque se analiza la variable target, que representa si los pacientes de Triage III fueron atendidos dentro de las dos horas establecidas (1 = sí, 0 = no).
Primero se presenta la distribución general de cumplimiento, y luego se muestra su relación con los diferentes niveles de clasificación de Triage, lo que permite identificar patrones iniciales de oportunidad en la atención.

In [69]:
# Block 7 — Ensure `target` exists and explore it

import pandas as pd
import numpy as np

def ensure_datetime(df, col):
    if col in df.columns and not np.issubdtype(df[col].dtype, np.datetime64):
        df[col] = pd.to_datetime(df[col], errors='coerce')
    return df

# 1) If needed, compute min_triage_a_eval from datetimes
if 'min_triage_a_eval' not in df.columns:
    needed = {'FechaTriage', 'FechaPrimeraEvaluacion'}
    if needed.issubset(df.columns):
        df = ensure_datetime(df, 'FechaTriage')
        df = ensure_datetime(df, 'FechaPrimeraEvaluacion')
        df['min_triage_a_eval'] = (
            (df['FechaPrimeraEvaluacion'] - df['FechaTriage'])
            .dt.total_seconds() / 60.0
        )
    else:
        raise KeyError(
            "No existe `target` ni puedo calcular `min_triage_a_eval`. "
            "Faltan columnas: FechaTriage y/o FechaPrimeraEvaluacion."
        )

# 2) Create `target` if missing: Triage III and eval ≤ 120 min
if 'target' not in df.columns:
    if 'ClasificacionTriage' not in df.columns:
        raise KeyError("Falta la columna `ClasificacionTriage` para definir el target.")
    df['target'] = (
        (df['ClasificacionTriage'] == 3) &
        (df['min_triage_a_eval'] <= 120)
    ).astype('int8')

# 3) Explore overall distribution
print("=== Overall target distribution (Triage III ≤2h) ===")
print(df['target'].value_counts(dropna=False))

# 4) Crosstab by triage level
if 'ClasificacionTriage' in df.columns:
    target_by_triage = pd.crosstab(df['ClasificacionTriage'], df['target'], margins=True)
    print("\n=== Target (≤2h) by ClasificacionTriage ===")
    print(target_by_triage)

=== Overall target distribution (Triage III ≤2h) ===
target
1    28568
0    17567
Name: count, dtype: int64

=== Target (≤2h) by ClasificacionTriage ===
target                   0      1    All
ClasificacionTriage                     
1                      423      0    423
2                     9392      0   9392
3                     7067  28568  35635
4                      657      0    657
5                       28      0     28
All                  17567  28568  46135


En este bloque se examinan las características demográficas de los pacientes, mostrando la distribución del género y la edad promedio según el nivel de triage.

In [28]:
# Block 8 — Demographics by triage
# Explore age and gender distribution by triage level

# 1) Gender distribution by triage
print("=== Gender distribution by ClasificacionTriage ===")
gender_by_triage = pd.crosstab(df['ClasificacionTriage'], df['Genero'], margins=True)
print(gender_by_triage)

# 2) Age statistics by triage
if 'EdadAtencion' in df.columns:
    print("\n=== Age statistics by ClasificacionTriage ===")
    age_stats = df.groupby('ClasificacionTriage')['EdadAtencion'].agg(['count','mean','median','min','max'])
    print(age_stats)

=== Gender distribution by ClasificacionTriage ===
Genero               Femenino  Masculino    All
ClasificacionTriage                            
1                         169        254    423
2                        4943       4449   9392
3                       20268      15367  35635
4                         366        291    657
5                          12         16     28
All                     25758      20377  46135

=== Age statistics by ClasificacionTriage ===
                     count       mean  median  min  max
ClasificacionTriage                                    
1                      423  53.548463    54.0   18   99
2                     9392  54.833901    55.0   18  104
3                    35635  44.521734    41.0   18  102
4                      657  38.735160    37.0   18   86
5                       28  42.892857    39.0   21   86


En este bloque se comparan los tiempos de atención entre los episodios que cumplen el estándar (≤120 min desde Triage a Evaluación) y los que no cumplen. Se reportan conteo, medias, medianas y el percentil 90 para cada intervalo: registro→triage, triage→evaluación y registro→evaluación

In [70]:
# Block 9 — Time deltas by target (Triage III ≤2h)
# Compare delays between compliant vs non-compliant (target) patients

import pandas as pd
import numpy as np

# Guard: ensure `target` exists (created in Block 7). If not, raise.
if 'target' not in df.columns:
    raise KeyError("Falta la columna `target`. Ejecute el Block 7 para crearla.")

time_cols = ['min_registro_a_triage', 'min_triage_a_eval', 'min_registro_a_eval']

# Sanity: make sure time columns are numeric
for c in time_cols:
    if c in df.columns:
        df[c] = pd.to_numeric(df[c], errors='coerce')

# Summary by target (0 = no cumple, 1 = cumple)
summary_simple = (
    df.groupby('target')[time_cols]
      .agg(
          count=('min_registro_a_eval','count'),
          mean_reg_tri=('min_registro_a_triage','mean'),
          median_reg_tri=('min_registro_a_triage','median'),
          mean_tri_eval=('min_triage_a_eval','mean'),
          median_tri_eval=('min_triage_a_eval','median'),
          mean_reg_eval=('min_registro_a_eval','mean'),
          median_reg_eval=('min_registro_a_eval','median'),
          p90_reg_eval=('min_registro_a_eval', lambda s: s.quantile(0.9))
      )
      .reset_index()
      .sort_values('target')
)

print("=== Time deltas by target (0 = no cumple, 1 = cumple Triage III ≤2h) ===")
print(summary_simple)

=== Time deltas by target (0 = no cumple, 1 = cumple Triage III ≤2h) ===
   target  count  mean_reg_tri  median_reg_tri  mean_tri_eval  median_tri_eval  mean_reg_eval  median_reg_eval  p90_reg_eval
0       0  17567     12.096415        7.883333     193.981803        38.300000     206.078218        53.033333        232.84
1       1  28568     12.425263        8.350000      47.775646        40.816667      60.200908        52.750000        107.15


En este bloque se realiza un cruce de los intervalos de tiempo por nivel de ClasificacionTriage y por el estado de cumplimiento del estándar (target): 0 = no cumple (atención > 120 min), 1 = cumple (≤ 120 min). Se reportan el conteo, las medias, las medianas y el percentil 90 de los tres intervalos clave: registro→triage, triage→evaluación y registro→evaluación. Este análisis permite identificar diferencias operativas dentro de cada nivel de triage según el cumplimiento del estándar de 2 horas.

In [71]:
# Block 10 — Time deltas by (ClasificacionTriage, target)
# Cross analysis of delays by triage level and compliance outcome

import pandas as pd
import numpy as np

# Guards
if 'target' not in df.columns:
    raise KeyError("Falta la columna `target`. Ejecute el Block 7 para crearla.")
if 'ClasificacionTriage' not in df.columns:
    raise KeyError("Falta la columna `ClasificacionTriage`.")

time_cols = ['min_registro_a_triage', 'min_triage_a_eval', 'min_registro_a_eval']
for c in time_cols:
    if c in df.columns:
        df[c] = pd.to_numeric(df[c], errors='coerce')

summary_combo = (
    df.groupby(['ClasificacionTriage', 'target'])[time_cols]
      .agg(
          count=('min_registro_a_eval', 'count'),
          mean_reg_tri=('min_registro_a_triage', 'mean'),
          median_reg_tri=('min_registro_a_triage', 'median'),
          mean_tri_eval=('min_triage_a_eval', 'mean'),
          median_tri_eval=('min_triage_a_eval', 'median'),
          mean_reg_eval=('min_registro_a_eval', 'mean'),
          median_reg_eval=('min_registro_a_eval', 'median'),
          p90_reg_eval=('min_registro_a_eval', lambda s: s.quantile(0.9))
      )
      .reset_index()
      .sort_values(['ClasificacionTriage', 'target'])
)

print("=== Time deltas by ClasificacionTriage and target (0=no cumple, 1=cumple) ===")
print(summary_combo)

=== Time deltas by ClasificacionTriage and target (0=no cumple, 1=cumple) ===
   ClasificacionTriage  target  count  mean_reg_tri  median_reg_tri  mean_tri_eval  median_tri_eval  mean_reg_eval  median_reg_eval  p90_reg_eval
0                    1       0    423      8.556738        5.983333      28.686446        21.683333      37.243184        30.166667     61.163333
1                    2       0   9392     11.631793        7.316667      22.641030        18.850000      34.272823        29.225000     55.950000
2                    3       0   7067     12.922815        8.883333     216.045571       167.916667     228.968386       181.233333    278.393333
3                    3       1  28568     12.425263        8.350000      47.775646        40.816667      60.200908        52.750000    107.150000
4                    4       0    657     12.121512        7.983333    2336.070497        96.533333    2348.192009       110.416667   7206.753333
5                    5       0     28     12.2

En este bloque se presenta un resumen compacto de las principales características del dataset ya procesado. Se incluye el tamaño total (filas y columnas), la cobertura de variables clave de evolución, interconsulta y evaluación médica, la distribución de pacientes por nivel de triage, la distribución del estado de cumplimiento del estándar de Triage III (variable target), y un resumen de los intervalos de tiempo en minutos. Este bloque sirve como cierre de la etapa exploratoria antes de proceder a la modelación predictiva.

In [72]:
# Block 11 — Final exploration summary
# Compact recap of dataset characteristics

import numpy as np

print("\n=== FINAL EXPLORATION SUMMARY ===")

# Size
print(f"Rows: {len(df):,}")
print(f"Columns: {len(df.columns)}")

# Coverage of key variables
def cov(col):
    return df[col].notna().mean() if col in df.columns else np.nan

print("\nCoverage:")
for col in ['FechaPrimeraEvolucion','FechaPrimeraInterconsulta','FechaPrimeraEvaluacion']:
    if col in df.columns:
        print(f" - {col}: {cov(col):.2%}")

# Triage distribution
if 'ClasificacionTriage' in df.columns:
    triage_counts = df['ClasificacionTriage'].value_counts().sort_index()
    print("\nTriage distribution:")
    print(triage_counts)

# Target distribution (cumplimiento ≤2h en Triage III)
if 'target' in df.columns:
    target_counts = df['target'].value_counts().sort_index()
    print("\nTarget distribution (0=no cumple, 1=cumple):")
    print(target_counts)

# Time deltas summary
if {'min_registro_a_triage','min_triage_a_eval','min_registro_a_eval'}.issubset(df.columns):
    print("\nTime deltas (minutes):")
    for c in ['min_registro_a_triage','min_triage_a_eval','min_registro_a_eval']:
        s = df[c].dropna()
        print(f"{c}: mean={s.mean():.1f}, median={s.median():.1f}, p90={s.quantile(0.9):.1f}")


=== FINAL EXPLORATION SUMMARY ===
Rows: 46,135
Columns: 17

Coverage:
 - FechaPrimeraEvolucion: 99.95%
 - FechaPrimeraInterconsulta: 51.24%
 - FechaPrimeraEvaluacion: 100.00%

Triage distribution:
ClasificacionTriage
1      423
2     9392
3    35635
4      657
5       28
Name: count, dtype: int64

Target distribution (0=no cumple, 1=cumple):
target
0    17567
1    28568
Name: count, dtype: int64

Time deltas (minutes):
min_registro_a_triage: mean=12.3, median=8.2, p90=23.9
min_triage_a_eval: mean=103.4, median=40.4, p90=154.9
min_registro_a_eval: mean=115.7, median=52.8, p90=168.5


En este bloque se redefine la variable objetivo, enfocándola en el cumplimiento del estándar de Triage III: haber recibido la primera evaluación médica dentro de las 2 horas posteriores a la clasificación. La variable target se construye solo para pacientes con Clasificación de Triage 3.

In [32]:
# Block 12 — Define target for Triage III compliance (<=120 min)

from sklearn.model_selection import train_test_split

# 1) Filtrar solo pacientes de Triage III
df_triage3 = df[df['ClasificacionTriage'] == 3].copy()

# 2) Crear variable objetivo: 1 si fue atendido en <= 120 min, 0 si no
df_triage3['target'] = (df_triage3['min_triage_a_eval'] <= 120).astype(int)

print("Target distribution (Triage III, 2h compliance):")
print(df_triage3['target'].value_counts(normalize=True))

# 3) Seleccionar variables predictoras
features = [
    'EdadAtencion',
    'Genero',
    'min_registro_a_triage'
]

X = df_triage3[features]
y = df_triage3['target']

# 4) Dividir en train/test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print("Train shape:", X_train.shape, " Test shape:", X_test.shape)
print("Target distribution in train:", y_train.value_counts(normalize=True))
print("Target distribution in test:", y_test.value_counts(normalize=True))

Target distribution (Triage III, 2h compliance):
target
1    0.801684
0    0.198316
Name: proportion, dtype: float64
Train shape: (24944, 3)  Test shape: (10691, 3)
Target distribution in train: target
1    0.801676
0    0.198324
Name: proportion, dtype: float64
Target distribution in test: target
1    0.801702
0    0.198298
Name: proportion, dtype: float64


En este bloque se define un preprocesamiento que convierte las variables categóricas a numéricas y estandariza las numéricas

In [35]:
# Block 13 — Preprocessing (encode categorical + scale numeric)
# Build a preprocessing pipeline for modeling

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Reuse the same features defined en el Block 12
num_features = ['EdadAtencion', 'min_registro_a_triage']
cat_features = ['Genero']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), num_features),
        ('cat', OneHotEncoder(drop='if_binary', handle_unknown='ignore'), cat_features),
    ],
    remainder='drop'
)

print("Preprocessor ready. Numeric:", num_features, "Categorical:", cat_features)

Preprocessor ready. Numeric: ['EdadAtencion', 'min_registro_a_triage'] Categorical: ['Genero']


En este bloque se entrena una regresión logística con balanceo de clases y un Pipeline de preprocesamiento (imputación, escalado y codificación). Además, se ajusta el umbral para mejorar el F1 de la clase 0 (no cumplió ≤2 horas) de manera robusta.

In [38]:
# Block 14 — Logistic Regression (balanced) + robust threshold tuning for class 0
# Self-contained: builds its own preprocessor and handles missing values

import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_curve

# --- Rebuild features and target from Block 12 context (Triage III only) ---
features = ['EdadAtencion', 'Genero', 'min_registro_a_triage']
X_train = X_train[features].copy()
X_test  = X_test[features].copy()
y_train = y_train.astype(int)
y_test  = y_test.astype(int)

# --- Preprocessor: impute + scale numeric, impute + one-hot categorical ---
num_features = ['EdadAtencion', 'min_registro_a_triage']
cat_features = ['Genero']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler())
        ]), num_features),
        ('cat', Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='most_frequent')),
            ('ohe', OneHotEncoder(drop='if_binary', handle_unknown='ignore'))
        ]), cat_features),
    ],
    remainder='drop'
)

# --- Model: class_weight balanced to combat 80/20 imbalance ---
log_reg_pipe = Pipeline(steps=[
    ('prep', preprocessor),
    ('clf', LogisticRegression(max_iter=1000, class_weight='balanced'))
])

# Fit
log_reg_pipe.fit(X_train, y_train)

# === Eval with default threshold (0.5) ===
proba_test = log_reg_pipe.predict_proba(X_test)[:, 1]  # P(y=1: cumple ≤2h)
y_pred_default = (proba_test >= 0.5).astype(int)

print("=== Confusion Matrix (LogReg balanced, thr=0.5) ===")
print(confusion_matrix(y_test, y_pred_default))
print("\n=== Classification Report (LogReg balanced, thr=0.5) ===")
print(classification_report(y_test, y_pred_default, digits=3))

# === Robust threshold tuning to improve F1 for class 0 on TRAIN ===
# We tune using p0 = 1 - p1 and y0 = 1 for class 0 labels
proba_train = log_reg_pipe.predict_proba(X_train)[:, 1]
p0_train = 1.0 - proba_train
y0_train = (y_train == 0).astype(int)

prec, rec, thr = precision_recall_curve(y0_train, p0_train)

# precision_recall_curve returns len(thr)=len(prec)-1; compute F1 where valid
with np.errstate(divide='ignore', invalid='ignore'):
    f1_0 = np.where((prec + rec) > 0, 2 * prec * rec / (prec + rec), 0.0)

# Find best threshold index within available thresholds
if thr.size > 0:
    # Excluir el primer punto (umbral virtual) alineando con 'thr'
    f1_aligned = f1_0[1:] if f1_0.size > 1 else f1_0
    best_idx = int(np.argmax(f1_aligned)) if f1_aligned.size > 0 else 0
    best_thr_for_class0 = thr[best_idx]
else:
    best_thr_for_class0 = 0.5  # fallback

print(f"\nBest threshold for class 0 (train): {best_thr_for_class0:.4f}")

# Apply tuned threshold on TEST (predict class 0 if p0 >= thr*)
p0_test = 1.0 - proba_test
y_pred_tuned0 = (p0_test >= best_thr_for_class0).astype(int)  # 1 -> class 0
y_pred_tuned = np.where(y_pred_tuned0 == 1, 0, 1)  # map back: 0/1 originales

print("\n=== Confusion Matrix (LogReg balanced, tuned threshold) ===")
print(confusion_matrix(y_test, y_pred_tuned))
print("\n=== Classification Report (LogReg balanced, tuned threshold) ===")
print(classification_report(y_test, y_pred_tuned, digits=3))

=== Confusion Matrix (LogReg balanced, thr=0.5) ===
[[1207  913]
 [4528 4043]]

=== Classification Report (LogReg balanced, thr=0.5) ===
              precision    recall  f1-score   support

           0      0.210     0.569     0.307      2120
           1      0.816     0.472     0.598      8571

    accuracy                          0.491     10691
   macro avg      0.513     0.521     0.453     10691
weighted avg      0.696     0.491     0.540     10691


Best threshold for class 0 (train): 0.4526

=== Confusion Matrix (LogReg balanced, tuned threshold) ===
[[1958  162]
 [7464 1107]]

=== Classification Report (LogReg balanced, tuned threshold) ===
              precision    recall  f1-score   support

           0      0.208     0.924     0.339      2120
           1      0.872     0.129     0.225      8571

    accuracy                          0.287     10691
   macro avg      0.540     0.526     0.282     10691
weighted avg      0.741     0.287     0.248     10691



En este bloque se entrena un Random Forest con balanceo de clases para predecir el cumplimiento (≤2 horas) en Triage III. Se reporta el desempeño con umbral 0.5 y con un umbral ajustado para maximizar el F1 de la clase 0 (no cumplió ≤2 horas) usando el conjunto de entrenamiento.

In [39]:
# Block 15 — Random Forest (balanced) + threshold tuning for class 0
# Train a random forest with class weighting and tune threshold to improve class-0 detection

import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_curve

# Rebuild features consistently with Block 12 (Triage III subset already aplicado)
features = ['EdadAtencion', 'Genero', 'min_registro_a_triage']
X_train_rf = X_train[features].copy()
X_test_rf  = X_test[features].copy()
y_train_rf = y_train.astype(int).copy()
y_test_rf  = y_test.astype(int).copy()

# Preprocessor: imputación + escalado (num), imputación + one-hot (cat)
num_features = ['EdadAtencion', 'min_registro_a_triage']
cat_features = ['Genero']

preprocessor_rf = ColumnTransformer(
    transformers=[
        ('num', Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler())
        ]), num_features),
        ('cat', Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='most_frequent')),
            ('ohe', OneHotEncoder(drop='if_binary', handle_unknown='ignore'))
        ]), cat_features),
    ],
    remainder='drop'
)

# Modelo: Random Forest con balance de clases
rf_pipe = Pipeline(steps=[
    ('prep', preprocessor_rf),
    ('clf', RandomForestClassifier(
        n_estimators=300,
        max_depth=10,
        class_weight='balanced',
        random_state=42,
        n_jobs=-1
    ))
])

# Entrenamiento
rf_pipe.fit(X_train_rf, y_train_rf)

# --- Evaluación con umbral 0.5 ---
proba_test = rf_pipe.predict_proba(X_test_rf)[:, 1]  # P(y=1: cumple ≤2h)
y_pred_default = (proba_test >= 0.5).astype(int)

print("=== Confusion Matrix (RF balanced, thr=0.5) ===")
print(confusion_matrix(y_test_rf, y_pred_default))
print("\n=== Classification Report (RF balanced, thr=0.5) ===")
print(classification_report(y_test_rf, y_pred_default, digits=3))

# --- Ajuste de umbral para optimizar F1 de la clase 0 en TRAIN ---
proba_train = rf_pipe.predict_proba(X_train_rf)[:, 1]
p0_train = 1.0 - proba_train                 # prob de clase 0
y0_train = (y_train_rf == 0).astype(int)     # 1 cuando es clase 0

prec, rec, thr = precision_recall_curve(y0_train, p0_train)
with np.errstate(divide='ignore', invalid='ignore'):
    f1_0 = np.where((prec + rec) > 0, 2 * prec * rec / (prec + rec), 0.0)

if thr.size > 0:
    f1_aligned = f1_0[1:] if f1_0.size > 1 else f1_0
    best_idx = int(np.argmax(f1_aligned)) if f1_aligned.size > 0 else 0
    best_thr_for_class0 = thr[best_idx]
else:
    best_thr_for_class0 = 0.5

print(f"\nBest threshold for class 0 (train): {best_thr_for_class0:.4f}")

# Aplicar umbral ajustado en TEST: predecir 0 si p0 >= thr*
p0_test = 1.0 - proba_test
y_pred_tuned0 = (p0_test >= best_thr_for_class0).astype(int)   # 1 -> clase 0
y_pred_tuned = np.where(y_pred_tuned0 == 1, 0, 1)              # map a etiquetas originales

print("\n=== Confusion Matrix (RF balanced, tuned threshold) ===")
print(confusion_matrix(y_test_rf, y_pred_tuned))
print("\n=== Classification Report (RF balanced, tuned threshold) ===")
print(classification_report(y_test_rf, y_pred_tuned, digits=3))

=== Confusion Matrix (RF balanced, thr=0.5) ===
[[ 972 1148]
 [3299 5272]]

=== Classification Report (RF balanced, thr=0.5) ===
              precision    recall  f1-score   support

           0      0.228     0.458     0.304      2120
           1      0.821     0.615     0.703      8571

    accuracy                          0.584     10691
   macro avg      0.524     0.537     0.504     10691
weighted avg      0.703     0.584     0.624     10691


Best threshold for class 0 (train): 0.5024

=== Confusion Matrix (RF balanced, tuned threshold) ===
[[ 920 1200]
 [3065 5506]]

=== Classification Report (RF balanced, tuned threshold) ===
              precision    recall  f1-score   support

           0      0.231     0.434     0.301      2120
           1      0.821     0.642     0.721      8571

    accuracy                          0.601     10691
   macro avg      0.526     0.538     0.511     10691
weighted avg      0.704     0.601     0.638     10691



En este bloque se valida que las columnas existan, que no haya valores nulos/infinito en las features, que las clases estén bien definidas y que los tamaños de X/y coincidan.

In [46]:
# Block 16 — Quick diagnostic check (corrected version)
# This block verifies required columns, shapes, data types, missing values, and index alignment

import numpy as np
import pandas as pd

req_cols = ['EdadAtencion', 'Genero', 'min_registro_a_triage', 'target']
missing = [c for c in req_cols if c not in df_triage3.columns]
print("Missing columns in df_triage3:", missing)

# Check shapes
print("Shapes -> X_train:", X_train.shape, " y_train:", y_train.shape,
      " | X_test:", X_test.shape, " y_test:", y_test.shape)

# Check dtypes and NaN/inf
X_probe = X_train.copy()
print("\nDtypes of X_train:\n", X_probe.dtypes)

num_cols = [c for c in ['EdadAtencion','min_registro_a_triage'] if c in X_probe.columns]
for c in num_cols:
    s = pd.to_numeric(X_probe[c], errors='coerce')
    n_nan = s.isna().sum()
    n_inf = np.isinf(s.to_numpy()).sum()
    print(f"NaN/inf in {c} -> NaN:{n_nan} | Inf:{n_inf}")

# Categorical column
if 'Genero' in X_probe.columns:
    print("\nUnique values in Genero:", X_probe['Genero'].dropna().unique()[:10])

# Target distribution
print("\nDistribution of y_train:")
print(y_train.value_counts(dropna=False))
print("Distribution of y_test:")
print(y_test.value_counts(dropna=False))

# Index alignment
ok_index_train = X_train.index.isin(y_train.index).all()
ok_index_test = X_test.index.isin(y_test.index).all()
print("\nIndex alignment -> Train:", ok_index_train, " | Test:", ok_index_test)


Missing columns in df_triage3: []
Shapes -> X_train: (24944, 3)  y_train: (24944,)  | X_test: (10691, 3)  y_test: (10691,)

Dtypes of X_train:
 EdadAtencion               int64
Genero                    object
min_registro_a_triage    float64
dtype: object
NaN/inf in EdadAtencion -> NaN:0 | Inf:0
NaN/inf in min_registro_a_triage -> NaN:0 | Inf:0

Unique values in Genero: ['Masculino' 'Femenino']

Distribution of y_train:
target
1    19997
0     4947
Name: count, dtype: int64
Distribution of y_test:
target
1    8571
0    2120
Name: count, dtype: int64

Index alignment -> Train: True  | Test: True


En este bloque se generan variables de contexto para pacientes de Triage III: hora de atención, día de la semana, indicador de fin de semana, mes y una medida de congestión (arrivals_60m, número de llegadas en los 60 minutos previos). Estas características buscan capturar factores operativos que pueden influir en el cumplimiento del estándar de dos horas.

In [47]:
# Block 17 — Feature engineering for Triage III (time & congestion)
# Add hour, day-of-week, weekend, month, and arrivals in prior 60 minutes

import pandas as pd
import numpy as np

# Ensure datetime
df_triage3['FechaTriage'] = pd.to_datetime(df_triage3['FechaTriage'], errors='coerce')

# Time-based features
df_triage3['hour']       = df_triage3['FechaTriage'].dt.hour
df_triage3['dow']        = df_triage3['FechaTriage'].dt.dayofweek   # 0=Mon ... 6=Sun
df_triage3['is_weekend'] = (df_triage3['dow'] >= 5).astype(int)
df_triage3['month']      = df_triage3['FechaTriage'].dt.month

# Congestion proxy: arrivals in the previous 60 minutes (excluding current)
df_tmp = df_triage3[['FechaTriage']].copy()
df_tmp = df_tmp.sort_values('FechaTriage').set_index('FechaTriage')
df_tmp['ones'] = 1
df_tmp['arrivals_60m'] = df_tmp['ones'].rolling('60min').sum() - 1
df_tmp['arrivals_60m'] = df_tmp['arrivals_60m'].clip(lower=0).fillna(0)

# Merge back
df_triage3 = df_triage3.join(df_tmp['arrivals_60m'], on='FechaTriage')

# Feature set with engineered features
features_fe = [
    'EdadAtencion', 'Genero',
    'min_registro_a_triage',     # existing signal
    'hour', 'dow', 'is_weekend', 'month',  # calendar/time
    'arrivals_60m'               # congestion proxy
]

# Align to existing train/test split
X_train_fe = df_triage3.loc[X_train.index, features_fe].copy()
X_test_fe  = df_triage3.loc[X_test.index,  features_fe].copy()
y_train_fe = df_triage3.loc[y_train.index, 'target'].astype(int).copy()
y_test_fe  = df_triage3.loc[y_test.index,  'target'].astype(int).copy()

print("Shapes -> X_train_fe:", X_train_fe.shape, "X_test_fe:", X_test_fe.shape)
print("Preview engineered columns:\n", X_train_fe.head(3))

Shapes -> X_train_fe: (24959, 8) X_test_fe: (10702, 8)
Preview engineered columns:
        EdadAtencion     Genero  min_registro_a_triage  hour  dow  is_weekend  month  arrivals_60m
10330            19  Masculino              11.866667    21    4           0      6           3.0
12759            25   Femenino              11.016667    16    0           0      1           5.0
10463            37   Femenino               3.050000     0    2           0      9           4.0


En este bloque se entrena un modelo de Random Forest balanceado utilizando las nuevas variables de contexto generadas en el bloque anterior. Se evalúa el desempeño del modelo con el umbral estándar (0.5) y luego con un umbral ajustado para maximizar el F1 de la clase 0 (pacientes que no cumplen el estándar de dos horas).

In [48]:
# Block 18 — Random Forest (balanced) with engineered features + tuned threshold
# Refit RF using engineered features and compare performance

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_curve
import numpy as np

# Split features into numeric and categorical
num_features = ['EdadAtencion','min_registro_a_triage','hour','dow','is_weekend','month','arrivals_60m']
cat_features = ['Genero']

preprocessor_rf_fe = ColumnTransformer(
    transformers=[
        ('num', Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler())
        ]), num_features),
        ('cat', Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='most_frequent')),
            ('ohe', OneHotEncoder(drop='if_binary', handle_unknown='ignore'))
        ]), cat_features),
    ],
    remainder='drop'
)

rf_pipe_fe = Pipeline(steps=[
    ('prep', preprocessor_rf_fe),
    ('clf', RandomForestClassifier(
        n_estimators=350,
        max_depth=10,
        class_weight='balanced',
        random_state=42,
        n_jobs=-1
    ))
])

# Fit
rf_pipe_fe.fit(X_train_fe, y_train_fe)

# --- thr=0.5 ---
proba_test = rf_pipe_fe.predict_proba(X_test_fe)[:, 1]
y_pred_default = (proba_test >= 0.5).astype(int)

print("=== Confusion Matrix (RF+FE, thr=0.5) ===")
print(confusion_matrix(y_test_fe, y_pred_default))
print("\n=== Classification Report (RF+FE, thr=0.5) ===")
print(classification_report(y_test_fe, y_pred_default, digits=3))

# --- threshold tuning for class 0 (maximize F1_0 on train) ---
proba_train = rf_pipe_fe.predict_proba(X_train_fe)[:, 1]
p0_train = 1.0 - proba_train
y0_train = (y_train_fe == 0).astype(int)

prec, rec, thr = precision_recall_curve(y0_train, p0_train)
with np.errstate(divide='ignore', invalid='ignore'):
    f1_0 = np.where((prec + rec) > 0, 2 * prec * rec / (prec + rec), 0.0)

if thr.size > 0 and f1_0.size > 1:
    f1_aligned = f1_0[1:]
    best_idx = int(np.argmax(f1_aligned))
    best_thr_for_class0 = thr[best_idx]
else:
    best_thr_for_class0 = 0.5

print(f"\nBest threshold for class 0 (train): {best_thr_for_class0:.4f}")

# Apply tuned threshold on TEST
p0_test = 1.0 - proba_test
y_pred_tuned0 = (p0_test >= best_thr_for_class0).astype(int)  # 1 -> class 0
y_pred_tuned = np.where(y_pred_tuned0 == 1, 0, 1)

print("\n=== Confusion Matrix (RF+FE, tuned threshold) ===")
print(confusion_matrix(y_test_fe, y_pred_tuned))
print("\n=== Classification Report (RF+FE, tuned threshold) ===")
print(classification_report(y_test_fe, y_pred_tuned, digits=3))

=== Confusion Matrix (RF+FE, thr=0.5) ===
[[1702  422]
 [2076 6502]]

=== Classification Report (RF+FE, thr=0.5) ===
              precision    recall  f1-score   support

           0      0.451     0.801     0.577      2124
           1      0.939     0.758     0.839      8578

    accuracy                          0.767     10702
   macro avg      0.695     0.780     0.708     10702
weighted avg      0.842     0.767     0.787     10702


Best threshold for class 0 (train): 0.5721

=== Confusion Matrix (RF+FE, tuned threshold) ===
[[1476  648]
 [1477 7101]]

=== Classification Report (RF+FE, tuned threshold) ===
              precision    recall  f1-score   support

           0      0.500     0.695     0.581      2124
           1      0.916     0.828     0.870      8578

    accuracy                          0.801     10702
   macro avg      0.708     0.761     0.726     10702
weighted avg      0.834     0.801     0.813     10702



En este bloque se entrena XGBoost con las variables de contexto (tiempo y congestión) usando un Pipeline de preprocesamiento. Se reportan resultados con umbral 0.5 y con umbral ajustado para maximizar el F1 de la clase 0 (no cumplimiento del estándar).

In [50]:
# Block 19 — XGBoost with engineered features (no early stopping) + tuned threshold

import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_curve

try:
    from xgboost import XGBClassifier
except Exception as e:
    raise RuntimeError("XGBoost no está instalado. Instalar con: pip install xgboost") from e

# Features consistentes con Block 18
num_features = ['EdadAtencion','min_registro_a_triage','hour','dow','is_weekend','month','arrivals_60m']
cat_features = ['Genero']

# Preprocesador
preprocessor_xgb = ColumnTransformer(
    transformers=[
        ('num', Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler())
        ]), num_features),
        ('cat', Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='most_frequent')),
            ('ohe', OneHotEncoder(drop='if_binary', handle_unknown='ignore'))
        ]), cat_features),
    ],
    remainder='drop'
)

# Modelo XGBoost (sin early stopping)
xgb = XGBClassifier(
    n_estimators=600,
    learning_rate=0.05,
    max_depth=6,
    min_child_weight=5,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    n_jobs=-1,
    eval_metric='logloss'
)

xgb_pipe = Pipeline(steps=[
    ('prep', preprocessor_xgb),
    ('clf', xgb)
])

# Entrenamiento
xgb_pipe.fit(X_train_fe, y_train_fe)

# === Evaluación con umbral 0.5 ===
proba_test = xgb_pipe.predict_proba(X_test_fe)[:, 1]  # P(y=1: cumple ≤2h)
y_pred_default = (proba_test >= 0.5).astype(int)

print("=== Confusion Matrix (XGB+FE, thr=0.5) ===")
print(confusion_matrix(y_test_fe, y_pred_default))
print("\n=== Classification Report (XGB+FE, thr=0.5) ===")
print(classification_report(y_test_fe, y_pred_default, digits=3))

# === Ajuste de umbral para maximizar F1 de la clase 0 (no cumple ≤2h) ===
proba_train = xgb_pipe.predict_proba(X_train_fe)[:, 1]
p0_train = 1.0 - proba_train
y0_train = (y_train_fe == 0).astype(int)

prec, rec, thr = precision_recall_curve(y0_train, p0_train)
with np.errstate(divide='ignore', invalid='ignore'):
    f1_0 = np.where((prec + rec) > 0, 2 * prec * rec / (prec + rec), 0.0)

if thr.size > 0 and f1_0.size > 1:
    f1_aligned = f1_0[1:]
    best_idx = int(np.argmax(f1_aligned))
    best_thr_for_class0 = thr[best_idx]
else:
    best_thr_for_class0 = 0.5

print(f"\nBest threshold for class 0 (train): {best_thr_for_class0:.4f}")

# Aplicación del umbral ajustado en TEST
p0_test = 1.0 - proba_test
y_pred_tuned0 = (p0_test >= best_thr_for_class0).astype(int)  # 1 -> clase 0
y_pred_tuned = np.where(y_pred_tuned0 == 1, 0, 1)

print("\n=== Confusion Matrix (XGB+FE, tuned threshold) ===")
print(confusion_matrix(y_test_fe, y_pred_tuned))
print("\n=== Classification Report (XGB+FE, tuned threshold) ===")
print(classification_report(y_test_fe, y_pred_tuned, digits=3))

=== Confusion Matrix (XGB+FE, thr=0.5) ===
[[ 948 1176]
 [ 515 8063]]

=== Classification Report (XGB+FE, thr=0.5) ===
              precision    recall  f1-score   support

           0      0.648     0.446     0.529      2124
           1      0.873     0.940     0.905      8578

    accuracy                          0.842     10702
   macro avg      0.760     0.693     0.717     10702
weighted avg      0.828     0.842     0.830     10702


Best threshold for class 0 (train): 0.3517

=== Confusion Matrix (XGB+FE, tuned threshold) ===
[[1345  779]
 [1033 7545]]

=== Classification Report (XGB+FE, tuned threshold) ===
              precision    recall  f1-score   support

           0      0.566     0.633     0.598      2124
           1      0.906     0.880     0.893      8578

    accuracy                          0.831     10702
   macro avg      0.736     0.756     0.745     10702
weighted avg      0.839     0.831     0.834     10702



All models above don't consider the text in MotivoConsulta, available in Triage.

Now the text is modelled as topic to use in the ML models

En este bloque se construye una tabla comparativa entre los modelos Random Forest y XGBoost con las variables de contexto. Además, se realiza un barrido de umbrales en XGBoost para identificar el punto operativo que maximiza el F1 de la clase 0 y permite observar cómo cambian las métricas (precision, recall y F1) de ambas clases según el umbral de decisión.

In [51]:
# Block 20 — Model comparison (RF vs XGB) + threshold sweep for XGBoost

import pandas as pd
import numpy as np
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_curve

# === 1. Compare RF vs XGB at tuned thresholds ===
# Metrics for RF (already computed in Block 18 tuned)
rf_metrics = {
    "Model": "Random Forest + FE (tuned)",
    "Accuracy": 0.801,
    "Recall_0": 0.695,
    "Precision_0": 0.500,
    "F1_0": 0.581,
    "Recall_1": 0.828,
    "Precision_1": 0.916,
    "F1_1": 0.870
}

# Metrics for XGB (already computed in Block 19 tuned)
xgb_metrics = {
    "Model": "XGBoost + FE (tuned)",
    "Accuracy": 0.831,
    "Recall_0": 0.633,
    "Precision_0": 0.566,
    "F1_0": 0.598,
    "Recall_1": 0.880,
    "Precision_1": 0.906,
    "F1_1": 0.893
}

df_compare = pd.DataFrame([rf_metrics, xgb_metrics])
print("=== Model comparison ===")
print(df_compare)

# === 2. Threshold sweep for XGBoost ===
proba_test_xgb = xgb_pipe.predict_proba(X_test_fe)[:, 1]
y_true = y_test_fe

thresholds = np.linspace(0.1, 0.9, 17)  # from 0.1 to 0.9 step 0.05
rows = []

for thr in thresholds:
    y_pred = (proba_test_xgb >= thr).astype(int)
    cm = confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cm.ravel()
    prec0 = tn / (tn + fn) if (tn + fn) > 0 else 0
    rec0 = tn / (tn + fp) if (tn + fp) > 0 else 0
    f1_0 = (2*prec0*rec0)/(prec0+rec0) if (prec0+rec0) > 0 else 0
    prec1 = tp / (tp + fp) if (tp + fp) > 0 else 0
    rec1 = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1_1 = (2*prec1*rec1)/(prec1+rec1) if (prec1+rec1) > 0 else 0
    acc = (tn+tp)/(tn+fp+fn+tp)
    
    rows.append({
        "Threshold": thr,
        "Accuracy": acc,
        "Recall_0": rec0,
        "Precision_0": prec0,
        "F1_0": f1_0,
        "Recall_1": rec1,
        "Precision_1": prec1,
        "F1_1": f1_1
    })

df_thr = pd.DataFrame(rows)
print("\n=== Threshold sweep for XGBoost ===")
print(df_thr.round(3).head(10))  # preview first 10 rows

# Show best threshold for F1_0
best_idx = df_thr['F1_0'].idxmax()
best_thr = df_thr.loc[best_idx, 'Threshold']
print(f"\nBest threshold for F1_0: {best_thr:.3f}")
print(df_thr.loc[best_idx])


=== Model comparison ===
                        Model  Accuracy  Recall_0  Precision_0   F1_0  Recall_1  Precision_1   F1_1
0  Random Forest + FE (tuned)     0.801     0.695        0.500  0.581     0.828        0.916  0.870
1        XGBoost + FE (tuned)     0.831     0.633        0.566  0.598     0.880        0.906  0.893

=== Threshold sweep for XGBoost ===
   Threshold  Accuracy  Recall_0  Precision_0   F1_0  Recall_1  Precision_1   F1_1
0       0.10     0.806     0.024        0.912  0.048     0.999        0.805  0.892
1       0.15     0.811     0.058        0.866  0.109     0.998        0.811  0.894
2       0.20     0.818     0.098        0.850  0.176     0.996        0.817  0.897
3       0.25     0.826     0.158        0.821  0.265     0.991        0.826  0.901
4       0.30     0.834     0.219        0.792  0.343     0.986        0.836  0.905
5       0.35     0.837     0.267        0.749  0.394     0.978        0.843  0.906
6       0.40     0.841     0.330        0.716  0.451     

En este bloque se construyen variables temáticas a partir de los textos de la columna MotivoConsulta. Para ello, primero se deduplican los registros a fin de evitar inconsistencias en los índices y se alinean con las matrices de entrenamiento y prueba previamente definidas. Posteriormente, se aplica un modelo de TF-IDF para representar los textos y, sobre esa base, se utiliza NMF para identificar diez tópicos latentes. Los puntajes de pertenencia a cada tópico se integran como nuevas variables predictoras en las matrices de entrenamiento y prueba, enriqueciendo así las características disponibles para los modelos.

In [57]:
# Block 21 — Topic features from MotivoConsulta (TF-IDF + NMF) — DEDUP & ALIGN

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

assert 'MotivoConsulta' in df_triage3.columns, "MotivoConsulta column is required"
assert isinstance(X_train_fe, pd.DataFrame) and isinstance(X_test_fe, pd.DataFrame), "Run Block 17 first"

# 1) Deduplicate df_triage3 by index (keep first) to guarantee 1:1 mapping by label
df_triage3_dedup = (
    df_triage3[['MotivoConsulta']]
    .astype(str)
    .groupby(level=0, sort=False)
    .nth(0)
)

# 2) Now align EXACTLY to the indices and order of X_train_fe / X_test_fe
idx_train = X_train_fe.index
idx_test  = X_test_fe.index

# Reindex is safe now because df_triage3_dedup has unique index labels
text_train = df_triage3_dedup.reindex(idx_train)['MotivoConsulta'].fillna("").astype(str)
text_test  = df_triage3_dedup.reindex(idx_test)['MotivoConsulta'].fillna("").astype(str)

# Hard checks
assert len(text_train) == len(X_train_fe), f"Train length mismatch: {len(text_train)} vs {len(X_train_fe)}"
assert len(text_test)  == len(X_test_fe),  f"Test length mismatch: {len(text_test)} vs {len(X_test_fe)}"
assert text_train.index.equals(X_train_fe.index)
assert text_test.index.equals(X_test_fe.index)

# 3) TF-IDF (sin 'spanish' para evitar error de sklearn)
tfidf = TfidfVectorizer(
    lowercase=True,
    strip_accents='unicode',
    min_df=10,
    max_df=0.9,
    ngram_range=(1,2),
    max_features=20000
)
X_tfidf_train = tfidf.fit_transform(text_train)
X_tfidf_test  = tfidf.transform(text_test)

# 4) NMF topics
n_topics = 10
nmf = NMF(n_components=n_topics, init='nndsvd', random_state=42, max_iter=400)
W_train = nmf.fit_transform(X_tfidf_train)
W_test  = nmf.transform(X_tfidf_test)

topic_cols = [f"topic_{i+1}" for i in range(n_topics)]
topic_train_df = pd.DataFrame(W_train, index=X_train_fe.index, columns=topic_cols)
topic_test_df  = pd.DataFrame(W_test,  index=X_test_fe.index,  columns=topic_cols)

# 5) Concatenate with engineered features
X_train_topics = pd.concat([X_train_fe, topic_train_df], axis=1)
X_test_topics  = pd.concat([X_test_fe,  topic_test_df],  axis=1)

print("Aligned shapes OK ->",
      "TF-IDF:", X_tfidf_train.shape, X_tfidf_test.shape,
      "| Topics:", W_train.shape, W_test.shape,
      "| Matrices:", X_train_topics.shape, X_test_topics.shape)

Aligned shapes OK -> TF-IDF: (24959, 14276) (10702, 14276) | Topics: (24959, 10) (10702, 10) | Matrices: (24959, 18) (10702, 18)


En este bloque se inspeccionan los tópicos generados a partir de MotivoConsulta. Primero, se listan las palabras más representativas de cada tópico; luego, se calcula la prevalencia de cada uno en el conjunto de entrenamiento asignando a cada registro el tópico con mayor peso; finalmente, se muestran ejemplos de textos con mayor contribución por tópico. No se incluyen visualizaciones gráficas para mantener la trazabilidad del pipeline.

In [65]:
# Block 21A — Topic inspection (no plots)
# Requires Block 21 to have run (tfidf, nmf, topic_train_df)

import numpy as np
import pandas as pd

# --- Guard rails ---
required = ['tfidf', 'nmf', 'topic_train_df', 'X_train_fe', 'df_triage3']
missing = [name for name in required if name not in globals()]
assert not missing, f"Missing objects from Block 21: {missing}"

# 1) Top words per topic
feature_names = np.array(tfidf.get_feature_names_out())

def top_words_per_topic(model, feature_names, topn=12):
    rows = []
    for i, comp in enumerate(model.components_):
        idx = np.argsort(comp)[::-1][:topn]
        words = [feature_names[j] for j in idx]
        rows.append({
            "topic": f"topic_{i+1}",
            "top_words": ", ".join(words)
        })
    return pd.DataFrame(rows)

df_top_words = top_words_per_topic(nmf, feature_names, topn=12)
print("=== Top palabras por tópico ===")
print(df_top_words)

# 2) Topic prevalence (hard assignment by argmax)
W_train = topic_train_df.values  # document-topic weights (train)
topic_idx = W_train.argmax(axis=1)  # winning topic per doc
counts = pd.Series(topic_idx).value_counts().sort_index()
prevalence = (counts / counts.sum()).rename("proportion")
df_prev = pd.DataFrame({
    "topic": [f"topic_{i+1}" for i in range(W_train.shape[1])],
    "count": counts.reindex(range(W_train.shape[1]), fill_value=0).values,
    "proportion": prevalence.reindex(range(W_train.shape[1]), fill_value=0.0).values
})
print("\n=== Prevalencia de tópicos (train) ===")
print(df_prev)

# 3) Show top examples per topic
# Utilities
def truncate(text, n=180):
    t = str(text).replace("\n", " ").strip()
    return (t[:n] + "…") if len(t) > n else t

# Recuperar los textos de train alineados
text_train = df_triage3.loc[X_train_fe.index, 'MotivoConsulta'].astype(str)

n_examples = 5  # how many samples to print per topic
print("\n=== Ejemplos por tópico (top contribuciones) ===")
for t in range(W_train.shape[1]):
    topic_name = f"topic_{t+1}"
    # ordenar documentos por peso descendente en el tópico t
    order = np.argsort(W_train[:, t])[::-1]
    top_idx = X_train_fe.index[order][:n_examples]
    top_weights = W_train[order, t][:n_examples]
    print(f"\n-- {topic_name}: ejemplos (n={n_examples}) --")
    for i, (idx, w) in enumerate(zip(top_idx, top_weights), 1):
        txt = truncate(text_train.loc[idx], 220)
        print(f"{i:>2}. weight={w:.3f} | idx={idx} | {txt}")

=== Top palabras por tópico ===
      topic                                          top_words
0   topic_1  niega_x000d_, por, realiza, se realiza, _x000d...
1   topic_2  refiere, niega_x000d_, el siguiente, siguiente...
2   topic_3  los, covid 19, 19, 15, valoracion, covid, _x00...
3   topic_4  riesgo, riesgo de, _x000d_ riesgo, farmacologi...
4   topic_5  niega _x000d_, niega, por, alergico niega, _x0...
5   topic_6  el, refiere, aproximado paciente, aproximado, ...
6   topic_7  trauma, limitacion, trauma en, mano, la, edema...
7   topic_8  _x000d_ _x000d_, paciente al, ingresa paciente...
8   topic_9  15, valoracion, para, paciente refiere, brinda...
9  topic_10  normales, signos vitales, vitales, con signos,...

=== Prevalencia de tópicos (train) ===
      topic  count  proportion
0   topic_1   4449    0.178252
1   topic_2   2687    0.107657
2   topic_3    704    0.028206
3   topic_4    770    0.030851
4   topic_5   5809    0.232742
5   topic_6   1778    0.071237
6   topic_7   2073

En este bloque se entrena un modelo de XGBoost que incorpora tanto las variables ingenierizadas como las características temáticas derivadas de MotivoConsulta. El modelo se evalúa inicialmente con un umbral estándar de 0.5 y luego con un umbral ajustado para maximizar el desempeño en la clase de incumplimiento (Triage III > 2 horas). Se presentan las matrices de confusión y los reportes de clasificación correspondientes, así como una comparación compacta de métricas clave.

In [58]:
# Block 22 — XGBoost with engineered + topic features (+ tuned threshold)
# Train XGBoost using contextual features plus NMF topic features and compare to previous XGB+FE.

import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_curve

from xgboost import XGBClassifier

# --- 1) Guard rails: check matrices with topics exist ---
assert 'topic_1' in X_train_topics.columns, "Topic features not found. Run Block 21 first."
assert set(X_train_fe.index) == set(X_train_topics.index), "Index mismatch between FE and topic matrices (train)."
assert set(X_test_fe.index)  == set(X_test_topics.index),  "Index mismatch between FE and topic matrices (test)."

# --- 2) Define feature lists ---
# Base contextual features (same as Block 18/19)
base_num = ['EdadAtencion','min_registro_a_triage','hour','dow','is_weekend','month','arrivals_60m']
cat_features = ['Genero']

# Topic feature columns
topic_cols = sorted([c for c in X_train_topics.columns if c.startswith('topic_')])

# Final features: numeric includes base_num + topics; categorical keeps 'Genero'
num_features = base_num + topic_cols

# --- 3) Build train/test with topics (aligned indices) ---
Xtr_topics = X_train_topics[cat_features + num_features].copy()
Xte_topics = X_test_topics[cat_features + num_features].copy()
ytr_topics = y_train_fe.copy()
yte_topics = y_test_fe.copy()

print("Shapes -> X_train_topics:", Xtr_topics.shape, " X_test_topics:", Xte_topics.shape)

# --- 4) Preprocessor: impute + scale numeric; impute + OHE categorical ---
preprocessor_xgb_topics = ColumnTransformer(
    transformers=[
        ('num', Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler())
        ]), num_features),
        ('cat', Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='most_frequent')),
            ('ohe', OneHotEncoder(drop='if_binary', handle_unknown='ignore'))
        ]), cat_features),
    ],
    remainder='drop'
)

# --- 5) XGB model (no early stopping to avoid version issues) ---
xgb_topics = XGBClassifier(
    n_estimators=700,
    learning_rate=0.05,
    max_depth=6,
    min_child_weight=5,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    n_jobs=-1,
    eval_metric='logloss'
)

xgb_topics_pipe = Pipeline(steps=[
    ('prep', preprocessor_xgb_topics),
    ('clf', xgb_topics)
])

# --- 6) Fit ---
xgb_topics_pipe.fit(Xtr_topics, ytr_topics)

# --- 7) Evaluate @ thr=0.5 ---
proba_test = xgb_topics_pipe.predict_proba(Xte_topics)[:, 1]   # P(y=1: cumple ≤2h)
y_pred_default = (proba_test >= 0.5).astype(int)

print("=== Confusion Matrix (XGB+FE+Topics, thr=0.5) ===")
print(confusion_matrix(yte_topics, y_pred_default))
print("\n=== Classification Report (XGB+FE+Topics, thr=0.5) ===")
print(classification_report(yte_topics, y_pred_default, digits=3))

# --- 8) Tune threshold to maximize F1 for class 0 (no cumple ≤2h) ---
proba_train = xgb_topics_pipe.predict_proba(Xtr_topics)[:, 1]
p0_train = 1.0 - proba_train
y0_train = (ytr_topics == 0).astype(int)

prec, rec, thr = precision_recall_curve(y0_train, p0_train)
with np.errstate(divide='ignore', invalid='ignore'):
    f1_0 = np.where((prec + rec) > 0, 2 * prec * rec / (prec + rec), 0.0)

if thr.size > 0 and f1_0.size > 1:
    f1_aligned = f1_0[1:]  # align with thresholds
    best_idx = int(np.argmax(f1_aligned))
    best_thr_for_class0 = thr[best_idx]
else:
    best_thr_for_class0 = 0.5

print(f"\nBest threshold for class 0 (train): {best_thr_for_class0:.4f}")

# Apply tuned threshold on TEST
p0_test = 1.0 - proba_test
y_pred_tuned0 = (p0_test >= best_thr_for_class0).astype(int)  # 1 -> class 0
y_pred_tuned = np.where(y_pred_tuned0 == 1, 0, 1)

print("\n=== Confusion Matrix (XGB+FE+Topics, tuned threshold) ===")
print(confusion_matrix(yte_topics, y_pred_tuned))
print("\n=== Classification Report (XGB+FE+Topics, tuned threshold) ===")
print(classification_report(yte_topics, y_pred_tuned, digits=3))

# --- 9) Compact comparison vs XGB+FE (Block 19 tuned) ---
def metrics_from_report(y_true, y_pred):
    cm = confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cm.ravel()
    prec0 = tn / (tn + fn) if (tn + fn) > 0 else 0
    rec0  = tn / (tn + fp) if (tn + fp) > 0 else 0
    f1_0  = (2*prec0*rec0)/(prec0+rec0) if (prec0+rec0) > 0 else 0
    prec1 = tp / (tp + fp) if (tp + fp) > 0 else 0
    rec1  = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1_1  = (2*prec1*rec1)/(prec1+rec1) if (prec1+rec1) > 0 else 0
    acc   = (tn+tp)/(tn+fp+fn+tp)
    return acc, rec0, prec0, f1_0, rec1, prec1, f1_1

acc_tuned, rec0_tuned, prec0_tuned, f10_tuned, rec1_tuned, prec1_tuned, f11_tuned = metrics_from_report(yte_topics, y_pred_tuned)

comparison = pd.DataFrame([
    {
        "Model": "XGBoost + FE (tuned, Block 19)",
        # Rellena aquí tus mejores métricas del Block 19 tuned si quieres fijarlas:
        # (opcional) Si no las guardaste, puedes reimprimirlas desde Block 19.
        "Accuracy": np.nan, "Recall_0": np.nan, "Precision_0": np.nan, "F1_0": np.nan,
        "Recall_1": np.nan, "Precision_1": np.nan, "F1_1": np.nan
    },
    {
        "Model": "XGBoost + FE + Topics (tuned, Block 22)",
        "Accuracy": acc_tuned,
        "Recall_0": rec0_tuned, "Precision_0": prec0_tuned, "F1_0": f10_tuned,
        "Recall_1": rec1_tuned, "Precision_1": prec1_tuned, "F1_1": f11_tuned
    }
])

print("\n=== Compact comparison (tuned) ===")
print(comparison)

Shapes -> X_train_topics: (24959, 18)  X_test_topics: (10702, 18)
=== Confusion Matrix (XGB+FE+Topics, thr=0.5) ===
[[1000 1124]
 [ 411 8167]]

=== Classification Report (XGB+FE+Topics, thr=0.5) ===
              precision    recall  f1-score   support

           0      0.709     0.471     0.566      2124
           1      0.879     0.952     0.914      8578

    accuracy                          0.857     10702
   macro avg      0.794     0.711     0.740     10702
weighted avg      0.845     0.857     0.845     10702


Best threshold for class 0 (train): 0.3927

=== Confusion Matrix (XGB+FE+Topics, tuned threshold) ===
[[1263  861]
 [ 718 7860]]

=== Classification Report (XGB+FE+Topics, tuned threshold) ===
              precision    recall  f1-score   support

           0      0.638     0.595     0.615      2124
           1      0.901     0.916     0.909      8578

    accuracy                          0.852     10702
   macro avg      0.769     0.755     0.762     10702
weighted

En este bloque se entrena un modelo de LightGBM utilizando las variables ingenierizadas y las características temáticas derivadas de MotivoConsulta. Se evalúa el desempeño con umbral estándar (0.5) y con un umbral ajustado para maximizar el F1 de la clase de incumplimiento (Triage III > 2 horas). Finalmente, se presenta una comparación compacta contra XGBoost con tópicos

In [60]:
pip install lightgbm


Collecting lightgbm
  Downloading lightgbm-4.6.0-py3-none-win_amd64.whl.metadata (17 kB)
Downloading lightgbm-4.6.0-py3-none-win_amd64.whl (1.5 MB)
   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
   ---------------------------------------- 1.5/1.5 MB 32.9 MB/s  0:00:00
Installing collected packages: lightgbm
Successfully installed lightgbm-4.6.0
Note: you may need to restart the kernel to use updated packages.


In [63]:
# Block 23 — LightGBM with engineered + topic features (no warnings, no wrappers)

import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_curve
from lightgbm import LGBMClassifier

# 1) Hacer que todas las transformaciones devuelvan DataFrame con nombres
set_config(transform_output="pandas")

# 2) Entradas
assert isinstance(X_train_topics, pd.DataFrame) and isinstance(X_test_topics, pd.DataFrame)
ytr = y_train_fe.astype(int).copy()
yte = y_test_fe.astype(int).copy()

cat_features = ['Genero']
base_num = ['EdadAtencion','min_registro_a_triage','hour','dow','is_weekend','month','arrivals_60m']
topic_cols = sorted([c for c in X_train_topics.columns if c.startswith('topic_')])
num_features = base_num + topic_cols

Xtr = X_train_topics[cat_features + num_features].copy()
Xte = X_test_topics[cat_features + num_features].copy()

# 3) Preprocesado (ahora sklearn devolverá DataFrame automáticamente)
preprocessor = ColumnTransformer(
    transformers=[
        ('num', Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler())
        ]), num_features),
        ('cat', Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='most_frequent')),
            # OHE en denso para que mantenga columnas en DataFrame
            ('ohe', OneHotEncoder(drop='if_binary', handle_unknown='ignore', sparse_output=False))
        ]), cat_features),
    ],
    remainder='drop'
)

# 4) Modelo
lgbm = LGBMClassifier(
    n_estimators=700,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    n_jobs=-1
)

pipe = Pipeline(steps=[
    ('prep', preprocessor),
    ('clf', lgbm)
])

# 5) Entrenar
pipe.fit(Xtr, ytr)

# 6) Evaluar (thr=0.5)
proba_test = pipe.predict_proba(Xte)[:, 1]   # sigue siendo DataFrame internamente; sin warnings
y_pred = (proba_test >= 0.5).astype(int)

print("=== Confusion Matrix (LGBM+FE+Topics, thr=0.5) ===")
print(confusion_matrix(yte, y_pred))
print("\n=== Classification Report (LGBM+FE+Topics, thr=0.5) ===")
print(classification_report(yte, y_pred, digits=3))

# 7) Ajuste de umbral para maximizar F1 de clase 0
proba_train = pipe.predict_proba(Xtr)[:, 1]
p0_train = 1.0 - proba_train
y0_train = (ytr == 0).astype(int)

prec, rec, thr = precision_recall_curve(y0_train, p0_train)
with np.errstate(divide='ignore', invalid='ignore'):
    f1_0 = np.where((prec + rec) > 0, 2 * prec * rec / (prec + rec), 0.0)
best_thr = thr[np.argmax(f1_0[1:])] if thr.size > 0 and f1_0.size > 1 else 0.5
print(f"\nBest threshold for class 0 (train): {best_thr:.4f}")

p0_test = 1.0 - proba_test
y_pred_tuned0 = (p0_test >= best_thr).astype(int)
y_pred_tuned = np.where(y_pred_tuned0 == 1, 0, 1)

print("\n=== Confusion Matrix (LGBM+FE+Topics, tuned threshold) ===")
print(confusion_matrix(yte, y_pred_tuned))
print("\n=== Classification Report (LGBM+FE+Topics, tuned threshold) ===")
print(classification_report(yte, y_pred_tuned, digits=3))

[LightGBM] [Info] Number of positive: 20011, number of negative: 4948
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001147 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2966
[LightGBM] [Info] Number of data points in the train set: 24959, number of used features: 18
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.801755 -> initscore=1.397299
[LightGBM] [Info] Start training from score 1.397299
=== Confusion Matrix (LGBM+FE+Topics, thr=0.5) ===
[[ 986 1138]
 [ 391 8187]]

=== Classification Report (LGBM+FE+Topics, thr=0.5) ===
              precision    recall  f1-score   support

           0      0.716     0.464     0.563      2124
           1      0.878     0.954     0.915      8578

    accuracy                          0.857     10702
   macro avg      0.797     0.709     0.739     10702
weighted avg      0.846     0.857     0.845     10702


Best threshold for class 0 (train): 0.3967

==

En este bloque se construye una tabla comparativa de desempeño entre los diferentes modelos evaluados (Random Forest, XGBoost y LightGBM). Para cada uno se reportan las métricas principales en su umbral ajustado, de modo que se visualicen las fortalezas relativas en precisión, recall y F1 para la clase de incumplimiento (Triage III > 2 horas) y la clase de cumplimiento.

In [64]:
# Block 24 — Comparative performance table

import pandas as pd

# Resultados ya observados en bloques anteriores (rellenados a mano desde tus salidas)
# Nota: puedes ajustar estos valores si vuelves a correr con nuevos umbrales
results = [
    {
        "Model": "Random Forest + FE (tuned, Block 19)",
        "Accuracy": 0.801,
        "Recall_0": 0.695,
        "Precision_0": 0.500,
        "F1_0": 0.581,
        "Recall_1": 0.828,
        "Precision_1": 0.916,
        "F1_1": 0.870
    },
    {
        "Model": "XGBoost + FE + Topics (tuned, Block 22)",
        "Accuracy": 0.852,
        "Recall_0": 0.595,
        "Precision_0": 0.638,
        "F1_0": 0.615,
        "Recall_1": 0.916,
        "Precision_1": 0.901,
        "F1_1": 0.909
    },
    {
        "Model": "LightGBM + FE + Topics (tuned, Block 23)",
        "Accuracy": 0.855,
        "Recall_0": 0.589,
        "Precision_0": 0.648,
        "F1_0": 0.617,
        "Recall_1": 0.921,
        "Precision_1": 0.901,
        "F1_1": 0.910
    }
]

df_comparison = pd.DataFrame(results)
print("=== Final Model Comparison ===")
print(df_comparison)

=== Final Model Comparison ===
                                      Model  Accuracy  Recall_0  Precision_0   F1_0  Recall_1  Precision_1   F1_1
0      Random Forest + FE (tuned, Block 19)     0.801     0.695        0.500  0.581     0.828        0.916  0.870
1   XGBoost + FE + Topics (tuned, Block 22)     0.852     0.595        0.638  0.615     0.916        0.901  0.909
2  LightGBM + FE + Topics (tuned, Block 23)     0.855     0.589        0.648  0.617     0.921        0.901  0.910


En este bloque se persisten los artefactos necesarios para inferencia en producción: (i) el pipeline completo de LightGBM con su preprocesamiento tabular, (ii) el umbral operativo seleccionado para la clase de incumplimiento, y (iii) los vectorizadores de texto (TF-IDF y NMF) utilizados para generar las variables temáticas a partir de MotivoConsulta. Adicionalmente, se guarda un archivo de metadatos con la versión del pipeline, las columnas esperadas y un ejemplo de uso.

In [66]:
# Block 25 — Persist artifacts for production (model, threshold, TF-IDF, NMF, metadata)

import os
import json
import joblib
import pandas as pd
from datetime import datetime

# ====== 0) Preconditions ======
# Assumes you already ran:
# - Block 21 (created `tfidf`, `nmf`, and topic features)
# - Block 23 (trained LightGBM pipeline `pipe` and tuned threshold `best_thr`)
# - You have X_train_topics / X_test_topics and y_train_fe / y_test_fe
assert 'pipe' in globals(), "Missing `pipe` (trained LightGBM pipeline). Run Block 23."
assert 'best_thr' in globals(), "Missing `best_thr` (chosen operating threshold). Run Block 23."
assert 'tfidf' in globals() and 'nmf' in globals(), "Missing TF-IDF / NMF artifacts. Run Block 21."

# ====== 1) Output directory & filenames ======
SAVE_DIR = r"C:\Users\wilmerbelza\Documents\Prediction model\artifacts"
os.makedirs(SAVE_DIR, exist_ok=True)

FN_PIPELINE = os.path.join(SAVE_DIR, "lgbm_topics_pipeline.pkl")
FN_TFIDF    = os.path.join(SAVE_DIR, "tfidf_vectorizer.pkl")
FN_NMF      = os.path.join(SAVE_DIR, "nmf_model.pkl")
FN_META     = os.path.join(SAVE_DIR, "model_metadata.json")

# ====== 2) Collect schema & metadata ======
# Expected columns (order matters for convenience)
cat_features = ['Genero']
base_num = ['EdadAtencion','min_registro_a_triage','hour','dow','is_weekend','month','arrivals_60m']
topic_cols = sorted([c for c in X_train_topics.columns if c.startswith('topic_')])
num_features = base_num + topic_cols
expected_columns = cat_features + num_features

metadata = {
    "model_name": "LightGBM + FE + Topics",
    "version": "1.0.0",
    "created_at": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
    "operating_threshold_class0": float(best_thr),  # threshold applied on P(class0) = 1 - proba(class1)
    "target_definition": {
        "name": "target",
        "description": "Triage III cumple ≤ 2h (1) vs no cumple > 2h (0), solo adultos y Triage III."
    },
    "features": {
        "categorical": cat_features,
        "numeric": num_features,
        "topic_features": topic_cols
    },
    "expects_columns_in_order": expected_columns,
    "notes": [
        "Pipeline (`pipe`) contains: ColumnTransformer(impute/scale/OHE) + LightGBM.",
        "TF-IDF and NMF are stored separately to compute topic features from MotivoConsulta on new data.",
        "Operating threshold is on class 0 (no cumplimiento), computed on train to maximize F1_0."
    ],
    "example_inference": {
        "steps": [
            "1) Compute engineered features (hour, dow, is_weekend, month, arrivals_60m).",
            "2) Transform MotivoConsulta -> TF-IDF -> NMF -> topic_1..topic_k with saved artifacts.",
            "3) Build DataFrame with expected columns in order.",
            "4) Call pipeline.predict_proba -> get P(class1).",
            "5) Convert to class with tuned threshold on P(class0)=1-P(class1): class0 if p0>=thr else class1."
        ]
    }
}

# ====== 3) Persist artifacts ======
joblib.dump(pipe, FN_PIPELINE)
joblib.dump(tfidf, FN_TFIDF)
joblib.dump(nmf, FN_NMF)

with open(FN_META, "w", encoding="utf-8") as f:
    json.dump(metadata, f, ensure_ascii=False, indent=2)

print("Saved artifacts:")
print(" - Pipeline:", FN_PIPELINE)
print(" - TF-IDF  :", FN_TFIDF)
print(" - NMF     :", FN_NMF)
print(" - Metadata:", FN_META)

# ====== 4) Quick smoke test: reload and score few rows ======
pipe_loaded = joblib.load(FN_PIPELINE)
proba_sample = pipe_loaded.predict_proba(X_test_topics[expected_columns].head(5))[:, 1]
print("\nSmoke test — proba(class1) on 5 samples:", proba_sample)

Saved artifacts:
 - Pipeline: C:\Users\wilmerbelza\Documents\Prediction model\artifacts\lgbm_topics_pipeline.pkl
 - TF-IDF  : C:\Users\wilmerbelza\Documents\Prediction model\artifacts\tfidf_vectorizer.pkl
 - NMF     : C:\Users\wilmerbelza\Documents\Prediction model\artifacts\nmf_model.pkl
 - Metadata: C:\Users\wilmerbelza\Documents\Prediction model\artifacts\model_metadata.json

Smoke test — proba(class1) on 5 samples: [0.99471273 0.99367893 0.48729575 0.85319557 0.64802657]


En este bloque se ilustra cómo cargar los artefactos previamente guardados (pipeline de LightGBM, vectorizadores TF-IDF y NMF, y metadatos) para realizar predicciones en un entorno de producción. El procedimiento consiste en reconstruir las características esperadas, incluyendo las variables temáticas generadas a partir del campo MotivoConsulta, y aplicar el modelo con el umbral operativo afinado para distinguir entre pacientes que cumplen o no con el estándar de Triage III (≤2 horas). De esta manera, se asegura la reproducibilidad del pipeline y se habilita el uso del modelo en nuevos datos clínicos.

In [3]:
# Block 26 — Load artifacts and run inference with tuned operating threshold

import os
import json
import joblib
import numpy as np
import pandas as pd

SAVE_DIR = r"C:\Users\wilmerbelza\Documents\Prediction model\artifacts"
FN_PIPELINE = os.path.join(SAVE_DIR, "lgbm_topics_pipeline.pkl")
FN_TFIDF    = os.path.join(SAVE_DIR, "tfidf_vectorizer.pkl")
FN_NMF      = os.path.join(SAVE_DIR, "nmf_model.pkl")
FN_META     = os.path.join(SAVE_DIR, "model_metadata.json")

# 1) Load artifacts
pipe = joblib.load(FN_PIPELINE)
tfidf = joblib.load(FN_TFIDF)
nmf = joblib.load(FN_NMF)
meta = json.load(open(FN_META, "r", encoding="utf-8"))

expected_cols = meta["expects_columns_in_order"]
thr = meta["operating_threshold_class0"]

# 2) Suppose you have a new batch with columns:
# ['Genero','EdadAtencion','min_registro_a_triage','hour','dow','is_weekend','month','arrivals_60m','MotivoConsulta']
# You must transform `MotivoConsulta` to topic_i using the saved tfidf+nmf, then assemble the matrix.

def texts_to_topics(motivos: pd.Series, tfidf, nmf, n_topics: int) -> pd.DataFrame:
    """Convert MotivoConsulta series into topic features using saved TF-IDF and NMF."""
    X_tfidf = tfidf.transform(motivos.fillna("").astype(str))
    W = nmf.transform(X_tfidf)
    cols = [f"topic_{i+1}" for i in range(n_topics)]
    return pd.DataFrame(W, index=motivos.index, columns=cols)

# Example: new_df contains raw engineered + text
# new_df = pd.DataFrame({...})
# topics = texts_to_topics(new_df['MotivoConsulta'], tfidf, nmf, n_topics=len([c for c in expected_cols if c.startswith('topic_')]))
# X_new = pd.concat([new_df.drop(columns=['MotivoConsulta']), topics], axis=1)
# X_new = X_new[expected_cols]  # ensure column order

def predict_with_operating_point(X_new: pd.DataFrame, pipeline, thr_class0: float) -> pd.DataFrame:
    """Return proba, class using tuned threshold on class0 (p0 = 1 - p1)."""
    proba1 = pipeline.predict_proba(X_new)[:, 1]
    p0 = 1.0 - proba1
    yhat = (p0 >= thr_class0).astype(int)    # 1->class0, 0->class1
    yhat = np.where(yhat == 1, 0, 1)         # map back to {0:no cumple, 1:cumple}
    out = pd.DataFrame({
        "proba_cumple_<=2h": proba1,
        "proba_no_cumple": p0,
        "pred_clase": yhat
    }, index=X_new.index)
    return out

# Example usage:
# preds = predict_with_operating_point(X_new, pipe, thr)
# print(preds.head())

Este bloque recibe desde el Digital Twin (FlexSim) el tiempo entre Registro y Triage en formato hh:mm:ss y opcionalmente otras variables (edad, género, calendario y motivo de consulta).
Convierte el tiempo a minutos, arma el registro con las columnas esperadas por el pipeline guardado en el Block 26 y realiza la predicción en línea: probabilidad de cumplir el estándar Triage III ≤ 2 horas y la clasificación usando el umbral operativo calibrado.
Si no se proporciona MotivoConsulta pero el modelo espera tópicos, se rellenan con 0 (predicción válida pero menos precisa).

In [8]:
# Block 27 — Online inference with interactive input (Registro→Triage in hh:mm:ss)

import os
import json
import joblib
import numpy as np
import pandas as pd
import warnings
from sklearn.exceptions import DataConversionWarning

# ---------- 1) Helper: hh:mm:ss → minutes ----------
def hhmmss_to_minutes(hhmmss: str) -> float:
    h, m, s = hhmmss.strip().split(":")
    return int(h) * 60 + int(m) + int(s) / 60.0

# ---------- 2) INPUTS FROM USER ----------
tt_reg_tri_hhmmss = input("Ingrese el tiempo Registro→Triage en formato hh:mm:ss: ")
edad_opt   = int(input("Edad del paciente: "))
genero_opt = input("Género del paciente (Femenino/Masculino): ")

# ---------- 3) Load artifacts ----------
SAVE_DIR   = r"C:\Users\wilmerbelza\Documents\Prediction model\artifacts"
FN_PIPE    = os.path.join(SAVE_DIR, "lgbm_topics_pipeline.pkl")
FN_TFIDF   = os.path.join(SAVE_DIR, "tfidf_vectorizer.pkl")
FN_NMF     = os.path.join(SAVE_DIR, "nmf_model.pkl")
FN_META    = os.path.join(SAVE_DIR, "model_metadata.json")

pipe  = joblib.load(FN_PIPE)
tfidf = joblib.load(FN_TFIDF)
nmf   = joblib.load(FN_NMF)
with open(FN_META, "r", encoding="utf-8") as f:
    meta = json.load(f)

expected_cols = meta.get("expects_columns_in_order", [])
thr_class0    = float(meta.get("operating_threshold_class0", 0.5))

# ---------- 4) Build base row ----------
min_registro_a_triage_val = hhmmss_to_minutes(tt_reg_tri_hhmmss)
base_row = {
    'Genero': genero_opt,
    'EdadAtencion': edad_opt,
    'min_registro_a_triage': min_registro_a_triage_val,
}
X_new = pd.DataFrame([base_row])

# Añadir columnas faltantes (relleno con 0 o NaN)
for c in expected_cols:
    if c not in X_new.columns:
        X_new[c] = 0.0 if c.startswith("topic_") else np.nan
X_new = X_new[expected_cols]

# ---------- 5) Fix dtypes ----------
if 'Genero' in X_new.columns:
    X_new['Genero'] = X_new['Genero'].astype('category')

for c in X_new.columns:
    if c != 'Genero':
        X_new[c] = pd.to_numeric(X_new[c], errors='coerce')

# ---------- 6) Predict ----------
with warnings.catch_warnings():
    warnings.filterwarnings(
        "ignore",
        message="X does not have valid feature names, but .* was fitted with feature names"
    )
    warnings.filterwarnings("ignore", category=DataConversionWarning)
    proba1 = pipe.predict_proba(X_new)[:, 1]

p0     = 1.0 - proba1
yhat0  = (p0 >= thr_class0).astype(int)
yhat   = np.where(yhat0 == 1, 0, 1)

print("\n=== ONLINE PREDICTION (Triage III ≤ 2 horas) ===")
print(f"Input Digital Twin — Registro→Triage: {tt_reg_tri_hhmmss}  ->  {min_registro_a_triage_val:.2f} min")
print(f"Edad: {edad_opt} | Género: {genero_opt}")
print(f"Umbral operativo (class 0): {thr_class0:.3f}")
print(f"Prob. CUMPLE ≤2h (clase 1): {proba1[0]:.3f}")
print(f"Predicción: {'CUMPLE (1)' if yhat[0]==1 else 'NO CUMPLE (0)'}")

Ingrese el tiempo Registro→Triage en formato hh:mm:ss:  00:49:55
Edad del paciente:  33
Género del paciente (Femenino/Masculino):  Masculino



=== ONLINE PREDICTION (Triage III ≤ 2 horas) ===
Input Digital Twin — Registro→Triage: 00:49:55  ->  49.92 min
Edad: 33 | Género: Masculino
Umbral operativo (class 0): 0.397
Prob. CUMPLE ≤2h (clase 1): 0.951
Predicción: CUMPLE (1)
