# Réplica de resultados: 
- Publicación: Explainable LightGBM Approach for Predicting Myocardial Infarction Mortality
- Autores: A. Garcés, R. Malaquias, R. Romero
- Referencia: https://www.semanticscholar.org/paper/Explainable-LightGBM-Approach-for-Predicting-Vicente-Junior/3c3fcdc079b4b94b11f0ddf80c3140fed936a39f

## Carga de datos

In [1]:
# Código obtenido desde https://archive.ics.uci.edu/dataset/579/myocardial+infarction+complications

import numpy as np
import pandas as pd
from ucimlrepo import fetch_ucirepo 
  
# Extraer dataset
myocardial_infarction_complications = fetch_ucirepo(id=579);

# Data (como DataFrames de Pandas) 
X = myocardial_infarction_complications.data.features 
y = myocardial_infarction_complications.data.targets 

In [2]:
X_1 = X.copy()

# Definición de target: LET_IS==0 -> Paciente vive, LET_IS!=0 -> Paciente fallece
y_1 = y['LET_IS']!=0

## Selección de variables
Se eliminan variables con más de 10% de datos faltantes

In [3]:
columns_to_delete = X_1.isna().mean()
columns_to_delete = X_1.columns[columns_to_delete>0.1]

X_1.drop(columns=columns_to_delete, inplace=True)
X_1.shape

(1700, 94)

Se eliminan datos con prevalencia de clase dominante mayor a 95%. Deben quedar 61 variables

In [4]:
columns_to_delete = []
for col in X_1.columns:
    max_freq = X_1[col].fillna(-1).value_counts(normalize=True).values[0]
    if max_freq > 0.95:
        columns_to_delete.append(col)

X_1.drop(columns=columns_to_delete, inplace=True)
X_1.shape

(1700, 61)

In [5]:
y_1.value_counts()

LET_IS
False    1429
True      271
Name: count, dtype: int64

## Transformación de variables

### Variables categóricas

In [6]:
myocardial_infarction_complications.variables['type'].value_counts()

type
Binary         89
Categorical    17
Integer        11
Continuous      7
Name: count, dtype: int64

In [7]:
cat_vars = myocardial_infarction_complications.variables.query('type=="Categorical"')
print(cat_vars[['name','description']].values)

[['INF_ANAM'
  'Quantity of myocardial infarctions in the anamnesis. \n\n0: zero\n\n1: one\n\n2: two\n\n3: three and more']
 ['STENOK_AN'
  'Exertional angina pectoris in the anamnesis. \n\n0: never\n\n1: during the last year \n\n2: one year ago\n\n3: two years ago\n\n4: three years ago\n\n5: 4-5 years ago']
 ['FK_STENOK'
  'Functional class (FC) of angina pectoris in the last year. \n\n0: there is no angina pectoris\n\n1: I FC\n\n2: II FC\n\n3: III FC\n\n4: IV FC']
 ['IBS_POST'
  'Coronary heart disease (CHD) in recent weeks, days before admission to hospital \n\n0: none\n\n1: exertional angina pectoris\n\n2: unstable angina pectoris']
 ['GB'
  'Presence of an essential hypertension \n\n0: there is no essential hypertension \n\n1: Stage 1 \n\n2: Stage 2\n\n3: Stage 3']
 ['DLIT_AG'
  'there was no arterial hypertension\n\n1: one year\n\n2: two years\n\n3: three years\n\n4: four years\n\n5: five years\n\n6: 6-10 years\n\n7: more than 10 years']
 ['ZSN_A'
  'Presence of chronic Heart fai

A las variables categóricas nominales se les aplica One Hot Encoding

In [8]:
from sklearn.preprocessing import OneHotEncoder

nominal_cat_vars = ['FK_STENOK','DLIT_AG','ZSN_A','ant_im','lat_im','inf_im','post_im']
nominal_cat_vars = [x for x in nominal_cat_vars if x in X_1.columns]

ohe = OneHotEncoder()
ohe_data = ohe.fit_transform(X_1[nominal_cat_vars]).toarray()
ohe_df = pd.DataFrame(ohe_data, columns=ohe.get_feature_names_out())

ohe_df.head()

X_1 = pd.concat([X_1, ohe_df], axis=1)
X_1.drop(columns=nominal_cat_vars, inplace=True)

### Estandarización de variables

In [9]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_1 = pd.DataFrame(scaler.fit_transform(X_1), columns=X_1.columns)

### Imputación de nulos
Categóricas con la moda, numéricas con la mediana

In [10]:
cat_vars = [x for x in cat_vars if x in X_1.columns]

for col in X_1.columns:
    if col in cat_vars:
        mode = X_1[col].mode().values[0]
        X_1[col] = X_1[col].fillna(mode)
    else:
        median = X_1[col].median()
        X_1[col] = X_1[col].fillna(median)


### División Entrenamiento / Prueba
- Tamaño de test: 20%.
- Semilla aleatoria no mencionada en el paper.

In [11]:
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(X_1, y_1, test_size=0.2, random_state=1)

### Undersampling
- Se hace undersampling del 50%.
- Semilla aleatoria no mencionada en el paper.

In [12]:
alpha = 0.5

X_train_0 = X_train[y_train==0]
X_train_0.sample(int(alpha*X_train_0.shape[0]), random_state=1)

X_train_1 = X_train[y_train==1]

X_train_sampled = pd.concat([X_train_0, X_train_1], axis=0).sort_index()

y_train_sampled = y_train.loc[X_train_sampled.index.tolist()]

## Selección de variables
- Top K variables según correlación

In [13]:
def select_top_k_features(data, target, k=50):
    correlations = data.corrwith(target).abs().sort_values(ascending=False)
    top_k = correlations.head(k).index.tolist()
    return data[top_k]

## Entrenamiento del modelo
Se utilizan los parámetros obtenidos mediante grid-search.

In [14]:
from sklearn.ensemble import RandomForestClassifier

model =RandomForestClassifier(
    max_features=None,
    min_samples_leaf=4,
    min_samples_split=2,
    criterion='entropy',
    random_state=123,
    n_jobs=-1
)

X_sel = select_top_k_features(X_train_sampled, y_train_sampled, k=50)
model.fit(X_sel, y_train_sampled)

## Resultados

In [15]:
from sklearn.model_selection import cross_val_score

wF1_score = cross_val_score(model, X_sel, y_train_sampled, n_jobs=-1, cv=10, scoring='f1_weighted')
precision_score = cross_val_score(model, X_sel, y_train_sampled, n_jobs=-1, cv=10, scoring='precision_weighted')
recall_score = cross_val_score(model, X_sel, y_train_sampled, n_jobs=-1, cv=10, scoring='recall_weighted')
accuracy_score = cross_val_score(model, X_sel, y_train_sampled, n_jobs=-1, cv=10, scoring='accuracy')

print(f'F1 ponderado: {np.round(wF1_score.mean(), 3)}. Resultado Paper: 0.900')
print(f'Precisión: {np.round(precision_score.mean(), 3)}. Resultado Paper: 0.899')
print(f'Recall: {np.round(recall_score.mean(), 3)}. Resultado Paper: 0.903')
print(f'Accuracy: {np.round(accuracy_score.mean(), 3)}. Resultado Paper: 0.903')

F1 ponderado: 0.843. Resultado Paper: 0.900
Precisión: 0.871. Resultado Paper: 0.899
Recall: 0.871. Resultado Paper: 0.903
Accuracy: 0.871. Resultado Paper: 0.903
