## Heart Attack

In [4]:
!pip install pycaret

Collecting pycaret
  Downloading pycaret-3.3.2-py3-none-any.whl.metadata (17 kB)
Collecting pandas<2.2.0 (from pycaret)
  Downloading pandas-2.1.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting scipy<=1.11.4,>=1.6.1 (from pycaret)
  Downloading scipy-1.11.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.4/60.4 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting joblib<1.4,>=1.2.0 (from pycaret)
  Downloading joblib-1.3.2-py3-none-any.whl.metadata (5.4 kB)
Collecting pyod>=1.1.3 (from pycaret)
  Downloading pyod-2.0.2.tar.gz (165 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m165.8/165.8 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting category-encoders>=2.4.0 (from pycaret)
  Downloading category_encoders-2.6.4-py2.py3-none-any.whl.metadata (8.0 kB)
Colle

In [1]:
# Importar librerías
import pandas as pd
import numpy as np
import joblib
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from pycaret.classification import setup, compare_models, save_model
from sklearn.ensemble import GradientBoostingClassifier
from lightgbm import LGBMClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

In [3]:
# Leer información
df = pd.read_csv("ai-dl.csv")

About this dataset
* Age : Age of the patient
* Sex : Sex of the patient
* exang: exercise induced angina (1 = yes; 0 = no)
* ca: number of major vessels (0-3)
* cp : Chest Pain type
    - Value 1: typical angina
    - Value 2: atypical angina
    - Value 3: non-anginal pain
    - Value 4: asymptomatic
* trtbps : resting blood pressure (in mm Hg)
* chol : cholestoral in mg/dl fetched via BMI sensor
* fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
* rest_ecg : resting electrocardiographic results
    - Value 0: normal
    - Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    - Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
* thalach : maximum heart rate achieved
* target : 0= less chance of heart attack 1= more chance of heart attack

## 1. Entendimiento del negocio

- Definir objetivo de negocio según lo que usted considere que puede ser de mayor utilidad según los datos elegidos.

- Definir objetivo analítico.


**Definición del objetivo de negocio:**
El objetivo de negocio será reducir el riesgo de ataques cardíacos mediante la implementación de un sistema que permita identificar oportunamente a pacientes con mayor probabilidad de sufrir un ataque cardíaco. Esto permitirá priorizar la atención médica, optimizar recursos hospitalarios y mejorar la calidad de vida de los pacientes mediante intervenciones tempranas.

**Extensión del objetivo analítico:**
En la metodología CRISP-DM (Cross Industry Standard Process for Data Mining), el objetivo analítico se centra en minimizar los falsos negativos, es decir, los casos en los que se clasifica erróneamente a un paciente en la categoría de "sin riesgo de ataque cardíaco" cuando realmente está en riesgo. Esto se logra haciendo un enfoque en la métrica de Recall.

**Razón de la elección:**

En un contexto médico, un falso negativo tiene implicaciones críticas, ya que significa no identificar a una persona que necesita intervención médica, lo cual puede resultar en consecuencias fatales. Por ejemplo, un paciente no tratado a tiempo podría sufrir complicaciones graves o incluso fallecer.

Al priorizar Recall, se asegura que el modelo sea capaz de identificar la mayor cantidad posible de pacientes en riesgo, aunque ello implique generar más falsos positivos. En este caso, es preferible clasificar a un paciente como "en riesgo" (aunque no lo esté) que dejar de atender a alguien que realmente necesita atención urgente.

**Beneficios del enfoque:**

- Garantiza una mayor cobertura de pacientes en riesgo, reduciendo la probabilidad de casos no detectados.
- Aumenta la confianza en el sistema como herramienta complementaria para los profesionales de la salud.
- Se alinea con la misión de salvar vidas y mejorar los servicios de salud, que son objetivos esenciales en cualquier contexto médico.

In [None]:
df.describe()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531,0.544554
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277,0.498835
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


No hay valores nulos en d

In [None]:
df.shape

(303, 14)

In [None]:
df.dtypes

Unnamed: 0,0
age,int64
sex,int64
cp,int64
trtbps,int64
chol,int64
fbs,int64
restecg,int64
thalachh,int64
exng,int64
oldpeak,float64


## 2. Preparación datos

Transformaciones más comunes en esta etapa
- Encoding de variables categóricas (OHE): Es una técnica que transforma variables categóricas en representaciones numéricas para que los algoritmos de machine learning puedan procesarlas, ya que la mayoría no trabajan directamente con texto o categorías.
- Escalado de variables numéricas: Es el proceso de normalizar o estandarizar las variables numéricas para garantizar que todas estén en una misma escala, evitando que algunas dominen el entrenamiento del modelo debido a su rango de valores.
- Balanceo de clases (No aplica): Es el proceso de ajustar la proporción de clases en el conjunto de datos, particularmente cuando hay una desbalanceo significativo entre las clases (por ejemplo, 90% clase 0 y 10% clase 1).

In [None]:
# Revisar datos nulos
df.isna().sum()

Unnamed: 0,0
age,0
sex,0
cp,0
trtbps,0
chol,0
fbs,0
restecg,0
thalachh,0
exng,0
oldpeak,0


In [None]:
# Revisar balanceo de clases
df.output.value_counts()

Unnamed: 0_level_0,count
output,Unnamed: 1_level_1
1,165
0,138


In [None]:
# OHE sobre variables de naturaleza categórica
variables_naturaleza_categorica = ['cp','restecg','thall','slp']
df = pd.get_dummies(df, columns=variables_naturaleza_categorica, drop_first=True)

## 3. Modelación

### **Pycaret**

In [None]:
clf_setup = setup(
    data=df,
    target='output',
    preprocess=True,
    train_size = 0.9,  # Entrenar el modelo con el 90% de los datos, ya que en son pocos
    normalize=True
)

best_model = compare_models(sort='Accuracy')

Unnamed: 0,Description,Value
0,Session id,42
1,Target,output
2,Target type,Binary
3,Original data shape,"(303, 20)"
4,Transformed data shape,"(303, 20)"
5,Transformed train set shape,"(272, 20)"
6,Transformed test set shape,"(31, 20)"
7,Numeric features,9
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
ridge,Ridge Classifier,0.8423,0.9098,0.8919,0.8409,0.8621,0.678,0.6885,0.069
lda,Linear Discriminant Analysis,0.8423,0.9093,0.8919,0.8409,0.8621,0.678,0.6885,0.048
lr,Logistic Regression,0.8313,0.9032,0.8719,0.8356,0.8499,0.6571,0.6661,0.05
nb,Naive Bayes,0.8272,0.8797,0.871,0.835,0.8466,0.6471,0.6608,0.045
knn,K Neighbors Classifier,0.8204,0.8796,0.8438,0.8398,0.8357,0.6356,0.6446,0.048
rf,Random Forest Classifier,0.805,0.8986,0.8181,0.8338,0.8211,0.6065,0.6149,0.667
ada,Ada Boost Classifier,0.8049,0.8741,0.8376,0.8155,0.8246,0.6035,0.6072,0.158
et,Extra Trees Classifier,0.798,0.8903,0.8167,0.8277,0.815,0.5902,0.6022,0.184
lightgbm,Light Gradient Boosting Machine,0.7942,0.8846,0.8167,0.8158,0.8121,0.5827,0.589,0.101
gbc,Gradient Boosting Classifier,0.7794,0.8647,0.8038,0.804,0.8003,0.5528,0.5591,0.165


Processing:   0%|          | 0/65 [00:00<?, ?it/s]

### **Búsqueda de hiperparámetros con Gridsearch sobre los mejores modelos arrojados por PyCaret**

In [None]:
X = df.drop('output', axis=1)
y = df['output']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42, stratify=y)

In [None]:
from sklearn.linear_model import RidgeClassifier

param_grid = {
    'alpha': [0.1, 1, 10],
    'max_iter':[None, 10],
    'solver': ['auto', 'svd', 'cholesky', 'lsqr', 'sag', 'saga']
}

# Configuración de GridSearchCV para RidgeClassifier
grid_search = GridSearchCV(RidgeClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Mejor combinación de hiperparámetros
print('Best Parameters:', grid_search.best_params_)
print('Best Cross-Validation Accuracy:', grid_search.best_score_)

# Modelo con los mejores parámetros
ridge_model = grid_search.best_estimator_
ridge_model.fit(X_train, y_train)

# Evaluación en el conjunto de prueba
y_pred = ridge_model.predict(X_test)
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Confusion Matrix:\n', confusion_matrix(y_test, y_pred))
print('Classification Report:\n', classification_report(y_test, y_pred))

Best Parameters: {'alpha': 0.1, 'max_iter': None, 'solver': 'lsqr'}
Best Cross-Validation Accuracy: 0.8383164983164983
Accuracy: 0.7419354838709677
Confusion Matrix:
 [[ 8  6]
 [ 2 15]]
Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.57      0.67        14
           1       0.71      0.88      0.79        17

    accuracy                           0.74        31
   macro avg       0.76      0.73      0.73        31
weighted avg       0.75      0.74      0.73        31



In [None]:
# Linear Discriminant Analysis

param_grid_lda = {
    'solver': ['svd', 'lsqr', 'eigen'],
    'shrinkage': [None, 'auto'],
    'tol':[0.001,0.0001]
}

# Configuración de GridSearchCV para LDA
grid_search_lda = GridSearchCV(LinearDiscriminantAnalysis(), param_grid_lda, cv=5)
grid_search_lda.fit(X_train, y_train)

# Mejor combinación de hiperparámetros
print('LDA Best Parameters:', grid_search_lda.best_params_)
print('LDA Best Cross-Validation Accuracy:', grid_search_lda.best_score_)

# Modelo con los mejores parámetros
lda_model = grid_search_lda.best_estimator_
lda_model.fit(X_train, y_train)

# Evaluación en el conjunto de prueba
y_pred_lda = lda_model.predict(X_test)
print('LDA Accuracy:', accuracy_score(y_test, y_pred_lda))
print('LDA Confusion Matrix:\n', confusion_matrix(y_test, y_pred_lda))
print('LDA Classification Report:\n', classification_report(y_test, y_pred_lda))

LDA Best Parameters: {'shrinkage': None, 'solver': 'svd', 'tol': 0.001}
LDA Best Cross-Validation Accuracy: 0.8346801346801346
LDA Accuracy: 0.7419354838709677
LDA Confusion Matrix:
 [[ 8  6]
 [ 2 15]]
LDA Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.57      0.67        14
           1       0.71      0.88      0.79        17

    accuracy                           0.74        31
   macro avg       0.76      0.73      0.73        31
weighted avg       0.75      0.74      0.73        31



In [None]:
from sklearn.naive_bayes import GaussianNB

param_grid_nb = {
    'var_smoothing': [1e-9, 1e-8, 1e-7, 1e-6]
}

# Configuración de GridSearchCV para GaussianNB
grid_search_nb = GridSearchCV(GaussianNB(), param_grid_nb, cv=5)
grid_search_nb.fit(X_train, y_train)

# Mejor combinación de hiperparámetros
print('Naive Bayes Best Parameters:', grid_search_nb.best_params_)
print('Naive Bayes Best Cross-Validation Accuracy:', grid_search_nb.best_score_)

# Modelo con los mejores parámetros
nb_model = grid_search_nb.best_estimator_
nb_model.fit(X_train, y_train)

# Evaluación en el conjunto de prueba
y_pred_nb = nb_model.predict(X_test)
print('Naive Bayes Accuracy:', accuracy_score(y_test, y_pred_nb))
print('Naive Bayes Confusion Matrix:\n', confusion_matrix(y_test, y_pred_nb))
print('Naive Bayes Classification Report:\n', classification_report(y_test, y_pred_nb))

Naive Bayes Best Parameters: {'var_smoothing': 1e-07}
Naive Bayes Best Cross-Validation Accuracy: 0.8346127946127947
Naive Bayes Accuracy: 0.7419354838709677
Naive Bayes Confusion Matrix:
 [[ 9  5]
 [ 3 14]]
Naive Bayes Classification Report:
               precision    recall  f1-score   support

           0       0.75      0.64      0.69        14
           1       0.74      0.82      0.78        17

    accuracy                           0.74        31
   macro avg       0.74      0.73      0.74        31
weighted avg       0.74      0.74      0.74        31



In [None]:
from sklearn.linear_model import LogisticRegression

param_grid_lr = {
    'penalty': ['l1', 'l2', 'elasticnet', 'none'],
    'C': [0.1, 1, 10, 100],
    'solver': ['liblinear', 'saga','lbfgs'],
    'l1_ratio': [None,0.1, 0.5, 0.9],
    'max_iter': [None,100, 500, 1000]
}


grid_search_lr = GridSearchCV(
    LogisticRegression(),
    param_grid_lr,
    cv=5,
    scoring='recall',
    n_jobs=-1,
    verbose=1
)

grid_search_lr.fit(X_train, y_train)

# Mejor combinación de hiperparámetros
print('Logistic Regression Best Parameters:', grid_search_lr.best_params_)
print('Logistic Regression Best Cross-Validation Accuracy:', grid_search_lr.best_score_)

# Modelo con los mejores parámetros
lr_model = grid_search_lr.best_estimator_
lr_model.fit(X_train, y_train)

# Evaluación en el conjunto de prueba
y_pred_lr = lr_model.predict(X_test)
print('Logistic Regression Accuracy:', accuracy_score(y_test, y_pred_lr))
print('Logistic Regression Confusion Matrix:\n', confusion_matrix(y_test, y_pred_lr))
print('Logistic Regression Classification Report:\n', classification_report(y_test, y_pred_lr))

Fitting 5 folds for each of 768 candidates, totalling 3840 fits
Logistic Regression Best Parameters: {'C': 100, 'l1_ratio': None, 'max_iter': 500, 'penalty': 'l2', 'solver': 'lbfgs'}
Logistic Regression Best Cross-Validation Accuracy: 0.8493602693602693
Logistic Regression Accuracy: 0.7419354838709677
Logistic Regression Confusion Matrix:
 [[ 8  6]
 [ 2 15]]
Logistic Regression Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.57      0.67        14
           1       0.71      0.88      0.79        17

    accuracy                           0.74        31
   macro avg       0.76      0.73      0.73        31
weighted avg       0.75      0.74      0.73        31



In [None]:
# LGBM Classifier

param_grid_lgbm = {
    'num_leaves': [31, 50, 70],
    'learning_rate': [0.01, 0.05, 0.1],
    'n_estimators': [50, 100, 200],
    'max_depth': [-1, 10, 20],
}

lgbm = LGBMClassifier(verbosity=-1)

# Configuración de GridSearchCV para LightGBM
grid_search_lgbm = GridSearchCV(lgbm, param_grid_lgbm, cv=5)
grid_search_lgbm.fit(X_train, y_train)

# Mejor combinación de hiperparámetros
print('LightGBM Best Parameters:', grid_search_lgbm.best_params_)
print('LightGBM Best Cross-Validation Accuracy:', grid_search_lgbm.best_score_)

# Modelo con los mejores parámetros
lgbm_model = grid_search_lgbm.best_estimator_
lgbm_model.fit(X_train, y_train)

# Evaluación en el conjunto de prueba
y_pred_lgbm = lgbm_model.predict(X_test)
print('LightGBM Accuracy:', accuracy_score(y_test, y_pred_lgbm))
print('LightGBM Confusion Matrix:\n', confusion_matrix(y_test, y_pred_lgbm))
print('LightGBM Classification Report:\n', classification_report(y_test, y_pred_lgbm))

LightGBM Best Parameters: {'learning_rate': 0.05, 'max_depth': -1, 'n_estimators': 50, 'num_leaves': 31}
LightGBM Best Cross-Validation Accuracy: 0.8162962962962963
LightGBM Accuracy: 0.7419354838709677
LightGBM Confusion Matrix:
 [[ 8  6]
 [ 2 15]]
LightGBM Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.57      0.67        14
           1       0.71      0.88      0.79        17

    accuracy                           0.74        31
   macro avg       0.76      0.73      0.73        31
weighted avg       0.75      0.74      0.73        31



### **Mejor modelo**

Vamos a probar ahora el mejor modelo presentado por pycaret **Ridge Classifier** con hiperparametros optimizados

In [None]:
from pycaret.classification import *
tuned_model = tune_model(best_model, optimize='Recall')

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8214,0.9128,0.8667,0.8125,0.8387,0.6392,0.6408
1,0.7857,0.8615,0.8667,0.7647,0.8125,0.5648,0.5708
2,0.8519,0.8833,1.0,0.7895,0.8824,0.6897,0.7255
3,0.8889,0.9444,0.8667,0.9286,0.8966,0.7769,0.779
4,0.7778,0.8278,0.8,0.8,0.8,0.55,0.55
5,0.7037,0.8056,0.8,0.7059,0.75,0.3898,0.3944
6,0.8519,0.9722,1.0,0.7895,0.8824,0.6897,0.7255
7,0.963,1.0,0.9333,1.0,0.9655,0.9256,0.9282
8,0.8889,0.9505,0.9286,0.8667,0.8966,0.7769,0.779
9,0.8889,0.9341,0.7857,1.0,0.88,0.7793,0.799


Processing:   0%|          | 0/7 [00:00<?, ?it/s]

Fitting 10 folds for each of 10 candidates, totalling 100 fits


Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).


##¿Difieren los resultados respecto a lo hecho con Scikit Learn en el taller pasado?

Las métricas en los resultados tienden a diferir un poco desde PyCaret y la búsqueda de hiperparámetros, esto se debe en parte al conjunto de datos de entrenamieto y prueba con que se ejecutan ambos procesos.

Respecto al ejercicio, al obtener un recall cercano a 0.8 en el mejor modelo, consideramos que este no debería ser utilizado en un ambiente productivo, especialmente en un ámbito médico donde las decisiones tienen un impacto crítico en la vida de los pacientes. Un recall de 0.8 implica que, de cada 10 pacientes que realmente están en riesgo de sufrir un ataque cardíaco, el modelo identifica correctamente a 8, pero falla en detectar 2 pacientes en promedio, clasificándolos erróneamente como "sin riesgo". Este nivel de error es inaceptable en un contexto clínico, ya que las consecuencias de ignorar estos casos pueden incluir complicaciones graves de salud, retrasos en el tratamiento, e incluso la muerte.