# Análisis explotatorio: Dataset "Heart Disease Prediction"
Exploraré la data identificando los principales malestares presentados por rangos de edades, denotar patrones en los que se suelen presentar los mismos y si hay alguna posible correlación entre ellos.

## Rutas e importaciones

In [45]:
import pandas as pd
import numpy as np
import plotly.express as px

original_data = '../data/raw/Heart_Disease_Prediction.csv'

In [46]:
raw_data = pd.read_csv(original_data)
raw_data.to_parquet('../data/processed/raw_data_p')
raw_data.head()


Unnamed: 0,Age,Sex,Chest pain type,BP,Cholesterol,FBS over 120,EKG results,Max HR,Exercise angina,ST depression,Slope of ST,Number of vessels fluro,Thallium,Heart Disease
0,70,1,4,130,322,0,2,109,0,2.4,2,3,3,Presence
1,67,0,3,115,564,0,2,160,0,1.6,2,0,7,Absence
2,57,1,2,124,261,0,0,141,0,0.3,1,0,7,Presence
3,64,1,4,128,263,0,0,105,1,0.2,2,1,7,Absence
4,74,0,2,120,269,0,2,121,1,0.2,1,1,3,Absence


### Detalles importantes sobre la data:
1. Age: En años cumplidos
2. Sex: 1 = Hombre, 0 = Mujer
3. Chest pain type: Tipo de dolor experimenado por el paciente en escala del 1 al 4 donde 4 es mayor
4. BP: Presión arterial (mm Hg)
5. Cholesterol: (mg/dl)
6. FBS over 120: Azucar en sangre > 120 mg/dl (1 = yes, 0 = no)
7. EKG results: Resultados del electrocardiograma
8. Max HR: Ritmo cardiaco máximo registrado durante prueba de ejercicio
9. Exercise angina: Presencia de angina durante el ejercicio (1 = yes, 0 = no)
10. ST depression: Depresión de segmento ST, (0.0 → Normal (sin depresión), Valores > 0 → Mayor depresión del ST → mayor riesgo cardíaco)

## Inspección general

In [47]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 270 entries, 0 to 269
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Age                      270 non-null    int64  
 1   Sex                      270 non-null    int64  
 2   Chest pain type          270 non-null    int64  
 3   BP                       270 non-null    int64  
 4   Cholesterol              270 non-null    int64  
 5   FBS over 120             270 non-null    int64  
 6   EKG results              270 non-null    int64  
 7   Max HR                   270 non-null    int64  
 8   Exercise angina          270 non-null    int64  
 9   ST depression            270 non-null    float64
 10  Slope of ST              270 non-null    int64  
 11  Number of vessels fluro  270 non-null    int64  
 12  Thallium                 270 non-null    int64  
 13  Heart Disease            270 non-null    object 
dtypes: float64(1), int64(12), 

In [48]:
raw_data.isnull().sum()

Age                        0
Sex                        0
Chest pain type            0
BP                         0
Cholesterol                0
FBS over 120               0
EKG results                0
Max HR                     0
Exercise angina            0
ST depression              0
Slope of ST                0
Number of vessels fluro    0
Thallium                   0
Heart Disease              0
dtype: int64

In [49]:
raw_data.duplicated().count()

np.int64(270)

## Limpieza y normalización de strings

In [50]:
raw_data.columns = raw_data.columns.str.lower().str.replace(' ', '_')
raw_data.head()

Unnamed: 0,age,sex,chest_pain_type,bp,cholesterol,fbs_over_120,ekg_results,max_hr,exercise_angina,st_depression,slope_of_st,number_of_vessels_fluro,thallium,heart_disease
0,70,1,4,130,322,0,2,109,0,2.4,2,3,3,Presence
1,67,0,3,115,564,0,2,160,0,1.6,2,0,7,Absence
2,57,1,2,124,261,0,0,141,0,0.3,1,0,7,Presence
3,64,1,4,128,263,0,0,105,1,0.2,2,1,7,Absence
4,74,0,2,120,269,0,2,121,1,0.2,1,1,3,Absence


### Copia limpia y guardado de df en Parquet

In [51]:
data_hd = raw_data.copy()
data_hd.head()

Unnamed: 0,age,sex,chest_pain_type,bp,cholesterol,fbs_over_120,ekg_results,max_hr,exercise_angina,st_depression,slope_of_st,number_of_vessels_fluro,thallium,heart_disease
0,70,1,4,130,322,0,2,109,0,2.4,2,3,3,Presence
1,67,0,3,115,564,0,2,160,0,1.6,2,0,7,Absence
2,57,1,2,124,261,0,0,141,0,0.3,1,0,7,Presence
3,64,1,4,128,263,0,0,105,1,0.2,2,1,7,Absence
4,74,0,2,120,269,0,2,121,1,0.2,1,1,3,Absence


## Transformaciones o agregación

In [52]:
data_hd['sex_mf'] = np.where(data_hd['sex'] == 0, 'femenino', 'masculino')
data_hd['fbs_over_tf'] = np.where(data_hd['fbs_over_120'] == 1, 'yes', 'no')
data_hd.to_parquet('../data/processed/data_hd_ok')
data_hd.head()

Unnamed: 0,age,sex,chest_pain_type,bp,cholesterol,fbs_over_120,ekg_results,max_hr,exercise_angina,st_depression,slope_of_st,number_of_vessels_fluro,thallium,heart_disease,sex_mf,fbs_over_tf
0,70,1,4,130,322,0,2,109,0,2.4,2,3,3,Presence,masculino,no
1,67,0,3,115,564,0,2,160,0,1.6,2,0,7,Absence,femenino,no
2,57,1,2,124,261,0,0,141,0,0.3,1,0,7,Presence,masculino,no
3,64,1,4,128,263,0,0,105,1,0.2,2,1,7,Absence,masculino,no
4,74,0,2,120,269,0,2,121,1,0.2,1,1,3,Absence,femenino,no


## Agrupaciones
1. BP promedio por edades por sexo, respectivamente.
2. Cholesterol promedio por sexo
3. FBS over 120 promedio por edad y por sexo, respectivamente.

In [53]:
bp_edad_hombres = (
    data_hd[data_hd['sex_mf']=='masculino']
    .groupby('age')
    .agg(bp_hombres=('bp', 'mean'),
         pop_hombres=('bp', 'count')
        )
    .reset_index()
    )

bp_edad_mujeres = (
    data_hd[data_hd['sex_mf']=='femenino']
    .groupby('age')
    .agg(bp_mujeres=('bp', 'mean'),
         pop_mujeres=('bp', 'count')
        )
    .reset_index()
)


In [54]:
chol_sex = (
    data_hd
    .groupby('sex_mf')
    .agg(
        chol_promedio=('cholesterol', 'mean')
    )
    .reset_index()
)
chol_sex.head()

Unnamed: 0,sex_mf,chol_promedio
0,femenino,264.747126
1,masculino,242.486339


In [55]:
fbs_edad_hombres = (
    data_hd[data_hd['sex_mf']=='masculino']
    .groupby('fbs_over_tf')
    .agg(
        edad_promedio_hombres=('age', 'mean')
    )
    .reset_index()
)
fbs_edad_hombres.head()

fbs_edad_mujeres = (
    data_hd[data_hd['sex_mf']=='femenino']
    .groupby('fbs_over_tf')
    .agg(
        edad_promedio_mujeres=('age', 'mean')
    )
    .reset_index()
)
fbs_edad_mujeres.head()

Unnamed: 0,fbs_over_tf,edad_promedio_mujeres
0,no,55.223684
1,yes,58.818182


## Vizualización y EDA

1. BP Promedio en hombres y mujeres por edad:

In [59]:
fig_bp_mean = px.bar(
    bp_edad_hombres, 
    x='age',
    y='bp_hombres',
    title='BP promedio por edad (Hombres)'
    )
fig_bp_mean.add_hline(y=120, line_dash="dash", line_color="red", annotation_text="Límite 150", annotation_position="top left")
fig_bp_mean.show()