<div style="width: 100%; clear: both;">
<div style="float: left; width: 50%;">
<img src="https://www.uoc.edu/content/dam/news/images/noticies/2016/202-nova-marca-uoc.jpg" align="left" width="45%">
</div>
<div style="float: right; width: 50%;">
<p style="margin: 0; padding-top: 22px; text-align:right;">M2.859 - Visualización de datos - PRÁCTICA 2</p>
<p style="margin: 0; text-align:right;">2023-2 - Máster universitario en Ciencia de datos (Data science)</p>
<p style="margin: 0; text-align:right; padding-button: 100px;">Estudios de Informática, Multimedia y Telecomunicación</p>
</div>
</div>
<div style="width:100%;">&nbsp;</div>

# Práctica 2. Proyecto de Visualización

**Autor:** JLL

**Junio 2024**

## Descripción del conjunto de datos

Realizamos un análisis completo y significativo del conjunto de datos de salud del corazón. Nos vamos a enfocar en las variables que tienen un impacto directo en la salud cardíaca.

### Variables más relevantes para el análisis.

1. **Demográficas y de estilo de vida**:
   - **State**: Permite análisis geográficos.
   - **Sex**: Comparación entre géneros.
   - **AgeCategory**: Análisis de riesgo por edad.
   - **RaceEthnicityCategory**: Comparación entre diferentes grupos raciales y étnicos.
   - **SmokerStatus**: Impacto del tabaquismo.
   - **ECigaretteUsage**: Uso de cigarrillos electrónicos.
   - **AlcoholDrinkers**: Consumo de alcohol.
   - **PhysicalActivities**: Nivel de actividad física.
   - **SleepHours**: Calidad y cantidad de sueño.

2. **Historial médico**:
   - **HadHeartAttack**: Historia de ataques cardíacos.
   - **HadAngina**: Historia de angina.
   - **HadStroke**: Historia de accidentes cerebrovasculares.
   - **HadAsthma**: Historia de asma.
   - **HadCOPD**: Historia de EPOC.
   - **HadDiabetes**: Historia de diabetes.
   - **HadKidneyDisease**: Historia de enfermedad renal.
   - **HadArthritis**: Historia de artritis.
   - **HadDepressiveDisorder**: Historia de t
astornos depresivos.

3. **Mediciones físicas**:
   - **HeightInMeters**: Altura.
   - **WeightInKilograms**: Peso.
   - **BMI**: Índice de masa corporal (IMC).

4. **Evaluaciones de salud y chequeos**:
   - **GeneralHealth**: Estado general de salud.
   - **PhysicalHealthDays**: Días de buena salud física.
   - **MentalHealthDays**: Días de buena salud mental.
   - **LastCheckupTime**: Fecha del último chequeo médico.
   - **CardiacRiskScoe**: Puntaje de riesgo cardíaco.


In [1]:
## Librerías

# Ignorar advertencias
import warnings
warnings.filterwarnings('ignore')

from pandas.api.types import CategoricalDtype

# Manipulación y análisis de datos
import pandas as pd
import numpy as np

# Visualización
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
## Path
path = '/kaggle/input/heart-data/heart_clean-2024.csv'

In [3]:
## Carga de fichero
df = pd.read_csv(path, delimiter=';', skipinitialspace=True)

In [4]:
## Información de conjunto de datos.
## Contiene todas la variables que luego filtraremos.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 246013 entries, 0 to 246012
Data columns (total 54 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   State                      246013 non-null  object 
 1   Sex                        246013 non-null  object 
 2   GeneralHealth              246013 non-null  object 
 3   PhysicalHealthDays         246013 non-null  int64  
 4   MentalHealthDays           246013 non-null  int64  
 5   LastCheckupTime            246013 non-null  object 
 6   PhysicalActivities         246013 non-null  object 
 7   SleepHours                 246013 non-null  int64  
 8   RemovedTeeth               246013 non-null  object 
 9   HadHeartAttack             246013 non-null  object 
 10  HadAngina                  246013 non-null  object 
 11  HadStroke                  246013 non-null  object 
 12  HadAsthma                  246013 non-null  object 
 13  HadSkinCancer              24

In [5]:
## Resumen de conjunto de datos.
df.head()

Unnamed: 0,State,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,RemovedTeeth,HadHeartAttack,...,AgeCategory_level,PhysicalActivities_bin,SmokerStatus_level,HadDiabetes_level,AgeCategory_NM,BMI_NM,PhysicalActivities_NM,SmokerStatus_NM,HadDiabetes_NM,CardiacRiskScore
0,Alabama,Female,Very good,4,0,Within past year (anytime less than 12 months ...,Yes,9,None of them,No,...,10,1,2,0.0,0.75,0.1865,1.0,0.333333,0.0,73.984.999.999.999.900
1,Alabama,Male,Very good,0,0,Within past year (anytime less than 12 months ...,Yes,6,None of them,No,...,11,1,2,1.0,0.833333,0.211491,1.0,0.333333,1.0,81.695
2,Alabama,Male,Very good,0,0,Within past year (anytime less than 12 months ...,No,8,"6 or more, but not all",No,...,12,0,2,0.0,0.916667,0.229359,0.0,0.333333,0.0,8.349
3,Alabama,Female,Fair,5,0,Within past year (anytime less than 12 months ...,Yes,9,None of them,No,...,13,1,1,0.0,1.0,0.225388,1.0,0.0,0.0,8.348
4,Alabama,Female,Good,3,15,Within past year (anytime less than 12 months ...,Yes,5,1 to 5,No,...,13,1,1,0.0,1.0,0.245825,1.0,0.0,0.0,86.105


## Exploración y limpieza de variables
Vamos a reducir las variables por las que son interesantes para nuestro proyecto. Además las vamos a agrupar en dataframes diferentes.

In [6]:
for columna in df.columns:
    df[columna] = df[columna].apply(lambda x: 1 if x == 'Yes' else (0 if x == 'No' else x))

### Variables demográficas

In [7]:
## Variables demográficas.
columnas = ['State', 'Sex', 'AgeCategory', 'RaceEthnicityCategory', 'SmokerStatus', 'ECigaretteUsage', 'AlcoholDrinkers', 'PhysicalActivities', 'SleepHours']
df[columnas].head()

Unnamed: 0,State,Sex,AgeCategory,RaceEthnicityCategory,SmokerStatus,ECigaretteUsage,AlcoholDrinkers,PhysicalActivities,SleepHours
0,Alabama,Female,Age 65 to 69,"White only, Non-Hispanic",Former smoker,Never used e-cigarettes in my entire life,0,1,9
1,Alabama,Male,Age 70 to 74,"White only, Non-Hispanic",Former smoker,Never used e-cigarettes in my entire life,0,1,6
2,Alabama,Male,Age 75 to 79,"White only, Non-Hispanic",Former smoker,Never used e-cigarettes in my entire life,1,0,8
3,Alabama,Female,Age 80 or older,"White only, Non-Hispanic",Never smoked,Never used e-cigarettes in my entire life,0,1,9
4,Alabama,Female,Age 80 or older,"White only, Non-Hispanic",Never smoked,Never used e-cigarettes in my entire life,0,1,5


In [8]:
## Modificamos las variables para una mejor visualización.
### State
df['State'] = df['State'].fillna('unknown')
df['State'] = df['State'].str.title()
df['State'] = df['State'].astype('category')

### Sex
df['Sex'] = df['Sex'].fillna('unknown')
df['Sex'] = df['Sex'].str.lower()
df['Sex'] = df['Sex'].astype('category')

### AgeCategory
categorias_ordenadas = ['Age 18 to 24', 'Age 25 to 29', 'Age 30 to 34', 'Age 35 to 39', 'Age 40 to 44', 'Age 45 to 49', 'Age 50 to 54', 'Age 55 to 59', 'Age 60 to 64', 'Age 65 to 69', 'Age 70 to 74', 'Age 75 to 79', 'Age 80 or older']
categorias_simplificadas = [categoria.replace('Age ', '').replace(' to ', '-').replace(' or older', '+') for categoria in categorias_ordenadas]

mapeo_categorias = dict(zip(categorias_ordenadas, categorias_simplificadas))

df['AgeCategory'] = df['AgeCategory'].map(mapeo_categorias)
df['AgeCategory'] = df['AgeCategory'].fillna('unknown')
df['AgeCategory'] = pd.Categorical(df['AgeCategory'], categories=categorias_simplificadas, ordered=True)

### RaceEthnicityCategory
df['RaceEthnicityCategory'] = df['RaceEthnicityCategory'].replace({
    'White only, Non-Hispanic': 'white',
    'Black only, Non-Hispanic': 'black',
    'Other race only, Non-Hispanic': 'other',
    'Multiracial, Non-Hispanic': 'multiracial',
    'Hispanic': 'hispanic'
})

df['RaceEthnicityCategory'] = df['RaceEthnicityCategory'].fillna('unknown')
df['RaceEthnicityCategory'] = df['RaceEthnicityCategory'].astype('category')

### SmokerStatus
df['SmokerStatus'] = df['SmokerStatus'].replace({
    'Former smoker': 'ex-smoker',
    'Never smoked': 'non-smoker',
    'Current smoker - now smokes every day': 'current smoker',
    'Current smoker - now smokes some days': 'current smoker'
})

df['SmokerStatus'] = df['SmokerStatus'].fillna('unknown')
df['SmokerStatus'] = df['SmokerStatus'].astype('category')

### ECigaretteUsage
df['ECigaretteUsage'] = df['ECigaretteUsage'].replace({
    'Never used e-cigarettes in my entire life': 'never used',
    'Use them some days': 'occasional user',
    'Not at all (right now)': 'not currently using',
    'Use them every day': 'daily user'
})
df['ECigaretteUsage'] = df['ECigaretteUsage'].fillna('unknown')
df['ECigaretteUsage'] = df['ECigaretteUsage'].astype('category')

### SleepHours
df['SleepHours'] = df['SleepHours'].astype('int')

In [9]:
### df_demograficas
df[columnas].head()

Unnamed: 0,State,Sex,AgeCategory,RaceEthnicityCategory,SmokerStatus,ECigaretteUsage,AlcoholDrinkers,PhysicalActivities,SleepHours
0,Alabama,female,65-69,white,ex-smoker,never used,0,1,9
1,Alabama,male,70-74,white,ex-smoker,never used,0,1,6
2,Alabama,male,75-79,white,ex-smoker,never used,1,0,8
3,Alabama,female,80+,white,non-smoker,never used,0,1,9
4,Alabama,female,80+,white,non-smoker,never used,0,1,5


### Variables con historial médico

In [10]:
## Variables de historial médico
columnas = ['HadHeartAttack', 'HadAngina', 'HadStroke', 'HadAsthma', 'HadCOPD', 'HadDiabetes', 'HadKidneyDisease', 'HadArthritis', 'HadDepressiveDisorder']
df[columnas].head()

Unnamed: 0,HadHeartAttack,HadAngina,HadStroke,HadAsthma,HadCOPD,HadDiabetes,HadKidneyDisease,HadArthritis,HadDepressiveDisorder
0,0,0,0,0,0,0,0,1,0
1,0,0,0,0,0,1,0,1,0
2,0,0,0,0,0,0,0,1,0
3,0,0,0,0,0,0,0,1,1
4,0,0,0,0,0,0,0,1,0


In [11]:
def map_values(value):
    if isinstance(value, str):
        if 'Yes' in value:
            return 1
        elif 'No' in value:
            return 0
    return value

df['HadDiabetes'] = df['HadDiabetes'].apply(map_values)

In [12]:
### df_historial
df[columnas].head()

Unnamed: 0,HadHeartAttack,HadAngina,HadStroke,HadAsthma,HadCOPD,HadDiabetes,HadKidneyDisease,HadArthritis,HadDepressiveDisorder
0,0,0,0,0,0,0,0,1,0
1,0,0,0,0,0,1,0,1,0
2,0,0,0,0,0,0,0,1,0
3,0,0,0,0,0,0,0,1,1
4,0,0,0,0,0,0,0,1,0


### Variables con mediciones físicas

In [13]:
## Variables de mediciones físicas
columnas = ['HeightInMeters', 'WeightInKilograms', 'BMI']
df[columnas].head()

Unnamed: 0,HeightInMeters,WeightInKilograms,BMI
0,1.6,71.67,27.99
1,1.78,95.25,30.13
2,1.85,108.86,31.66
3,1.7,90.72,31.32
4,1.55,79.38,33.07


In [14]:
## Variables de evaluaciones de salud
columnas = ['GeneralHealth', 'PhysicalHealthDays', 'MentalHealthDays', 'LastCheckupTime', 'CardiacRiskScore']
df[columnas].head()

Unnamed: 0,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,CardiacRiskScore
0,Very good,4,0,Within past year (anytime less than 12 months ...,73.984.999.999.999.900
1,Very good,0,0,Within past year (anytime less than 12 months ...,81.695
2,Very good,0,0,Within past year (anytime less than 12 months ...,8.349
3,Fair,5,0,Within past year (anytime less than 12 months ...,8.348
4,Good,3,15,Within past year (anytime less than 12 months ...,86.105


In [15]:
## Modificamos las variables para una mejor visualización.

### GeneralHealth
### Pasamos a minúsculas antes de cualquier transformación.
df['GeneralHealth'] = df['GeneralHealth'].str.lower()

### Definimos las categorías en el orden deseado.
categorias = ['poor', 'fair', 'good', 'very good', 'excellent']
df_evaluaciones_cat = CategoricalDtype(categories=categorias, ordered=True)
df['GeneralHealth'] = df['GeneralHealth'].astype(df_evaluaciones_cat)

### PhysicalHealthDays
df['PhysicalHealthDays'] = df['PhysicalHealthDays'].astype('int')

### MentalHealthDays
df['MentalHealthDays'] = df['MentalHealthDays'].astype('int')

### LastCheckupTime
df['LastCheckupTime'] = df['LastCheckupTime'].replace({
    'Within past year (anytime less than 12 months ago)': '<1',
    '5 or more years ago': '5+',
    'Within past 2 years (1 year but less than 2 years ago)': '1-2',
    'Within past 5 years (2 years but less than 5 years ago)': '2-5'
})

categorias = ['<1', '1-2', '2-5', '5+']
df_evaluaciones_cat = CategoricalDtype(categories=categorias, ordered=True)
df['LastCheckupTime'] = df['LastCheckupTime'].astype(df_evaluaciones_cat)

### CardiacRiskScore
df['CardiacRiskScore'] = df['CardiacRiskScore'].str.replace('.', '')
df['CardiacRiskScore'] = df['CardiacRiskScore'].astype('float')

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df['CardiacRiskScore_scaler'] = scaler.fit_transform(df[['CardiacRiskScore']])

In [16]:
## df_evaluaciones
df['CardiacRiskScore_scaler'] =  round(df['CardiacRiskScore_scaler'],5)

In [17]:
 df.head()

Unnamed: 0,State,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,RemovedTeeth,HadHeartAttack,...,PhysicalActivities_bin,SmokerStatus_level,HadDiabetes_level,AgeCategory_NM,BMI_NM,PhysicalActivities_NM,SmokerStatus_NM,HadDiabetes_NM,CardiacRiskScore,CardiacRiskScore_scaler
0,Alabama,female,very good,4,0,<1,1,9,None of them,0,...,1,2,0.0,0.75,0.1865,1.0,0.333333,0.0,7.3985e+16,3.50591
1,Alabama,male,very good,0,0,<1,1,6,None of them,0,...,1,2,1.0,0.833333,0.211491,1.0,0.333333,1.0,81695.0,-0.52955
2,Alabama,male,very good,0,0,<1,0,8,"6 or more, but not all",0,...,0,2,0.0,0.916667,0.229359,0.0,0.333333,0.0,8349.0,-0.52955
3,Alabama,female,fair,5,0,<1,1,9,None of them,0,...,1,1,0.0,1.0,0.225388,1.0,0.0,0.0,8348.0,-0.52955
4,Alabama,female,good,3,15,<1,1,5,1 to 5,0,...,1,1,0.0,1.0,0.245825,1.0,0.0,0.0,86105.0,-0.52955


In [18]:
## guardamos el conjunto de datos.
df.to_csv('heart-2024.csv', index=False, sep=";")