# Transformación de Datos con Pandas

Vamos a trabajar con el dataset `DS_Clase_10_Heart.csv`. Dejamos una breve descripción de qué representan algunas columnas.

* `slope_of_peak_exercise_st_segment:` the slope of the peak exercise ST segment, an electrocardiography read out indicating quality of blood flow to the heart
* `thal:` results of thallium stress test measuring blood flow to the heart
* `resting_blood_pressure:` resting blood pressure
* `chest_pain_type:` chest pain type
* `num_major_vessels:` number of major vessels colored by flourosopy
* `fasting_blood_sugar_gt_120_mg_per_dl:` fasting blood sugar > 120 mg/dl
* `resting_ekg_results:` resting electrocardiographic results
* `serum_cholesterol_mg_per_dl:` serum cholestoral in mg/dl
* `oldpeak_eq_st_depression:` oldpeak = ST depression induced by exercise relative to rest, a measure of abnormality in electrocardiograms
* `max_heart_rate_achieved:` maximum heart rate achieved (beats per minute)
* `exercise_induced_angina:` exercise-induced chest pain (0: False, 1: True)

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()

Abrimos los datos y descartamos la columna `patient_id` porque no agrega información en este análisis.

In [2]:
data = pd.read_csv('DS_Clase_10_Heart.csv')
data.head()

Unnamed: 0,patient_id,slope_of_peak_exercise_st_segment,thal,resting_blood_pressure,chest_pain_type,num_major_vessels,fasting_blood_sugar_gt_120_mg_per_dl,resting_ekg_results,serum_cholesterol_mg_per_dl,oldpeak_eq_st_depression,sex,age,max_heart_rate_achieved,exercise_induced_angina,heart_disease_present
0,0z64un,1,normal,128,2,0,0,2,308,0.0,male,45,170,0,0
1,ryoo3j,2,normal,110,3,0,0,0,214,1.6,female,54,158,0,0
2,yt1s1x,1,normal,125,4,3,0,2,304,0.0,male,77,162,1,1
3,l2xjde,1,reversible_defect,152,4,0,0,0,223,0.0,male,40,181,0,1
4,oyt4ek,3,reversible_defect,178,1,0,0,2,270,4.2,male,59,145,0,0


In [3]:
data.drop(columns = 'patient_id', inplace = True)
data.head()

Unnamed: 0,slope_of_peak_exercise_st_segment,thal,resting_blood_pressure,chest_pain_type,num_major_vessels,fasting_blood_sugar_gt_120_mg_per_dl,resting_ekg_results,serum_cholesterol_mg_per_dl,oldpeak_eq_st_depression,sex,age,max_heart_rate_achieved,exercise_induced_angina,heart_disease_present
0,1,normal,128,2,0,0,2,308,0.0,male,45,170,0,0
1,2,normal,110,3,0,0,0,214,1.6,female,54,158,0,0
2,1,normal,125,4,3,0,2,304,0.0,male,77,162,1,1
3,1,reversible_defect,152,4,0,0,0,223,0.0,male,40,181,0,1
4,3,reversible_defect,178,1,0,0,2,270,4.2,male,59,145,0,0


Usamos la función de Seaborn `pairplot` para hacer una primera mirada del dataset.

**Transformación de datos**

Transformamos la columna `sex`

In [4]:
diccionario = {'female': 0, 'male': 1}
data['sex'] = data.sex.map(diccionario)
data.head()

Unnamed: 0,slope_of_peak_exercise_st_segment,thal,resting_blood_pressure,chest_pain_type,num_major_vessels,fasting_blood_sugar_gt_120_mg_per_dl,resting_ekg_results,serum_cholesterol_mg_per_dl,oldpeak_eq_st_depression,sex,age,max_heart_rate_achieved,exercise_induced_angina,heart_disease_present
0,1,normal,128,2,0,0,2,308,0.0,1,45,170,0,0
1,2,normal,110,3,0,0,0,214,1.6,0,54,158,0,0
2,1,normal,125,4,3,0,2,304,0.0,1,77,162,1,1
3,1,reversible_defect,152,4,0,0,0,223,0.0,1,40,181,0,1
4,3,reversible_defect,178,1,0,0,2,270,4.2,1,59,145,0,0


Y la columna `thal`

In [5]:
data = pd.concat([data, pd.get_dummies(data['thal'])], axis=1)
data.head()

Unnamed: 0,slope_of_peak_exercise_st_segment,thal,resting_blood_pressure,chest_pain_type,num_major_vessels,fasting_blood_sugar_gt_120_mg_per_dl,resting_ekg_results,serum_cholesterol_mg_per_dl,oldpeak_eq_st_depression,sex,age,max_heart_rate_achieved,exercise_induced_angina,heart_disease_present,fixed_defect,normal,reversible_defect
0,1,normal,128,2,0,0,2,308,0.0,1,45,170,0,0,0,1,0
1,2,normal,110,3,0,0,0,214,1.6,0,54,158,0,0,0,1,0
2,1,normal,125,4,3,0,2,304,0.0,1,77,162,1,1,0,1,0
3,1,reversible_defect,152,4,0,0,0,223,0.0,1,40,181,0,1,0,0,1
4,3,reversible_defect,178,1,0,0,2,270,4.2,1,59,145,0,0,0,0,1
