## Dataset escolhido 

- https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction


## About Dataset
### Similar Datasets
---

- Hepatitis C Dataset: LINK
- Body Fat Prediction Dataset: LINK
- Cirrhosis Prediction Dataset: LINK
- Stroke Prediction Dataset: LINK
- Stellar Classification Dataset - SDSS17: LINK
- Wind Speed Prediction Dataset: LINK
- Spanish Wine Quality Dataset: LINK


### Context

Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide. Four out of 5CVD deaths are due to heart attacks and strokes, and one-third of these deaths occur prematurely in people under 70 years of age. Heart failure is a common event caused by CVDs and this dataset contains 11 features that can be used to predict a possible heart disease.

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.

Attribute Information
- Age: age of the patient [years]
- Sex: sex of the patient [M: Male, F: Female]
- ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
-  RestingBP: resting blood pressure [mm Hg]
- Cholesterol: serum cholesterol [mm/dl]
- FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
- RestingECG: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria]
- MaxHR: maximum heart rate achieved [Numeric value between 60 and 202]
- ExerciseAngina: exercise-induced angina [Y: Yes, N: No]
- Oldpeak: oldpeak = ST [Numeric value measured in depression]
- ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]
- HeartDisease: output class [1: heart disease, 0: Normal]


## Source
This dataset was created by combining different datasets already available independently but not combined before. In this dataset, 5 heart datasets are combined over 11 common features which makes it the largest heart disease dataset available so far for research purposes. The five datasets used for its curation are:

Cleveland: 303 observations
Hungarian: 294 observations
Switzerland: 123 observations
Long Beach VA: 200 observations
Stalog (Heart) Data Set: 270 observations
Total: 1190 observations
Duplicated: 272 observations

Final dataset: 918 observations

Every dataset used can be found under the Index of heart disease datasets from UCI Machine Learning Repository on the following link: https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/

Citation
fedesoriano. (September 2021). Heart Failure Prediction Dataset. Retrieved [Date Retrieved] from https://www.kaggle.com/fedesoriano/heart-failure-prediction.

Acknowledgements
Creators:

Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.
Donor:
David W. Aha (aha '@' ics.uci.edu) (714) 856-8779

## Instalando Dependencias

In [2]:
!pip install -r requirements.txt

Collecting pandas
  Using cached pandas-1.4.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.7 MB)
Collecting matplotlib
  Downloading matplotlib-3.5.2-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl (11.3 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.3/11.3 MB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:01[0m01[0m
[?25hCollecting seaborn
  Using cached seaborn-0.11.2-py3-none-any.whl (292 kB)
Collecting numpy>=1.18.5
  Downloading numpy-1.22.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.9 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.9/16.9 MB[0m [31m24.7 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
Collecting cycler>=0.10
  Using cached cycler-0.11.0-py3-none-any.whl (6.4 kB)
Collecting pillow>=6.2.0
  Downloading Pillow-9.1.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K     [38;2;114;156;31m━━━━━━

In [2]:
import pandas as pd

In [7]:
!unzip "../base/archive (1).zip"

Archive:  ../base/archive (1).zip
  inflating: heart.csv               


## Carregando Base 

In [15]:
df = pd.read_csv("heart.csv")

In [17]:
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [18]:
df.groupby("RestingECG").count()

Unnamed: 0_level_0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
RestingECG,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
LVH,188,188,188,188,188,188,188,188,188,188,188
Normal,552,552,552,552,552,552,552,552,552,552,552
ST,178,178,178,178,178,178,178,178,178,178,178


In [19]:
df.groupby("ChestPainType").count()

Unnamed: 0_level_0,Age,Sex,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
ChestPainType,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
ASY,496,496,496,496,496,496,496,496,496,496,496
ATA,173,173,173,173,173,173,173,173,173,173,173
NAP,203,203,203,203,203,203,203,203,203,203,203
TA,46,46,46,46,46,46,46,46,46,46,46


In [20]:
df.groupby("ST_Slope").count()

Unnamed: 0_level_0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,HeartDisease
ST_Slope,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Down,63,63,63,63,63,63,63,63,63,63,63
Flat,460,460,460,460,460,460,460,460,460,460,460
Up,395,395,395,395,395,395,395,395,395,395,395


In [21]:
df.groupby("Sex").count()

Unnamed: 0_level_0,Age,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
F,193,193,193,193,193,193,193,193,193,193,193
M,725,725,725,725,725,725,725,725,725,725,725


## Dicionario de dados para transformação


| Coluna | Original | Transformado |   |   |
|--------|----------|--------------|---|---|
|    Sex    |     F/N     |  F=1 e M=2    |   |   |
|      ST_Slope   |      Down,Flat, Up    |    Down=1,Flat=2,Up=3           |   |   |
|  ChestPainType      |     ASY, ATA, NAP, TA     | ASY=1, ATA=2,NAP=3,TA=4 |   |   |
|  RestingECG      |     LVH, Normal, ST     | LVH=1, Normal=2,ST=3         |   |   |

## Transform colum Sex

In [5]:
df['Sex'] = df['Sex'].apply(lambda item: 1 if item == "F" else 2)

In [6]:
df

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,2,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,1,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,2,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,1,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,2,NAP,150,195,0,Normal,122,N,0.0,Up,0
...,...,...,...,...,...,...,...,...,...,...,...,...
913,45,2,TA,110,264,0,Normal,132,N,1.2,Flat,1
914,68,2,ASY,144,193,1,Normal,141,N,3.4,Flat,1
915,57,2,ASY,130,131,0,Normal,115,Y,1.2,Flat,1
916,57,1,ATA,130,236,0,LVH,174,N,0.0,Flat,1


## Transform colum ST_Slope

In [9]:

def normalize_st_slope(val):
    if val == 'Up':
        return 3
    if val == 'Flat':
        return 2
    if val == 'Down':    
        return 1

df['ST_Slope'] = df['ST_Slope'].apply(normalize_st_slope)

## Transform colum RestingECG

In [14]:

def normalize_RestingECG(val):
    print(val)
    if val == 'ASY':
        return 1
    if val == 'ATA':
        return 2
    if val == 'NAP':    
        return 3
    if val == 'TA':    
        return 4

df['RestingECG'] = df['RestingECG'].apply(normalize_RestingECG)

None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None


In [13]:
df

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,2,ATA,140,289,0,,172,N,0.0,3,0
1,49,1,NAP,160,180,0,,156,N,1.0,2,1
2,37,2,ATA,130,283,0,,98,N,0.0,3,0
3,48,1,ASY,138,214,0,,108,Y,1.5,2,1
4,54,2,NAP,150,195,0,,122,N,0.0,3,0
...,...,...,...,...,...,...,...,...,...,...,...,...
913,45,2,TA,110,264,0,,132,N,1.2,2,1
914,68,2,ASY,144,193,1,,141,N,3.4,2,1
915,57,2,ASY,130,131,0,,115,Y,1.2,2,1
916,57,1,ATA,130,236,0,,174,N,0.0,2,1


## Transform colum ChestPainType

## Base normalizada

In [None]:
df.to_csv("base_normalizada.csv",index=False)