## Dataset basics

- size: 918, 12
- atrributes: 

| Index | Attribute | Description | Type |
| --- | --- | --- | --- |
| 1 | Age | Age of the patient in years | Numeric (int) |
| 2 | Sex | Sex of the patient | Categorical (M: male, F: female) |
| 3 | ChestPainType | chest pain type | Categorical [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic] |
| 4 | RestingBP | Resting blood pressure | Numeric (int) |
| 5 | Cholesterol | Serum cholesterol | Numeric (int) |
| 6 | FastingBS | Fasting blood sugar (açucar no sangue em jejum) | Binary (1: if FastingBS > 120 mg/dl, 0: otherwise) |
| 7 | RestingECG | Resting electrocardiogram results | Categorical (Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria) |
| 8 | MaxHR | Maximum heart rate achieved | Numeric (int) |
| 9 | ExerciseAngina | Exercise-induced angina | Categorical (Y: Yes, N: No) |
| 10 | Oldpeak | Oldpeak | Numeric (float) |
| 11 | ST_Slope | The slope of the peak exercise ST segment | Categorical (Up: upsloping, Flat: flat, Down: downsloping) |
| 12 | HeartDisease | Output class | 1: heart disease, 0: Normal |

In [1]:
import pandas as pd

BASE_PATH = '../data/complete-heart.csv'
df_complete = pd.read_csv(BASE_PATH)
df_complete.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [2]:
from sklearn.model_selection import train_test_split

attributes = df_complete.drop(columns=['HeartDisease']).columns
values = df_complete.drop(columns=['HeartDisease']).values
labels = df_complete['HeartDisease'].values

X_train, X_test, y_train, y_test = train_test_split(
    values,
    labels,
    test_size=0.2,
    random_state=42,
    stratify=labels
)

df_train = pd.DataFrame(
    data=X_train,
    columns=attributes
)
df_train['HeartDisease'] = y_train

df_test = pd.DataFrame(
    data=X_test,
    columns=attributes
)
df_test['HeartDisease'] = y_test

df_train.to_csv('../data/train-heart.csv', index=False)
df_test.to_csv('../data/test-heart.csv', index=False)

In [3]:
df = pd.read_csv('../data/train-heart.csv')

## Missing values and correlation
  - there is no nil values;

In [4]:
# check for nil values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 734 entries, 0 to 733
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             734 non-null    int64  
 1   Sex             734 non-null    object 
 2   ChestPainType   734 non-null    object 
 3   RestingBP       734 non-null    int64  
 4   Cholesterol     734 non-null    int64  
 5   FastingBS       734 non-null    int64  
 6   RestingECG      734 non-null    object 
 7   MaxHR           734 non-null    int64  
 8   ExerciseAngina  734 non-null    object 
 9   Oldpeak         734 non-null    float64
 10  ST_Slope        734 non-null    object 
 11  HeartDisease    734 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 68.9+ KB


## Categorical attributes handling

- `Sex`: ordinal encoder
- `ChestPainType`: OneHotEncoder
- `RestingECG`: OrdinalEncoder
- `ExerciseAngina`: Ordinal
- `ST_Slope`: Ordinal

In [32]:
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder

ordinal_encoder = OrdinalEncoder()
ordinal_encoder.fit(df[['Sex', 'ExerciseAngina', 'ST_Slope']])
ordinal_encoder.categories_

processed_df = df.copy()

processed_df[['Sex', 'ExerciseAngina', 'ST_Slope']] = ordinal_encoder.transform(df[['Sex', 'ExerciseAngina', 'ST_Slope']])

one_hot_encoder = OneHotEncoder(feature_name_combiner='concat')
one_hot_encoder.fit(processed_df[['ChestPainType']])
one_hot_encoder.categories_

encoded_data = one_hot_encoder.transform(processed_df[['ChestPainType']]).toarray()

encoded_df = pd.DataFrame(
    data=encoded_data,
    columns=one_hot_encoder.categories_
)

encoded_df.head()

%sklearn_version

# processed_df.head()
# df['ST_Slope'].unique()


Unnamed: 0,ASY,ATA,NAP,TA
0,0.0,1.0,0.0,0.0
1,0.0,1.0,0.0,0.0
2,1.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0
4,1.0,0.0,0.0,0.0
