# Feature Engineering Class

## Dataset: Heart Disease UCI

**Link**: [https://www.kaggle.com/datasets/ronitf/heart-disease-uci](https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction)

---

## Tasks

### 1. Data Loading and Inspection

* Load the dataset using pandas.
* Display the first 5 rows.
* Check for missing values.
* Inspect data types and basic statistics.

---

### 2. Creating New Features

* Create an **age group** feature by segmenting age into bins:

  * Young (0-35)
  * Middle-aged (36-50)
  * Senior (51-65)
  * Elderly (66+)

* Create a new feature that represents the **ratio of cholesterol to age** (`chol_per_age`).

* Create an **interaction feature** by multiplying `thal` and `slope`.

---

### 3. Encoding Categorical Features

* Identify all categorical features.
* Apply **One-Hot Encoding** to the categorical features.
* Concatenate the encoded features back into the dataset.

---

### 4. Scaling Numerical Features

* Identify all numerical features.
* Apply **StandardScaler** to normalize these features.

---

### 5. Final Verification

* Show the transformed dataset.
* Check that all features are now numeric.
* Confirm the dataset shape and readiness for modeling.

---

## Objective

By completing these tasks, you will learn how to perform essential **feature engineering** steps:

* Creating new informative features
* Encoding categorical variables
* Scaling for machine learning models



In [27]:
# Imports
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Code - Task 1 and 2: Data loading and Create new features

In [9]:
# Load the dataset
df = pd.read_csv("heart.csv")

# Display the first few rows of the dataset
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [10]:
# Verify ausent values
print("Missing valuer by column:")
print(df.isnull().sum())

Missing valuer by column:
Age               0
Sex               0
ChestPainType     0
RestingBP         0
Cholesterol       0
FastingBS         0
RestingECG        0
MaxHR             0
ExerciseAngina    0
Oldpeak           0
ST_Slope          0
HeartDisease      0
dtype: int64


In [11]:
# Inspect data types and basic statistics
print("Basic statistics of the dataset:")
print(df.describe())
print("\nInformations about the dataset:")
print(df.info())

Basic statistics of the dataset:
              Age   RestingBP  Cholesterol   FastingBS       MaxHR  \
count  918.000000  918.000000   918.000000  918.000000  918.000000   
mean    53.510893  132.396514   198.799564    0.233115  136.809368   
std      9.432617   18.514154   109.384145    0.423046   25.460334   
min     28.000000    0.000000     0.000000    0.000000   60.000000   
25%     47.000000  120.000000   173.250000    0.000000  120.000000   
50%     54.000000  130.000000   223.000000    0.000000  138.000000   
75%     60.000000  140.000000   267.000000    0.000000  156.000000   
max     77.000000  200.000000   603.000000    1.000000  202.000000   

          Oldpeak  HeartDisease  
count  918.000000    918.000000  
mean     0.887364      0.553377  
std      1.066570      0.497414  
min     -2.600000      0.000000  
25%      0.000000      0.000000  
50%      0.600000      1.000000  
75%      1.500000      1.000000  
max      6.200000      1.000000  

Informations about the datase

In [None]:
# Display the columns of the dataset
print(df.columns)

Index(['Age', 'Sex', 'ChestPainType', 'RestingBP', 'Cholesterol', 'FastingBS',
       'RestingECG', 'MaxHR', 'ExerciseAngina', 'Oldpeak', 'ST_Slope',
       'HeartDisease', 'age_group', 'cholesterol_age_ratio'],
      dtype='object')


# Code - Task 3: Encoding categorical features

In [23]:
# Create a new column 'age_group' based on the 'Age' column
def categorize_age(age):
    if age <= 35:
        return "Young"
    elif age <= 50:
        return "Middle-aged"
    elif age <= 65:
        return "Senior"
    else:
        return "Elderly"
    
df['age_group'] = df['Age'].apply(categorize_age)

# Create the cholesterol/age ratio
df['chol_per_age'] = df['Cholesterol'] / df['Age']

# Column 'thal' doesnt exist in dataset, so I use thee column 'ChestPainType'

# One-Hot Encoding for the categorical variables 'ChestPainType' and 'ST_Slope'
df_encoded = pd.get_dummies(df, columns=['ChestPainType', 'ST_Slope'], prefix=['CP', 'Slope'])

# Example of interaction between a ChestPainType category and an ST_Slope category
df_encoded['CP_ASY_Slope_Flat'] = (
    df_encoded.get('CP_ASY', 0) * df_encoded.get('Slope_Flat', 0)
)

df_encoded.head()


Unnamed: 0,Age,Sex,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,HeartDisease,...,chol_per_age,thal,CP_ASY,CP_ATA,CP_NAP,CP_TA,Slope_Down,Slope_Flat,Slope_Up,CP_ASY_Slope_Flat
0,40,M,140,289,0,Normal,172,N,0.0,0,...,7.225,ATA,False,True,False,False,False,False,True,False
1,49,F,160,180,0,Normal,156,N,1.0,1,...,3.673469,NAP,False,False,True,False,False,True,False,False
2,37,M,130,283,0,ST,98,N,0.0,0,...,7.648649,ATA,False,True,False,False,False,False,True,False
3,48,F,138,214,0,Normal,108,Y,1.5,1,...,4.458333,ASY,True,False,False,False,False,True,False,True
4,54,M,150,195,0,Normal,122,N,0.0,0,...,3.611111,NAP,False,False,True,False,False,False,True,False


# Code - Task 4: Scaling Numerical Features

In [24]:
#Identify categorical features for one-hot encoding
categorical_features = ["ChestPainType", "ST_Slope"]
df_encoded = pd.get_dummies(df, columns=categorical_features, prefix=['CP', 'Slope'])
df_encoded.head()

Unnamed: 0,Age,Sex,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,HeartDisease,...,cholesterol_age_ratio,chol_per_age,thal,CP_ASY,CP_ATA,CP_NAP,CP_TA,Slope_Down,Slope_Flat,Slope_Up
0,40,M,140,289,0,Normal,172,N,0.0,0,...,7.225,7.225,ATA,False,True,False,False,False,False,True
1,49,F,160,180,0,Normal,156,N,1.0,1,...,3.673469,3.673469,NAP,False,False,True,False,False,True,False
2,37,M,130,283,0,ST,98,N,0.0,0,...,7.648649,7.648649,ATA,False,True,False,False,False,False,True
3,48,F,138,214,0,Normal,108,Y,1.5,1,...,4.458333,4.458333,ASY,True,False,False,False,False,True,False
4,54,M,150,195,0,Normal,122,N,0.0,0,...,3.611111,3.611111,NAP,False,False,True,False,False,False,True


In [25]:
# Apply one-hot
df_encoded = pd.get_dummies(df, columns=categorical_features, prefix=['CP', 'Slope'])
df_encoded.head()

Unnamed: 0,Age,Sex,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,HeartDisease,...,cholesterol_age_ratio,chol_per_age,thal,CP_ASY,CP_ATA,CP_NAP,CP_TA,Slope_Down,Slope_Flat,Slope_Up
0,40,M,140,289,0,Normal,172,N,0.0,0,...,7.225,7.225,ATA,False,True,False,False,False,False,True
1,49,F,160,180,0,Normal,156,N,1.0,1,...,3.673469,3.673469,NAP,False,False,True,False,False,True,False
2,37,M,130,283,0,ST,98,N,0.0,0,...,7.648649,7.648649,ATA,False,True,False,False,False,False,True
3,48,F,138,214,0,Normal,108,Y,1.5,1,...,4.458333,4.458333,ASY,True,False,False,False,False,True,False
4,54,M,150,195,0,Normal,122,N,0.0,0,...,3.611111,3.611111,NAP,False,False,True,False,False,False,True


In [29]:
# Identify numerical columns to apply Scaling


numerical_features = ["Age", "Cholesterol", "RestingBP", "MaxHR"]

scaler = StandardScaler()
df_encoded[numerical_features] = scaler.fit_transform(df_encoded[numerical_features])
df_encoded[numerical_features].head()

Unnamed: 0,Age,Cholesterol,RestingBP,MaxHR
0,-1.43314,0.82507,0.410909,1.382928
1,-0.478484,-0.171961,1.491752,0.754157
2,-1.751359,0.770188,-0.129513,-1.525138
3,-0.584556,0.13904,0.302825,-1.132156
4,0.051881,-0.034755,0.951331,-0.581981


In [32]:
# Verifying the changes
print(f"Dataset shape: {df_encoded.shape}")
print("Checking if all columns are numeric:")
print(df_encoded.dtypes)
df_encoded.head()

Dataset shape: (918, 21)
Checking if all columns are numeric:
Age                      float64
Sex                       object
RestingBP                float64
Cholesterol              float64
FastingBS                  int64
RestingECG                object
MaxHR                    float64
ExerciseAngina            object
Oldpeak                  float64
HeartDisease               int64
age_group                 object
cholesterol_age_ratio    float64
chol_per_age             float64
thal                      object
CP_ASY                      bool
CP_ATA                      bool
CP_NAP                      bool
CP_TA                       bool
Slope_Down                  bool
Slope_Flat                  bool
Slope_Up                    bool
dtype: object


Unnamed: 0,Age,Sex,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,HeartDisease,...,cholesterol_age_ratio,chol_per_age,thal,CP_ASY,CP_ATA,CP_NAP,CP_TA,Slope_Down,Slope_Flat,Slope_Up
0,-1.43314,M,0.410909,0.82507,0,Normal,1.382928,N,0.0,0,...,7.225,7.225,ATA,False,True,False,False,False,False,True
1,-0.478484,F,1.491752,-0.171961,0,Normal,0.754157,N,1.0,1,...,3.673469,3.673469,NAP,False,False,True,False,False,True,False
2,-1.751359,M,-0.129513,0.770188,0,ST,-1.525138,N,0.0,0,...,7.648649,7.648649,ATA,False,True,False,False,False,False,True
3,-0.584556,F,0.302825,0.13904,0,Normal,-1.132156,Y,1.5,1,...,4.458333,4.458333,ASY,True,False,False,False,False,True,False
4,0.051881,M,0.951331,-0.034755,0,Normal,-0.581981,N,0.0,0,...,3.611111,3.611111,NAP,False,False,True,False,False,False,True


# Code - Task 5: Final Verification

In [33]:
# Display the first few rows of the modified dataset
df_encoded.head()

Unnamed: 0,Age,Sex,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,HeartDisease,...,cholesterol_age_ratio,chol_per_age,thal,CP_ASY,CP_ATA,CP_NAP,CP_TA,Slope_Down,Slope_Flat,Slope_Up
0,-1.43314,M,0.410909,0.82507,0,Normal,1.382928,N,0.0,0,...,7.225,7.225,ATA,False,True,False,False,False,False,True
1,-0.478484,F,1.491752,-0.171961,0,Normal,0.754157,N,1.0,1,...,3.673469,3.673469,NAP,False,False,True,False,False,True,False
2,-1.751359,M,-0.129513,0.770188,0,ST,-1.525138,N,0.0,0,...,7.648649,7.648649,ATA,False,True,False,False,False,False,True
3,-0.584556,F,0.302825,0.13904,0,Normal,-1.132156,Y,1.5,1,...,4.458333,4.458333,ASY,True,False,False,False,False,True,False
4,0.051881,M,0.951331,-0.034755,0,Normal,-0.581981,N,0.0,0,...,3.611111,3.611111,NAP,False,False,True,False,False,False,True


In [34]:
# CHECKING IF ALL COLUMNS ARE NUMERIC
df_encoded.dtypes

Age                      float64
Sex                       object
RestingBP                float64
Cholesterol              float64
FastingBS                  int64
RestingECG                object
MaxHR                    float64
ExerciseAngina            object
Oldpeak                  float64
HeartDisease               int64
age_group                 object
cholesterol_age_ratio    float64
chol_per_age             float64
thal                      object
CP_ASY                      bool
CP_ATA                      bool
CP_NAP                      bool
CP_TA                       bool
Slope_Down                  bool
Slope_Flat                  bool
Slope_Up                    bool
dtype: object

In [36]:
# Display the shape of the modified dataset
df_encoded.shape

(918, 21)