[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mobadara/<your_repository_name>/blob/main/notebooks/feature_engineering.ipynb)
[![Kaggle Notebook](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/new?source=https://github.com/mobadara/<your_repository_name>/blob/main/notebooks/feature_engineering.ipynb)
[![Python](https://img.shields.io/badge/python-3.7+-blue.svg)](https://www.python.org/downloads/)


# **Feature Engineering - Cardiovascular Disease Risk Prediction**

## **Introduction**

Cardiovascular disease (CVD) is a leading cause of death globally. This project aims to build a machine learning model to predict the risk of CVD based on patient data. This notebook focuses on the crucial step of feature engineering, where we create new features from the existing dataset to potentially enhance the predictive power of our model.

The dataset used for this project is the "Cardiovascular Disease Dataset" by Sulianova, available on Kaggle.

You can find the Exploratory Data Analysis (EDA) notebook for this project, which provides a deeper understanding of the data, here:

[![Open In GitHub](https://img.shields.io/badge/View%20EDA%20Notebook-blue?logo=github)](https://github.com/mobadara/cardiovascular-disease-risk-prediction/blob/main/notebooks/exploratory-data-analysis.ipynb)

In this notebook, we will engineer features such as Body Mass Index (BMI), convert age to years, and explore creating categories from blood pressure readings. Let's begin!

## **Setup**

In [12]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings

warnings.filterwarnings('ignore')

## **Data Loading**

The following code cell loads the dataset, which was used for Exploratory Data Analysis (EDA) phase. This dataset is being loaded directly from the GitHub repository where the EDA notebook is hosted.

In [13]:
df = pd.read_csv('https://raw.githubusercontent.com/mobadara/'
                'cardiovascular-disease-risk-prediction'
                '/refs/heads/main/data/cardio-train.csv', sep=';')
df.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0


## **Body Mass Index (BMI)**

Body Mass Index (BMI) is a widely used single number to evaluate body weight in relation to height. It is calculated as weight in kilograms divided by the square of height in meters. BMI can be an important indicator of potential health risks associated with weight.

The formula for BMI is:

$\qquad \text{BMI} = \frac{\text{weight (kg)}}{(\text{height (m)})^2}$

We will create a new feature, `bmi`, using the `weight` and `height` columns from our dataset. This new feature might provide the model with a more direct measure of a patient's weight category relative to their height, potentially improving prediction accuracy.

In [14]:
# We are dividing `height` by 100 to convert the height
# from `cm` unit to `m`.
df['bmi'] = df['weight'] / ((df['height'] / 100) ** 2)
df.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,bmi
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0,21.96712
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1,34.927679
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1,23.507805
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1,28.710479
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0,23.011177


## **Age in Years**

The `age` column in the current dataset is represented in days. For better interpretability and potential alignment with common medical understanding, we will convert the age from days to years.

To do this, we will divide the 'age' column by the approximate number of days in a year (365.25 to account for leap years).

This conversion will transform the age into a more familiar unit, which could be more readily understood by the model and might reveal non-linear relationships with the target variable more effectively.

In [15]:
df['age'] = df['age'] / 365.25
df.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,bmi
0,0,50.35729,2,168,62.0,110,80,1,1,0,0,1,0,21.96712
1,1,55.381246,1,156,85.0,140,90,3,1,0,0,1,1,34.927679
2,2,51.627652,1,165,64.0,130,70,3,1,0,0,0,1,23.507805
3,3,48.249144,2,169,82.0,150,100,1,1,0,0,1,1,28.710479
4,4,47.841205,1,156,56.0,100,60,1,1,0,0,0,0,23.011177


Before I start the feature engineering step, we need to replace the encoded labels with actual labels for consistency.

In [16]:
df['gender'] = df['gender'].map({1: 'Female', 2: 'Male'})
df['cholesterol'] = df['cholesterol'].map({1: 'Normal', 2: 'Above Normal', 3: 'Well Above Normal'})
df['gluc'] = df['gluc'].map({1: 'Normal', 2: 'Above Normal', 3: 'Well Above Normal'})
df['smoke'] = df['smoke'].map({0: 'No', 1: 'Yes'})
df['alco'] = df['alco'].map({0: 'No', 1: 'Yes'})
df['active'] = df['active'].map({0: 'Inactive', 1: 'Active'})
df.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,bmi
0,0,50.35729,Male,168,62.0,110,80,Normal,Normal,No,No,Active,0,21.96712
1,1,55.381246,Female,156,85.0,140,90,Well Above Normal,Normal,No,No,Active,1,34.927679
2,2,51.627652,Female,165,64.0,130,70,Well Above Normal,Normal,No,No,Inactive,1,23.507805
3,3,48.249144,Male,169,82.0,150,100,Normal,Normal,No,No,Active,1,28.710479
4,4,47.841205,Female,156,56.0,100,60,Normal,Normal,No,No,Inactive,0,23.011177


## **Age Group**

To potentially capture non-linear relationships between age and cardiovascular disease risk, and to make the age feature more interpretable, we will create a new categorical feature called 'age_group'. We will categorize the age (now in years) into the following groups:

* **Child/Teen:** 0-17 years
* **Young Adult:** 18-35 years
* **Middle-Aged:** 36-55 years
* **Senior:** 56 years and older

These categories are broadly defined and aim to group individuals into common life stages where the likelihood of certain health conditions might differ.

We will use the `age` column (which we created in the previous step) to create this new 'age_group' feature.

In [17]:
df['age_group'] = pd.cut(df['age'], bins=[0, 17, 35, 55, np.inf],
                           labels=['Child/Teen', 'Young Adult',
                                   'Middle-Aged', 'Senior'])
df.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,bmi,age_group
0,0,50.35729,Male,168,62.0,110,80,Normal,Normal,No,No,Active,0,21.96712,Middle-Aged
1,1,55.381246,Female,156,85.0,140,90,Well Above Normal,Normal,No,No,Active,1,34.927679,Senior
2,2,51.627652,Female,165,64.0,130,70,Well Above Normal,Normal,No,No,Inactive,1,23.507805,Middle-Aged
3,3,48.249144,Male,169,82.0,150,100,Normal,Normal,No,No,Active,1,28.710479,Middle-Aged
4,4,47.841205,Female,156,56.0,100,60,Normal,Normal,No,No,Inactive,0,23.011177,Middle-Aged


## **Blood Pressure Categories**

Blood pressure is typically measured with two numbers: systolic (the pressure when the heart beats) over diastolic (the pressure when the heart rests between beats), usually written as `ap_hi` / `ap_lo`. To make these continuous numerical features potentially more informative and to align with medical understanding, we can categorize them based on standard blood pressure ranges [American Heart Association](https://www.heart.org/en/health-topics/high-blood-pressure/understanding-blood-pressure-readings).

For someone who isn't a medical professional, think of it like classifying how "high" or "low" these pressure readings are according to general health guidelines. Instead of just having the raw numbers, we'll create labels like **"Normal"**, **"Elevated"**, or different stages of **"Hypertension"** (high blood pressure). This categorization can sometimes help machine learning models identify patterns more effectively.

We will use the guidelines provided by the American Heart Association (AHA) for adults to create these categories.

Based on these guidelines, we can define categories such as:

* **Normal:** less than 120 systolic AND less than 80 diastolic
* **Elevated:** 120-129 systolic AND less than 80 diastolic
* **Hypertension Stage 1:** 130-139 systolic OR 80-89 diastolic
* **Hypertension Stage 2:** 140 or higher systolic OR 90 or higher diastolic
* **Hypertensive Crisis (Seek medical attention):** higher than 180 systolic AND/OR higher than 120 diastolic

We will create a new categorical feature, perhaps called 'blood_pressure_category', based on these ranges.

In [18]:
def categorize_blood_pressure(row: pd.Series) -> str:
    """
    Categorizes blood pressure based on systolic and diastolic values.

    Arguments
    ---------
    row :pd.Series
        A row from the DataFrame.

    Returns
    -------
    str: The category of blood pressure.
    """
    systolic = row['ap_hi']
    diastolic = row['ap_lo']

    if systolic < 120 and diastolic < 80:
        return 'Normal'
    elif (120 <= systolic <= 129) and diastolic < 80:
        return 'Elevated'
    elif (130 <= systolic <= 139) or (80 <= diastolic <= 89):
        return 'Hypertension Stage 1'
    elif systolic >= 140 or diastolic >= 90:
        return 'Hypertension Stage 2'
    elif systolic > 180 or diastolic > 120:
        return 'Hypertensive Crisis'
    else:
        return 'Undefined' # Catch any edge cases

# Apply the function to each row to create the new column
df['blood_pressure_category'] = df.apply(categorize_blood_pressure, axis=1)

# Display the value counts of the new category
print(df['blood_pressure_category'].value_counts())

# Display a few rows with the new column
print("\nSample of DataFrame with blood_pressure_category:")
df[['ap_hi', 'ap_lo', 'blood_pressure_category']].head()

blood_pressure_category
Hypertension Stage 1    39934
Hypertension Stage 2    17333
Normal                   9608
Elevated                 3125
Name: count, dtype: int64

Sample of DataFrame with blood_pressure_category:


Unnamed: 0,ap_hi,ap_lo,blood_pressure_category
0,110,80,Hypertension Stage 1
1,140,90,Hypertension Stage 2
2,130,70,Hypertension Stage 1
3,150,100,Hypertension Stage 2
4,100,60,Normal


## **Pulse Pressure**

The difference between systolic and diastolic blood pressure (`ap_hi` - `ap_lo`). This can sometimes be a useful indicator, especially in older adults [Mayo Clinic](https://www.mayoclinic.org/diseases-conditions/high-blood-pressure/expert-answers/pulse-pressure/faq-20058189)

In [19]:
df['pulse_pressure'] = df['ap_hi'] - df['ap_lo']
df.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,bmi,age_group,blood_pressure_category,pulse_pressure
0,0,50.35729,Male,168,62.0,110,80,Normal,Normal,No,No,Active,0,21.96712,Middle-Aged,Hypertension Stage 1,30
1,1,55.381246,Female,156,85.0,140,90,Well Above Normal,Normal,No,No,Active,1,34.927679,Senior,Hypertension Stage 2,50
2,2,51.627652,Female,165,64.0,130,70,Well Above Normal,Normal,No,No,Inactive,1,23.507805,Middle-Aged,Hypertension Stage 1,60
3,3,48.249144,Male,169,82.0,150,100,Normal,Normal,No,No,Active,1,28.710479,Middle-Aged,Hypertension Stage 2,50
4,4,47.841205,Female,156,56.0,100,60,Normal,Normal,No,No,Inactive,0,23.011177,Middle-Aged,Normal,40


## **End of Feature Engineering**

In this notebook, we successfully engineered several new features from the original dataset, including:

* **Body Mass Index (BMI):** A measure of body weight relative to height.
* **Age in Years:** Converting age from days to a more interpretable unit.
* **Age Group:** Categorizing age into broader life stages.
* **Blood Pressure Category:** Segmenting blood pressure readings into clinically relevant groups based on AHA guidelines.

These newly created features aim to provide the machine learning model with potentially more informative representations of the underlying data, which could lead to improved predictive performance for cardiovascular disease risk.

## **Next Step**

As you mentioned, the subsequent steps in your model development phase should include:

1.  **One-Hot Encoding:** Convert the categorical features (like `gender`, `cholesterol`, `gluc`, `smoke`, `alco`, `active`, `age_group`, and `blood_pressure_category`) into a numerical format suitable for machine learning algorithms.

2.  **Standard Scaling:** Scale the numerical features (like `age_years`, `height`, `weight`, `bmi`, `ap_hi`, `ap_lo`, and `pulse_pressure` to have zero mean and unit variance. This is important for many machine learning algorithms to prevent features with larger ranges from dominating the learning process.

3.  **Label Encoding:** The target variable `cardio` is already binary (0 and 1).


This concludes the feature engineering part the project our analysis.

**Reference:**

> American Heart Association. (n.d.). *Understanding Blood Pressure Readings*. Retrieved from [https://www.heart.org/en/health-topics/high-blood-pressure/understanding-blood-pressure-readings](https://www.heart.org/en/health-topics/high-blood-pressure/understanding-blood-pressure-readings)

> Mayo Clinic (n.d.). *Pulse pressure: An indicator of a heart health*. Retrieved from [https://www.mayoclinic.org/diseases-conditions/high-blood-pressure/expert-answers/pulse-pressure/faq-20058189](https://www.mayoclinic.org/diseases-conditions/high-blood-pressure/expert-answers/pulse-pressure/faq-20058189)

Muyiwa J. Obadara<br>
[https://linkedin.com/in/obadara-m](https://linkedin.com/in/obadara-m)<br>
[https://github.com/mobadara](https://github.com/mobadara)