# Heart disease data preparation
First deliverable in the course *Artificial intelligence applied to engineering* at ETSEIB, UPC spring 2024. The team members contributing to the deliverable is 
- Lise Jakobsen
- Julie Sørlie Lund
- Magnus Ingnes Sagmo

The dataset used in this deliverable can be retrieved from [Kaggle](https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset).

### 1. Check the dimensions of the dataset

In [None]:
import pandas as pd

# Load the heart dataset
heart_data = pd.read_csv("heart.csv")

# Check the dimensions of the dataset
dataset_dimensions = heart_data.shape

print("The dataset has {} rows and {} colums.".format(dataset_dimensions[0],dataset_dimensions[1]))


### 2. Understand the data structure
*Peek at the first few rows to understand the data structure.*

In [None]:
# Adjust pandas display settings
pd.set_option('display.max_columns', 12)  # Ensure all columns are attempted to be displayed
pd.set_option('display.width', 1000)  # Adjust the width 


# Display the first few rows of heart.csv
print(heart_data.head())
print('\n')
print(heart_data.describe())

### 3. Examine the types of data present in each column 
*Examine the types of data present in each column (numerical, categorical, datetime, ...). Verify that the data types assigned to each column align with the actual nature of the data. Convert data types if necessary.*

In the original dataset, the columns `Sex`, `ChestPainType`, `RestingECG`, `ExerciseAngina` and `ST_Slope` are of the `pandas.object` data type. All of the values are however of a categorical type, and were consequently changed to the `pandas.category` data type.

In [None]:
# Function to display data types 
def display_dtypes(dataframe):
    # Create a DataFrame from the dtypes
    dtypes_df = pd.DataFrame(dataframe.dtypes, columns=['Data Type'])
    # Reset index to get the column names as a separate column
    dtypes_df.reset_index(inplace=True)
    # Rename columns 
    dtypes_df.columns = ['Column Name', 'Data Type']
    # Display the DataFrame
    display(dtypes_df)  

In [None]:
# Display original data types
print("Original Data Types:")
display_dtypes(heart_data)

# Convert categorical variables to 'category' data type
categorical_columns = ['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope', 'FastingBS']
# categorical_columns = heart_data.select_dtypes(include='object').columns.to_list()
heart_data[categorical_columns] = heart_data[categorical_columns].astype('category')

# Display data types after conversion for comparison
print("\nData Types after Conversion:")
display_dtypes(heart_data)
heart_data = heart_data
print(heart_data.describe())

### 4. Filter the data 
*Filter the data by removing variables that are not relevant for the analysis.*

Given the context of predicting heart failure, all of the 12 variables have relevant associations with cardiovascular health and could potentially provide valuable insights when building a predictive model. 

### 5. Univariate analysis
*Summarize statistics for numerical variables and frequency distribution for categorical ones. Create visualizations (histograms, box plots, bar plots, etc).*

In [None]:
# Import necessary libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Set the style of the plots
sns.set_style("whitegrid")

# Summary statistics for numerical variables
print("Summary Statistics for Numerical Variables:")
print(heart_data.describe())

# Histograms for numerical variables
numerical_columns = ['Age', 'RestingBP', 'Cholesterol', 'MaxHR', 'Oldpeak']
for col in numerical_columns:
    plt.figure(figsize=(8, 4))
    sns.histplot(heart_data[col], kde=True, bins=30)
    plt.title(f'Histogram of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')
    plt.show()

# Frequency distribution and bar plots for categorical variables
categorical_columns = ['Sex', 'ChestPainType', 'FastingBS', 'RestingECG', 'ExerciseAngina', 'ST_Slope']
for col in categorical_columns:
    print(f"Frequency distribution for {col}:")
    print(heart_data[col].value_counts())
    
    plt.figure(figsize=(8, 4))
    sns.countplot(x=col, data=heart_data)
    plt.title(f'Bar Plot of {col}')
    plt.ylabel('Count')
    plt.show()

### 6. Outliers
*Identify outliers and decide whether to remove, transform, or keep them.*

In [None]:
def IQR_method(dataframe, column, lower_quantile=0.25, upper_quantile=0.75):
    ''''Calculate outliers for a given column using the IQR method

    The IQR is the difference between the 75th percentile (Q3) and the 25th percentile (Q1) of the data.
    Outliers are typically considered data points that fall below (Q1- 1,5*IQR) or above (Q3 + 1,5*IQR).
    
    Parameters
    ----------
    dataframe : pandas.DataFrame
        The data to locate outliers in 
    column : str
        Name of column to locate outliers in
    lower_quantile : float
        Quantile to represent the lower bound (default is 0.25 for Q1)
    upper_quantile : float
        Quantile to represent the upper bound (default is 0.75 for Q3)

    Returns
    -------
    pandas.DataFrame
        Dataframe without entries containing outliers for the provided column
    '''

    Q1 = dataframe[column].quantile(lower_quantile)
    Q3 = dataframe[column].quantile(upper_quantile)

    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Selecting rows that are not considered outliers
    return dataframe[(dataframe[column] >= lower_bound) & (dataframe[column] <= upper_bound)]


We start by removing the zero values for the columns desrcribed below.

##### RestingBP

- Normal blood pressure for adults ranges from 90/60 mm KG to 120/80 mm Hg. 
- A resting blood pressure of zero is impossible. The occurences of `RestingBP=0` in the dataset are therefore most likely due to an error. As it is difficult to infer an approximation to the blood pressure, and it is an important risk factor for ehart disease, we have decided to remove the zero blood pressure values. 

##### Cholesterol
- High cholesterol levels are clinically relevant and can indicate a significant risk for cardiovascular diseases. 
- The instances where cholesterol is recorded as 0 are clearly errors or missing data, as it's physiologically impossible for someone to have no cholesterol. All records where `Cholesterol=0` were removed.

In [None]:
old_len = heart_data.shape[0]
heart_data = heart_data[(heart_data['RestingBP'] > 0)]
intermediate_len = heart_data.shape[0]
resting_bp_removed = old_len - intermediate_len

heart_data = heart_data[(heart_data['Cholesterol'] > 0)]
new_len = heart_data.shape[0]
cholesterol_removed = intermediate_len - new_len

print(f'Removed {resting_bp_removed} rows for RestingBP and {cholesterol_removed} rows for Cholesterol.')
print(f'The new dataframe has {new_len} entries.')

In [None]:
def display_outliers(dataframe, variables):
    '''Display outliers for dataframe using IQR method

    Apply the IQR method to given dataframe for each of the given variables.
    
    Parameters
    ----------
    dataframe : pandas.DataFrame
        The data to locate outliers in 
    column : list of str
        List of column names for numerical variables to locate outliers in
    '''
    for variable in variables:
        outliers_df = IQR_method(dataframe, variable, 0.25, 0.75)
        print(f"Number of outliers in {variable}: {len(outliers_df)}")
        if len(outliers_df) > 0:
            display(outliers_df[[variable]])

We then proceed by locating and removing outliers for the attributes below.

#### RestingBP

- High blood pressure is a critical risk factor for heart disease. We will therefore only look for lower quantile outliers for `RestingB`.

#### Cholesterol
- High cholesterol levels are clinically relevant and can indicate a significant risk for cardiovascular diseases. We will therefore only look for lower quantile outliers for `RestingB`.

#### MaxHR  


#### Oldpeak 
- *Oldpeak* refers to the ST depression induced by exercise relative to rest, measured in millimeters (mm) on an electrocardiogram (ECG). ST depression is a finding on an ECG that can indicate myocardial ischemia, a condition where part of the heart does not receive enough oxygen, often due to blockages in the coronary arteries. In the context of a stress test, an increase in oldpeak value (more negative or more pronounced depression) can suggest a higher likelihood of coronary artery disease.
- ST Elevation: An ST amplitude of ≥0.1 mV, ≥0.15 mV, and ≥0.2 mV is considered significant for ST elevation. ST elevation can indicate acute myocardial infarction (heart attack) and other conditions that lead to an elevated risk of heart disease. High positive Oldpeak values (e.g., 4.0, 5.6, 6.2) are well above the ≥0.2 mV elevation significance threshold, indicating substantial ST segment deviations. These are critical for identifying potential heart disease risk and should be retained for their clinical significance.
- ST Depression: For ST depression, thresholds of ≤–0.05 mV or ≤–0.1 mV are used to denote clinical significance. ST depression is indicative of myocardial ischemia, a condition where the heart muscle doesn't receive enough oxygen, often due to narrowed or blocked coronary arteries. With ST depression being clinically significant at values of ≤–0.05 mV or ≤–0.1 mV, a negative Oldpeak value reflects ST depression and is thus clinically relevant. This value suggests myocardial ischemia and should be included in analyses concerning heart disease risk, assuming the negative sign indicates depression below the baseline.

In [None]:
# RestingBP
new_df = IQR_method(heart_data, 'RestingBP', lower_quantile=0.25)
intermediate_len = new_df.shape[0]
restingbp_removed = heart_data.shape[0] - intermediate_len

# Cholesterol
new_df = IQR_method(new_df, 'Cholesterol', lower_quantile=0.25)
cholesterol_removed = intermediate_len - new_df.shape[0]
intermediate_len = new_df.shape[0]

# MaxHR
new_df = IQR_method(new_df, 'MaxHR', lower_quantile=0.25, upper_quantile=0.75)
maxhr_removed = intermediate_len - new_df.shape[0]
intermediate_len = new_df.shape[0]

# Oldpeak
new_df = IQR_method(new_df, 'Oldpeak', lower_quantile=0.25, upper_quantile=0.75)
oldpeak_removed = intermediate_len - new_df.shape[0]
intermediate_len = new_df.shape[0]

heart_data = new_df

print(f"Removed {resting_bp_removed} values for 'RestingBP', {cholesterol_removed} values for 'Cholesterol', {maxhr_removed} values for 'MaxHR' and {oldpeak_removed} values for 'Oldpeak'")
print(f"The resulting dataframe has {heart_data.shape[0]} entries.")

### 7. Create new features 
*Create new features from existing ones if necessary (e.g., extracting date components, ratios between variables, etc.).*

Resting blood pressure varies with sex and age. We have categorised in `low`, `normal` and `high` blood pressure according to a table from [Heart research institute](https://www.hri.org.au/health/learn/risk-factors/what-is-normal-blood-pressure-by-age):

| Age Range   | F             | M             |
|-------------|---------------|---------------|
| 18–39 years | 110/68 mm Hg  | 119/70 mm Hg  |
| 40–59 years | 122/74 mm Hg  | 124/77 mm Hg  |
| 60+ years   | 139/68 mm Hg  | 133/69 mm Hg  |

In [None]:
def get_bp_category(row):
    age = row['Age']
    sex = row['Sex']
    bp = row['RestingBP']

    normal_bp_values = {
        'young': {'F': 110, 'M': 119},
        'middle': {'F': 122, 'M': 124},
        'old': {'F': 139, 'M': 133}
    }

    if age <= 39:
        age_group = 'young'
    elif age <= 59:
        age_group = "middle"
    else:
        age_group = "old"

    # Retrieve the normal BP for the age and sex
    if sex == 'M' or sex == '1' or sex == 'm':
        sex = 'M'
    else: sex = 'F'

    normal_bp = normal_bp_values[age_group][sex]

    if bp <= (normal_bp - 3):
        return 'low'
    elif normal_bp - 3 < bp < normal_bp + 3:
        return 'normal'
    else:
        return 'high'


heart_data['BloodPressure'] = heart_data.apply(get_bp_category, axis=1)

# Make new feature categorical
heart_data[['BloodPressure']] = heart_data[['BloodPressure']].astype('category')

### 8. Encode categorical variables

In [None]:
# Locate categorical columns
categorical_columns = heart_data.select_dtypes(include='category').columns.to_list()

# Create new columns with encoded values
for column in categorical_columns:
    heart_data[column] = heart_data[column].cat.codes

### 9. Feature importance
*Identify variables that provide little to no information and remove them.*

To look for feature importance of each individual feature, we use the `sklearn.RandomForestClassifier` as it has a built-in `feature_importance_` parameter. 

The code below is inspired by dialogue with ChatGPT.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Copy data
X = heart_data.copy()
y = X['HeartDisease']
X = X.drop('HeartDisease', axis=1)

model = RandomForestClassifier(n_estimators=1000)

# Split data 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train the model
model.fit(X_train, y_train)

# Get feature importance
df = pd.DataFrame({'Feature': X_train.columns.tolist(), 'Importance': model.feature_importances_})
df.sort_values(by='Importance', ascending=False, inplace=True)

print(df)

From the code above, we see that `Sex`, `RestingECG` and `FastingBS` have seemingly lower importance than the other features. We will therefore remove these. 

In [None]:
heart_data = heart_data.drop(columns=['Sex', 'RestingECG', 'FastingBS', 'BloodPressure'], axis=1)
print(heart_data.columns)

### 10. Scale numerical features 
*Scale numerical features to a similar range in order to improve model performance on some models.*

Applying [min-max scaling](https://towardsdatascience.com/everything-you-need-to-know-about-min-max-normalization-in-python-b79592732b79) to the originally numerical values.

In [None]:
# Create a DataFrame with the desired values
df = pd.DataFrame(heart_data[['Age', 'RestingBP', 'Cholesterol', 'MaxHR', 'Oldpeak']])

print(heart_data.columns)
# Min-max scaling
df_scaled = (df - df.min()) / (df.max() - df.min())

print(df_scaled.head())

### 11. Divide the dataset into training and test sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)