# Predicting Heart Disease Using a Support Vector Classifier

## 1. Introduction:

### Background:

The purpose

### Objective:

### Datasets:

### Tech Stack:

The following tools and libraries are used in this project:
- Python
- Pandas
- Matplotlib
- Statsmodels

## 2. Setup and Imports:

### Library Imports:

In [None]:
# Standard library imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from statistics import mean

# Scipy and Statsmodels imports for statistical analysis
from scipy.stats import pointbiserialr, chi2_contingency
import statsmodels.formula.api as smf

# Scikit-learn imports for machine learning models, metrics, and preprocessing
from sklearn.model_selection import (GridSearchCV, train_test_split, StratifiedKFold,
                                     cross_val_score, StratifiedShuffleSplit, cross_validate)
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.naive_bayes import CategoricalNB
from sklearn.metrics import (accuracy_score, recall_score, precision_score, f1_score, 
                             confusion_matrix, classification_report, roc_curve, auc)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.decomposition import PCA

# IPython for HTML display
from IPython.core.display import HTML

### CSS Styling:

In [None]:
# Importing custom CSS for styling

css = open('style.css').read()
HTML('<style>{}</style>'.format(css))

## 3. Data Processing & Exploration - Kaggle Dataset

### 3.1 Data Processing

#### Data Loading

1. Load the dataset from 'kaggle-heart.csv' into a pandas DataFrame, handling missing values.
2. Preview the first 2 rows to ensure the data has been loaded correctly.

In [None]:
# Load the kaggle-heart.csv dataset into a DataFrame called "df_kaggle"
# We treat " ", "?", and "NA" as missing values and replace them with NaN
df_kaggle = pd.read_csv('kaggle-heart.csv', na_values=[" ","?","NA"])

# Display the first two rows of the dataset for inspection
df_kaggle.head(2)

#### Data Dictionary

1. Extract column names and data types from the dataset.
2. Add descriptions for each column based on the dataset documentation.
3. Calculate the min and max values for each numerical column.
4. Combine all information into a single DataFrame.

In [None]:
# Creating a Data Dictionary for the dataset:
# We will collect the following:
# - Field names (column names)
# - Data types
# - Descriptions (based on Kaggle's dataset page)
# - Max and Min values for numerical columns

# Column names and data types
kaggle_field_list = df_kaggle.columns.tolist()  # List of column names
kaggle_dtype_list = df_kaggle.dtypes.astype(str).tolist()  # List of column data types as strings

# Description of each field based on Kaggle's dataset page
kaggle_description_list = [
    "age",
    "sex",
    "chest pain type (4 values)",
    "resting blood pressure",
    "serum cholestoral in mg/dl",
    "fasting blood sugar > 120 mg/dl",
    "resting electrocardiographic results (values 0,1,2)",
    "maximum heart rate achieved",
    "exercise induced angina",
    "oldpeak = ST depression induced by exercise relative to rest",
    "the slope of the peak exercise ST segment",
    "number of major vessels (0-3) colored by flourosopy",
    "thal: 0 = normal; 1 = fixed defect; 2 = reversable defect",
    "presence of heart disease. 0 = no disease and 1 = disease."
]

# Max and min values for each column
kaggle_max_list = df_kaggle.max().to_list()  # List of max values for each column
kaggle_min_list = df_kaggle.min().to_list()  # List of min values for each column

# Combine all lists into one DataFrame for easier reference
kaggle_concat_list = [
    kaggle_field_list,
    kaggle_dtype_list,
    kaggle_description_list,
    kaggle_min_list,
    kaggle_max_list
]

# The lists need to be converted to Series before concatenating
df_kaggle_data_dictionary = pd.DataFrame(pd.concat([pd.Series(x) for x in kaggle_concat_list], axis=1))

# Set column names for the new DataFrame
df_kaggle_data_dictionary.columns = ["FieldName", "DataType", "Description", "Min", "Max"]

# Display the data dictionary rounded to 1 decimal point for readability
df_kaggle_data_dictionary.round(1)

#### Summary Statistics

Generate summary statistics for the numerical columns in the dataset, rounding the results to two decimal places.

In [None]:
# Generate summary statistics for numerical columns and round the results to two decimal places
df_kaggle.describe().round(2)

**Notes:**
- The dataset appears to contain more rows than expected.
- We expected 303.
- This discrepancy may indicate data issues, such as extra rows or duplicate entries that need to be investigated and cleaned.

#### Count Null Values per Column

This step counts the missing (null) values in each column to assess the completeness of the dataset and guide decisions on handling missing data.

In [None]:
# Count the number of missing (null) values per column in the dataset
df_kaggle.isnull().sum()

#### Count Duplicated Rows

Calculate the number of duplicated rows in the dataset, helping identify copied data.

In [None]:
# Calculate the number of duplicated rows in the dataset
df_kaggle.duplicated().sum()

#### Count Unique Rows

Identify and count the unique rows in the dataset.

In [None]:
# Get the unique rows in the DataFrame (removing duplicates)
unique_rows = np.unique(df_kaggle, axis=0)

# Display the shape of the unique rows (number of unique records)
unique_rows.shape

## 4. Data Processing & Exploration - UCI Dataset

### 4.1 Data Processing

#### Data Loading

1. Load 'processed.cleveland.data' into a DataFrame and identify na values.
2. View the first 2 rows.

In [None]:
# Load the Cleveland dataset into a DataFrame
df_cleveland = pd.read_csv(
    "processed.cleveland.data",
    na_values=[" ", "?", "NA"],
    encoding="ISO-8859-1",
    header=None,
    delimiter=","
)

# View the first two rows of the Cleveland dataset
df_cleveland.head(2)

#### Viewing information on nulls and datatypes

Display dataset structure, including non-null counts, and data types, to check for missing values and overall composition.

In [None]:
# Display information about the dataset
df_cleveland.info()

#### Data Dictionary

Create a data dictionary for the Cleveland dataset by extracting column names, data types, descriptions, and min/max values.

In [None]:
# Assign the dataset to df for reference
df = df_cleveland

# Creating a DataFrame showing column details: name, type, description, and min/max values
column_details = list(zip(
    df.columns,
    df.dtypes.astype(str),
    [
        "Age in years",
        "Sex (1 = male; 0 = female)",
        "Chest pain type (1: typical angina, 2: atypical angina, 3: non-anginal pain, 4: asymptomatic)",
        "Resting blood pressure (mmHg on admission)",
        "Serum cholesterol (mg/dL)",
        "Fasting blood sugar > 120 mg/dL (1 = true, 0 = false)",
        "Resting electrocardiographic results (0: normal, 1: ST-T wave abnormality, 2: probable left ventricular hypertrophy)",
        "Maximum heart rate achieved",
        "Exercise-induced angina (1 = yes; 0 = no)",
        "ST depression induced by exercise relative to rest",
        "Slope of peak exercise ST segment (1: upsloping, 2: flat, 3: downsloping)",
        "Number of major vessels (0–3) colored by fluoroscopy",
        "Thalassemia (3: normal, 6: fixed defect, 7: reversible defect)",
        "Diagnosis of heart disease (0: <50% narrowing, 1: >50% narrowing)"
    ],
    df.min().tolist(),
    df.max().tolist()
))

# Creating the DataFrame
df_cleveland_data_dictionary = pd.DataFrame(
    column_details, 
    columns=["FieldName", "DataType", "Description", "Min", "Max"]
)

# Display the DataFrame
df_cleveland_data_dictionary

#### Loading the other datasets

Load the Hungarian, Switzerland, and Long Beach datasets into DataFrames, handling missing values, encoding, and delimiters.

In [None]:
# Common parameters for reading datasets
read_params = {
    "na_values": [" ", "?", "NA"],
    "encoding": "ISO-8859-1",
    "header": None,
    "delimiter": ","
}

# Loading the Hungarian dataset into a DataFrame
df_hungarian = pd.read_csv("processed.hungarian.data", **read_params)

# Loading the Switzerland dataset into a DataFrame
df_switzerland = pd.read_csv("processed.switzerland.data", **read_params)

# Loading the Long Beach dataset into a DataFrame
df_longbeach = pd.read_csv("processed.va.data", **read_params)

#### Creating dictionary of location and df name

Create a dictionary to store all datasets, making them easier to access by name.

In [None]:
# Creating a dictionary to store all datasets for easier access by name
df_all_datasets_dict = {
    "Cleveland": df_cleveland,
    "Hungarian": df_hungarian,
    "Switzerland": df_switzerland,
    "Longbeach": df_longbeach
}

#### Getting shape of each location dataframe

Iterate through the datasets, retrieves the shape (rows and columns) of each, and stores the results in a summary DataFrame for easy comparison.

In [None]:
# Create a list to store the shape of each dataset
df_shape_list = [
    {
        "DataFrame": name, 
        "Rows": frame.shape[0], 
        "Columns": frame.shape[1]
    }
    for name, frame in df_all_datasets_dict.items()
]

# Convert the list of dictionaries into a DataFrame
df_all_shapes = pd.DataFrame(df_shape_list).set_index("DataFrame")

# Display the shape summary of all datasets
df_all_shapes


#### New Column Names

Define a list of new column names that will be applied to all datasets to ensure consistency and clarity.

In [None]:
# Define a list of new column names for the datasets
new_column_names = [
    "Age",           # Age in years
    "Sex",           # 1 = male, 0 = female
    "ChestPain",     # Chest pain type
    "RestingBP",     # Resting blood pressure
    "Chol",          # Serum cholesterol
    "FastingBS",     # Fasting blood sugar > 120 mg/dl (1 = true, 0 = false)
    "RestingECG",    # Resting electrocardiographic results
    "HeartRateMax",  # Maximum heart rate achieved
    "ExeAngina",     # Exercise induced angina (1 = yes, 0 = no)
    "STDep",         # ST depression induced by exercise relative to rest
    "STSlope",       # Slope of peak exercise ST segment
    "ColouredMV",    # Number of major vessels (0-3) colored by fluoroscopy
    "Thalass",       # Thalassemia (3 = normal, 6 = fixed defect, 7 = reversible defect)
    "Diagnosis"      # Diagnosis of heart disease (0 = < 50% narrowing, 1 = > 50% narrowing)
]

#### Renaming columns

Applies the new column names to all datasets in the df_all_datasets_dict dictionary, ensuring consistency across datasets.

In [None]:
# Apply the new column names to all datasets
for name, frame in df_all_datasets_dict.items():
    # Assign the new column names to each DataFrame
    frame.columns = new_column_names

#### Summarising datatypes for each column in each dataset

Generates a summary of the data types for each column in each dataset, concatenating the results into a single DataFrame for easier comparison.

In [None]:
# Create a list of DataFrames containing the data types of each dataset's columns
df_dtype_list = []

for name, frame in df_all_datasets_dict.items():
    # Create a DataFrame of column data types for each dataset
    df_dtype = pd.DataFrame(frame.dtypes)
    
    # Rename the column to the dataset name for clarity
    df_dtype.columns = [name]
    
    # Append the DataFrame to the list
    df_dtype_list.append(df_dtype)

# Concatenate all DataFrames in the list into one DataFrame
df_all_dtypes = pd.concat(df_dtype_list, axis=1)

# Display the data types summary
df_all_dtypes

#### Summarising null values in each column in each dataset

Counts the number of null (missing) values in each column of all datasets and generates a summary DataFrame. It also displays the null count for specific columns.

In [None]:
# Create a list of DataFrames containing the null value counts for each dataset's columns
df_null_list = []

for name, frame in df_all_datasets_dict.items():
    # Count the null values in each column of the current dataset
    df_null = pd.DataFrame(frame.isna().sum())
    
    # Rename the column to the dataset name for clarity
    df_null.columns = [name]
    
    # Append the DataFrame to the list
    df_null_list.append(df_null)

# Concatenate all DataFrames in the list into one DataFrame
df_all_null = pd.DataFrame(pd.concat(df_null_list, axis=1))

# Save the null values summary to a CSV file
df_all_null.to_csv("UCI_location_nulls.csv", index=True)

# Display the null values count for specific columns of interest
df_all_null.loc[["RestingBP", "Chol", "FastingBS", "RestingECG", "HeartRateMax", 
                 "ExeAngina", "STDep", "STSlope", "ColouredMV", "Thalass"]].T

#### Adding Location column

Add a new column, "Location," to each dataset, which stores the name of the dataset as a string to track its source.

In [None]:
# Add a "Location" column to each dataset to identify the dataset source
for name, frame in df_all_datasets_dict.items():
    # Assign the dataset name to the new "Location" column
    frame["Location"] = name
    
    # Ensure that the "Location" column is stored as a string
    frame["Location"] = frame["Location"].astype(str)

#### Combining datasets

Concatenate all datasets into a single DataFrame and displays the first three rows.

In [None]:
# Concatenate all datasets into one combined DataFrame
df_combined = pd.concat(df_all_datasets_dict.values(), ignore_index=True)

# Display the first three rows of the combined dataset
df_combined.head(3)

#### Removing rows with missing values

Count the number of rows in the combined dataset that contain any missing values.

In [None]:
# Count the number of rows in the combined dataset with any missing values
missing_rows_count = len(df_combined[df_combined.isna().any(axis=1)])
missing_rows_count

Drop rows with any missing values from the combined dataset and displays the shape of the cleaned dataset.

In [None]:
# Drop rows with any missing values from the combined dataset
df_combined = df_combined.dropna()

# Display the shape of the cleaned dataset
df_combined.shape

#### Counting remaining rows from each dataset

Count the number of rows for each dataset in the combined dataset and displays the results in a DataFrame.

In [None]:
# Count the number of rows for each dataset in the combined dataset
location_count_list = []

for name, frame in df_all_datasets_dict.items():
    # Get the count of rows for each location (dataset)
    location_count = (name, df_combined[df_combined["Location"] == name].shape[0])
    location_count_list.append(location_count)

# Create a DataFrame to display the count of rows per dataset location
df_location_count = pd.DataFrame(location_count_list, columns=["DataFrame", "NoNulls"]).style.hide(axis="index")

# Display the count of rows for each dataset
df_location_count

#### Creating dictionary of datatypes for columns

Convert specific columns in the df_combined DataFrame to their appropriate data types and display the resulting data types of all columns.

In [None]:
# Convert columns to appropriate data types
df_combined = df_combined.astype({
    "Age": "int64",
    "Sex": "category",
    "ChestPain": "category",
    "RestingBP": "int64",
    "Chol": "int64",
    "FastingBS": "category",
    "RestingECG": "category",
    "HeartRateMax": "int64",
    "ExeAngina": "category",
    "STDep": "float64",
    "STSlope": "category",
    "ColouredMV": "int64",
    "Thalass": "category",
    "Diagnosis": "category"
})

# Confirm the data types of each column in a structured format
df_combined_dtypes = pd.DataFrame(df_combined.dtypes, columns=["Dtype"])
df_combined_dtypes

Looking at first five rows.

In [None]:
df_combined.head()

#### Adding SexMF column

Add a new column SexMF to the df_combined DataFrame, replacing 0 and 1 in the Sex column with F and M.

In [None]:
# Ensure 'Sex' column is categorical and rename categories to 'F' and 'M'
if df_combined["Sex"].dtype.name == "category":
    df_combined["SexMF"] = df_combined["Sex"].cat.rename_categories({0: "F", 1: "M"})

# Confirm the first few rows of the newly created 'SexMF' column
df_combined[["Sex", "SexMF"]].head()

#### Changing categorical values in cat columns

Rename groups in categorical columns based on a predefined dictionary, modifying the category values in the df_combined DataFrame.

In [None]:
# Dictionary for category renaming
dict_column_changes = {
    "Sex": {1.0: 1, 0.0: 0},
    "ChestPain": {1.0: 1, 2.0: 2, 3.0: 3, 4.0: 4},
    "FastingBS": {0.0: 0, 1.0: 1},
    "RestingECG": {0.0: 0, 1.0: 1, 2.0: 2},
    "ExeAngina": {1.0: 1, 0.0: 0},
    "STSlope": {1.0: 1, 2.0: 2, 3.0: 3},
    "Thalass": {3.0: 3, 6.0: 6, 7.0: 7}
}

# Rename the categories safely
for col, change in dict_column_changes.items():
    if df_combined[col].dtype.name == "category":
        df_combined[col] = df_combined[col].cat.rename_categories(change)

# Display a sample to confirm changes
df_combined[dict_column_changes.keys()].head(5)

#### Checking if categorical data is ordered

Check each column in df_combined to identify if categorical data is ordered and display the category labels.

In [None]:
# Inspect each column for data type and order status
for col in df_combined:
    if df_combined[col].dtype == "category":
        print(f"Column: {col}\nCategories: {df_combined[col].cat.categories.tolist()}\nOrdered: {df_combined[col].cat.ordered}\n")
    else:
        print(f"Column: {col}\nData Type: {df_combined[col].dtype}\n")

#### Making "Diagnosis" ordered

Convert the Diagnosis column to an ordered categorical variable with specified categories.

In [None]:
# Convert 'Diagnosis' column to an ordered categorical variable with defined levels
df_combined["Diagnosis"] = pd.Categorical(
    df_combined["Diagnosis"], 
    categories=[0, 1, 2, 3, 4], 
    ordered=True
)

# Confirm the updated categorical properties
print(f"Categories: {df_combined['Diagnosis'].cat.categories.tolist()}")
print(f"Ordered: {df_combined['Diagnosis'].cat.ordered}")

#### Making "DiagnosisYN" column

Create a new column DiagnosisYN to classify Diagnosis into binary categories (1: heart disease, 0: no heart disease) and convert it to a categorical variable.

In [None]:
# Create a new binary column 'DiagnosisYN' based on 'Diagnosis'
# 1 for heart disease (1-4), 0 for no heart disease (0)
df_combined["DiagnosisYN"] = np.where(df_combined["Diagnosis"].isin([1, 2, 3, 4]), 1, 0)

# Convert 'DiagnosisYN' to a categorical variable with defined categories
df_combined["DiagnosisYN"] = pd.Categorical(df_combined["DiagnosisYN"], categories=[0, 1], ordered=False)

# Verify the new column's categorical properties
print(f"Categories: {df_combined['DiagnosisYN'].cat.categories.tolist()}")
print(f"Ordered: {df_combined['DiagnosisYN'].cat.ordered}")

#### Check Dtypes and summaries

Display the data types of all columns in the df_combined DataFrame.

In [None]:
# Display the data types of all columns in the DataFrame
df_combined.dtypes

Generate and round summary statistics for all numeric columns in the df_combined DataFrame to two decimal places.

In [None]:
# Generate summary statistics for numeric columns and round to 2 decimal places
df_combined.describe().round(2)

#### Summary statistics for each Diagnosis type

Generate separate summary statistics for rows in df_combined where DiagnosisYN is 0 and 1, rounding the results to two decimal places.

In [None]:
# Summary for DiagnosisYN == 0
summary_no_disease = df_combined[df_combined["DiagnosisYN"] == 0].describe().round(2)

# Summary for DiagnosisYN == 1
summary_disease = df_combined[df_combined["DiagnosisYN"] == 1].describe().round(2)

# Optionally display summaries
display(summary_no_disease, summary_disease)

# Alternative: Group by DiagnosisYN for concise summaries
grouped_summary = df_combined.groupby("DiagnosisYN", observed=False).describe().round(2)
grouped_summary

### 4.2 Data Exploration - Visualisations

#### Making lists of categorical data and numerical data

Create separate lists for categorical and numerical columns to organize the dataset for further analysis.

In [None]:
# Lists for categorical and numerical columns for easy access
columns_for_categorical = [
    "SexMF",    # Gender (Male/Female)
    "ChestPain", # Type of chest pain
    "FastingBS", # Fasting blood sugar
    "RestingECG", # Resting electrocardiographic results
    "ExeAngina",  # Exercise induced angina
    "STSlope",    # Slope of the peak exercise ST segment
    "Thalass",    # Thalassemia
    "Diagnosis",  # Diagnosis (original classes)
    "DiagnosisYN" # Diagnosis (binary: 1 for heart disease, 0 for no heart disease)
]

columns_for_numerical = [
    "Age",          # Age of the patient
    "RestingBP",    # Resting blood pressure
    "Chol",         # Cholesterol levels
    "HeartRateMax", # Maximum heart rate achieved
    "STDep",        # Depression of the ST segment
    "ColouredMV"    # Number of colored major vessels
]

#### Bar Chart - Sex

In [None]:
# Create a count plot for 'SexMF' grouped by 'DiagnosisYN'
plt.figure(figsize=(4, 5))

sns.countplot(
    x="SexMF", 
    order=["M", "F"], 
    palette="Set3", 
    hue="DiagnosisYN", 
    data=df_combined
)

# Adding title, labels, and styling
plt.title("Sex by Heart Disease Diagnosis", fontsize=10, fontweight='bold')
plt.xlabel('Sex', fontsize=12)
plt.xticks(
    ticks=[0, 1],
    labels=["M", "F"],
    fontsize=8
)
plt.ylabel('Count', fontsize=12)
plt.legend(
    title='Diagnosis', 
    labels=['No', 'Yes'], 
    loc='upper right',
    fontsize=10
)

# Adjusting layout to avoid clipping
plt.subplots_adjust(top=0.95)
plt.subplots_adjust(left=0.15)

# Display the plot
plt.show()

#### Bar Chart - ChestPain

In [None]:
# Create a count plot for the 'DiagnosisYN' column, grouped by 'ChestPain'
plt.figure(figsize=(4,5))

sns.countplot(
    x="DiagnosisYN", 
    order=["0","1"], 
    palette="Set3", 
    hue="ChestPain", 
    data=df_combined
)

# Adding title, labels and styling
plt.title("Heart Disease Diagnosis by Chest Pain Type", fontsize=10, fontweight='bold')
plt.xlabel('Diagnosis', fontsize=12)
plt.xticks(
    ticks = [0, 1],
    labels = ["N","Y"],
    fontsize = 8
)

# Customize legend to show chest pain types
plt.legend(
    title='Pain Type', 
    labels=["Typical Angina", "Atypical Angina", "Non-Anginal Pain", "Asymptomatic"], 
    loc='upper left',
    fontsize=8
)

# Adjusting layout to avoid clipping
plt.subplots_adjust(top=0.95)
plt.subplots_adjust(left=0.15)

# Display the plot
plt.show()

#### Bar Chart - FastingBS

In [None]:
# Create a count plot for 'FastingBS', grouped by 'DiagnosisYN'
plt.figure(figsize=(5, 6))

sns.countplot(
    x="FastingBS", 
    order=["0", "1"],  # FastingBS: 0 = False, 1 = True
    palette="Set3", 
    hue="DiagnosisYN",  # Hue by heart disease diagnosis
    data=df_combined
)

# Add title, labels, and styling
plt.title("Fasting Blood Sugar by Heart Disease Diagnosis", fontsize=10, fontweight='bold')
plt.xlabel('Fasting blood sugar ( > 120 mg/dL)', fontsize=12)
plt.xticks(
    ticks=[0, 1],
    labels=["False", "True"],  # Labels for FastingBS: False = 0, True = 1
    fontsize=10
)
plt.ylabel('Count', fontsize=12)

# Customize legend to show diagnosis categories
plt.legend(
    title='Diagnosis', 
    labels=['No', 'Yes'],  # Legend for diagnosis: No = 0, Yes = 1
    loc='upper right',
    fontsize=10
)

# Adjust layout
plt.subplots_adjust(top=0.8)
plt.subplots_adjust(left=0.15)

# Display the plot
plt.show()

#### Bar Chart - RestingECG

In [None]:
# Create a count plot for 'RestingECG', grouped by 'DiagnosisYN'
plt.figure(figsize=(5, 6))

sns.countplot(
    x="RestingECG", 
    order=["0", "1", "2"],  # RestingECG: 0 = Normal, 1 = ST-T Abnormality, 2 = Hypertrophy
    palette="Set3", 
    hue="DiagnosisYN",  # Hue by heart disease diagnosis
    data=df_combined
)

# Add title, labels, and styling
plt.title("Resting ECG Results by Heart Disease Diagnosis", fontsize=10, fontweight='bold')
plt.xlabel('Resting ECG Results', fontsize=12)
plt.xticks(
    ticks=[0, 1, 2],  # x-axis ticks for ECG results
    labels=["Normal", "ST-T Abnormality", "Hypertrophy"],  # RestingECG categories
    fontsize=10
)
plt.ylabel('Count', fontsize=12)

# Customize legend to show diagnosis categories
plt.legend(
    title='Diagnosis', 
    labels=['No', 'Yes'],  # Legend for diagnosis: No = 0, Yes = 1
    loc='upper center',
    fontsize=10
)

# Adjust layout
plt.subplots_adjust(top=0.8)
plt.subplots_adjust(left=0.15)

# Display the plot
plt.show()

#### Bar Chart - ExeAngina

In [None]:
# Create a count plot for 'ExeAngina', grouped by 'DiagnosisYN'
plt.figure(figsize=(5, 6))

sns.countplot(
    x="ExeAngina", 
    order=["0", "1"],  # ExeAngina: 0 = No, 1 = Yes
    palette="Set3", 
    hue="DiagnosisYN",  # Hue by heart disease diagnosis
    data=df_combined
)

# Add title, labels, and styling
plt.title("Exercise Induced Angina by Heart Disease Diagnosis", fontsize=12, fontweight='bold')
plt.xlabel('Exercise Induced Angina', fontsize=12)
plt.xticks(
    ticks=[0, 1],  # x-axis ticks for angina types
    labels=['No', 'Yes'],  # Labels for ExeAngina categories
    fontsize=12
)
plt.ylabel('Count', fontsize=12)

# Customize legend to show diagnosis categories
plt.legend(
    title='Diagnosis', 
    labels=['No', 'Yes'],  # Legend for diagnosis: No = 0, Yes = 1
    loc='upper right',
    fontsize=10
)

# Adjust layout to avoid clipping
plt.subplots_adjust(top=0.8)
plt.subplots_adjust(left=0.15)

# Display the plot
plt.show()

#### Bar Chart - STSlope

In [None]:
# Create a count plot for 'STSlope', grouped by 'DiagnosisYN'
plt.figure(figsize=(5, 6))

sns.countplot(
    x="STSlope", 
    order=["1", "2", "3"],  # Order for ST Slope: 1 = Upsloping, 2 = Flat, 3 = Downsloping
    palette="Set3", 
    hue="DiagnosisYN",  # Hue by heart disease diagnosis
    data=df_combined
)

# Add title, labels, and styling
plt.title("ST Slope by Heart Disease Diagnosis", fontsize=12, fontweight='bold')
plt.xlabel('ST Slope', fontsize=12)
plt.xticks(
    ticks=[0, 1, 2],  # x-axis ticks for slope types
    labels=["Upsloping", "Flat", "Downsloping"],  # Labels for Slope categories
    fontsize=10
)
plt.ylabel('Count', fontsize=12)

# Customize legend to show diagnosis categories
plt.legend(
    title='Diagnosis', 
    labels=['No', 'Yes'],  # Legend for diagnosis: No = 0, Yes = 1
    loc='upper right',
    fontsize=10
)

# Adjust layout to avoid clipping
plt.subplots_adjust(top=0.8)
plt.subplots_adjust(left=0.15)

# Display the plot
plt.show()


#### Bar Chart - Thalass

In [None]:
# Create a count plot for 'Thalass', grouped by 'DiagnosisYN'
plt.figure(figsize=(5, 6))

sns.countplot(
    x="Thalass", 
    order=["3", "6", "7"],  # Order for Thalassemia status: 3 = Normal, 6 = Fixed Defect, 7 = Reversible Defect
    palette="Set3", 
    hue="DiagnosisYN",  # Hue by heart disease diagnosis
    data=df_combined
)

# Add title, labels, and styling
plt.title("Thalassemia Status by Heart Disease Diagnosis", fontsize=12, fontweight='bold')
plt.xlabel('Thalassemia Status', fontsize=12)
plt.xticks(
    ticks=[0, 1, 2],  # x-axis ticks for Thalassemia categories
    labels=["Normal", "Fixed Defect", "Reversible Defect"],  # Labels for Thalassemia categories
    fontsize=10
)

plt.ylabel('Count', fontsize=12)

# Customize legend to show diagnosis categories
plt.legend(
    title='Diagnosis', 
    labels=['No', 'Yes'],  # Legend for diagnosis: No = 0, Yes = 1
    loc='upper right',
    fontsize=10
)

# Adjust layout to avoid clipping
plt.subplots_adjust(top=0.95)
plt.subplots_adjust(left=0.15)

# Display the plot
plt.show()

#### Bar Chart - ColouredMV

In [None]:
# Create a count plot for 'ColouredMV', grouped by 'DiagnosisYN'
plt.figure(figsize=(5,6))

sns.countplot(
    x="ColouredMV", 
    order=["0", "1", "2", "3"],  # Order for coloured major vessels: 0 = No vessels, 1-3 = Different numbers of vessels
    palette="Set3", 
    hue="DiagnosisYN",  # Hue by heart disease diagnosis
    data=df_combined
)

# Add title, labels, and styling
plt.title("Fluoroscopy Results by Heart Disease Diagnosis", fontsize=12, fontweight='bold')
plt.xlabel('Coloured Major Vessels', fontsize=12)
plt.xticks(
    ticks=[0, 1, 2, 3],  # x-axis ticks for coloured vessels
    labels=["0", "1", "2", "3"],  # Labels for number of vessels
    fontsize=10
)

plt.ylabel('Count', fontsize=12)

# Customize legend to show diagnosis categories
plt.legend(
    title='Diagnosis', 
    labels=['No', 'Yes'],  # Legend for diagnosis: No = 0, Yes = 1
    loc='upper right',
    fontsize=10
)

# Adjust layout to avoid clipping
plt.subplots_adjust(top=0.8)
plt.subplots_adjust(left=0.15)

# Display the plot
plt.show()

#### Histogram - Age

In [None]:
att = "Age"
att_name = "Age (Years)"
title = "Overlayed Histograms of Age\n for Heart Disease vs. No Heart Disease"

plt.figure(figsize=(5,6))

# Create a histogram of 'Age' with KDE, grouped by heart disease diagnosis
sns.histplot(
    data=df_combined,
    x=att,
    hue="DiagnosisYN",  # Group by heart disease diagnosis
    hue_order=[0,1],  # Order for the hue: No heart disease first
    kde=True,  # Kernel Density Estimate for smooth distribution curve
    bins=20,  # Increase the number of bins for more granularity
    palette="Set2",  # Color palette for the groups
    multiple="dodge",  # Separate the histograms for each group
    alpha=0.6  # Transparency for better overlay visibility
)

# Title, labels, and styling
plt.title(title, fontsize=12, fontweight='bold')
plt.xlabel(att_name, fontsize=12)
plt.ylabel("Frequency", fontsize=12)

# Customize the legend
plt.legend(
    title='Diagnosis', 
    labels=["Yes", "No"],  # Labels for the diagnosis groups
    loc='upper right',
    fontsize=10
)

# Adjust layout
plt.subplots_adjust(top=0.92)
plt.subplots_adjust(left=0.15)

# Display the plot
plt.show()

#### Box Plot - Age

In [None]:
att = "Age"
att_name = "Age (Years)"
title = "Box Plots of Age\n for Heart Disease vs. No Heart Disease"

plt.figure(figsize=(4,5))

# Create a box plot for 'Age', grouped by heart disease diagnosis
sns.boxplot(
    data=df_combined,
    x="DiagnosisYN",  # Group by heart disease diagnosis
    y=att,  # The variable 'Age'
    hue="DiagnosisYN",  # Coloring by diagnosis
    palette="Set2",  # Color palette for the groups
    showmeans=True,  # Show the mean values on the plot
    width=0.5  # Adjust the width for better spacing
)

# Title, labels, and styling
plt.title(title, fontsize=12, fontweight='bold')
plt.xlabel("Diagnosis", fontsize=12)
plt.ylabel(att_name, fontsize=12)

# Customize the x-ticks to show 'No' and 'Yes' instead of 0 and 1
plt.xticks(
    ticks=[0, 1], 
    labels=['No', 'Yes'],  # Replace '0' with 'No' and '1' with 'Yes'
    fontsize=12
)

# Remove the legend
plt.legend([], [], frameon=False)

# Adjust layout for better spacing
plt.subplots_adjust(top=0.9)
plt.subplots_adjust(left=0.2)

# Display the plot
plt.show()

#### Histogram - RestingBP

In [None]:
att = "RestingBP"
att_name = "Resting Blood Pressure (mm/Hg on admission)"
title = "Overlayed Histograms of Resting Blood Pressure\n for Heart Disease vs. No Heart Disease"

plt.figure(figsize=(5,6))

# Create a histogram of 'RestingBP' with KDE, grouped by heart disease diagnosis
sns.histplot(
    data=df_combined,
    x=att,
    hue="DiagnosisYN",  # Group by heart disease diagnosis
    hue_order=[0,1],  # Order for the hue: No heart disease first
    kde=True,  # Kernel Density Estimate for smooth distribution curve
    bins=20,  # Increase the number of bins for more granularity
    palette="Set2",  # Color palette for the groups
    multiple="dodge",  # Separate the histograms for each group
    alpha=0.6  # Transparency for better overlay visibility
)

# Title, labels, and styling
plt.title(title, fontsize=12, fontweight='bold')
plt.xlabel(att_name, fontsize=12)
plt.ylabel("Frequency", fontsize=12)

# Customize the legend
plt.legend(
    title='Diagnosis', 
    labels=["Yes", "No"],  # Labels for the diagnosis groups
    loc='upper right',
    fontsize=10
)

# Adjust layout
plt.subplots_adjust(top=0.8)
plt.subplots_adjust(left=0.15)

# Display the plot
plt.show()

#### Box Plot - RestingBP

In [None]:
att = "RestingBP"
att_name = "Resting Blood Pressure (mm/Hg on admission)"
title = "Box Plots of Resting Blood Pressure\n for Heart Disease vs. No Heart Disease"

plt.figure(figsize=(4,5))

# Create a box plot for 'RestingBP', grouped by 'DiagnosisYN'
sns.boxplot(
    data=df_combined,
    x="DiagnosisYN",  # Group by heart disease diagnosis
    y=att,  # Plot RestingBP on the y-axis
    hue="DiagnosisYN",  # Add color based on diagnosis
    palette="Set2",  # Color palette for the groups
    showmeans=True  # Show mean values on the box plot
)

# Title, labels, and styling
plt.title(title, fontsize=12, fontweight='bold')
plt.xlabel("Diagnosis", fontsize=12)
plt.ylabel(att_name, fontsize=10)

# Customize x-ticks
plt.xticks(
    ticks=[0, 1], 
    labels=['No', 'Yes'],  # Replace 0 with 'No' and 1 with 'Yes'
    fontsize=12
)

# Remove the legend
plt.legend([],[],frameon=False)

# Adjust layout
plt.subplots_adjust(top=0.8)
plt.subplots_adjust(left=0.2)

# Display the plot
plt.show()

#### Histogram - Chol

In [None]:
att = "Chol"
att_name = "Serum Cholestorol (mg/dL)"
title = "Overlayed Histograms of Serum Cholersterol\n for Heart Disease vs. No Heart Disease"

plt.figure(figsize=(4,5))

# Create a histogram with KDE for 'Chol', grouped by 'DiagnosisYN'
sns.histplot(
    data=df_combined,
    x=att,  # Plot Serum Cholesterol on the x-axis
    hue="DiagnosisYN",  # Group by heart disease diagnosis
    hue_order=[0, 1],  # '0' for No, '1' for Yes
    kde=True,  # Add KDE for smoother distribution visualization
    bins=15,  # Number of bins
    palette="Set2",  # Color palette for the groups
    multiple="dodge",  # Separate histograms for different hues
    alpha=0.5  # Transparency to overlay the histograms
)

# Add title, labels, and styling
plt.title(title, fontsize=12, fontweight='bold')
plt.xlabel(att_name, fontsize=12)
plt.ylabel("Frequency", fontsize=12)

# Customize the legend
plt.legend(
    title='Diagnosis', 
    labels=["Yes", "No"], 
    loc='upper right', 
    fontsize=10
)

# Adjust layout to avoid clipping
plt.subplots_adjust(top=0.92)
plt.subplots_adjust(left=0.15)

# Display the plot
plt.show()

#### Box Plot - Chol

In [None]:
att = "Chol"
att_name = "Serum Cholestorol (mg/dL)"
title = "Box Plots of Serum Cholestorol\n for Heart Disease vs. No Heart Disease"

plt.figure(figsize=(4,5))

# Create a boxplot for 'Chol', grouped by 'DiagnosisYN'
sns.boxplot(
    data=df_combined,
    x="DiagnosisYN",  # Group by diagnosis (Heart Disease vs. No Heart Disease)
    y=att,  # Plot Serum Cholesterol on the y-axis
    hue="DiagnosisYN",  # Color the boxes by diagnosis
    palette="Set2",  # Color palette for the groups
    showmeans=True  # Show the mean value in the boxplot
)

# Add title and axis labels
plt.title(title, fontsize=12, fontweight='bold')
plt.xlabel("Diagnosis", fontsize=12)
plt.ylabel(att_name, fontsize=12)

# Customize x-ticks to display "No" and "Yes" for diagnosis
plt.xticks(
    ticks=[0, 1], 
    labels=['No', 'Yes'],  # Replace '0' with 'No' and '1' with 'Yes'
    fontsize=12
)

# Remove the legend as it is redundant with the labels on the x-axis
plt.legend([],[],frameon=False)

# Adjust layout to avoid clipping
plt.subplots_adjust(top=0.9)
plt.subplots_adjust(left=0.25)

# Display the plot
plt.show()

#### Histogram - HeartRateMax

In [None]:
att = "HeartRateMax"
att_name = "Maximum Achieved Heart Rate (BPM)"
title = "Overlayed Histograms of Maximum Heart Rate\n for Heart Disease vs. No Heart Disease"

plt.figure(figsize=(4,6))

# Create a histogram with KDE for 'HeartRateMax', grouped by 'DiagnosisYN'
sns.histplot(
    data=df_combined,
    x=att,  # Plot the 'HeartRateMax' column on the x-axis
    hue="DiagnosisYN",  # Color the bars by diagnosis (Heart Disease vs. No Heart Disease)
    hue_order=[0,1],  # Order of hue labels (No Heart Disease first)
    kde=True,  # Display a kernel density estimate on the histogram
    bins=20,  # Adjusted number of bins for better clarity
    palette="Set2",  # Color palette for the groups
    multiple="dodge",  # Place the bars for each group next to each other
    alpha=0.5  # Transparency of the bars
)

# Add title and axis labels
plt.title(title, fontsize=12, fontweight='bold')
plt.xlabel(att_name, fontsize=12)
plt.ylabel("Frequency", fontsize=12)

# Customize the legend
plt.legend(
    title='Diagnosis', 
    labels=["Yes","No"], 
    loc='upper right',  # Position the legend in the upper right
    fontsize=10
)

# Adjust layout to avoid clipping of elements
plt.subplots_adjust(top=0.92)
plt.subplots_adjust(left=0.15)

# Display the plot
plt.show()

#### Box Plot - HeartRateMax

In [None]:
att = "HeartRateMax"
att_name = "Maximum Achieved Heart Rate (BPM)"
title = "Box Plots of Maximum Heart Rate\n for Heart Disease vs. No Heart Disease"

plt.figure(figsize=(4,5))  # Increased figure size for better clarity

# Create a boxplot comparing 'HeartRateMax' for different diagnoses ('Yes' or 'No')
sns.boxplot(
    data=df_combined,
    x="DiagnosisYN",  # x-axis represents diagnosis (No/Yes)
    y=att,  # y-axis represents 'HeartRateMax'
    hue="DiagnosisYN",  # Color boxes based on diagnosis (Heart Disease vs. No Heart Disease)
    palette="Set2",  # Color palette for the boxes
    showmeans=True  # Display the mean on the boxplot
)

# Set the title and axis labels
plt.title(title, fontsize=12, fontweight='bold')  # Increased title font size
plt.xlabel("Diagnosis", fontsize=12)
plt.ylabel(att_name, fontsize=12)  # Increased y-axis label font size

# Customize tick labels on x-axis
plt.xticks(
    ticks=[0, 1], 
    labels=['No', 'Yes'],  # Replace '0' with 'No' and '1' with 'Yes'
    fontsize=12
)

# Remove the legend as it is not needed for this plot
plt.legend([],[],frameon=False)

# Adjust the layout to avoid clipping of elements
plt.subplots_adjust(top=0.9)
plt.subplots_adjust(left=0.2)

# Display the plot
plt.show()

#### Histogram - STDep

In [None]:
att = "STDep"
att_name = "ST Depression"
title = "Overlayed Histograms of ST Depression\n for Heart Disease vs. No Heart Disease"

plt.figure(figsize=(4,6))

# Create a histogram with KDE for 'STDep' for both diagnoses ('Yes' and 'No')
sns.histplot(
    data=df_combined,
    x=att,  # x-axis represents 'STDep'
    hue="DiagnosisYN",  # Color by diagnosis (Heart Disease vs. No Heart Disease)
    hue_order=[0,1],  # Order of hue categories (No = 0, Yes = 1)
    kde=True,  # Include KDE for smoothed distribution
    bins=20,  # Adjusted number of bins for better resolution
    palette="Set2",  # Color palette for the histogram
    multiple="dodge",  # Display bars side by side
    alpha=0.5  # Transparency level for the bars
)

# Set the title and axis labels
plt.title(title, fontsize=12, fontweight='bold')  # Increased title font size
plt.xlabel(att_name, fontsize=12)
plt.ylabel("Frequency", fontsize=12)

# Customize the legend
plt.legend(
    title='Diagnosis', 
    labels=["Yes","No"], 
    loc='upper right',
    fontsize=12  # Increased font size for legend
)

# Adjust layout to avoid clipping
plt.subplots_adjust(top=0.8)
plt.subplots_adjust(left=0.15)

# Display the plot
plt.show()

#### Box Plot - STDep

In [None]:
att = "STDep"
att_name = "ST Depression"
title = "Box Plots of ST Depression\n for Heart Disease vs. No Heart Disease"

plt.figure(figsize=(3,4))

# Create a box plot to show the distribution of 'STDep' for each diagnosis group
sns.boxplot(
    data=df_combined,
    x="DiagnosisYN",  # x-axis represents diagnosis categories (No vs. Yes)
    y=att,  # y-axis represents 'STDep'
    hue="DiagnosisYN",  # Color by diagnosis (Heart Disease vs. No Heart Disease)
    palette="Set2",  # Color palette for the plot
    showmeans=True  # Show the mean value in the box plot
)

# Set the title and axis labels with increased font size
plt.title(title, fontsize=12, fontweight='bold')  # Increased title font size
plt.xlabel("Diagnosis", fontsize=14)  # Increased label font size
plt.ylabel(att_name, fontsize=14)  # Increased label font size

# Customize x-axis ticks and labels
plt.xticks(
    ticks=[0, 1],  # x-axis positions
    labels=['No', 'Yes'],  # Labels for the diagnosis categories
    fontsize=12
)

# Remove the legend as it's not needed for this box plot
plt.legend([], [], frameon=False)

# Adjust layout to avoid clipping
plt.subplots_adjust(top=0.8)
plt.subplots_adjust(left=0.15)

# Display the plot
plt.show()

### 4.3 Data Exploration - Correlations

#### Correlation matrix

Generate a heatmap of the correlation matrix for numerical features in the dataset. The correlations are annotated inside the heatmap with two decimal places.

In [None]:
# Calculate the correlation matrix for selected numerical columns
corr_matrix = df_combined[columns_for_numerical].corr()

# Set up the figure for the heatmap
plt.figure(figsize=(7,7))  # Adjusted figure size for clearer view

# Create the heatmap of the correlation matrix
sns.heatmap(
    corr_matrix,              # Data to plot (correlation matrix)
    annot=True,               # Annotate the cells with correlation values
    fmt=".2f",                # Format for displaying correlation values
    cmap="coolwarm",          # Colormap for the heatmap
    linewidths=0.5,           # Line width between the cells
    vmin=-1, vmax=1,          # Color scale limits (min and max values)
    square=True,              # Ensure the heatmap is square-shaped
    cbar_kws={"shrink": .8}   # Shrink color bar for better visibility
)

# Set the title of the plot
plt.title("Correlation Matrix of Numerical Features", fontsize=12)

# Adjust the layout to avoid clipping of labels and titles
plt.subplots_adjust(top=0.95)
plt.subplots_adjust(left=0.2)

# Display the heatmap
plt.show()

#### Point Biserial Correlation Coefficient Calculation

This section calculates the point biserial correlation coefficient between each numerical variable and the target variable `DiagnosisYN` (which indicates the presence of heart disease). The point biserial correlation is used to measure the strength and direction of the association between a continuous and a binary variable.

The results are displayed as a table with the correlation coefficient and the corresponding p-value for each numerical feature. This is useful for identifying which continuous variables have a significant relationship with the presence of heart disease.

In [None]:
# Initialize a list to store results
pointbi_list = []

# Loop through each numerical column in the dataset
for col in columns_for_numerical:
    # Calculate point-biserial correlation and p-value between the numerical column and 'DiagnosisYN'
    correlation, p_value = pointbiserialr(df_combined[col], df_combined['DiagnosisYN'])
        
    # Append the results (column name, correlation, and p-value) to the list
    pointbi_list.append([col, correlation, p_value])

# Create a DataFrame to store the results for better visualisation
df_pointbi = pd.DataFrame(pointbi_list, columns=["Variable", "Correlation", "P-value"])

# Display the resulting DataFrame in a clean format
df_pointbi

#### Visualisation of Point Biserial Correlation Coefficients

This section visualises the point biserial correlation coefficients between the numerical features and the target variable `DiagnosisYN` using a bar plot. Each bar represents the correlation between a numerical feature and the diagnosis outcome (whether or not the individual has heart disease). The y-axis displays the correlation coefficients ranging from -1 to 1, indicating the strength and direction of the relationship.

The visualisation helps to quickly identify which features have stronger associations with heart disease diagnosis and the nature of these relationships (positive or negative).

In [None]:
# Visualize the point biserial correlation coefficients with a bar plot
plt.figure(figsize=(4, 5))

# Create a bar plot for the correlation coefficients of each numerical feature with DiagnosisYN
sns.barplot(
    x=df_pointbi["Variable"],  # Numerical features
    y=df_pointbi["Correlation"],  # Correlation coefficients
    color='skyblue',  # Bar color
    data=df_pointbi  # Data for the plot
)

# Set the title for the plot
plt.title("Point Biserial Correlation Coefficients with DiagnosisYN", fontsize=10, fontweight='bold')

# Set the x-axis label and adjust the font size
plt.xlabel('Numerical Features', fontsize=12)

# Rotate the x-axis labels to avoid overlap and adjust the font size
plt.xticks(rotation=60, fontsize=10)

# Set y-axis limits to ensure the correlation coefficients are within the range [-1, 1]
plt.ylim(-1, 1)

# Adjust the layout for better spacing
plt.subplots_adjust(left=0.25, bottom=0.25)

# Set the y-axis label for correlation coefficients
plt.ylabel('Correlation Coefficient', fontsize=10)

# Display the plot
plt.show()

#### PairGrid Visualisation

This code creates a PairGrid visualisation to explore relationships between continuous numerical variables in the dataset. The grid displays histograms for individual variables and scatter plots for pairwise comparisons, with the colour representing whether the patient has heart disease (DiagnosisYN).

In [None]:
# Create a PairGrid for continuous numerical variables, excluding "ColouredMV"
ax = sns.PairGrid(
    df_combined.drop(["ColouredMV"], axis=1),  # Drop "ColouredMV" column
    hue="DiagnosisYN",  # Colour by DiagnosisYN
    hue_order=[0, 1]    # Ensure correct order for hue (No = 0, Yes = 1)
)

# Apply histograms to diagonal elements (individual variable distributions)
ax.map_diag(sns.histplot)

# Apply scatter plots to off-diagonal elements (pairwise comparisons)
ax.map_offdiag(sns.scatterplot)

# Add legend and set title
ax.add_legend()
ax.fig.suptitle("Pair Grid of Continuous Numerical Variables", fontsize=10, fontweight='bold')

# Adjust layout to prevent overlap and clipping
plt.subplots_adjust(top=0.95)
plt.tight_layout()  # Ensure elements fit well into the plot area

plt.show()

#### Chi-squared test

This code calculates the Chi-squared test for each categorical variable against the DiagnosisYN variable (heart disease diagnosis). The test checks whether there is a significant association between categorical variables and heart disease diagnosis. The results are stored and displayed in a DataFrame.

In [None]:
# Convert 'ColouredMV' to a categorical variable with ordered categories
# The column 'ColouredMV' contains multiple values, and we want to treat it as a categorical variable with specific order.
df_combined["ColouredMVCat"] = pd.Categorical(df_combined["ColouredMV"], categories=[0, 1, 2, 3], ordered=True)

# Check the dtype of the new categorical column to verify the conversion
print(df_combined["ColouredMVCat"].dtype)  # This will print the dtype to confirm the conversion

In [None]:
# List of categorical variables to check against DiagnosisYN
columns_for_chi = [
    "Sex", 
    "ChestPain", 
    "FastingBS", 
    "RestingECG", 
    "ExeAngina", 
    "STSlope", 
    "Thalass", 
    "ColouredMVCat"
]

chi_list = []

# Loop through each categorical variable and perform the Chi-squared test
for cat in columns_for_chi:
    # Create a contingency table for the categorical variable vs DiagnosisYN
    contingency_table = pd.crosstab(df_combined[cat], df_combined['DiagnosisYN'])
    
    # Perform the Chi-squared test
    chi2, p, dof, expected = chi2_contingency(contingency_table)

    # Output the results with formatted p-value
    print(f"For {cat}:")
    print(f"Chi-squared statistic: {chi2:.4f}")
    print(f"P-value: {p:.4f}")
    print(f"Degrees of freedom: {dof}")
    print(f"Expected frequencies:\n{expected}")
    
    # Interpretation based on p-value
    if p < 0.05:
        print("The result is statistically significant (reject the null hypothesis).\n")
    else:
        print("The result is not statistically significant (fail to reject the null hypothesis).\n")
    
    # Append the results to the list
    chi_list.append([cat, chi2, dof, p])

# Display the results in a DataFrame
df_chi = pd.DataFrame(chi_list, columns=["Variable", "Chi-squared", "Degrees of Freedom", "P-value"])
df_chi.style.hide(axis="index")

### 4.4 Data Exploration - Decision Tree & Naive Bayes

#### Decision Tree

Train a Decision Tree classifier on categorical variables to predict heart disease diagnosis (DiagnosisYN). The decision tree is visualised to understand how different categorical variables are used to classify patients into heart disease categories (True/False). The plot helps in interpreting the decision-making process of the model.

In [None]:
# List of features used for training the decision tree model
features_for_tree = [
    "Sex", 
    "ChestPain", 
    "FastingBS", 
    "RestingECG", 
    "ExeAngina", 
    "STSlope", 
    "Thalass", 
    "ColouredMVCat"
]

# Train the Decision Tree classifier
clf = DecisionTreeClassifier(random_state=42).fit(
    X=df_combined[features_for_tree],  # Feature data
    y=df_combined["DiagnosisYN"]       # Target variable (heart disease diagnosis)
)

# Create a plot of the trained decision tree
plt.figure(figsize=(80, 20))  # Adjusted figure size for better readability

# Visualise the decision tree
plot_tree(
    clf,  # Decision tree classifier
    filled=True,  # Colour the nodes based on the majority class
    max_depth=11,  # Limit the depth for simplicity
    feature_names=features_for_tree,  # Feature names used in the model
    class_names=["No", "Yes"],  # Heart disease diagnosis (No, Yes)
    label='all',  # Display all labels (class, samples, value, etc.)
    fontsize=10,  # Font size for labels
    impurity=False  # Don't show impurity values
)

# Display the plot
plt.show()

Perform a GridSearchCV to find the optimal number of leaf nodes for a Decision Tree classifier by evaluating the model's performance across different values for max_leaf_nodes. The optimal number of leaf nodes is determined by selecting the model that achieves the highest test score. Afterward, the decision tree is visualised with the best number of leaf nodes.

In [None]:
# Define the parameters for GridSearchCV
parameters = {'max_leaf_nodes': range(3, 20)}  # Searching for optimal number of leaf nodes

# Initialize GridSearchCV with Decision Tree classifier and defined parameters
clf = GridSearchCV(DecisionTreeClassifier(random_state=42), parameters, n_jobs=4)  # Random state for reproducibility
clf.fit(X=df_combined[features_for_tree], y=df_combined["DiagnosisYN"])

# Get the best model with the optimal leaf nodes
tree_model = clf.best_estimator_

# Print the best score and the corresponding parameters
print("Best score was {:.4f} at {}".format(clf.best_score_, clf.best_params_))

# Plot the relationship between the number of leaf nodes and test score
plt.figure(figsize=(10, 6))
plt.plot(range(3, 20), clf.cv_results_['mean_test_score'], color='b', marker='o')  # Plot mean test scores
plt.xlabel('Max Number of Leaf Nodes', fontsize=12)  # X-axis label
plt.ylabel('Test Score', fontsize=12)  # Y-axis label
plt.title('Test Score vs. Max Leaf Nodes', fontsize=14, fontweight='bold')  # Plot title
plt.grid(True)  # Add grid for better readability
plt.show()

In [None]:
# Create a new decision tree with the optimal number of leaf nodes (7 in this case)
tree_model.set_params(max_leaf_nodes=7)  # Set the leaf nodes to 7 as per the model
plt.figure(figsize=(12, 5))

# Plot the decision tree with the new leaf nodes
plot_tree(
    tree_model, 
    filled=True, 
    max_depth=5, 
    feature_names=features_for_tree, 
    class_names=["No", "Yes"],
    label='all', 
    fontsize=10, 
    impurity=False
)

plt.title('Decision Tree Model with 7 Leaf Nodes', fontsize=14, fontweight='bold')  # Plot title

# Display the plot
plt.show()

#### Naive Bayes

Implement a Naive Bayes classifier using categorical data to predict the likelihood of a diagnosis. The model is evaluated on 500 random splits of the data (using different random seeds) to assess the accuracy of predictions on the test set. The accuracy scores are collected for further analysis of the model’s performance.

In [None]:
# Split into features for Naive Bayes classifier
columns_for_nb = [
    "Sex",
    "ChestPain",
    "FastingBS",
    "RestingECG",
    "ExeAngina",
    "STSlope",
    "Thalass",
    "ColouredMVCat"
]

X = df_combined[columns_for_nb]  # Feature variables
y = df_combined["Diagnosis"]     # Target variable

scores = []  # List to store accuracy scores

# Loop for 500 random splits to evaluate accuracy
for seed in range(1, 501):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)
    
    # Initialize and train the Naive Bayes classifier
    model = CategoricalNB()
    model.fit(X_train, y_train)
    
    # Predict on the test set and calculate accuracy
    y_pred = model.predict(X_test)
    score = accuracy_score(y_test, y_pred)
    
    # Append the accuracy score to the list
    scores.append(score)

# Convert the list of scores to a numpy array for easier analysis
scores = np.array(scores)

# Print the summary statistics of the accuracy scores
print(f"Mean accuracy: {scores.mean() * 100:.2f}%")
print(f"Standard deviation of accuracy: {scores.std() * 100:.2f}%")

# Optionally: plot the distribution of accuracy scores
plt.figure(figsize=(8, 6))
plt.hist(scores, bins=30, color='skyblue', edgecolor='black')
plt.title("Distribution of Accuracy Scores for Naive Bayes Classifier", fontsize=14, fontweight='bold')
plt.xlabel('Accuracy', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.grid(True)
plt.show()

Visualises the distribution of these scores using a box plot, which displays the spread of accuracy scores along with the mean. This helps assess the consistency of the classifier's performance across multiple random seeds.

In [None]:
# Store accuracy scores in a DataFrame and convert to percentage
accuracy_scores_df = pd.DataFrame(scores, columns=["Score"])
accuracy_scores_df["Score"] = accuracy_scores_df["Score"] * 100  # Convert to percentage

# Display the DataFrame with accuracy scores
accuracy_scores_df

# Create the box plot for accuracy scores
plt.figure(figsize=(5, 2))  # Slightly larger figure for better readability
ax = sns.boxplot(
    x=accuracy_scores_df["Score"],
    showmeans=True
)

# Set title and labels
plt.title("Accuracy Scores for Naive Bayes Classifier \nOver 500 Seeds", fontsize=12, fontweight='bold')
plt.xlabel("Accuracy Score (%)", fontsize=10)  # Clarify that this is in percentage
plt.xlim(0, 100)  # Set x-axis limits to 0-100 for percentage scale

# Improve axis labels and ticks
plt.xticks(fontsize=12)
plt.tight_layout()  # Adjust layout to avoid clipping
plt.subplots_adjust(left=0.1)  # Adjust left margin for aesthetics

# Remove the legend, since it's not needed
plt.legend([],[], frameon=False)

# Show the plot
plt.show()

### 4.4 Data Exploration - Logistic Regression

#### Logit

Fit a logistic regression model using statsmodels to assess the significance of various features in predicting the likelihood of heart disease. The model evaluates multiple predictors, including both continuous and categorical variables. The coefficients, p-values, and confidence intervals are extracted and stored in a DataFrame.

In [None]:
# Convert DiagnosisYN to integer (binary)
df_combined["DiagnosisYNInt"] = df_combined["DiagnosisYN"].astype("int64")

# Define the logistic regression formula
formula = """
DiagnosisYNInt ~ 
Age + 
C(Sex) + 
C(ChestPain) + 
RestingBP + 
Chol + 
C(FastingBS) + 
C(RestingECG) + 
HeartRateMax + 
C(ExeAngina) + 
STDep + 
C(STSlope) + 
ColouredMV + 
C(Thalass)
"""

# Fit the logistic regression model
mod_2 = smf.logit(formula, data=df_combined)
res_2 = mod_2.fit()

# Display the summary of the model
print(res_2.summary())

# Extract coefficients, p-values, and confidence intervals
coefficients = res_2.params
p_values = res_2.pvalues
conf_int = res_2.conf_int()

# Create a DataFrame for results
results_df = pd.DataFrame({
    "Coefficient": coefficients,
    "P-Value": p_values,
    "Conf_Int_Lower": conf_int[0],
    "Conf_Int_Upper": conf_int[1],
})

The categorical features are transformed into binary groups (0 or 1) to facilitate the analysis. A logistic regression model is then fitted using these transformed features and continuous variables. The results, including coefficients, p-values, and confidence intervals, are displayed.

In [None]:
# Transform categorical features into binary groups based on specific thresholds
df_combined["ChestPainGroup"] = np.where(df_combined["ChestPain"].isin([1, 2, 3]), 0, 1)
df_combined["ChestPainGroup"] = pd.Categorical(df_combined["ChestPainGroup"], categories=[0, 1], ordered=False)

df_combined["RestingECGGroup"] = np.where(df_combined["RestingECG"].isin([0]), 0, 1)
df_combined["RestingECGGroup"] = pd.Categorical(df_combined["RestingECGGroup"], categories=[0, 1], ordered=False)

df_combined["ThalassGroup"] = np.where(df_combined["Thalass"].isin([3, 6]), 0, 1)
df_combined["ThalassGroup"] = pd.Categorical(df_combined["ThalassGroup"], categories=[0, 1], ordered=False)

df_combined["ColouredMVGroup"] = np.where(df_combined["ColouredMV"].isin([0]), 0, 1)
df_combined["ColouredMVGroup"] = pd.Categorical(df_combined["ColouredMVGroup"], categories=[0, 1], ordered=False)

df_combined["STSlopeGroup"] = np.where(df_combined["STSlope"].isin([2]), 0, 1)
df_combined["STSlopeGroup"] = pd.Categorical(df_combined["STSlopeGroup"], categories=[0, 1], ordered=False)

# Define the logistic regression formula with transformed binary features
formula = """
DiagnosisYNInt ~ 
Age + 
C(Sex) + 
C(ChestPainGroup) + 
RestingBP + 
Chol + 
HeartRateMax + 
C(ExeAngina) + 
STDep + 
C(STSlopeGroup) + 
C(ColouredMVGroup) + 
C(ThalassGroup)
"""

# Fit the logistic regression model
mod_2 = smf.logit(formula, data=df_combined)
res_2 = mod_2.fit()

# Display the summary of the model
print(res_2.summary())

# Extract coefficients, p-values, and confidence intervals
coefficients = res_2.params
p_values = res_2.pvalues
conf_int = res_2.conf_int()

# Create a DataFrame for results
results_df = pd.DataFrame({
    "Coefficient": coefficients,
    "P-Value": p_values,
    "Conf_Int_Lower": conf_int[0],  # Lower bound of confidence interval
    "Conf_Int_Upper": conf_int[1],  # Upper bound of confidence interval
})

## 5. SVM Model

### 5.1 Creating Dummy Variables

In this section, we create dummy variables for the features that will be used in a Support Vector Classifier (SVC) model. This includes converting categorical variables into binary features using one-hot encoding, removing unnecessary columns, and converting the data types to ensure compatibility with the SVC model. The result is a cleaned and processed dataset ready for model training.

In [None]:
# Check the data types to see which columns require conversion
df_combined.dtypes

In [None]:

# Create dummy variables for categorical features and drop unnecessary columns
df_combined_svc = pd.get_dummies(df_combined, columns=[
    "ChestPain", 
    "RestingECG", 
    "STSlope", 
    "ColouredMVCat", 
    "Thalass"
], drop_first=False)

# Drop columns that are not needed for the model
df_combined_svc = df_combined_svc.drop(columns=[
    "SexMF",  # Dropped as we will treat Sex as binary (0/1)
    "DiagnosisYNInt",  # This column is redundant with DiagnosisYN
    "ChestPainGroup",  # Grouped versions of ChestPain not needed
    "RestingECGGroup",  # Similar to ChestPainGroup, not required
    "ThalassGroup",  # Grouped version of Thalass not required
    "ColouredMVGroup",  # Grouped version of ColouredMV not needed
    "STSlopeGroup",  # Grouped version of Slope not necessary
    "Diagnosis"  # This is the target, which will be kept separately in y
])

# Convert relevant columns to appropriate data types for SVC model
df_combined_svc = df_combined_svc.astype({
    "Age": "int64",
    "Sex": "bool",  # Sex is binary (0 or 1)
    "RestingBP": "int64",
    "Chol": "int64",
    "FastingBS": "bool",  # FastingBS is binary
    "HeartRateMax": "int64",
    "ExeAngina": "bool",  # ExeAngina is binary
    "STDep": "float64",
    "DiagnosisYN": "bool"  # Target variable is binary (0 or 1)
})

# Confirm the changes in data types
df_combined_svc.dtypes

In [None]:
# List of all features for the SVC model
all_svc_features = [
    'Age',
    'Sex',
    'ChestPain_1',
    'ChestPain_2',
    'ChestPain_3',
    'ChestPain_4',
    'RestingBP',
    'Chol',
    'FastingBS',
    'RestingECG_0',
    'RestingECG_1',
    'RestingECG_2',
    'HeartRateMax',
    'ExeAngina',
    'STDep',
    'STSlope_1',
    'STSlope_2',
    'STSlope_3',
    'ColouredMVCat_0',
    'ColouredMVCat_1',
    'ColouredMVCat_2',
    'ColouredMVCat_3',
    'Thalass_3',
    'Thalass_6',
    'Thalass_7',
]

### 5.2 Linear vs. RBF vs. Poly // Unscaled

In this section, we compare the performance of three different SVC kernel types (Linear, RBF, and Poly) without scaling the features. We start by implementing the model with a linear kernel, training it on the selected features, and calculating the accuracy. The accuracy scores are recorded for each seed value (from 1 to 100) to observe the stability of the model's performance. Afterward, we will compare these results to the other kernel types.

In [None]:
# Define a function to train and evaluate the model for a given kernel type
def evaluate_svc_kernel(kernel_type, X, y, n_runs=100):
    accuracy_list = []
    for seed in range(1, n_runs + 1):
        # Split the data into training and testing sets
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)
        
        # SVC model with the specified kernel
        model = SVC(kernel=kernel_type)
        
        # Train the model
        model.fit(X_train, y_train)
        
        # Predict the labels for the test set
        y_pred = model.predict(X_test)
       
        # Calculate the accuracy score
        accuracy_svc = accuracy_score(y_test, y_pred)
        
        # Append the accuracy score
        accuracy_list.append(accuracy_svc)
    
    # Return the accuracy list and average accuracy
    average_acc = mean(accuracy_list)
    return accuracy_list, average_acc

# Define features and target variable
X = df_combined_svc[all_svc_features]
y = df_combined_svc["DiagnosisYN"]

# Evaluate each kernel and store results
kernels = ['linear', 'rbf', 'poly']
kernel_results = {}

for kernel in kernels:
    accuracy_list, avg_acc = evaluate_svc_kernel(kernel, X, y)
    kernel_results[kernel] = {
        'accuracy_list': accuracy_list,
        'average_accuracy': avg_acc
    }
    print(f"{kernel.capitalize()} Kernel Accuracy: {avg_acc * 100:.2f}%\n")

# Convert results into DataFrame for easy comparison
df_kernel_comparison = pd.DataFrame({
    kernel: results['accuracy_list'] for kernel, results in kernel_results.items()
})

df_kernel_comparison

In [None]:
# Melt the DataFrame to prepare for plotting
df_kernel_comparison_melted = pd.melt(df_kernel_comparison)
df_kernel_comparison_melted.columns = ["Kernel", "AccuracyScore"]

# Create the boxplot for visual comparison
plt.figure(figsize=(8, 6))
sns.boxplot(data=df_kernel_comparison_melted, x="Kernel", y="AccuracyScore", palette="Set2", hue="Kernel", showmeans=True)
plt.title("Comparison of Accuracy Scores Based on Kernel Type (Unscaled SVC Model)", fontsize=14)
plt.xlabel("Kernel", fontsize=12)
plt.ylabel("Accuracy (%)", fontsize=12)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.tight_layout()

# Show plot
plt.show()

### 5.3 Linear vs. RBF vs. Poly // Scaled

We now compare the performance of three different SVC kernel types (Linear, RBF, and Poly) when scaling the features.

In [None]:
# Define a function to train and evaluate the SVC model with a specified kernel type and scaling
def evaluate_svc_kernel_scaled(kernel_type, X, y, n_runs=100):
    accuracy_list = []
    for seed in range(1, n_runs + 1):
        # Split the data into training and testing sets
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

        # Standardize the features using StandardScaler
        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train)
        X_test_scaled = scaler.transform(X_test)
        
        # Initialize and train the SVC model
        model = SVC(kernel=kernel_type)
        model.fit(X_train_scaled, y_train)
        
        # Predict the labels for the test set and calculate accuracy
        y_pred = model.predict(X_test_scaled)
        accuracy_svc = accuracy_score(y_test, y_pred)
        
        # Append the accuracy score
        accuracy_list.append(accuracy_svc)
    
    # Return the accuracy list and average accuracy
    average_acc = mean(accuracy_list)
    return accuracy_list, average_acc

# Define features and target variable
X = df_combined_svc[all_svc_features]
y = df_combined_svc["DiagnosisYN"]

# Evaluate each kernel with scaling and store results
kernels = ['linear', 'rbf', 'poly']
kernel_results_scaled = {}

for kernel in kernels:
    accuracy_list, avg_acc = evaluate_svc_kernel_scaled(kernel, X, y)
    kernel_results_scaled[kernel] = {
        'accuracy_list': accuracy_list,
        'average_accuracy': avg_acc
    }
    print(f"{kernel.capitalize()} Kernel (Scaled) Accuracy: {avg_acc * 100:.2f}%\n")

# Convert results into DataFrame for easy comparison
df_kernel_comparison_scaled = pd.DataFrame({
    kernel: results['accuracy_list'] for kernel, results in kernel_results_scaled.items()
})

df_kernel_comparison_scaled

In [None]:
# Melt the DataFrame to prepare for plotting
df_kernel_comparison_scaled_melted = pd.melt(df_kernel_comparison_scaled)
df_kernel_comparison_scaled_melted.columns = ["Kernel", "AccuracyScore"]

# Create the boxplot for visual comparison
plt.figure(figsize=(8, 6))
sns.boxplot(data=df_kernel_comparison_scaled_melted, x="Kernel", y="AccuracyScore", palette="Set2", hue="Kernel", showmeans=True)
plt.title("Comparison of Accuracy Scores Based on Kernel Type (Scaled SVC Model)", fontsize=14)
plt.xlabel("Kernel", fontsize=12)
plt.ylabel("Accuracy (%)", fontsize=12)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.tight_layout()

# Show plot
plt.show()

### 5.4 Linear vs. RBF vs. Poly // Scaled // Reduced Variables

We now compare the performance of three different SVC kernel types (Linear, RBF, and Poly) when scaling the features and using a reduced set of variables.

In [None]:
# Function to train and evaluate the SVC model with a specified kernel and scaling
def evaluate_svc_kernel_scaled_reduced(kernel_type, X, y, n_runs=100):
    accuracy_list = []
    for seed in range(1, n_runs + 1):
        # Split the data into training and testing sets
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

        # Standardize the features using StandardScaler
        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train)
        X_test_scaled = scaler.transform(X_test)
        
        # Initialize and train the SVC model
        model = SVC(kernel=kernel_type)
        model.fit(X_train_scaled, y_train)
        
        # Predict the labels for the test set and calculate accuracy
        y_pred = model.predict(X_test_scaled)
        accuracy_svc = accuracy_score(y_test, y_pred)
        
        # Append the accuracy score
        accuracy_list.append(accuracy_svc)
    
    # Return the accuracy list and average accuracy
    average_acc = mean(accuracy_list)
    return accuracy_list, average_acc

# Define the reduced features and target variable
reduced_svc_features = [
    'Age', 'Sex', 'ChestPain_2', 'ChestPain_4', 'RestingBP', 'Chol', 'HeartRateMax',
    'ExeAngina', 'STDep', 'STSlope_2', 'ColouredMVCat_0', 'ColouredMVCat_1', 'ColouredMVCat_2',
    'ColouredMVCat_3', 'Thalass_3', 'Thalass_7'
]

X = df_combined_svc[reduced_svc_features]
y = df_combined_svc["DiagnosisYN"]

# Evaluate each kernel with scaling and store results
kernels = ['linear', 'rbf', 'poly']
kernel_results_scaled_reduced = {}

for kernel in kernels:
    accuracy_list, avg_acc = evaluate_svc_kernel_scaled_reduced(kernel, X, y)
    kernel_results_scaled_reduced[kernel] = {
        'accuracy_list': accuracy_list,
        'average_accuracy': avg_acc
    }
    print(f"{kernel.capitalize()} Kernel (Scaled) Accuracy: {avg_acc * 100:.2f}%\n")

# Convert results into DataFrame for easy comparison
df_kernel_comparison_scaled_reduced = pd.DataFrame({
    kernel: results['accuracy_list'] for kernel, results in kernel_results_scaled_reduced.items()
})

df_kernel_comparison_scaled_reduced

In [None]:
# Melt the DataFrame to prepare for plotting
df_kernel_comparison_scaled_reduced_melted = pd.melt(df_kernel_comparison_scaled_reduced)
df_kernel_comparison_scaled_reduced_melted.columns = ["Kernel", "AccuracyScore"]

# Create the boxplot for visual comparison
plt.figure(figsize=(8, 6))
sns.boxplot(data=df_kernel_comparison_scaled_reduced_melted, x="Kernel", y="AccuracyScore", palette="Set2", hue="Kernel", showmeans=True)
plt.title("Comparison of Accuracy Scores Based on Kernel Type (Scaled SVC Model with Reduced Variables)", fontsize=14)
plt.xlabel("Kernel", fontsize=12)
plt.ylabel("Accuracy (%)", fontsize=12)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.tight_layout()

# Show plot
plt.show()

### 5.5. Linear vs. RBF vs. Poly // Scaled // Reduced Variables // Balanced

To handle any class imbalance, a balanced approach is applied during cross-validation to ensure that each class is equally represented in the training and testing splits. This comparison helps assess how each kernel type performs under these conditions and which is the most effective for the given dataset.

In [None]:
# Function to evaluate SVC with balanced class weights and scaling
def evaluate_svc_kernel_balanced_scaled(kernel_type, X, y, n_runs=100):
    accuracy_list = []
    for seed in range(1, n_runs + 1):
        # Split the data into training and testing sets
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

        # Standardize the features using StandardScaler
        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train)
        X_test_scaled = scaler.transform(X_test)
        
        # Initialize and train the SVC model with class weight balancing
        model = SVC(kernel=kernel_type, class_weight='balanced')
        model.fit(X_train_scaled, y_train)
        
        # Predict the labels for the test set and calculate accuracy
        y_pred = model.predict(X_test_scaled)
        accuracy_svc = accuracy_score(y_test, y_pred)
        
        # Append the accuracy score
        accuracy_list.append(accuracy_svc)
    
    # Return the accuracy list and average accuracy
    average_acc = mean(accuracy_list)
    return accuracy_list, average_acc

# Define the reduced features and target variable
reduced_svc_features = [
    'Age', 'Sex', 'ChestPain_2', 'ChestPain_4', 'RestingBP', 'Chol', 'HeartRateMax',
    'ExeAngina', 'STDep', 'STSlope_2', 'ColouredMVCat_0', 'ColouredMVCat_1', 'ColouredMVCat_2',
    'ColouredMVCat_3', 'Thalass_3', 'Thalass_7'
]

X = df_combined_svc[reduced_svc_features]
y = df_combined_svc["DiagnosisYN"]

# Evaluate each kernel with scaling and class balancing
kernels = ['linear', 'rbf', 'poly']
kernel_results_balanced_scaled = {}

for kernel in kernels:
    accuracy_list, avg_acc = evaluate_svc_kernel_balanced_scaled(kernel, X, y)
    kernel_results_balanced_scaled[kernel] = {
        'accuracy_list': accuracy_list,
        'average_accuracy': avg_acc
    }
    print(f"{kernel.capitalize()} Kernel (Balanced, Scaled) Accuracy: {avg_acc * 100:.2f}%\n")

# Convert results into DataFrame for easy comparison
df_kernel_comparison_balanced_scaled = pd.DataFrame({
    kernel: results['accuracy_list'] for kernel, results in kernel_results_balanced_scaled.items()
})

df_kernel_comparison_balanced_scaled

In [None]:
# Melt the DataFrame to prepare for plotting
df_kernel_comparison_balanced_scaled_melted = pd.melt(df_kernel_comparison_balanced_scaled)
df_kernel_comparison_balanced_scaled_melted.columns = ["Kernel", "AccuracyScore"]

# Create the boxplot for visual comparison
plt.figure(figsize=(8, 6))
sns.boxplot(data=df_kernel_comparison_balanced_scaled_melted, x="Kernel", y="AccuracyScore", palette="Set2", hue="Kernel", showmeans=True)
plt.title("Comparison of Accuracy Scores Based on Kernel Type (Scaled SVC Model with Reduced Variables and Balanced Class Weight)", fontsize=14)
plt.xlabel("Kernel", fontsize=12)
plt.ylabel("Accuracy (%)", fontsize=12)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.tight_layout()

# Show plot
plt.show()

### 5.6 Linear vs. RBF vs. Poly // Scaled // Reduced Variables // StratKFold

We use Stratified K-Fold cross-validation, which preserves the proportion of each class in each fold, making the results more reliable.

In [None]:
# Function to evaluate SVC kernels with StratifiedKFold cross-validation
def evaluate_svc_kernel_stratified_kfold(kernel_type, X, y, n_splits=5, n_runs=100):
    accuracy_list = []
    
    # Create a pipeline for scaling and SVC model
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('model', SVC(kernel=kernel_type))
    ])
    
    for seed in range(1, n_runs + 1):
        # StratifiedKFold for preserving class distribution in splits
        skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=seed)

        # Perform cross-validation on the pipeline
        accuracy_score = cross_val_score(pipeline, X, y, cv=skf, scoring='accuracy')

        # Append the mean accuracy score from each fold
        accuracy_list.append(accuracy_score.mean())

    # Return the accuracy list and average accuracy
    average_acc = mean(accuracy_list)
    return accuracy_list, average_acc

# Define the reduced features and target variable
reduced_svc_features = [
    'Age', 'Sex', 'ChestPain_2', 'ChestPain_4', 'RestingBP', 'Chol', 'HeartRateMax',
    'ExeAngina', 'STDep', 'STSlope_2', 'ColouredMVCat_0', 'ColouredMVCat_1', 'ColouredMVCat_2',
    'ColouredMVCat_3', 'Thalass_3', 'Thalass_7'
]

X = df_combined_svc[reduced_svc_features]
y = df_combined_svc["DiagnosisYN"]

# Evaluate each kernel with StratifiedKFold cross-validation
kernels = ['linear', 'rbf', 'poly']
kernel_results_stratified_kfold = {}

for kernel in kernels:
    accuracy_list, avg_acc = evaluate_svc_kernel_stratified_kfold(kernel, X, y)
    kernel_results_stratified_kfold[kernel] = {
        'accuracy_list': accuracy_list,
        'average_accuracy': avg_acc
    }
    print(f"{kernel.capitalize()} Kernel (StratifiedKFold) Accuracy: {avg_acc * 100:.2f}%\n")

# Convert results into DataFrame for easy comparison
df_kernel_comparison_stratified_kfold = pd.DataFrame({
    kernel: results['accuracy_list'] for kernel, results in kernel_results_stratified_kfold.items()
})

df_kernel_comparison_stratified_kfold

In [None]:
# Melt the DataFrame to prepare for plotting
df_kernel_comparison_stratified_kfold_melted = pd.melt(df_kernel_comparison_stratified_kfold)
df_kernel_comparison_stratified_kfold_melted.columns = ["Kernel", "AccuracyScore"]

# Create the boxplot for visual comparison
plt.figure(figsize=(8, 6))
sns.boxplot(data=df_kernel_comparison_stratified_kfold_melted, x="Kernel", y="AccuracyScore", palette="Set2", hue="Kernel", showmeans=True)
plt.title("Comparison of Accuracy Scores Based on Kernel Type, Scaling, Reduced Variables, and StratifiedKFold Validator", fontsize=14)
plt.xlabel("Kernel", fontsize=12)
plt.ylabel("Accuracy (%)", fontsize=12)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.tight_layout()

# Show plot
plt.show()

### 5.7 Linear vs. RBF vs. Poly // Scaled // Reduced Variables // StratShuffleSplit

The evaluation is also conducted using Stratified Shuffle Split cross-validation, which randomly splits the dataset into training and test sets while preserving the class distribution in each split.

In [None]:
# Function to evaluate SVC kernels with StratifiedShuffleSplit cross-validation
def evaluate_svc_kernel_stratified_shufflesplit(kernel_type, X, y, n_splits=5, test_size=0.2, n_runs=100):
    accuracy_list = []
    
    # Create a pipeline for scaling and SVC model
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('model', SVC(kernel=kernel_type))
    ])
    
    for seed in range(1, n_runs + 1):
        # StratifiedShuffleSplit for preserving class distribution in splits
        sss = StratifiedShuffleSplit(n_splits=n_splits, test_size=test_size, random_state=seed)

        # Perform cross-validation on the pipeline
        accuracy_score = cross_val_score(pipeline, X, y, cv=sss, scoring='accuracy')

        # Append the mean accuracy score from each fold
        accuracy_list.append(accuracy_score.mean())

    # Return the accuracy list and average accuracy
    average_acc = mean(accuracy_list)
    return accuracy_list, average_acc

# Define the reduced features and target variable
reduced_svc_features = [
    'Age', 'Sex', 'ChestPain_2', 'ChestPain_4', 'RestingBP', 'Chol', 'HeartRateMax',
    'ExeAngina', 'STDep', 'STSlope_2', 'ColouredMVCat_0', 'ColouredMVCat_1', 'ColouredMVCat_2',
    'ColouredMVCat_3', 'Thalass_3', 'Thalass_7'
]

X = df_combined_svc[reduced_svc_features]
y = df_combined_svc["DiagnosisYN"]

# Evaluate each kernel with StratifiedShuffleSplit cross-validation
kernels = ['linear', 'rbf', 'poly']
kernel_results_stratified_shufflesplit = {}

for kernel in kernels:
    accuracy_list, avg_acc = evaluate_svc_kernel_stratified_shufflesplit(kernel, X, y)
    kernel_results_stratified_shufflesplit[kernel] = {
        'accuracy_list': accuracy_list,
        'average_accuracy': avg_acc
    }
    print(f"{kernel.capitalize()} Kernel (StratifiedShuffleSplit) Accuracy: {avg_acc * 100:.2f}%\n")

# Convert results into DataFrame for easy comparison
df_kernel_comparison_stratified_shufflesplit = pd.DataFrame({
    kernel: results['accuracy_list'] for kernel, results in kernel_results_stratified_shufflesplit.items()
})

df_kernel_comparison_stratified_shufflesplit

In [None]:

# Melt the DataFrame to prepare for plotting
df_kernel_comparison_stratified_shufflesplit_melted = pd.melt(df_kernel_comparison_stratified_shufflesplit)
df_kernel_comparison_stratified_shufflesplit_melted.columns = ["Kernel", "AccuracyScore"]

# Create the boxplot for visual comparison
plt.figure(figsize=(8, 6))
sns.boxplot(data=df_kernel_comparison_stratified_shufflesplit_melted, x="Kernel", y="AccuracyScore", palette="Set2", hue="Kernel", showmeans=True)
plt.title("Comparison of Accuracy Scores Based on Kernel Type, Scaling, Reduced Variables, and StratifiedShuffleSplit Validator", fontsize=14)
plt.xlabel("Kernel", fontsize=12)
plt.ylabel("Accuracy (%)", fontsize=12)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.tight_layout()

# Show plot
plt.show()

### 5.8 GridSearchCV

In this section, we use **GridSearchCV** to identify the optimal hyperparameters for an SVC model based on multiple performance metrics: accuracy, recall, precision, and F1 score. 

The model is evaluated using cross-validation, and the best parameter combinations are found for each score. Additionally, various kernels, class weight options, regularization strengths, gamma values, and PCA component selections are explored through a systematic grid search. This method helps identify the most effective configuration for achieving the best scores.

In [None]:
results_dict = {}  # Initialize dictionary to store results

Function to run the grid search

In [None]:
# Function to run grid search and return results for different metrics
def run_grid_search(X, y, param_grid, score, seed):
    """
    Run GridSearchCV for the given parameters and return the best hyperparameters and scores.
    
    Args:
    X (DataFrame): Feature set for the model.
    y (Series): Target labels.
    param_grid (dict): Dictionary of hyperparameters for the grid search.
    score (str): The metric to optimize when refitting the model (e.g., 'accuracy', 'recall', 'precision', 'f1').
    seed (int): Random seed for reproducibility.
    
    Returns:
    tuple: A tuple containing the best parameters (dict) and a DataFrame with cross-validation results.
    """
    # Create the pipeline with scaling, PCA, and SVC
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('pca', PCA()),
        ('svc', SVC())
    ])
    
    # StratifiedKFold for cross-validation
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=seed)
    
    # Initialize GridSearchCV with cross-validation and scoring metrics
    grid_search = GridSearchCV(
        estimator=pipeline,
        param_grid=param_grid,
        cv=skf,
        scoring=['accuracy', 'recall', 'precision', 'f1'],
        refit=score,  # Refit the model based on recall
        verbose=0,       # Set verbosity to 0 for minimal output
        n_jobs=-1        # Use all available CPU cores
    )
    
    # Fit the model with grid search
    grid_search.fit(X, y)
    
    # Get the best model from grid search
    best_model = grid_search.best_estimator_
    best_params = grid_search.best_params_

    # Collect cross-validation results and extract relevant columns
    cv_results = pd.DataFrame(grid_search.cv_results_)
    cv_results_filtered = cv_results[[
        'param_svc__kernel', 
        'param_svc__C', 
        'param_svc__class_weight', 
        'param_svc__gamma', 
        'param_pca__n_components',
        'mean_test_accuracy',
        'mean_test_recall', 
        'mean_test_precision',
        'mean_test_f1'
    ]]
    
    # Rename columns for clarity
    cv_results_filtered.columns = ["kernel", "C", "weight", "gamma", "pca", "accuracy", "recall", "precision", "f1"]
    cv_results_filtered = cv_results_filtered.sort_values(by=score, ascending=False)
    # Return the best parameters and filtered cross-validation results
    return best_params, cv_results_filtered

Running grid search using **Recall** as the score:

In [None]:
# Define features and target variable
X = df_combined_svc[reduced_svc_features]
y = df_combined_svc["DiagnosisYN"]

# Define the parameter grid for grid search
param_grid = {
    'scaler': [StandardScaler()],
    'svc__kernel': ['linear', 'rbf', 'poly'],
    'svc__class_weight': [None, 'balanced'],
    'svc__C': [0.05, 0.1, 1, 10],
    'svc__gamma': ['scale', 'auto'],
    'pca__n_components': [None, 0.80, 0.90, 6, 7]
}

#####################################
# Choose the score for best_params_ #
score = 'recall'
#####################################

# Choose the column for the final table to sort by
column = f'best_{score}'

# Name the dataframe that will be created
df_name = f"grid_results_{score}"

# Initialize a list to store results
results = []

# Loop over 100 random seeds for cross-validation
for seed in range(1, 101):
    # Print which seed is being used
    print(f"Running with random seed: {seed}")
    
    # Run the grid search and get the results
    best_params, cv_results_filtered = run_grid_search(X, y, param_grid, score, seed)

    # Ensure cv_results_filtered is not empty before accessing .iloc[0]
    if not cv_results_filtered.empty:
        top_row = cv_results_filtered.iloc[0]
    
        # Append the best results to the results list
        results.append({
            'seed': seed,
            'kernel': top_row['kernel'],
            'C': top_row['C'],
            'weight': top_row['weight'],
            'gamma': top_row['gamma'],
            'pca': top_row['pca'],
            'best_accuracy': top_row['accuracy'],
            'best_recall': top_row['recall'],
            'best_precision': top_row['precision'],
            'best_f1': top_row['f1']
    })

        # Print the best hyperparameters for the current seed
        print(f"Best {score}: {top_row[score]:.4f}, Kernel: {top_row['kernel']}, C: {top_row['C']}, Weight: {top_row['weight']}, Gamma: {top_row['gamma']}, PCA: {top_row['pca']}")
        print()
    else:
        print(f"Warning: No valid results for seed {seed}")
        print()

results_dict[score] = pd.DataFrame(results).sort_values(by=column, ascending=False)

# Display the latest concatenated results for this score
print(f"Displaying all results for {score}:")
display(results_dict[score].head(10))  # Shows the latest results

Running grid search using **accuracy** as the score:

In [None]:
# Define features and target variable
X = df_combined_svc[reduced_svc_features]
y = df_combined_svc["DiagnosisYN"]

# Define the parameter grid for grid search
param_grid = {
    'scaler': [StandardScaler()],
    'svc__kernel': ['linear', 'rbf', 'poly'],
    'svc__class_weight': [None, 'balanced'],
    'svc__C': [0.05, 0.1, 1, 10],
    'svc__gamma': ['scale', 'auto'],
    'pca__n_components': [None, 0.80, 0.90, 6, 7]
}

#####################################
# Choose the score for best_params_ #
score = 'accuracy'
#####################################

# Choose the column for the final table to sort by
column = f'best_{score}'

# Name the dataframe that will be created
df_name = f"grid_results_{score}"

# Initialize a list to store results
results = []

# Loop over 100 random seeds for cross-validation
for seed in range(1, 101):
    # Print which seed is being used
    print(f"Running with random seed: {seed}")
    
    # Run the grid search and get the results
    best_params, cv_results_filtered = run_grid_search(X, y, param_grid, score, seed)

    # Ensure cv_results_filtered is not empty before accessing .iloc[0]
    if not cv_results_filtered.empty:
        top_row = cv_results_filtered.iloc[0]
    
        # Append the best results to the results list
        results.append({
            'seed': seed,
            'kernel': top_row['kernel'],
            'C': top_row['C'],
            'weight': top_row['weight'],
            'gamma': top_row['gamma'],
            'pca': top_row['pca'],
            'best_accuracy': top_row['accuracy'],
            'best_recall': top_row['recall'],
            'best_precision': top_row['precision'],
            'best_f1': top_row['f1']
    })

        # Print the best hyperparameters for the current seed
        print(f"Best {score}: {top_row[score]:.4f}, Kernel: {top_row['kernel']}, C: {top_row['C']}, Weight: {top_row['weight']}, Gamma: {top_row['gamma']}, PCA: {top_row['pca']}")
        print()
    else:
        print(f"Warning: No valid results for seed {seed}")
        print()

results_dict[score] = pd.DataFrame(results).sort_values(by=column, ascending=False)

# Display the latest concatenated results for this score
print(f"Displaying all results for {score}:")
display(results_dict[score].head(10))  # Shows the latest results

Running grid search using **precision** as the score:

In [None]:
# Define features and target variable
X = df_combined_svc[reduced_svc_features]
y = df_combined_svc["DiagnosisYN"]

# Define the parameter grid for grid search
param_grid = {
    'scaler': [StandardScaler()],
    'svc__kernel': ['linear', 'rbf', 'poly'],
    'svc__class_weight': [None, 'balanced'],
    'svc__C': [0.05, 0.1, 1, 10],
    'svc__gamma': ['scale', 'auto'],
    'pca__n_components': [None, 0.80, 0.90, 6, 7]
}

#####################################
# Choose the score for best_params_ #
score = 'precision'
#####################################

# Choose the column for the final table to sort by
column = f'best_{score}'

# Name the dataframe that will be created
df_name = f"grid_results_{score}"

# Initialize a list to store results
results = []

# Loop over 100 random seeds for cross-validation
for seed in range(1, 101):
    # Print which seed is being used
    print(f"Running with random seed: {seed}")
    
    # Run the grid search and get the results
    best_params, cv_results_filtered = run_grid_search(X, y, param_grid, score, seed)

    # Ensure cv_results_filtered is not empty before accessing .iloc[0]
    if not cv_results_filtered.empty:
        top_row = cv_results_filtered.iloc[0]
    
        # Append the best results to the results list
        results.append({
            'seed': seed,
            'kernel': top_row['kernel'],
            'C': top_row['C'],
            'weight': top_row['weight'],
            'gamma': top_row['gamma'],
            'pca': top_row['pca'],
            'best_accuracy': top_row['accuracy'],
            'best_recall': top_row['recall'],
            'best_precision': top_row['precision'],
            'best_f1': top_row['f1']
    })

        # Print the best hyperparameters for the current seed
        print(f"Best {score}: {top_row[score]:.4f}, Kernel: {top_row['kernel']}, C: {top_row['C']}, Weight: {top_row['weight']}, Gamma: {top_row['gamma']}, PCA: {top_row['pca']}")
        print()
    else:
        print(f"Warning: No valid results for seed {seed}")
        print()

results_dict[score] = pd.DataFrame(results).sort_values(by=column, ascending=False)

# Display the latest concatenated results for this score
print(f"Displaying all results for {score}:")
display(results_dict[score].head(10))  # Shows the latest results

Running grid search using **f1** as the score:

In [None]:
# Define features and target variable
X = df_combined_svc[reduced_svc_features]
y = df_combined_svc["DiagnosisYN"]

# Define the parameter grid for grid search
param_grid = {
    'scaler': [StandardScaler()],
    'svc__kernel': ['linear', 'rbf', 'poly'],
    'svc__class_weight': [None, 'balanced'],
    'svc__C': [0.05, 0.1, 1, 10],
    'svc__gamma': ['scale', 'auto'],
    'pca__n_components': [None, 0.80, 0.90, 6, 7]
}

#####################################
# Choose the score for best_params_ #
score = 'f1'
#####################################

# Choose the column for the final table to sort by
column = f'best_{score}'

# Name the dataframe that will be created
df_name = f"grid_results_{score}"

# Initialize a list to store results
results = []

# Loop over 100 random seeds for cross-validation
for seed in range(1, 101):
    # Print which seed is being used
    print(f"Running with random seed: {seed}")
    
    # Run the grid search and get the results
    best_params, cv_results_filtered = run_grid_search(X, y, param_grid, score, seed)

    # Ensure cv_results_filtered is not empty before accessing .iloc[0]
    if not cv_results_filtered.empty:
        top_row = cv_results_filtered.iloc[0]
    
        # Append the best results to the results list
        results.append({
            'seed': seed,
            'kernel': top_row['kernel'],
            'C': top_row['C'],
            'weight': top_row['weight'],
            'gamma': top_row['gamma'],
            'pca': top_row['pca'],
            'best_accuracy': top_row['accuracy'],
            'best_recall': top_row['recall'],
            'best_precision': top_row['precision'],
            'best_f1': top_row['f1']
    })

        # Print the best hyperparameters for the current seed
        print(f"Best {score}: {top_row[score]:.4f}, Kernel: {top_row['kernel']}, C: {top_row['C']}, Weight: {top_row['weight']}, Gamma: {top_row['gamma']}, PCA: {top_row['pca']}")
        print()
    else:
        print(f"Warning: No valid results for seed {seed}")
        print()

results_dict[score] = pd.DataFrame(results).sort_values(by=column, ascending=False)

# Display the latest concatenated results for this score
print(f"Displaying all results for {score}:")
display(results_dict[score].head(10))  # Shows the latest results

## 6. Final Models

### 6.1 Final 1 // linear // weight balanced // C 10 // gamma scale

In [None]:
# Define feature set and target variable
X = df_combined_svc[reduced_svc_features]
y = df_combined_svc["DiagnosisYN"]

# Store results for each seed
results = []

# Loop through 500 different random seeds
for seed in range(1, 501):
    print(f"Running with seed: {seed}")
    
    # Define pipeline: StandardScaler + SVC with predefined parameters
    pipeline = Pipeline([
        ('scaler', StandardScaler()),  
        ('svc', SVC(kernel='linear', class_weight='balanced', C=10, gamma='scale'))
    ])
    
    # Perform 5-fold stratified cross-validation
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=seed)

    # Evaluate model performance using multiple metrics
    scores = cross_validate(
        pipeline, X, y, cv=skf, 
        scoring=['accuracy', 'recall', 'precision', 'f1'],
        return_train_score=False
    )

    # Compute mean scores across folds
    mean_accuracy = scores['test_accuracy'].mean()
    mean_recall = scores['test_recall'].mean()
    mean_precision = scores['test_precision'].mean()
    mean_f1 = scores['test_f1'].mean()

    # Display key results
    print(f"Seed {seed} - Accuracy: {mean_accuracy:.4f}, Recall: {mean_recall:.4f}")

    # Append results to list
    results.append([seed, mean_accuracy, mean_recall, mean_precision, mean_f1])

# Convert results list to DataFrame
linear_10_balanced_scale_nopca = pd.DataFrame(results, columns=['Seed', 'Accuracy', 'Recall', 'Precision', 'F1'])

# Display final DataFrame
linear_10_balanced_scale_nopca

In [None]:
# Select only relevant performance metrics
linear_10_balanced_scale_nopca = linear_10_balanced_scale_nopca.filter(['Accuracy', 'Recall', 'Precision', 'F1'], axis=1)

# Compute mean values for each metric
linear_10_balanced_scale_nopca_means = linear_10_balanced_scale_nopca.mean()
print("Summary of Results:")
print(linear_10_balanced_scale_nopca_means)

# Reshape data for visualization
linear_10_balanced_scale_nopca_melted = pd.melt(linear_10_balanced_scale_nopca, var_name="Metric", value_name="Score")

# Create boxplot to visualize distribution of metric scores
plt.figure(figsize=(3, 4))
sns.boxplot(
    data=linear_10_balanced_scale_nopca_melted,
    x="Metric",
    y="Score",
    hue="Metric",
    palette="Set2",
    showmeans=True
)

# Set plot titles and labels
plt.title("Metric Scores for Final \nSVC Model Over 500 \nRandom States", fontsize=10, fontweight='bold')
plt.xlabel("Metric", fontsize=12)
plt.ylabel("Score", fontsize=10)
plt.ylim(0.7, 0.9)

# Adjust x-axis labels
plt.xticks(
    ticks=[0, 1, 2, 3], 
    labels=['Accuracy', 'Recall', 'Precision', 'F1'],
    fontsize=8
)

# Remove legend
plt.legend([], [], frameon=False)

# Adjust layout for better spacing
plt.tight_layout()
plt.subplots_adjust(left=0.25)

# Show the plot
plt.show()

### 6.2 Final 2 // poly // C 1 // PCA 6 // gamma auto

In [None]:
# Define feature set and target variable
X = df_combined_svc[reduced_svc_features]
y = df_combined_svc["DiagnosisYN"]

# Store results for each seed
results = []

# Loop through 500 different random seeds
for seed in range(1, 501):
    print(f"Running with seed: {seed}")
    
    # Define pipeline: StandardScaler + PCA (6 components) + SVC with predefined parameters
    pipeline = Pipeline([
        ('scaler', StandardScaler()),  
        ('pca', PCA(n_components=6)),  
        ('svc', SVC(kernel='poly', class_weight=None, C=1, gamma='auto'))
    ])
    
    # Perform 5-fold stratified cross-validation
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=seed)

    # Evaluate model performance using multiple metrics
    scores = cross_validate(
        pipeline, X, y, cv=skf, 
        scoring=['accuracy', 'recall', 'precision', 'f1'],
        return_train_score=False
    )

    # Compute mean scores across folds
    mean_accuracy = scores['test_accuracy'].mean()
    mean_recall = scores['test_recall'].mean()
    mean_precision = scores['test_precision'].mean()
    mean_f1 = scores['test_f1'].mean()

    # Display key results
    print(f"Seed {seed} - Accuracy: {mean_accuracy:.4f}, Recall: {mean_recall:.4f}")

    # Append results to list
    results.append([seed, mean_accuracy, mean_recall, mean_precision, mean_f1])

# Convert results list to DataFrame
poly_1_none_auto_pca6 = pd.DataFrame(results, columns=['Seed', 'Accuracy', 'Recall', 'Precision', 'F1'])

# Display final DataFrame
poly_1_none_auto_pca6

In [None]:
# Select only relevant performance metrics
poly_1_none_auto_pca6 = poly_1_none_auto_pca6.filter(['Accuracy', 'Recall', 'Precision', 'F1'], axis=1)

# Compute mean values for each metric
poly_1_none_auto_pca6_means = poly_1_none_auto_pca6.mean()
print("Summary of Results:")
print(poly_1_none_auto_pca6_means)

# Reshape data for visualization
poly_1_none_auto_pca6_melted = pd.melt(poly_1_none_auto_pca6, var_name="Metric", value_name="Score")

# Create boxplot to visualize distribution of metric scores
plt.figure(figsize=(3, 4))
sns.boxplot(
    data=poly_1_none_auto_pca6_melted,
    x="Metric",
    y="Score",
    hue="Metric",
    palette="Set2",
    showmeans=True
)

# Set plot titles and labels
plt.title("Metric Scores for Final \nSVC Model Over 500 \nRandom States", fontsize=10, fontweight='bold')
plt.xlabel("Metric", fontsize=12)
plt.ylabel("Score", fontsize=10)
plt.ylim(0.7, 0.9)

# Adjust x-axis labels
plt.xticks(
    ticks=[0, 1, 2, 3], 
    labels=['Accuracy', 'Recall', 'Precision', 'F1'],
    fontsize=8
)

# Remove legend
plt.legend([], [], frameon=False)

# Adjust layout for better spacing
plt.tight_layout()
plt.subplots_adjust(left=0.25)

# Show the plot
plt.show()

### 6.3 Final 3 // rbf // C 1 // gamma scale

In [None]:
# Define feature set and target variable
X = df_combined_svc[reduced_svc_features]
y = df_combined_svc["DiagnosisYN"]

# Store results for each seed
results = []

# Loop through 500 different random seeds
for seed in range(1, 501):
    print(f"Running with seed: {seed}")
    
    # Define pipeline: StandardScaler + SVC with predefined parameters
    pipeline = Pipeline([
        ('scaler', StandardScaler()),  
        ('svc', SVC(kernel='rbf', class_weight=None, C=1, gamma='scale'))
    ])
    
    # Perform 5-fold stratified cross-validation
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=seed)

    # Evaluate model performance using multiple metrics
    scores = cross_validate(
        pipeline, X, y, cv=skf, 
        scoring=['accuracy', 'recall', 'precision', 'f1'],
        return_train_score=False
    )

    # Compute mean scores across folds
    mean_accuracy = scores['test_accuracy'].mean()
    mean_recall = scores['test_recall'].mean()
    mean_precision = scores['test_precision'].mean()
    mean_f1 = scores['test_f1'].mean()

    # Display key results
    print(f"Seed {seed} - Accuracy: {mean_accuracy:.4f}, Recall: {mean_recall:.4f}")

    # Append results to list
    results.append([seed, mean_accuracy, mean_recall, mean_precision, mean_f1])

# Convert results list to DataFrame
rbf_1_scale_nopca = pd.DataFrame(results, columns=['Seed', 'Accuracy', 'Recall', 'Precision', 'F1'])

# Display final DataFrame
rbf_1_scale_nopca

In [None]:
# Select only relevant performance metrics
rbf_1_scale_nopca = rbf_1_scale_nopca.filter(['Accuracy', 'Recall', 'Precision', 'F1'], axis=1)

# Compute mean values for each metric
rbf_1_scale_nopca_means = rbf_1_scale_nopca.mean()
print("Summary of Results:")
print(rbf_1_scale_nopca_means)

# Reshape data for visualization
rbf_1_scale_nopca_melted = pd.melt(rbf_1_scale_nopca, var_name="Metric", value_name="Score")

# Create boxplot to visualize distribution of metric scores
plt.figure(figsize=(3, 4))
sns.boxplot(
    data=rbf_1_scale_nopca_melted,
    x="Metric",
    y="Score",
    hue="Metric",
    palette="Set2",
    showmeans=True
)

# Set plot titles and labels
plt.title("Metric Scores for Final \nSVC Model Over 500 \nRandom States", fontsize=10, fontweight='bold')
plt.xlabel("Metric", fontsize=12)
plt.ylabel("Score", fontsize=10)
plt.ylim(0.7, 0.9)

# Adjust x-axis labels
plt.xticks(
    ticks=[0, 1, 2, 3], 
    labels=['Accuracy', 'Recall', 'Precision', 'F1'],
    fontsize=8
)

# Remove legend
plt.legend([], [], frameon=False)

# Adjust layout for better spacing
plt.tight_layout()
plt.subplots_adjust(left=0.25)

# Save and show the plot
plt.show()


### 6.4 Final 4 // rbf // gamma auto // C 0.05 // balanced

In [None]:
# Define feature set and target variable
X = df_combined_svc[reduced_svc_features]
y = df_combined_svc["DiagnosisYN"]

# Store results for each seed
results = []

# Loop through 500 different random seeds
for seed in range(1, 501):
    print(f"Running with seed: {seed}")
    
    # Define pipeline: StandardScaler + PCA (6 components) + SVC with predefined parameters
    pipeline = Pipeline([
        ('scaler', StandardScaler()),  
        ('pca', PCA(n_components=6)),  
        ('svc', SVC(kernel='rbf', class_weight='balanced', C=0.05, gamma='auto'))
    ])
    
    # Perform 5-fold stratified cross-validation
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=seed)

    # Evaluate model performance using multiple metrics
    scores = cross_validate(
        pipeline, X, y, cv=skf, 
        scoring=['accuracy', 'recall', 'precision', 'f1'],
        return_train_score=False
    )

    # Compute mean scores across folds
    mean_accuracy = scores['test_accuracy'].mean()
    mean_recall = scores['test_recall'].mean()
    mean_precision = scores['test_precision'].mean()
    mean_f1 = scores['test_f1'].mean()

    # Display key results
    print(f"Seed {seed} - Accuracy: {mean_accuracy:.4f}, Recall: {mean_recall:.4f}")

    # Append results to list
    results.append([seed, mean_accuracy, mean_recall, mean_precision, mean_f1])

# Convert results list to DataFrame
rbf_005_balanced_pca6_auto = pd.DataFrame(results, columns=['Seed', 'Accuracy', 'Recall', 'Precision', 'F1'])

# Display final DataFrame
rbf_005_balanced_pca6_auto


In [None]:
# Select only relevant performance metrics
rbf_005_balanced_pca6_auto = rbf_005_balanced_pca6_auto.filter(['Accuracy', 'Recall', 'Precision', 'F1'], axis=1)

# Compute mean values for each metric
rbf_005_balanced_pca6_auto_means = rbf_005_balanced_pca6_auto.mean()
print("Summary of Results:")
print(rbf_005_balanced_pca6_auto_means)

# Reshape data for visualization
rbf_005_balanced_pca6_auto_melted = pd.melt(rbf_005_balanced_pca6_auto, var_name="Metric", value_name="Score")

# Create boxplot to visualize distribution of metric scores
plt.figure(figsize=(3, 4))
sns.boxplot(
    data=rbf_005_balanced_pca6_auto_melted,
    x="Metric",
    y="Score",
    hue="Metric",
    palette="Set2",
    showmeans=True
)

# Set plot titles and labels
plt.title("Metric Scores for Final \nSVC Model Over 500 \nRandom States", fontsize=10, fontweight='bold')
plt.xlabel("Metric", fontsize=12)
plt.ylabel("Score", fontsize=10)
plt.ylim(0.65, 1.0)

# Adjust x-axis labels
plt.xticks(
    ticks=[0, 1, 2, 3], 
    labels=['Accuracy', 'Recall', 'Precision', 'F1'],
    fontsize=8
)

# Remove legend
plt.legend([], [], frameon=False)

# Adjust layout for better spacing
plt.tight_layout()
plt.subplots_adjust(left=0.25)

# Save and show the plot
plt.show()


## 7. Confusion Matrix & ROC AUC

Create a confusion matrix using top parameters and random_state 42.

In [None]:
# Define feature set and target variable
X = df_combined_svc[reduced_svc_features]
y = df_combined_svc["DiagnosisYN"]

# Split dataset into training and test sets (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train SVC model with predefined parameters
model = SVC(kernel="linear", class_weight='balanced', C=10, gamma='scale', probability=True)
model.fit(X_train_scaled, y_train)

# Generate predictions on test set
y_pred = model.predict(X_test_scaled)

# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Plot confusion matrix as heatmap
plt.figure(figsize=(3, 3))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Negative', 'Positive'], 
            yticklabels=['Negative', 'Positive'], 
            cbar=False)

# Set plot titles and labels
plt.title('Confusion Matrix', fontsize=16, fontweight='bold')
plt.xlabel('Predicted', fontsize=12)
plt.ylabel('Actual', fontsize=12)

# Adjust layout for better spacing
plt.subplots_adjust(left=0.2)

# Save and display confusion matrix
plt.savefig("images/Confusion_Matrix.png")
plt.show()

# Print classification report
print(classification_report(y_test, y_pred))


Create a ROC AUC diagram.

In [None]:
# Predict probabilities for the test set
y_prob = model.predict_proba(X_test_scaled)[:, 1]

# Compute ROC curve values: false positive rate (FPR) and true positive rate (TPR)
fpr, tpr, _ = roc_curve(y_test, y_prob)

# Calculate the area under the ROC curve (AUC)
roc_auc = auc(fpr, tpr)
print(f"ROC AUC: {roc_auc:.4f}")

# Plot ROC Curve
plt.figure(figsize=(4, 3))
plt.plot(fpr, tpr, color='blue', label=f'ROC curve (AUC = {roc_auc:.2f})')

# Add diagonal reference line (random classifier)
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')

# Set plot titles and labels
plt.title('ROC Curve', fontsize=12, fontweight='bold')
plt.xlabel('False Positive Rate', fontsize=10)
plt.ylabel('True Positive Rate', fontsize=10)

# Configure legend and grid
plt.legend(loc='lower right', fontsize=8)
plt.grid(True, linestyle='--', alpha=0.6)

# Save and display ROC curve
plt.savefig("images/ROC_Curve.png")
plt.show()


## 8. Reload dataset after removing columns

### 8.1 Processing

In [None]:
# Loading csvs into DFs
df_cleveland_new = pd.read_csv("processed.cleveland.data",na_values=[" ","?","NA"],encoding = "ISO-8859-1", header=None, delimiter=",")
df_hungarian_new = pd.read_csv("processed.hungarian.data",na_values=[" ","?","NA"],encoding = "ISO-8859-1", header=None, delimiter=",")
df_switzerland_new = pd.read_csv("processed.switzerland.data",na_values=[" ","?","NA"],encoding = "ISO-8859-1", header=None, delimiter=",")
df_longbeach_new = pd.read_csv("processed.va.data",na_values=[" ","?","NA"],encoding = "ISO-8859-1", header=None, delimiter=",")

# Creating new dictionary of names and dataframes
df_all_new_datasets_dict = {
    "Cleveland": df_cleveland_new, 
    "Hungarian": df_hungarian_new,
    "Switzerland": df_switzerland_new, 
    "Longbeach": df_longbeach_new
}

# Creating new column name list and updating names in each dataset
new_column_names = [
    "Age",
    "Sex",
    "ChestPain",
    "RestingBP",
    "Chol",
    "FastingBS",
    "RestingECG",
    "HeartRateMax",
    "ExeAngina",
    "STDep",
    "STSlope",
    "ColouredMV",
    "Thalass",
    "Diagnosis"
]
for name, frame in df_all_new_datasets_dict.items():
    frame.columns = new_column_names

# Dropping columns from datasets
for name, frame in df_all_new_datasets_dict.items():   
    frame.drop(columns=[
        "FastingBS",
        "RestingECG",
        "STSlope",
        "ColouredMV"
    ], inplace=True)

In [None]:
# Summarise null values in each column for all datasets
null_summary_list = [
    pd.DataFrame(frame.isna().sum(), columns=[name]) 
    for name, frame in df_all_new_datasets_dict.items()
]

# Combine null value summaries into a single DataFrame
df_null_summary = pd.concat(null_summary_list, axis=1)

# Save null summary to CSV for further analysis
df_null_summary.to_csv("UCI_location_new_nulls.csv", index=True)

# Display null counts for selected key columns across all datasets
df_null_summary.loc[
    ["Age", "Sex", "ChestPain", "RestingBP", "Chol", 
     "HeartRateMax", "ExeAngina", "STDep", "Thalass", "Diagnosis"]
].T

In [None]:
# Add 'Location' column to each dataset
for name, frame in df_all_new_datasets_dict.items():
    frame["Location"] = name  # Assign dataset name as location

# Combine all datasets into a single DataFrame
df_combined_again = pd.concat(df_all_new_datasets_dict.values(), ignore_index=True)
print("Shape before dropping NAs:", df_combined_again.shape)

# Drop rows with missing values
df_combined_again = df_combined_again.dropna()
print("Shape after dropping NAs:", df_combined_again.shape)

In [None]:
# Count remaining rows for each dataset after dropping NAs
location_count_list = [
    (name, df_combined_again[df_combined_again["Location"] == name].shape[0])
    for name in df_all_new_datasets_dict.keys()
]

# Create a DataFrame to display counts per dataset
df_location_count = pd.DataFrame(location_count_list, columns=["DataFrame", "NoNulls"]).style.hide(axis="index")
df_location_count


### 8.2 Making Dummies

In [None]:
# Apply one-hot encoding to categorical features "ChestPain" and "Thalass"
df_combined_again = pd.get_dummies(
    df_combined_again, 
    columns=["ChestPain", "Thalass"], 
    drop_first=False  # Keep all categories to retain full information
)

# Display the data types of all columns after encoding
df_combined_again.dtypes


In [None]:
# Convert "Diagnosis" into a binary outcome: 
# 1 if Diagnosis is in {1, 2, 3, 4} (indicating heart disease), else 0 (no heart disease)
df_combined_again["DiagnosisYN"] = np.where(
    df_combined_again["Diagnosis"].isin([1, 2, 3, 4]), 1, 0
)

# Convert "DiagnosisYN" to a categorical type with explicit categories
df_combined_again["DiagnosisYN"] = pd.Categorical(
    df_combined_again["DiagnosisYN"], categories=[0, 1], ordered=False
)

# Ensure appropriate data types for numerical and boolean columns
df_combined_again = df_combined_again.astype({
    "Age": "int64",
    "Sex": "bool",               # Convert Sex to boolean (Male/Female)
    "RestingBP": "int64",
    "Chol": "int64",
    "HeartRateMax": "int64",
    "ExeAngina": "bool",         # Convert Exercise Angina to boolean
    "STDep": "float64",
    "DiagnosisYN": "bool"        # Ensure DiagnosisYN is boolean
})

# Display updated data types
df_combined_again.dtypes

In [None]:
# Define selected features for the SVC model
reduced_svc_features = [
    "Age",
    "Sex",
    "ChestPain_2.0",
    "ChestPain_4.0",
    "RestingBP",
    "Chol",
    "HeartRateMax",
    "ExeAngina",
    "STDep",
    "Thalass_3.0",
    "Thalass_7.0",
]

# Define feature matrix (X) and target variable (y)
X = df_combined_again[reduced_svc_features]  # Subset dataset using selected features
y = df_combined_again["DiagnosisYN"]         # Target variable (heart disease diagnosis)


### 8.3 GridSearchCV with reduced variables

In [None]:
# Define features and target variable
X = df_combined_again[reduced_svc_features]
y = df_combined_again["DiagnosisYN"]

# Define the parameter grid for grid search
param_grid = {
    'scaler': [StandardScaler()],
    'svc__kernel': ['linear', 'rbf', 'poly'],
    'svc__class_weight': [None, 'balanced'],
    'svc__C': [0.05, 0.1, 1, 10],
    'svc__gamma': ['scale', 'auto'],
    'pca__n_components': [None, 0.80, 0.90, 6, 7]
}

#####################################
# Choose the score for best_params_ #
score = 'accuracy'
#####################################

# Choose the column for the final table to sort by
column = f'best_{score}'

# Name the dataframe that will be created
df_name = f"grid_results_{score}"

# Initialize a list to store results
results = []

# Loop over 100 random seeds for cross-validation
for seed in range(1, 101):
    # Print which seed is being used
    print(f"Running with random seed: {seed}")
    
    # Run the grid search and get the results
    best_params, cv_results_filtered = run_grid_search(X, y, param_grid, score, seed)

    # Ensure cv_results_filtered is not empty before accessing .iloc[0]
    if not cv_results_filtered.empty:
        top_row = cv_results_filtered.iloc[0]
    
        # Append the best results to the results list
        results.append({
            'seed': seed,
            'kernel': top_row['kernel'],
            'C': top_row['C'],
            'weight': top_row['weight'],
            'gamma': top_row['gamma'],
            'pca': top_row['pca'],
            'best_accuracy': top_row['accuracy'],
            'best_recall': top_row['recall'],
            'best_precision': top_row['precision'],
            'best_f1': top_row['f1']
    })

        # Print the best hyperparameters for the current seed
        print(f"Best {score}: {top_row[score]:.4f}, Kernel: {top_row['kernel']}, C: {top_row['C']}, Weight: {top_row['weight']}, Gamma: {top_row['gamma']}, PCA: {top_row['pca']}")
        print()
    else:
        print(f"Warning: No valid results for seed {seed}")
        print()

results_dict[score] = pd.DataFrame(results).sort_values(by=column, ascending=False)

# Display the latest concatenated results for this score
print(f"Displaying all results for {score}:")
display(results_dict[score].head(10))  # Shows the latest results

### 8.4 Supplementary Model

In [None]:
# Define feature set and target variable
X = df_combined_again[reduced_svc_features]
y = df_combined_again["DiagnosisYN"]

# Store results for each seed
results = []

# Loop through 500 different random seeds
for seed in range(1, 501):
    print(f"Running with seed: {seed}")
    
    # Define pipeline: StandardScaler + PCA (6 components) + SVC with predefined parameters
    pipeline = Pipeline([
        ('scaler', StandardScaler()),  
        #('pca', PCA(n_components=6)),  
        ('svc', SVC(kernel='rbf', class_weight=None, C=0.05, gamma='scale'))
    ])
    
    # Perform 5-fold stratified cross-validation
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=seed)

    # Evaluate model performance using multiple metrics
    scores = cross_validate(
        pipeline, X, y, cv=skf, 
        scoring=['accuracy', 'recall', 'precision', 'f1'],
        return_train_score=False
    )

    # Compute mean scores across folds
    mean_accuracy = scores['test_accuracy'].mean()
    mean_recall = scores['test_recall'].mean()
    mean_precision = scores['test_precision'].mean()
    mean_f1 = scores['test_f1'].mean()

    # Display key results
    print(f"Seed {seed} - Accuracy: {mean_accuracy:.4f}, Recall: {mean_recall:.4f}")

    # Append results to list
    results.append([seed, mean_accuracy, mean_recall, mean_precision, mean_f1])

# Convert results list to DataFrame
new_model = pd.DataFrame(results, columns=['Seed', 'Accuracy', 'Recall', 'Precision', 'F1'])

# Display final DataFrame
new_model


In [None]:
# Select only relevant performance metrics
new_model = new_model.filter(['Accuracy', 'Recall', 'Precision', 'F1'], axis=1)

# Compute mean values for each metric
new_model_means = new_model.mean()
print("Summary of Results:")
print(new_model_means)

# Reshape data for visualization
new_model_melted = pd.melt(new_model, var_name="Metric", value_name="Score")

# Create boxplot to visualize distribution of metric scores
plt.figure(figsize=(3, 4))
sns.boxplot(
    data=new_model_melted,
    x="Metric",
    y="Score",
    hue="Metric",
    palette="Set2",
    showmeans=True
)

# Set plot titles and labels
plt.title("Metric Scores for Final \nSVC Model Over 500 \nRandom States", fontsize=10, fontweight='bold')
plt.xlabel("Metric", fontsize=12)
plt.ylabel("Score", fontsize=10)
plt.ylim(0.65, 1.0)

# Adjust x-axis labels
plt.xticks(
    ticks=[0, 1, 2, 3], 
    labels=['Accuracy', 'Recall', 'Precision', 'F1'],
    fontsize=8
)

# Remove legend
plt.legend([], [], frameon=False)

# Adjust layout for better spacing
plt.tight_layout()
plt.subplots_adjust(left=0.25)

# Save and show the plot
plt.show()
