<img src="https://www.endocrine.org/-/media/endocrine/images/patient-engagement-webpage/condition-page-images/cardiovascular-disease/cardio_disease_t2d_pe_1796x943.jpg" alt="Alternative text" />

# ❤️ Topic: Cardio Vascular Disease (CVD)

Dataset taken from: https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset

---

## ✨ Introduction:
### 💡 What is Cardiovascular Disease?
Cardiovascular disease (CVD) is a general term that describes a disease of the heart or blood vessels. The term refers to set of diseases that one can be diagnosed with when blood flow to the heart, brain or body is reduced due to blood clots (Thrombosis) or a build-up of fatty deposits inside an artery, which usually leads to the artery becoming hard and narrow.

### 4️⃣ Main Types of CVD:
1. Coronary Heart Disease
2. Stroke
3. Peripheral Arterial Disease
4. Aortic Disease

### 📈 Causing Factors:

Risk factors for CVD include smoking, high blood pressure, high cholesterol levels, obesity, diabetes, and a family history of the disease. Lifestyle changes, such as maintaining a healthy diet and exercise regimen, quitting smoking, and managing these risk factors, can help prevent CVD. Additionally, medications and medical procedures may be recommended to treat or prevent CVD in certain cases.

### 📊 Dataset:
The Dataset that we have chosen is taken from Kaggle and contains data of a group of 70,000 patients, with CVD present in some of them. The data depicts certain factors and characteristics of these patients, some of which may have contributed to their CVD. 

We will explore this Dataset in this project.

---

## ❗Problem Description:

Our aim of this project is to use different Machine Learning Models and Algorithms to analyse the cardio-vascular dataset and the accuracy of the various models used and see which is the most accurate and suitable for predicting the presence of cardio-vascular diseases in a person.

We will do this by passing the cardio vascular dataset we have obtained from kaggle through the Machine Learning Models and also performing different Exploratory Data Analysis techniques on the dataset to prepare the dataset fully before implementation.

---
## 📖 Table of Contents:

### ⬇️ 1.0 Importing Data
1.1 Importing Essential Libraries and Dataset   
1.2 Importing Essential Functions   
1.3 Describing the Variables

---

### 🔎 2.0 Variable Analysis
2.1 Numerical Variable Analysis   
2.2 Categorical Variable Analysis

---

### 🗺️ 3.0 Exploratory Data Analysis:
3.1 Cleaning of Data and Removal of Outliers   
3.2 Cleaned Variable Conclusion

---

### 🤖 4.0 Machine Learning Models:
4.1 XGBoost    
4.2 XGBoost Analysis and Conclusion    
4.3 Logistic Regression    
4.4 Logistic Regression Analysis and Conclusion  

---

### 🖊️ 5.0 Conclusion:
5.1 Comparing XGBoost and Logistic Regression    
5.2 Epilogue and Conclusion    
5.3 Models Used    
5.4 What did we learn from this project?    
5.5 References    

---
## ⬇️1.0 Importing Essential Libraries & Dataset

In [None]:
# General Propose
# import pickle
# import requests
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
# from scipy.stats import shapiro
# from scikitplot import metrics as mt

# Hipo Test
# from scipy import stats
# from scipy.stats import f_oneway
# from scipy.stats import ttest_ind


# Pre-processing
# from sklearn.pipeline import Pipeline
# from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
# from sklearn.feature_selection import VarianceThreshold
from imblearn.combine import SMOTETomek
from sklearn.model_selection import train_test_split, StratifiedKFold, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler, RobustScaler, OneHotEncoder

# Modelling
from sklearn.linear_model import LogisticRegression
from sklearn.cluster import KMeans
from xgboost import XGBClassifier


# Evaluation
from sklearn.metrics import classification_report, accuracy_score, cohen_kappa_score, precision_score, f1_score, recall_score
# from yellowbrick.classifier.threshold import discrimination_threshold

df = pd.read_excel("cardio_train.xlsx")
df = df.reset_index(drop=True)
df

In [None]:
df.describe()

In [None]:
df.info()

---
## 1.1 Importing essential functions: 

In [None]:
def remove_outliers(df):
    q1 = df.quantile(0.25)
    q3 = df.quantile(0.75)
    iqr = q3 - q1
    df_out = df[~((df < (q1 - 1.5 * iqr)) | (df > (q3 + 1.5 * iqr))).any(axis=1)]
    count = ( ((df < (q1 - 1.5 * iqr)) | (df > (q3 + 1.5 * iqr) ))).sum()
    print(count)
    return df_out
    

# Helper Functions
def balanced_target(target, dataset, hue=None):
    """
    Function to check the balancing of the target variable.

    :target:  An pd.Series of the target variable that will be checked.
    :dataset: An Dataframe object. 
    """
    sb.set(style='darkgrid', palette='Accent')
    ax = sb.countplot(x=target, hue=hue, data=dataset)
    ax.figure.set_size_inches(10, 6)
    ax.set_title('Cardio Distribution', fontsize=18, loc='left')
    ax.set_xlabel(target, fontsize=14)
    ax.set_ylabel('Count', fontsize=14)
    ax=ax


def univariate_analysis(target, df):
    """
    Function to perform univariate analysis.

    df: DataFrame
    """
    for col in df.columns.to_list():

        fig = sb.displot(x=col, hue=target, data=df, kind='hist')
        fig.set_titles(f'{col}\n distribuition', fontsize=16)
        fig.set_axis_labels(col, fontsize=14)


def multi_histogram(data: pd.DataFrame, variables: list) -> None:

    # set of initial plot posistion
    plt.figure(figsize=(18, 10))
    n = 1
    for column in data[variables].columns:
        plt.subplot(3, 3, n)
        _ = sb.distplot(a=data[column], bins=50, hist=True)
        n += 1

    plt.subplots_adjust(hspace=0.3)

    plt.show()



def multi_boxplot(data: pd.DataFrame, variables: list) -> None:

    """
    Function to check for outliers visually through a boxplot

    data: DataFrame

    variable: list of numerical variables
    """

    # set of initial plot posistion
    plt.figure(figsize=(18, 10))
    n = 1
    for column in data[variables].columns:
        plt.subplot(3, 3, n)
        _ = sb.boxplot(x=column, data=data)
        n += 1

    plt.subplots_adjust(hspace=0.3)

    plt.show()


def hipo_test(*samples):

    samples = samples

    try:
        if len(samples) == 2:
            stat, p = ttest_ind(*samples)
        elif len(samples) > 2:
            stat, p = f_oneway(*samples)
    except:
        raise Exception("Deve ser fornecido pelo menos duas samples!!!")

    if p < 0.05:
        print(f'O valor de p é: {p}')
        print('Provável haver diferença')
    else:
        print(f'O valor de p é: {p}')
        print('Provável que não haja diferença')

    return stat, p


def point_bi_corr(a, b):

    """
    Function to calculate point biserial correlation coefficient heatmap function
    Credits: Bruno Santos - Comunidade DS

    :a: input dataframe with binary variable
    :b: input dataframe with continous variable
    """

    # Get column name
    a = a.values.reshape(-1)
    b = b.columns.reshape(-1)

    # apply scipys point-biserial
    stats.pointbiserialr(a, b)

    # correlation coefficient array
    c = np.corrcoef(a, b)

    # dataframe for heatmap
    df = pd.DataFrame(c, columns=[a, b], index=[a, b])

    # return heatmap
    return sb.heatmap(df, annot=True).set_title('{} x {} correlation heatmap'.format(a, b));


def change_threshold_lgbm(X, y, model, n_splits, thresh):

    # cross-validação
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=1)

    acc = []
    kappa = []
    recall = []
    for linhas_treino, linhas_valid in skf.split(X, y):

        X_treino, X_valid = X.iloc[linhas_treino], X.iloc[linhas_valid]
        y_treino, y_valid = y.iloc[linhas_treino], y.iloc[linhas_valid]

        pred_prob = model.predict_proba(X_valid)

        for i in range(0, len(pred_prob)):
            if pred_prob[i, 1] >= thresh:
                pred_prob[i, 1] = 1
            else:
                pred_prob[i, 1] = 0

        Acc = accuracy_score(y_valid, pred_prob[:, 1])
        Kappa =  cohen_kappa_score(y_valid, pred_prob[:, 1])
        Recall = recall_score(y_valid, pred_prob[:, 1])
        acc.append(Acc)
        kappa.append(Kappa)
        recall.append(Recall)

    print('####### Business Metrics #######')
    print('\n')
    acc_inc = np.mean(acc) - 0.50
    prc_inc = round((acc_inc/0.05)*500, 2)
    print(f'Increased precision: {round(acc_inc,2)}')
    print(f'Price Increased in: {prc_inc}')
    print(f'Percentual of Price increassing: {round(prc_inc/500,2)}')
    print('\n')

    # print classification report
    print('####### Machine Learning Metrics #######\n')
    print(classification_report(y_valid, pred_prob[:,1], digits=2))

    # Confusion Matrix
    mt.plot_confusion_matrix(y_valid, pred_prob[:,1], normalize=False, figsize=(10,8))

    return pred_prob[:, 1]


def change_threshold_lr(X, y, model, n_splits, thresh):
    # cross-validação
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=1)

    acc = []
    kappa = []
    recall = []
    for linhas_treino, linhas_valid in skf.split(X, y):

        X_treino, X_valid = X.iloc[linhas_treino], X.iloc[linhas_valid]
        y_treino, y_valid = y.iloc[linhas_treino], y.iloc[linhas_valid]

        y_scores_final = model.decision_function(X_valid)
        y_pred_recall = (y_scores_final > thresh)

        Acc = accuracy_score(y_valid, y_pred_recall)
        Kappa =  cohen_kappa_score(y_valid, y_pred_recall)
        Recall = recall_score(y_valid, y_pred_recall)
        acc.append(Acc)
        kappa.append(Kappa)
        recall.append(Recall)

    print('####### Bussines Metrics #######\n')

    acc_inc = np.mean(acc) - 0.50
    prc_inc = round((acc_inc/0.05)*500, 2)
    print(f'Increased precision: {round(acc_inc,2)}')
    print(f'Price Increased in: {prc_inc}')
    print(f'Percentual of Price increassing: {round(prc_inc/500,2)}')
    print('\n')

    print('####### Machine Learning Metrics #######\n')
    print(f'New kappa: {cohen_kappa_score(y_valid,y_pred_recall)}\n')
    print(classification_report(y_valid, y_pred_recall, digits=2))


    return y_pred_recall


################################################# Custons Transformers ###########################################################

class PreProcessingTransformer(BaseEstimator, TransformerMixin):

    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        Xtemp = X.copy()

        # Height
        index_height = Xtemp.loc[Xtemp['height'] > 230, ['height']].index
        Xtemp.drop(index_height, inplace=True)
        index_height1 = Xtemp.loc[Xtemp['height'] < 112, ['height']].index
        Xtemp.drop(index_height1, inplace=True)

        # Weight
        index_weight = Xtemp.loc[Xtemp['weight'] < 40, ['weight']].index
        Xtemp.drop(index_weight, inplace=True)

        # ap_hi
        index_ap_hi = Xtemp.loc[Xtemp['ap_hi'] < 10, ['ap_hi']].index
        Xtemp.drop(index_ap_hi, inplace=True)

        # ap_lo
        index_ap_lo = Xtemp.loc[Xtemp['ap_lo'] < 5, ['ap_lo']].index
        Xtemp.drop(index_ap_lo, inplace=True)

        # SMOTE + TOMEKLINK
        X = Xtemp.drop('cardio', axis=1)
        y = Xtemp['cardio']

        smt = SMOTETomek(random_state=42)
        Xres, yres = smt.fit_resample(X, y)
        Xtemp = pd.concat([Xres, yres], axis=1)

        return Xtemp


class FeatureEngineeringTransformer(BaseEstimator, TransformerMixin):

    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        Xtemp = X.copy()

        # Cluster based var
        kmeans = KMeans(n_clusters=2, init='k-means++',n_init=20, random_state=0).fit(Xtemp)
        Xtemp['kmeans_cat'] = kmeans.labels_

        # # Cluster GMM
        # gmm = GaussianMixture(n_components=3).fit(Xtemp)
        # Xtemp['gauss_cat'] = gmm.predict(Xtemp)

        # Year_age
        Xtemp['year_age'] = Xtemp['age'] / 365

        # drop 'id' and 'age' 'smoke','alco','gluc', 'ap_lo', 'cholesterol', 'height', 'active', 'weight'
        Xtemp.drop(['id', 'age'], inplace=True, axis=1)

        # IMC
        Xtemp['imc'] = Xtemp['weight']/(Xtemp['height']/100)**2

        # cat_dwarfism
        Xtemp['cat_Dwarfism'] = [1 if value < 145 else 0 for value in Xtemp['height']]

        # ap_hi divide 10
        Xtemp.loc[Xtemp['ap_hi'] > 220, ['ap_hi']] = Xtemp.loc[Xtemp['ap_hi'] > 220, ['ap_hi']]/10

        # ap_lo divide 10
        Xtemp.loc[Xtemp['ap_lo'] > 190, ['ap_lo']] = Xtemp.loc[Xtemp['ap_lo'] > 190, ['ap_lo']]/10

        # ap_hi divide 10
        Xtemp.loc[Xtemp['ap_hi'] > 220, ['ap_hi']] = Xtemp.loc[Xtemp['ap_hi'] > 220,['ap_hi']]/10

        # ap_hi divide 10
        Xtemp.loc[Xtemp['ap_hi'] > 220, ['ap_hi']] = Xtemp.loc[Xtemp['ap_hi'] > 220,['ap_hi']]/10

        # ap_lo divide 10
        Xtemp.loc[Xtemp['ap_lo'] > 190, ['ap_lo']] = Xtemp.loc[Xtemp['ap_lo'] > 190,['ap_lo']]/10

        return Xtemp


class CatBloodPressureTransformer(BaseEstimator, TransformerMixin):

    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        Xtemp = X.copy()

        # cat_bloodpressure
        def cat_bloodpressure(df):

            if df['ap_hi'] < 90 and df['ap_lo'] < 60:
                return 1 #Hipotensão
            elif 90 <= df['ap_hi'] < 140 and 60 <= df['ap_lo'] < 90:
                return 2    # Pré-Hipotensão
            elif 140 <= df['ap_hi'] < 160 and 90 <= df['ap_lo'] < 100:
                return 3  # 'Hipertensão estagio1'
            elif df['ap_hi'] >= 160 and df['ap_lo'] >= 100:
                return 4 # 'Hipertensão estagio2'
            else:
                return 5 # 'no_cat'

        # cat_bloodpressure
        Xtemp['cat_bloodpressure'] = Xtemp.apply(cat_bloodpressure, axis=1)

        return Xtemp


class TotalPressureTransformer(BaseEstimator, TransformerMixin):

    def __init__(self):
        pass

    def fit(self, X, y=None):

        return self

    def transform(self, X, y=None):

        Xtemp = X.copy()

        # total_preassure
        Xtemp['total_pressure'] = Xtemp['ap_hi'] + Xtemp['ap_lo']

        return Xtemp


class MyRobustScalerTransformer(BaseEstimator, TransformerMixin):

    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):

        Xtemp = X.copy()

        scaler = RobustScaler()
        Xscaled = scaler.fit_transform(Xtemp)
        Xtemp = pd.DataFrame(Xscaled, columns=Xtemp.columns.to_list())

        return Xtemp

    
# XGBoost Helper Functions

## 1.2 Describing the Variables:
---

### Different Variables:

    1. Age - Numerical - Days
    2. Height - Numerical - Cm
    3. Weight - Numerical - Kg
    4. Gender - Categorical - 1/0
    5. Systolic Blood Pressure (ap_hi) - Numerical - mmHg
    6. Systolic Blood Pressure (ap_lo) - Numerical - mmHg
    7. Cholesterol - Categorical - 1-3
    8. Glucose - Categorical - 1-3
    9. Smoke - Categorical - 1/0
    10. Alcohol Intake - Categorical - 1/0
    11. Physical Activity - Categorical - 1/0
    12. Cardio - Categorical - 1/0

# 🔎 2.0 Variable Analysis
---

First we will explore the Numerical Variables, then the Categorical Variables.

## 2.1 Numerical Variable Analysis

For Numerical Variables, we will use a boxplot and plot them against cardio (i.e the presence of cardio vascular disease)

Afterwhich, we will plot a histogram for each of the variables to check their skew.

In [None]:
df_new = df[['ap_hi', 'cardio']].copy()

df_new = remove_outliers(df_new)
f = plt.figure(figsize=(20,10))
sb.boxplot(data=df_new, orient='v',x='cardio',y='ap_hi')

Based on the American Heart Association guidelines for blood pressure , anything outside the range of for Systolic BP (ap_hi) from 50 to 200 is abnormal. Hence, we will be removing them from the api_hi column. For Diastolic BP (ap_lo) anything outside the range from 60 to 90 is abnormal, but the extreme and uncommon cases are those that are below 50 and above 120. Hence, we will be removing them from the dataset.

- Taken from https://www.heart.org/en/health-topics/high-blood-pressure/understanding-blood-pressure-readings

In [None]:
# Removing the Outliers based on the above description
df = df.copy()
index = df.loc[(df['ap_hi']<50)|(df['ap_hi']>200),['ap_hi']].index
df.drop(index, inplace=True)
index = df.loc[(df['ap_lo']<50)|(df['ap_lo']>120),['ap_lo']].index
df.drop(index, inplace=True)

### Plotting the Boxplots and Histograms for the Numerical Variables.

Boxplots are to see the spread of the graphs and identify any significant outliers while Histograms are to see the general skews of the graphs 

In [None]:
f = plt.figure(figsize=(20,10))
sb.boxplot(data=df, orient='v',x='cardio',y='age')
f = plt.figure(figsize=(20,10))
sb.boxplot(data=df, orient='v',x='cardio',y='height')
f = plt.figure(figsize=(20,10))
sb.boxplot(data=df, orient='v',x='cardio',y='weight')
f = plt.figure(figsize=(20,10))
sb.boxplot(data=df, orient='v',x='cardio',y='ap_hi')
f = plt.figure(figsize=(20,10))
sb.boxplot(data=df, orient='v',x='cardio',y='ap_lo')

In [None]:
f = plt.figure(figsize=(16, 8))
sb.histplot(data = df["age"], kde=True)
f = plt.figure(figsize=(16, 8))
sb.histplot(data = df["height"], kde=True)
f = plt.figure(figsize=(16, 8))
sb.histplot(data = df["weight"], kde=True)
f = plt.figure(figsize=(16, 8))
sb.histplot(data = df["ap_hi"], kde=True)
f = plt.figure(figsize=(16, 8))
sb.histplot(data = df["ap_lo"], kde=True)

Looking at the ap_hi and ap_lo histograms we can infer that the distribution very unevenly distributed, hence we will not be using these columns in our model prediction.

Looking at the height and weight plots, it seems that BMI is a better indicator as it takes into consideration both the height and weight to predict cardio. 

Hence, we will create a BMI column using the formula: weight/(height^2)

In [None]:
# Creation of BMI Column using weight and height(m) 
df['BMI'] = df['weight'] / ((df['height'])/100)**2
df

In [None]:
# Plotting the box plots for BMI
f = plt.figure(figsize=(20,10))
sb.boxplot(data=df, orient='v',x='cardio',y='BMI')

In [None]:
print("BMI # > 130: " , df.loc[df['BMI']>130,:].size)
print("BMI # < 10: " , df.loc[df['BMI']<10,:].size)

Since it is extremely rare for anyone's BMI to be above 130 and below 10, there may have been errors in the data. Hence, we will remove them from the dataset.

In [None]:
df = df.copy()
index = df.loc[(df['BMI']<10)|(df['BMI']>130),['BMI']].index
df.drop(index, inplace=True)
df

In [None]:
df[['age','height','weight','ap_hi','ap_lo']].skew(axis=0,skipna=True)

From the data, ap_hi is the most skewed from a regular normal distribution. With a high positive skew of 85.296214.

---

### Plotting Correlation & Heatmap

In [None]:
numvars=df[['age','height','weight','ap_hi','ap_lo', 'BMI']]
numvars.corr()

In [None]:
sb.heatmap(numvars.corr(),vmin=-1,vmax=1,annot=True,fmt=".4f")

As seen in the heatmap above, the height and weight has a strong correlation with the newly created BMI column, hence we will be using the BMI column instead of the height and weight columns to train our models. 

---
## 2.2 Categorical Variable Analysis:

For Categorical Variables, we will use a heatmap and correlation matrix.

We will first put these categorical Variables into one seperate DataFrame so as to more easily compare and contrast with 'cardio'.

The 6 Categorical Variables we are exploring are:
1. Gender
2. Cholesterol
3. Glucose
4. Smoking
5. Alcohol Intake
6. Physical Activity

In [None]:
# cardio distribution by 'cholesterol' (Cholesterol)
balanced_target(target='cholesterol', hue='cardio', dataset=df)

total_cardio_0 = df['cardio'].value_counts()[0]
total_cardio_1 = df['cardio'].value_counts()[1]

print('Total count of cardio = 0:', total_cardio_0)
print('Total count of cardio = 0:', total_cardio_1)

In [None]:
# cardio distribution by 'gluc' (Glucose)
balanced_target(target='gluc',hue='cardio', dataset=df)

In [None]:
# cardio distribution by 'alco' (Alcohol)
balanced_target(target='alco',hue='cardio', dataset=df)

In [None]:
# cardio distribution by 'active' (Physical Activity)
balanced_target(target='active',hue='cardio', dataset=df)

# 🗺️ 3.0 Exploratory Data Analysis:
---

## 3.1 Cleaning of Data and Removal of Outliers

In this section, we will change our categorical variables into numerical that will be represented by values 0,1,2.. 

This allows us to better and more easily use the dataset for the upcoming Machine Learning Models that will implement in the next sections.

In [None]:
# Changing column names and variable description to make the data more readable
df_cleaned = df.copy()
df_cleaned.rename(columns = {'ap_hi': 'Systolic_BP'}, inplace = True)
df_cleaned.rename(columns = {'ap_lo': 'Diastolic_BP'}, inplace = True)
df_cleaned.rename(columns = {'gluc': 'Glucose'}, inplace = True)
df_cleaned.rename(columns = {'alco': 'Alcohol'}, inplace = True)
df_cleaned.rename(columns = {'cardio': 'Cardio_Patient'}, inplace = True)

# Dropping unneeded ID column
df_cleaned.drop("id", axis=1, inplace=True);

# Replacing data in columns for readibility
df_cleaned['gender'] = df_cleaned['gender'].replace(1,0) # 0 is Male
df_cleaned['gender'] = df_cleaned['gender'].replace(2, 1) # 1 is Female
# df_cleaned['cholesterol'] = df_cleaned['cholesterol'].replace(1, 'Normal')
# df_cleaned['cholesterol'] = df_cleaned['cholesterol'].replace(2, 'Above_Normal')
# df_cleaned['cholesterol'] = df_cleaned['cholesterol'].replace(3, 'Well_Above_Normal')
# df_cleaned['Glucose'] = df_cleaned['Glucose'].replace(1, 'Normal')
# df_cleaned['Glucose'] = df_cleaned['Glucose'].replace(2, 'Above_Normal')
# df_cleaned['Glucose'] = df_cleaned['Glucose'].replace(3, 'Well_Above_Normal')
# df_cleaned['smoke'] = df_cleaned['smoke'].replace(1, 'Yes')
# df_cleaned['smoke'] = df_cleaned['smoke'].replace(0, 'No')
# df_cleaned['Alcohol'] = df_cleaned['Alcohol'].replace(1, 'Yes')
# df_cleaned['Alcohol'] = df_cleaned['Alcohol'].replace(0, 'No')
# df_cleaned['active'] = df_cleaned['active'].replace(1, 'Yes')
# df_cleaned['active'] = df_cleaned['active'].replace(0, 'No')
# df_cleaned['Cardio_Patient'] = df_cleaned['Cardio_Patient'].replace(1, 'Yes')
# df_cleaned['Cardio_Patient'] = df_cleaned['Cardio_Patient'].replace(0, 'No')

# Set all columns to upper case
df_cleaned.columns = df_cleaned.columns.str.upper()
df_cleaned

## 3.2 Cleaned Variable Conclusion:

Format: Column Description (COLUMN NAME):

    1. Age (AGE) - Numerical - Days
    2. Height (HEIGHT) - Numerical - Cm
    3. Weight (WEIGHT) - Numerical - Kg
    4. Gender (GENDER) - Categorical - 0 (Male), 1 (Female)
    5. Systolic Blood Pressure (SYSTOLIC_BP) - Numerical - mmHg
    6. Diastolic Blood Pressure (DIASTOLIC_BP) - Numerical - mmHg
    7. Cholesterol (CHOLESTEROL) - Categorical - 1 (Normal), 2 (Above Normal), 3 (Well Above Normal)
    8. Glucose (GLOCOSE) - Categorical - 1 (Normal), 2 (Above Normal), 3 (Well Above Normal) 
    9. Smoking (SMOKE) - Categorical - 1 (Yes), 0 (No)
    10. Alcohol Intake (ALCOHOL) - Categorical - 1 (Yes), 0 (No)
    11. Physical Activity (ACTIVE) - Categorical - 1 (Yes), 0 (No)
    12. Patient of Cardio Disease (CARDIO_PATIENT) - Categorical - 1 (Yes), 0 (No)
    13. BMI - Numerical - Kg/m**2

Based on the information provided, a suitable reason to justify the exclusion of heart rate from the variables used in the cardiovascular disease prediction model is that heart rate may not be a direct or accurate measure of cardiovascular disease risk. Although heart rate is commonly used as an indicator of overall cardiovascular health, there are many other factors that can affect heart rate, such as stress, physical activity, and even medication. Additionally, heart rate is not always a reliable indicator of heart disease risk, as some individuals with a normal heart rate may still be at increased risk for cardiovascular events.

Therefore, to build a more accurate and reliable cardiovascular disease prediction model, it may be best to focus on the other variables mentioned, such as age, BMI, cholesterol, glucose, smoking, and alcohol consumption. These factors have been shown to be more directly linked to cardiovascular disease risk and can provide a more comprehensive assessment of an individual's overall health and risk for developing cardiovascular disease.

In conclusion, the exclusion of heart rate from the cardiovascular disease prediction model is justifiable based on its limited direct correlation with cardiovascular disease risk. By focusing on the other variables that have been shown to be more closely linked to cardiovascular disease risk, we can build a more accurate and reliable model for predicting an individual's risk of developing this condition.

---
# 🤖 4.0 Machine Learning Models:
---
## 4.1 XGBoost

#### What is XGBoost?

XGBoost (Extreme Gradient Boosting) is an open-source machine learning library that implements gradient boosting algorithms. It was developed by Tianqi Chen and Carlos Guestrin in 2016 and has become one of the most popular machine-learning libraries, winning several Kaggle competitions.

Gradient boosting is a technique that combines multiple weak models to form a strong model. It does this by iteratively adding new models to the ensemble, with each model trying to correct the errors of the previous models. XGBoost is an implementation of gradient boosting that uses a tree-based model.

XGBoost can handle both regression and classification problems and is known for its speed and accuracy. It uses a number of techniques to reduce overfitting, including regularization and early stopping. It also supports parallel processing, making it possible to train models on large datasets in a reasonable amount of time.

Overall, XGBoost is a powerful machine-learning library widely used in industry and academia for various applications, including data mining, natural language processing, and computer vision.

---
#### What is Overfitting?

Overfitting is a phenomenon that occurs when a machine learning model is trained too well on the training data, to the point that it begins to memorize it instead of generalizing the patterns. This means that the model has learned the training data so well that it does not perform well on new, unseen data.

Overfitting occurs when the model is too complex, and it has too many parameters relative to the amount of training data. As a result, the model becomes too specialized to the training data and fails to capture the underlying patterns in the data that generalize well to new data.

Overfitted models tend to have good performance with the data used to fit them (the training data), but they behave poorly with unseen data (or test data, which is data not used to fit the model).

Overfitting can be detected by comparing the performance of the model on the training data versus the performance on the testing data. If the performance on the training data is significantly better than on the testing data, it is a sign of overfitting. To prevent overfitting, various techniques can be used, including regularization, cross-validation, and early stopping. Regularization adds a penalty term to the loss function to discourage the model from overfitting, cross-validation evaluates the performance of the model on different subsets of the data, and early stopping stops the training of the model when the performance on the testing data starts to deteriorate.

---
#### What is Regularization?

Regularization is a technique used in machine learning to prevent the overfitting of the model to the training data. It involves adding a penalty term to the loss function that the model is optimizing during training. The penalty term discourages the model from fitting the training data too well and instead encourages it to find a more generalizable solution that can perform well on new, unseen data.

There are two common types of regularization techniques used in machine learning:

L1 Regularization (Lasso Regression): This technique adds a penalty term to the loss function that is proportional to the absolute value of the weights of the model. This penalty term shrinks the weights of the model towards zero, effectively removing some of the less important features from the model. This makes the model simpler and less likely to overfit.

L2 Regularization (Ridge Regression): This technique adds a penalty term to the loss function that is proportional to the square of the weights of the model. This penalty term shrinks the weights of the model towards zero as well, but unlike L1 regularization, it does not remove any features completely from the model. Instead, it reduces the impact of less important features, making the model more robust to noise in the data.

In summary, regularization is a powerful technique to combat overfitting in machine learning models. By adding a penalty term to the loss function, it encourages the model to find a simpler solution that generalizes well to new data.

The XGBoost libraries support both L1 and L2 regularization. In XGBoost, L1 regularization is called "L1 regularization" or "Lasso regularization," while L2 regularization is called "L2 regularization" or "Ridge regularization."

XGBoost allows users to specify the regularization type and the strength of regularization using the "alpha" and "lambda" hyperparameters. The "alpha" hyperparameter controls the L1 regularization strength, while the "lambda" hyperparameter controls the L2 regularization strength. By tuning these hyperparameters, users can adjust the amount of regularization applied to the model and prevent overfitting.

In practice, a combination of L1 and L2 regularization is often used in XGBoost to achieve better performance. This is known as "Elastic Net" regularization, which combines the benefits of both L1 and L2 regularization. The Elastic Net regularization is controlled by two hyperparameters, alpha and lambda, which control the strength of the L1 and L2 regularization, respectively. By tuning these hyperparameters, users can balance the contribution of L1 and L2 regularization and achieve the best performance on their data.

---
#### What is Early Stopping?

Early stopping is a technique used in machine learning to prevent overfitting of the model to the training data. It involves monitoring the performance of the model on a validation set during training and stopping the training process when the performance of the model on the validation set starts to deteriorate.

During training, the model is evaluated on both the training set and the validation set at regular intervals (usually after each epoch). The training process is stopped when the performance of the model on the validation set starts to worsen. This is because continuing to train the model after this point will result in overfitting as the model begins to memorize the training data.

The point at which the training is stopped is determined using a stopping criterion. One common stopping criterion is to stop training when the validation loss (i.e., the loss on the validation set) stops decreasing for a certain number of epochs. Another common stopping criterion is to stop training when the difference between the training loss and validation loss exceeds a certain threshold.

Early stopping is a simple yet effective technique for preventing overfitting in machine learning models. By stopping the training process at the right time, it helps the model generalize better to new, unseen data and improves the overall performance of the model.

---
#### What is a hyperparameter?

In machine learning, a hyperparameter is a parameter that is set before the model is trained and remains fixed throughout the training process. Unlike model parameters, which are learned from the training data during the training process, hyperparameters are set by the data scientist or machine learning engineer before training begins.

Examples of hyperparameters include the learning rate of the model, the regularization strength, the number of hidden layers in a neural network, the number of trees in a random forest, and so on. These hyperparameters are set before training begins and are typically tuned through a process called hyperparameter tuning or hyperparameter optimization, where different values of the hyperparameters are tried to find the best combination that maximizes the performance of the model on a validation set or through cross-validation.

Choosing the right hyperparameters can have a significant impact on the performance of a machine-learning model. Hyperparameters that are not set properly can lead to overfitting or underfitting, resulting in poor performance of the model. Therefore, selecting the right hyperparameters is an important step in building an accurate and reliable machine-learning model.

In [None]:
# Helper function for XGBoost
import matplotlib.pyplot as plt
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV


def check(X_train_train, X_train_val, y_train_train, y_train_val):
     # Split the training data into training and validation sets
#     X_train_train, X_train_val, y_train_train, y_train_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)


    model = XGBClassifier(learning_rate=0.1, max_depth=5, n_estimators=100, random_state=42, colsample_bytree=0.8)

    # Fit the model to the training data
    model.fit(X_train_train, y_train_train)

    # Make predictions on the training and validation data
    y_train_pred = model.predict(X_train_train)
    y_val_pred = model.predict(X_train_val)

    # Calculate accuracy on the training and validation data
    train_acc = accuracy_score(y_train_train, y_train_pred)
    val_acc = accuracy_score(y_train_val, y_val_pred)

    # Return the accuracy scores
    return train_acc, val_acc

In [None]:
# XGBoost Classifier Start
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV

X_train, X_test, y_train, y_test = train_test_split(df_cleaned[['AGE', 'GENDER', 'CHOLESTEROL', 'GLUCOSE', 'SMOKE','ALCOHOL', 'ACTIVE', 'BMI']], df_cleaned['CARDIO_PATIENT'], test_size=0.2, random_state=69)
xgb = XGBClassifier(learning_rate=0.1, max_depth=5, n_estimators=100, random_state=42, colsample_bytree=0.8)
xgb.fit(X_train, y_train)

# Make predictions on the test set
y_pred = xgb.predict(X_test)

# Evaluate the performance of the XGBoost model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Define a parameter grid for the XGBoost model
# Including hyperparameters to increase the accuracy of the model, and to reduce overfitting
param_grid = {
    'learning_rate': [0.1, 0.2, 0.3, 0.4],
    'max_depth': [3, 4, 5],
    'n_estimators': [50, 100, 200]
}

# train_acc, val_acc = check(X_train, X_test, y_train, y_test)

# print("Training Accuracy:", train_acc)
# print("Validation Accuracy", val_acc)

# # Create an instance of the GridSearchCV class
grid_search = GridSearchCV(xgb, param_grid=param_grid, cv=3, scoring='accuracy')

# Fit the GridSearchCV instance to the training data
grid_search.fit(X_train, y_train)

# Print the best parameters and score
print("Best parameters:", grid_search.best_params_)
print("Best score:", grid_search.best_score_)

## 4.2 XGBoost Conclusion: 

Based on the accuracy score, the model has a accuracy of 64.57% after fine tuning the parameters.

---

## 4.3 Logistic Regression:

#### What is Logistic Regression?

Logistic regression is a transformation of the linear regression model that allows us to probabilistically model binary variables. It is also known as a generalized linear model that uses a logit-link. Logistic regression is great when we want to model binary data, just like we are doing here, when we want class probability predictions or when we want some interpretability of the model trough its coefficients we can quantify the impact of each feature on your model’s predictions via the odds ratio. On the order hand, Logistic Regression is not that great when our data is not linearly separable.

---
#### How does it work?
Logistic regression works very much like linear regression. Input (x) are combined linearly using weights or coefficients values to predict an output (y). The key difference is that the output is modeled as binary values through the equation below:

𝑦 = 𝑒∗∗(𝑏0+𝑏1∗𝑥)/(1+𝑒∗∗(𝑏0+𝑏1∗𝑥))
 
The coefficients (Beta values b) of the logistic regression algorithm are estimated by maximum-likelihood estimation.

Maximum-likelihood estimation is a common learning algorithm used by a variety of machine learning algorithms. The best coefficients would result in a model that would predict a value very close to 1 (e.g. male) for the default class and value very close to 0 (e.g. female) for the other class.

---
#### Advantages of using Logistic Regression:
1. Very Simple and easy to understand.
2. Interpretable.
3. Probabilistically outputs.
4. Low cost of maintenance

---
#### Disadvantages of using Logistic Regression:
1. Sensible to highly correlated inputs.
2. Assumes Gaussian Distribution.

---
To make the approach work, you need to pre-process all the features once you create dummy variables, which includes:

#### Centering and scaling
1. Transformation to remove skewness in data
2. Remove highly correlated features
3. Remove features that have near-zero variance
4. To prevent overfitting, you can use penalized logistic regression model.

In [None]:
## Logistic Regression Code:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler

# Normalizing Numerical Data
# Create a StandardScaler object
scaler = StandardScaler()
# Fit the scaler to the data
scaler.fit(df_cleaned[['AGE', 'BMI']])
# Transform the data
df_cleaned[['AGE', 'BMI']] = scaler.transform(df_cleaned[['AGE', 'BMI']])

# Training Model
X_train, X_test, y_train, y_test = train_test_split(df_cleaned[['AGE','GENDER', 'CHOLESTEROL', 'GLUCOSE', 'SMOKE','ALCOHOL', 'ACTIVE','BMI']], df_cleaned['CARDIO_PATIENT'], test_size=0.2, random_state=69)
model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print(classification_report(y_test, predictions))
print(sb.heatmap(confusion_matrix(y_test, predictions),annot=True,fmt=".0f",annot_kws={"size":18}))

### What do each of the Performance Metrics mean?

Performance metrics such as **Precision, Recall, F1 score, and Support** are commonly used to evaluate the performance of classification models. Here's what each of these metrics means:

> **1. Precision:** Precision is the ratio of true positive predictions to the total number of positive predictions made by the model. In other words, it measures how many of the predicted positive instances are actually positive. A high precision score means that the model has a low false positive rate and is good at predicting true positive instances.

> **2. Recall:** Recall is the ratio of true positive predictions to the total number of actual positive instances in the dataset. In other words, it measures how many of the actual positive instances are correctly identified by the model. A high recall score means that the model has a low false negative rate and is good at identifying all positive instances.

> **3. F1 score:** The F1 score is the harmonic mean of precision and recall. It provides a single score that balances precision and recall. A high F1 score means that the model has both high precision and high recall.

> **4. Support:** The support is the number of samples in each class. It gives an indication of how many samples in the dataset belong to each class.

When evaluating a model, it is important to consider all of these metrics together, as they can provide a more complete picture of the model's performance. For example, a model with high precision but low recall may be good at identifying positive instances, but may miss many true positive instances. On the other hand, a model with high recall but low precision may identify many positive instances, but may also generate many false positive instances. The F1 score provides a balanced view of both precision and recall, and can help you determine the overall performance of the model.


## 4.4 Logistic Regression Conclusion:

Based on the accuracy score, the model has a accuracy of 64%.

## 4.4 Neural Network Classifier:

In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
X_train, X_test, y_train, y_test = train_test_split(df_cleaned[['AGE', 'GENDER', 'CHOLESTEROL', 'GLUCOSE', 'SMOKE','ALCOHOL', 'ACTIVE', 'BMI']], df_cleaned['CARDIO_PATIENT'], test_size=0.2, random_state=69)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# define parameter grid
param_grid = {
    'hidden_layer_sizes': [(10,), (50,), (100,)],
    'activation': ['logistic', 'tanh', 'relu'],
    'solver': ['sgd', 'adam'],
    'alpha': [0.0001, 0.001, 0.01],
    'learning_rate': ['constant', 'adaptive'],
    'batch_size': [32, 64, 128],
}

# create MLPClassifier model
model = MLPClassifier(hidden_layer_sizes=(10,10), max_iter=1000)

# create GridSearchCV object
grid_search = GridSearchCV(model, param_grid=param_grid, cv=5, verbose=1, n_jobs=-1)

# fit model with GridSearchCV
grid_search.fit(X_train, y_train)

# print best hyperparameters
print("Best hyperparameters: ", grid_search.best_params_)
print("Best score:", grid_search.best_score_)

# clf = MLPClassifier(hidden_layer_sizes=(10,10), max_iter=1000)
# clf.fit(X_train, y_train)
# y_pred = clf.predict(X_test)
# accuracy = clf.score(X_test, y_test)
# print('Accuracy:', accuracy)

---
# 🖊️ 5.0 Conclusion:

## 5.1 Comparing XGBoost and Logistic Regression:

XGBoost Model Accuracy: 64.53%

Logistic Regression Accuracy: 64%


---
## 5.2 Epilogue and Conclusion:

Therefore, comparing the 2 models, as the accuracy of the XGBoost Machine Learning Model is higher, it is the better model.

---
## 5.3 Models Used:
1. XGBoost
2. Logistic Regression


---
## 5.4 What did we learn from this project?
- Handling imbalanced datasets using resampling methods and imblearn package
- Logistic Regression from sklearn
- XGBoost from xgboost
- Other packages such as tqdm, json, XGBoost
- Collaborating using GitHub
- Concepts about Precision, Recall, F1 Score and Support

---
## 5.5 References
https://www.nickmccullum.com/python-machine-learning/logistic-regression-python/
https://towardsdatascience.com/quick-and-easy-explanation-of-logistics-regression-709df5cc3f1e
https://www.digitalocean.com/community/tutorials/normalize-data-in-python