<a href="https://colab.research.google.com/github/kirtiver22/cadrio_vascular_risk/blob/main/final_cardio_vascular_risk.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - **Cardiovascular risk**



##### **Project Type**    - EDA/Regression ,**Classification**
##### **Contribution**    - Team
##### **Team Member 1 -** SURAJ THAKUR
##### **Team Member 2 -** Manjeet Sulekh
##### **Team Member 3 -** Kirti Verma

# **Project Summary -**

The dataset is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. The classification goal is to predict whether the patient has a 10-year risk of future coronary heart disease (CHD). The dataset provides the patients information. It includes over 4,000 records and 15 attributes. Each attribute is a potential risk factor. There are both Demographic, behavioral, and medical risk factors.

Cardiovascular disease (CVD) is a leading cause of death worldwide, and early detection of individuals at high risk of developing CVD is essential for effective prevention and treatment. Traditional methods of cardiovascular risk prediction rely on risk factors such as age, gender, and blood pressure, but these methods have limitations in accurately predicting individual risk.

Machine learning algorithms have shown promise in improving the accuracy of cardiovascular risk prediction by incorporating a wider range of variables, including lifestyle factors and biomarkers. However, there is a need for a robust and reliable machine learning model that can accurately predict an individual's risk of developing CVD.

Therefore, the problem statement is to develop a machine learning model that can accurately predict an individual's risk of developing CVD by incorporating a wide range of variables, including demographic, clinical, and lifestyle factors. The model should be interpretable, reliable, and able to provide personalized risk predictions that can aid in the development of prevention strategies for high-risk individuals.

The project aims to develop a machine learning model for the classification of cardiovascular risk in individuals. The model will be trained on a dataset of demographic, clinical, and lifestyle variables of patients, including age, gender, blood pressure, cholesterol levels, smoking status, and family history of cardiovascular disease.

The dataset will be preprocessed to handle missing values, normalize the data, and perform feature engineering to extract relevant features for the model. Several machine learning algorithms will be evaluated, such as logistic regression, decision trees, random forests, and support vector machines, to determine the best-performing algorithm for the task.

The performance of the model will be evaluated using metrics such as accuracy, precision, recall, and F1-score. The model's interpretability will be analyzed by identifying the most important features contributing to the prediction of cardiovascular risk.

In this project we will perform EDA and 7 ML model like

K-nearest Neighbors

Logistic Regression

Naïve Bayes

Decision Tree

Support Vector Machine

Random Forests

XG Boost

# **GitHub Link -**

https://github.com/kirtiver22

# **Problem Statement**


Cardiovascular diseases (CVDs) are the major cause of mortality worldwide. According to WHO, 17.9 million people died from CVDs in 2019, accounting for 32% of all global fatalities. Though CVDs cannot be treated, predicting the risk of the disease and taking the necessary precautions and medications can help to avoid severe symptoms and, in some cases, even death. As a result, it is critical that we accurately predict the risk of heart disease in order to avert as many fatalities as possible.

Therefore, the problem statement is to develop a machine learning model that can accurately predict an individual's risk of developing CVD by incorporating a wide range of variables, including demographic, clinical, and lifestyle factors. The model should be interpretable, reliable, and able to provide personalized risk predictions that can aid in the development of prevention strategies for high-risk individuals.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import
# Data manipulation
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing and feature selection
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_classif

# Machine learning models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.impute import KNNImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier

# Model evaluation
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import cross_val_score, GridSearchCV


### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
CVD_df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/cardiovascular_risk.csv')


### Dataset First View

In [None]:
# Dataset First Look
CVD_df.head()



In [None]:
CVD_df.tail()


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
num_rows, num_columns = CVD_df.shape

print(f"Number of rows: {num_rows}")
print(f"Number of columns: {num_columns}")


In [None]:
#columns name
columns = CVD_df.columns
print(columns)

### Dataset Information

In [None]:
# Dataset Info
CVD_df.info()

In [None]:

CVD_df.describe().transpose()

In [None]:
df_pivot = CVD_df.pivot_table(index=['id', 'age', 'education', 'sex', 'is_smoking', 'cigsPerDay', 'BPMeds',
                                      'prevalentStroke', 'prevalentHyp', 'diabetes', 'totChol', 'sysBP',
                                      'diaBP', 'BMI', 'heartRate', 'glucose', 'TenYearCHD'],
                               aggfunc='size')
print(df_pivot)


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
num_duplicates = CVD_df.duplicated().sum()

print(f"Number of duplicate rows: {num_duplicates}")



#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values_count = CVD_df.isnull().sum()

print(f"Missing values count:\n{missing_values_count}\n")



In [None]:
missing_values_count.sum()

In [None]:
# Visualizing the missing values
import missingno as msno
msno.bar(CVD_df, color='skyblue', figsize=(10, 6))
plt.title('Percentage of Missing Values by Column')
plt.show()


### What did you know about your dataset?

# **DEMOGRAPHIC**
Sex: male or female ('M' or 'F')

Age: Age of the patient (continuous- Although the recorded ages have been truncated to whole numbers, the concept of age is continuous)

Education: the level of education of the patient (categorical values - 1,2,3,4)
# **Behavioral:**
is_smoking: whether or not the patient is a current smoking ('yes' or 'No')

Cigs Per Day: the number of cigarettes that the person smoked on average in one day.(can be considered continuous as one can have any number of cigarettes, even half a cigarette.)

# **Medical(history):**
BP meds: whether or not the patient was on blood pressure medication(nominal)

Prevalent stroke: whether or not the patient had previously had a stroke(nominal)

Prevalent Hyp: whether or not the patient was hypertensive(nominal)

Diabetes: Whether or not the patient had diabetes(nominal)

# **Medical(current):**
Tot Chol: total cholesterol level (continuous)

Sys BP: Systgolic blood pressure (continuous )

Dia BP: diastolic blood pressure(continuous)

BMI: Body Mass Index (continuous)

Heart rate: Heart rate (continuous-in medical research, variables such as heart rate though in fact discrete, yet are considered continuous because of large number of possible values.)

Glucose: glucose level (continuous)

# **Predict variable (desired target):**
10 year risk of coronary heart disease CHD(binary:'1', means 'yes', '0' means 'NO')

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
CVD_df.columns


In [None]:
# Dataset Describe
CVD_df.describe(include='all')

### Variables Description

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
# Check unique values for each variable
for column in CVD_df.columns:
    unique_values = CVD_df[column].nunique()
    print(f"Number of Unique values for {column}: {unique_values}")


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# column rename
CVD_df.rename(columns={'cigsPerDay':'cigs_per_day','BPMeds':'bp_meds',
                   'prevalentStroke':'prevalent_stroke','prevalentHyp':'prevalent_hyp',
                   'totChol':'total_cholesterol','sysBP':'systolic_bp','diaBP':'diastolic_bp',
                   'BMI':'bmi','heartRate':'heart_rate','TenYearCHD':'ten_year_chd'},
          inplace = True)

In [None]:
CVD_df.columns

In [None]:
# Defining 3 lists containing the column names of
# a. dependent variables
# b. continuous independent variables
# c. categorical independent variables
# This is defined based on the number of unique values for each attribute

dependent_var = ['ten_year_chd']
continuous_var = ['age','cigs_per_day','total_cholesterol','systolic_bp', 'diastolic_bp', 'bmi', 'heart_rate', 'glucose']
categorical_var = ['education', 'sex', 'is_smoking','bp_meds','prevalent_stroke', 'prevalent_hyp', 'diabetes']

In [None]:
# Encoding the binary columns

CVD_df['sex'] = np.where(CVD_df['sex'] == 'M',1,0)
CVD_df['is_smoking'] = np.where(CVD_df['is_smoking'] == 'YES',1,0)


In [None]:
# All missing values in the cigs_per_day column
CVD_df[CVD_df['cigs_per_day'].isna()]



* **From the above table, we find that for every instance of missing values in cigs per day, the patients reported that they smoke.**
* **Let's check the mean and median number of cigarettes smoked by patients, who reported that they smoke.**


In [None]:
# mean and median number of cigarettes per day for a smoker (excluding non-smokers)
CVD_df[CVD_df['is_smoking']==1]['cigs_per_day'].mean(),CVD_df[CVD_df['is_smoking']==1]['cigs_per_day'].median()

* Mean number of cigarettes for a smoker = 18.34
* Median number of cigarettes for a smoker = 20


From the above table, we find that for every instance of missing values in cigs per day, the patients reported that they smoke.

Let's check the mean and median number of cigarettes smoked by patients, who reported that they smoke.

### What all manipulations have you done and insights you found?

Answer Here.

On a first look at the dataset, it is found that

* There are 3390 rows and 17 columns, out of which one is TenYearCHD which is the the dependent variable to be predicted
* Two of these features are not in numerical (int/float) datatype
There are no duplicated data in the dataset, and all values in id column are unique
* There are 510 missing values in the dataset, with 304 of them in glucose column
* The dataset is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. The classification goal is to predict whether the patient has a 10-year risk of future coronary heart disease (CHD).

* The dataset provides the patients’ information. Each attribute is a potential risk factor. There are demographic, behavioral, and medical risk factors.
* The dataset contains 3390 rows and 16 columns. There are no duplicate records, and contains 510 missing values.


## ***Replacing the missing values in the categorical columns with the most frequent entry:***

In [None]:

# Replacing the missing values in the categorical columns with its mode
CVD_df['education'] = CVD_df['education'].fillna(CVD_df['education'].mode()[0])
CVD_df['bp_meds'] = CVD_df['bp_meds'].fillna(CVD_df['bp_meds'].mode()[0])

In [None]:
# education distribution after mode imputation
CVD_df.education.value_counts()


In [None]:
# bp_meds distribution after mode imputation
CVD_df.bp_meds.value_counts()


## ***cigs_per_day:***

In [None]:
#CVD_df.cigs_per_day.mean().round(0),CVD_df.cigs_per_day.median()

In [None]:
# Filter the DataFrame for smokers and calculate mean and median
mean_cigs_per_day = CVD_df[CVD_df['is_smoking'] == 1]['cigs_per_day'].mean()
median_cigs_per_day = CVD_df[CVD_df['is_smoking'] == 1]['cigs_per_day'].median()

# Print the mean and median
print(f"Mean cigarettes per day for smokers: {mean_cigs_per_day}")
print(f"Median cigarettes per day for smokers: {median_cigs_per_day}")


In [None]:
# distribution of number of cigarettes per day for smokers (excluding non-smokers)
#import matplotlib.pyplot as plt
#import seaborn as sns

# Filter the DataFrame for smokers
smokers_df = CVD_df[CVD_df['is_smoking'] == 1]

# Create a histogram of 'cigs_per_day' for smokers
plt.figure(figsize=(10, 6))
sns.histplot(smokers_df['cigs_per_day'], bins=20, kde=True, color='skyblue')
plt.title('Distribution of Cigarettes Per Day for Smokers')
plt.xlabel('Number of Cigarettes Per Day')
plt.ylabel('Frequency')
plt.show()


In [None]:
# box plot for the number of cigarettes per day for smokers (excluding non-smokers)

plt.figure(figsize=(10, 6))
sns.boxplot(x='cigs_per_day', data=smokers_df, color='skyblue')
plt.title('Box Plot of Cigarettes Per Day for Smokers')
plt.xlabel('Number of Cigarettes Per Day')
plt.show()

Since the number of cigarettes smoked by the patients who smoke contains outliers, the missing values in ths cigs_per_day column can be imputed with its median value

In [None]:
# Imputing the missing values in the cigs_per_day

CVD_df['cigs_per_day'] = CVD_df['cigs_per_day'].fillna(CVD_df[CVD_df['is_smoking']==1]['cigs_per_day'].median())

In [None]:
# Checking for any wrong entries where the patient is not a smoker
# and cigarettes per day above 0
CVD_df[(CVD_df['is_smoking']==0) & (CVD_df['cigs_per_day']>0)]

In [None]:
# Checking for any wrong entries where the patient is a smoker
# and cigarettes per day is 0
CVD_df[(CVD_df['is_smoking']==1) & (CVD_df['cigs_per_day']==0)]

## ***total_cholestrol, bmi, heart_rate:***

In [None]:
# Checking the distribution of the total_cholesterol, bmi, and heart_rate

# List of columns
columns = ['total_cholesterol', 'bmi', 'heart_rate']

# Create subplots for each variable
fig, axes = plt.subplots(len(columns), 1, figsize=(10, 15))

for i, column in enumerate(columns):
    # Plot histogram with KDE for the current variable
    sns.histplot(CVD_df[column], kde=True, ax=axes[i], color='skyblue', line_kws={'linewidth': 2})
    axes[i].axvline(CVD_df[column].mean(), color='red', linestyle='dashed', linewidth=1, label='Mean')
    axes[i].axvline(CVD_df[column].median(), color='green', linestyle='dashed', linewidth=1, label='Median')
    axes[i].set_title(f'Distribution of {column.capitalize()}')
    axes[i].legend()

# Adjust layout
plt.tight_layout()
plt.show()



The total_cholestrol, bmi, and heart_rate columns are positively skewed

The most common heart rate is between 60 and 80 beats per minute.
There is a long tail to the right of the distribution, which suggests that there are a small number of people with very high heart rates.
The distribution is slightly skewed to the left, which suggests that the heart rate is more likely to be higher in younger people than older people.

In [None]:
# Checking outliers in total_cholesterol, bmi, heart_rate columns
for i in ['total_cholesterol','bmi','heart_rate']:
  plt.figure(figsize=(10,5))
  sns.boxplot(CVD_df[i])
  plt.title(i+' boxplot')
  plt.show()

The total_cholestrol, bmi, and heart_rate columns contain outliers.

1.The median heart rate is 80 beats per minute.

2.The interquartile range (IQR) is 20 beats per minute, which means that 50% of the people in the population have heart rates between 60 and 100 beats per minute.

3.There are two outliers, one at 140 beats per minute and one at 120 beats per minute. These outliers may be due to underlying medical conditions or they may be athletes with very high heart rates.

4.The distribution is slightly skewed to the right, which means that there are more people with higher heart rates than lower heart rates. This is likely due to the fact that the population includes a mix of people of different ages and fitness levels.

In [None]:
# Mean and median for total_cholesterol
# Calculate mean and median for total_cholesterol
mean_total_cholesterol = CVD_df['total_cholesterol'].mean()
median_total_cholesterol = CVD_df['total_cholesterol'].median()

# Print the mean and median
print(f"Mean total cholesterol: {mean_total_cholesterol}")
print(f"Median total cholesterol: {median_total_cholesterol}")


In [None]:
# Calculate mean and median for bmi
mean_bmi = CVD_df['bmi'].mean()
median_bmi = CVD_df['bmi'].median()

# Print the mean and median
print(f"Mean BMI: {mean_bmi}")
print(f"Median BMI: {median_bmi}")


In [None]:
# Calculate mean and median for heart_rate
mean_heart_rate = CVD_df['heart_rate'].mean()
median_heart_rate = CVD_df['heart_rate'].median()

# Print the mean and median
print(f"Mean Heart Rate: {mean_heart_rate}")
print(f"Median Heart Rate: {median_heart_rate}")



Since the total_cholestrol, bmi, and heart_rate columns are positively skewed, and also contains outliers. We can impute the missing values with its median

In [None]:
# Imputing missing values in the total_cholesterol, bmi, and heart_rate with their medain values
CVD_df['total_cholesterol'] = CVD_df['total_cholesterol'].fillna(median_total_cholesterol)
CVD_df['bmi'] = CVD_df['bmi'].fillna(median_bmi)
CVD_df['heart_rate'] = CVD_df['heart_rate'].fillna(median_heart_rate)

In [None]:
# mean and median of total_cholesterol after median imputation
mean_total_cholesterol_imputed = CVD_df['total_cholesterol'].mean()
median_total_cholesterol_imputed = CVD_df['total_cholesterol'].median()
print(mean_total_cholesterol_imputed,'and',median_total_cholesterol_imputed)

In [None]:
# mean and median of bmi after median imputation
mean_bmi_imputed = CVD_df['bmi'].mean()
median_bmi_imputed = CVD_df['bmi'].median()
print(mean_bmi_imputed,'and',median_bmi_imputed)

In [None]:
# mean and median of heart_rate after median imputation
mean_heart_rate_imputed = CVD_df['heart_rate'].mean()
median_heart_rate_imputed = CVD_df['heart_rate'].median()
print(mean_heart_rate_imputed,'and',median_heart_rate_imputed)

# ***Glucose***

In [None]:
# total missing values in glucose
CVD_df.glucose.isna().sum()

In [None]:

# distribution of glucose
plt.figure(figsize=(10,5))
sns.histplot(CVD_df['glucose'])
plt.axvline(CVD_df['glucose'].mean(), color='yellow', linestyle='dashed', linewidth=2)
plt.axvline(CVD_df['glucose'].median(), color='red', linestyle='dashed', linewidth=2)
plt.title('Glucose distribution')
plt.show()

The glucose column is positively skewed.

In [None]:
# Outliers in glucose
plt.figure(figsize=(8, 6))
sns.boxplot(x=CVD_df['glucose'], color='skyblue')
plt.title('Box Plot of Glucose')
plt.xlabel('Glucose')
plt.show()

The glucose column contains outliers.

In [None]:
# Mean, median, and mode for glucose
CVD_df.glucose.mean(),CVD_df.glucose.median(),CVD_df.glucose.mode()

1. The distribution is positively skewed, with outliers.
2.There are 304 missing values in the glucose column. If we choose to impute them with a single value of mean / median, we will be adding high bias at that point.
3.To avoid this we can impute the missing values using KNN imputer. If the dataset in question had been a time series, we could have used the interpolation method to impute the missing values.

In [None]:
# Using KNN imputer with K=10

imputer = KNNImputer(n_neighbors=10)
imputed = imputer.fit_transform(CVD_df)
df = pd.DataFrame(imputed, columns=CVD_df.columns)

In [None]:
# mean, median, and mode for glucose after knn imputation
CVD_df.glucose.mean(),CVD_df.glucose.median(),CVD_df.glucose.mode()

In [None]:
CVD_df.columns

After KNN imputation, there is no massive change in the values of mean. And the values of median and mode remain the same.

In [None]:
CVD_df.info()

The KNN imputer has converted all the columns to the float64 datatype. Hence, changing the column datatype accordingly as per the kind of data stored in the respective column.

In [None]:
# changing datatypes
CVD_df = CVD_df.astype({'age': int, 'education':int,'sex':int,'is_smoking':int,'cigs_per_day':int,
               'bp_meds':int,'prevalent_stroke':int,'prevalent_hyp':int,'diabetes':int,
               'total_cholesterol':float,'systolic_bp':float,'diastolic_bp':float,
               'bmi':float,'heart_rate':float,'glucose':float,'ten_year_chd':int})

In [None]:
# checking for missing values
CVD_df.isna().sum()

We have successfully handled all the missing values in the dataset.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 Which Gender is prone to coronary heart disease?

In [None]:
# Chart - 1 visualization code
# Calculate the number of individuals with CHD for each gender
chd_by_gender = CVD_df.groupby('sex')['ten_year_chd'].sum()

# Calculate the total number of individuals for each gender
total_by_gender = CVD_df['sex'].value_counts()

# Calculate the prevalence of CHD for each gender
prevalence_by_gender = chd_by_gender / total_by_gender

# Print the prevalence of CHD for each gender
print("Prevalence of CHD by gender:")
print(prevalence_by_gender)

# Plotting the bar plot
plt.figure(figsize=(8, 6))
prevalence_by_gender.plot(kind='bar', color=['skyblue', 'salmon'])
plt.title('Prevalence of Coronary Heart Disease by Gender')
plt.xlabel('Gender')
plt.ylabel('Prevalence')
plt.xticks(rotation=0)
plt.show()



##### 1. Why did you pick the specific chart?

Answer Here.

I picked the chart because it shows the percentage of people who are "At Risk" for CHD by gender. This is an important insight for businesses, as it can help them to understand the health risks that their employees face.


##### 2. What is/are the insight(s) found from the chart?

Answer Here

*   More men are at risk for CHD than women.
*    The percentage of men who are at risk for CHD is increasing.
* The percentage of women who are at risk for CHD is also increasing, but at a slower rate than men.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Overall, the chart provides valuable insights into the percentage of people who are at risk for CHD by gender. These insights can help businesses to understand the health risks that their employees face and to take steps to improve their health.

#### Chart - 2
How much cigarettes per day smoked by smokers?

In [None]:
# Chart - 2 visualization code
# distribution of number of cigarettes per day for smokers (excluding non-smokers)

# Filter the DataFrame to include only smokers
smokers_df = CVD_df[CVD_df['is_smoking'] == 1]

# Calculate mean and median of cigarettes per day for smokers
mean_cigs_per_day = smokers_df['cigs_per_day'].mean()
median_cigs_per_day = smokers_df['cigs_per_day'].median()

# Plotting the distribution of cigarettes per day for smokers
plt.figure(figsize=(10, 6))
sns.histplot(smokers_df['cigs_per_day'], bins=10, kde=True, color='skyblue')
plt.axvline(mean_cigs_per_day, color='magenta', linestyle='dashed', linewidth=2, label=f'Mean: {mean_cigs_per_day:.2f}')
plt.axvline(median_cigs_per_day, color='cyan', linestyle='dashed', linewidth=2, label=f'Median: {median_cigs_per_day:.2f}')
plt.title('Distribution of Cigarettes Per Day for Smokers')
plt.xlabel('Number of Cigarettes per Day')
plt.ylabel('Frequency')
plt.legend()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

I picked the chart because it shows the distribution of cigarettes smoked per day for smokers. This is an important chart to look at because it can help us understand the smoking habits of our target audience.


##### 2. What is/are the insight(s) found from the chart?

Answer Here

* The most common number of cigarettes smoked per day is 20.
* There is a long tail to the distribution, meaning that there are a small number of smokers who smoke a very high number of cigarettes per day.
* The distribution is slightly skewed to the right, meaning that there are slightly more smokers who smoke more cigarettes per day than smokers who smoke fewer cigarettes per day.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

These insights can help us create a positive business impact by:

1.Targeting our marketing campaigns to smokers who smoke 20 cigarettes per day.

2.Developing products and services that are specifically designed for smokers who smoke a high number of cigarettes per day.

3.Working with healthcare providers to educate smokers about the dangers of smoking and to help them quit smoking.

negative business impact

1.Smokers who smoke a high number of cigarettes per day are more likely to develop chronic diseases, such as heart disease, stroke, and cancer. This means that they are more likely to miss work, be hospitalized, and die prematurely. This can lead to lost productivity, increased healthcare costs, and a decrease in the overall workforce.

2.Smokers who smoke a high number of cigarettes per day are more likely to have children who smoke. This can lead to a cycle of smoking that can be difficult to break.

3.Smokers who smoke a high number of cigarettes per day are more likely to be obese and have other unhealthy habits. This can lead to a decrease in their overall health and well-being.

#### Chart - 3
Distribution of BMI in a population.

In [None]:
# Chart - 3 visualization code

# Plotting histograms for each continuous variable
plt.figure(figsize=(12, 10))
for i, var in enumerate(continuous_var, 1):
    plt.subplot(3, 3, i)
    sns.histplot(CVD_df[var], kde=True, color='skyblue')
    plt.title(f'Distribution of {var}')
    plt.xlabel(var)
    plt.ylabel('Frequency')

plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

I picked the chart because it shows the distribution of BMI in a population. This is an important insight for businesses, as it can help them to understand the health of their employees and customers.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

1.The majority of people in the population have a BMI in the normal range.

2.There is a small percentage of people who are underweight.

3.There is a larger percentage of people who are overweight or obese.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Overall, the chart provides valuable insights into the distribution of BMI in a population. These insights can help businesses to understand the health of their employees and customers and to take steps to improve their health.

#### Chart - 4
Making age distribution of CVD patients.

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(10, 6))
sns.histplot(CVD_df['age'], bins=20, kde=True, color='skyblue')
plt.title('Age Distribution of CVD Patients')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

I picked the chart because it shows the age distribution of CVD patients. This is an important insight for businesses, as it can help them to understand the impact of CVD on their workforce.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

1.The majority of CVD patients are over the age of 50.

2.The number of CVD patients is increasing in the older age groups.

3.CVD is a major cause of death for adults over the age of 50.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



Answer Here

1.Understand the risk of CVD to their workforce.

2.Implement policies to protect their employees from CVD, such as smoking cessation programs and healthy eating initiatives.

3.Provide support to employees who are sick with CVD, such as paid sick leave and short-term disability benefits

#### Chart - 5
Finding cholesterol levels in CVD patients.

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(8, 6))
sns.histplot(CVD_df['total_cholesterol'], color='green',bins=20)
plt.title('Cholesterol Levels in CVD Patients')
plt.xlabel('Total Cholesterol')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

I picked the chart because it shows the distribution of cholesterol levels in CVD patients. This is an important insight for healthcare professionals, as it can help them to identify patients who are at risk for further complications of CVD.

---



##### 2. What is/are the insight(s) found from the chart?

Answer Here

The majority of CVD patients have high cholesterol levels.


1.The number of patients with high cholesterol levels is increasing.

2.Patients with high cholesterol levels are at risk for further complications of CVD, such as heart attack, stroke, and peripheral artery disease.

3.The high number of patients with high cholesterol levels underscores the importance of cholesterol screening for all adults.

4.The increasing number of patients with high cholesterol levels highlights the need for more research into the causes of high cholesterol and the development of new treatments.

5.The high risk of further complications for patients with high cholesterol levels emphasizes the importance of lifestyle changes and medication to lower cholesterol levels.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Overall, the chart provides valuable insights into the distribution of cholesterol levels in CVD patients. These insights can help healthcare professionals to identify patients who are at risk for further complications of CVD and to provide early intervention and treatment. This can help to improve people's health and reduce the risk of developing serious health problems.

#### Chart - 6

Finding number of cases of CHD over a period of years



In [None]:
# Chart - 6 visualization code
var = 'ten_year_chd'

plt.figure(figsize = (11, 6))
ax = CVD_df[var].value_counts().plot(kind = 'bar')
plt.ylabel('Count of people')
plt.xlabel(var)
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='center', xytext=(0, 10), textcoords='offset points')

plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

I picked the chart because it shows the number of cases of CHD over a period of years. This is an important insight for businesses, as it can help them to plan for future healthcare costs.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

1.The number of cases of CHD has been increasing over the past few years.

2.The increase in the number of cases of CHD is likely due to a number of factors, including an aging population, an increase in obesity, and a sedentary lifestyle.

3.The increase in the number of cases of CHD is a major concern for businesses, as it will lead to increased healthcare costs.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

1.Offering health insurance that covers preventive care, such as diabetes and hypertension screenings.

2.Providing employees with resources to help them live a healthy lifestyle, such as on-site gyms and health education classes.

3.Encouraging employees to take breaks throughout the day and to stand up and move around regularly.

#### Chart - 7
Relationship between the Continuous variables and the Dependent variable

In [None]:
# Chart - 7 visualization code

def catplot_with_median(dataset, variable, legend, median=False, unit=None, kind_='violin'):
    '''
    Returns a categorical plot with median. Inputs: DataFrame dataset, continuous variable to plot,
    the discrete variable to legend on, median (True/False), unit and type of plot
    '''
    # Check if legend column exists in the dataset
    if legend not in dataset.columns:
        raise ValueError(f"Column '{legend}' not found in the dataset.")

    # Check if variable column exists in the dataset
    if variable not in dataset.columns:
        raise ValueError(f"Column '{variable}' not found in the dataset.")

    sns.catplot(data=dataset, x=legend, y=variable, height=5, aspect=11/6, kind=kind_)
    plt.ylabel(f'{variable} ({unit})')

    if median:
        # Use a colormap for colors
        cmap = plt.get_cmap('tab20')
        num_colors = len(dataset[legend].unique())
        colors = [cmap(i) for i in range(num_colors)]

        for i, value in enumerate(dataset.dropna()[legend].unique()):
            median_value = dataset[dataset[legend] == value][variable].median()
            plt.axhline(median_value, color=colors[i], linestyle='--',
                        label=f"Median ({legend} = {value}) = {median_value:.2f} {unit}")

        plt.legend(bbox_to_anchor=(1, 0.54))

    plt.show()
    plt.tight_layout()

# Example usage
#catplot_with_median(CVD_df, 'total_cholesterol', 'ten_year_chd', median=True, unit='mg/dL', kind_='box')


In [None]:
for var in continuous_var:
    catplot_with_median(dataset=CVD_df, variable=var, legend='ten_year_chd', unit=var, kind_='violin')

##### 1. Why did you pick the specific chart?

Answer Here.

A violin plot helps to effectively visualise the distribution of a continuous variable across the different categories of a categorical variable. It combines the features of box and kernel density plots in that the width of the violin at a particular point represents the density or frequency of the data at that point, while the height represents the range of values for the data.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

The following observations could be drawn from the above violin plots:

1.By just looking at the plots, it can be observed that age is a major factor .contributing to risk of CVD. More number of older patients have a risk of CVD than younger patients

2.More number of non-smokers are present among those who have no risk of CVD in next ten years. The violin plot for cigarettess/day is more flattened in the case of those with risk of CVD

3.The plots of Blood pressures, BMI and glucose levels show only minute variations, and nothing conclusive can be established

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8
 Relationship between the Discrete variables and the Dependent variable

In [None]:
# Chart - 8 visualization code
# Function for plotting both proportion and count plots for each variable. With formatting

def display_vals(ax):
    for p in ax.patches:
        ax.annotate(f'{p.get_height():.2f}', (p.get_x() + p.get_width() / 2., p.get_height()),
                     ha='center', va='center', xytext=(0, 5), textcoords='offset points')

def bivariate_discrete_plot(dataset, variable, legend, size):
    '''Plots both the count plot and "proportion" plot - the latter of which displays percentage
    of people in the classes of "legend" variable in each class of the variable in x-axis'''

    # Creating dataframes with count and proportion data
    counts = dataset.groupby([variable, legend], as_index=False)['id'].count()
    props = pd.merge(counts.groupby(variable)['id'].sum(), counts, on=variable)
    props['proportion'] = props['id_y'] * 100 / props['id_x']

    # Plotting the two data
    fig, axes = plt.subplots(1, 2, figsize=size)

    ax1 = sns.barplot(data=props, x=variable, y='proportion', hue=legend, ax=axes[0])
    ax1.legend_.remove()
    ax1.set(ylabel='Proportion of people (%)')
    display_vals(ax1)

    ax2 = sns.barplot(data=counts, x=variable, y='id', hue=legend, ax=axes[1])
    ax2.set(ylabel='Count of People')
    sns.move_legend(ax2, "lower center", bbox_to_anchor=(1.15, 0.4), title=legend, handlelength=2.5)
    display_vals(ax2)

    plt.tight_layout()
    plt.show()

# Example usage
# bivariate_discrete_plot(dataset, 'variable', 'legend', (width, height))


In [None]:
for var in categorical_var:
  bivariate_discrete_plot(dataset = CVD_df, variable = var, legend = 'ten_year_chd', size = (14, 6))

##### 1. Why did you pick the specific chart?

Answer Here.


1.Apart from the usual count plot, the "proportion" plot was also introduced here since the distribution is often not clear in the former

2.For example, though only 33 people have risk of CVD among those who have diabetes (compared to 478 who have a risk but no diabetes), we have to consider the total number of people who actually have diabetes. That is, 38% of people of those who have diabetes have a risk of CVD compared to 15% with no diabetes. This shows a massive variation, which is not noticed in the count plot

##### 2. What is/are the insight(s) found from the chart?

Answer Here


1.Males have a slightly higher percentage of risk of CVD (18.5%) compared to
females (12.4%)

2.Those with lower level of education have a higher percentage of CVD risk (18.4%). A close second are those with highest level of education (14.5%)

3.Very little effect of smoking on CVD risk can be observed. Smokers have about 16% chance of CVD risk while non-smokers have 14%

4.Patients who take BP Medications have a significantly higher risk of getting CVD (33%) than those who do not take the same (14.5%)

5.Those who have a history of Stroke have a higher risk of getting CVD (45.5%) than those who don't have a history of stroke (15%)

6.Those who have a history of Hypertension have a higher risk of getting CVD (24%) than those who don't have a history of Hypertension (11%) 38% of those who have diabetes have a risk of getting CVD while only 15% of those with no diabetes have a risk of CVD.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Having a history of stroke, history of Hypertension or higher levels of diabetes can be seen to affect the risk of Cadriovascular Disease. These are the factors under direct control, unlike features like Age. Hence, an initiative for medical care could be undertaken for these (for example diabetes treatment with insulin doses if any needed) based on reading these charts

#### Chart -
Analysing the bp_Meds" variable

In [None]:
# Chart - 9 visualization code



As seen earlier in chart no. 7, a counter-intuitive result was observed that those who were on medications for Blood Pressure had a higher risk of CVD, while we know these medications are taken for controlling BP which leads to a reduction in risk of CVD.

This relation is not a cause, i.e., BPMedications are not a causation to higher risk of CVD. Rather, it is only an effect of other contributing factors.

The first hypotheses is that, it could be because those patients taking medications generally have a higher systolic and diastolic Blood Pressures. Their distributions are plot below for those who take medications and those who don't

In [None]:
for var in ['systolic_bp', 'diastolic_bp']:
  catplot_with_median(dataset = CVD_df, variable = var, legend = 'bp_meds', median = True, unit = 'mmHg', kind_ = 'violin')

1.It can be observed that those taking BP Medications have significantly higher Blood Pressure values (median of 166.5/94.5 mmHg) while those not taking medications have a significantly lower BP (median of 128/81 mmHg). This is a logical observation, since according to MedlinePlus (US National Library of Medicine), BP Medications are recommended when the levels are over 130/80 mmHg

2.Also, all of the people taking BP Medications have had a history of Hypertension. Hypertension is a direct factor for high risk of CVD

3.Since dosage of the BP Medications are not provided as an information, it could also be the reason that the dosage for these high BP patients are not updated or are not sufficient, hence contributing to higher risk of CVD. This is something that could be analysed too. Thus, BP Meds don't directly contribute to higher risk of CVD, but rather the observation is only a side-effect of what other features are contributing to higher CVD risk

The Second hypotheses is that, since age is a signifcant factor to an increase in Blood Pressure levels, BP Medications are consumed generally by older people, who also have a higher risk of CVD

In [None]:
catplot_with_median(dataset = CVD_df, variable = 'age', legend = 'bp_meds', median = True, unit = 'years')

As the hypothesis correctly states, a large number of people who take BP Medications are significantly older than those who do not take them. The Median of the former is about 57 years while for the latter it is about 48 years. This could be another reason for the anomalous representation of the relation between BPMeds and Ten Year risk of CHD

#### Chart - 10
 Distribution of Discrete Independent features

In [None]:
# Chart - 10 visualization code
def display_vals(axis, round_=2):
    '''Displays the data value on the chart'''
    for p in axis.patches:
        axis.annotate(str(round(p.get_height(), round_)), (p.get_x() + p.get_width() / 2., p.get_height()),
                      ha='center', va='center', xytext=(0, 10), textcoords='offset points')


In [None]:
for var in categorical_var:
    plt.figure(figsize=(11, 6))
    ax = CVD_df[var].value_counts().plot(kind='bar')
    plt.ylabel('Count of people')
    plt.xlabel(var)
    display_vals(ax)
    plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.


Barplots are useful for discrete variables because they allow us to visualize the frequency or proportion of each category in the variable occuring in the dataset, to identify any imbalances or patterns in the data, and for comparing these across different groups or subgroups in the dataset.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

1.There are more number of females present in the dataset than males.

2.Assuming that the values in Education feature are hierarchical in ascending order, more number of people are less educated in the dataset

3.Only 100 people (~3% of dataset) are taking medications for Blood pressure despite over a 1000 people having history of Hypertension and also, half of the people having systolic and diastolic Blood Pressure over the optimum 130mmHg/80mmHg respectively as seen in previous Chart

4.Only 22 people have had a recorded history of stroke (0.6% of dataset)

5.Only 87 people have diabetes (~2% of the dataset)

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

The disproportion between the number of people taking medications for Blood Pressure and those with BP levels higher than optimum could be addressed on analysing these data. Further tests if necessary could be conducted on those with high BP and not taking medications, and to prescribe them any medications based on the results, if necessary.

#### Chart - 11
Analysing the Education variable

In [None]:
# Chart - 11 visualization code


Before venturing into data pre-processing and handling missing values, it is worth observing the Education variable in this dataset

As can seen above chart (chart-10) , the education variable contains 4 unique values - 1, 2, 3, and 4. Since no additional information is provided, it is assumed that these values represent a hierarchical educational qualification of the patient.

There has been speculation that the level of Education is related to a risk of Cardiovascular disease, as it is linked to a person's health and social determinants. [According to study](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5620039/#:~:text=Participants%20with%20a%20university%20degree,with%20primary%20or%20lower%20education.), a more educationed person has a less risk of Cardiovascular disease. But it is to be noted that this relation is through Blood Pressure levels, BMI and diabetes which are already present in the dataset. This relation can be seen in the plots below where a lower level of education correlates to a slightly higher BMI and Blood Pressure values

In [None]:
for var in ['diabetes', 'prevalent_hyp', 'prevalent_stroke']:
  bivariate_discrete_plot(dataset = df, variable = 'education', legend = var, size = (14, 6))

##### 1. Why did you pick the specific chart?

Answer Here.

I picked the specific chart because it shows the distribution of CVD risk factors in the dataset. This information can be used to improve the accuracy of CVD risk prediction models.

##### 2. What is/are the insight(s) found from the chart?

1.The most common CVD risk factors are high blood pressure, high cholesterol, and diabetes.

2.There is a significant gender difference in CVD risk factors, with men being more likely to have high blood pressure and cholesterol than women.

3.There is also a significant age difference in CVD risk factors, with people over the age of 65 being more likely to have high blood pressure, cholesterol, and diabetes

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Here are some specific reasons why the insights from the chart can lead to positive growth:

1.By targeting marketing and outreach efforts to people who are most at risk for CVD, businesses can help to increase awareness of CVD risk factors and encourage people to take steps to reduce their risk.

2.By developing programs and interventions to help people reduce their CVD risk factors, businesses can help to improve the health of their employees and customers, which can lead to increased productivity and decreased healthcare costs.

3.By reducing the incidence of CVD, businesses can help to improve the overall health of the community, which can lead to increased economic activity and prosperity.

#### Chart - 12
Percentage of patients at risk of CHD by diabetes

In [None]:
# Chart - 12 visualization code

for i in categorical_var:
    x_var, y_var = i, dependent_var[0]
    plt.figure(figsize=(10,5))
    df_grouped = df.groupby(x_var)[y_var].value_counts(normalize=True).unstack(y_var)*100
    df_grouped.plot.barh(stacked=True)
    plt.legend(
        bbox_to_anchor=(1.05, 1),
        loc="upper left",
        title=y_var)

    plt.title("% of patients at the risk of CHD by: "+i)
    for ix, row in df_grouped.reset_index(drop=True).iterrows():
        # print(ix, row)
        cumulative = 0
        for element in row:
            if element > 0.1:
                plt.text(
                    cumulative + element / 2,
                    ix,
                    f"{int(element)} %",
                    va="center",
                    ha="center",
                )
            cumulative += element
    plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.


I picked the chart because it shows the percentage of patients at risk of CHD by diabetes. This is an important insight for healthcare professionals, as it can help them to identify people who are at high risk for CHD.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Patients with diabetes are at much higher risk of CHD than patients without diabetes.

The risk of CHD increases with the severity of diabetes.

Even patients with prediabetes are at increased risk of CHD.

The high risk of CHD in patients with diabetes underscores the importance of early diagnosis and treatment of diabetes.

The increased risk of CHD with the severity of diabetes highlights the importance of good glycemic control in patients with diabetes.

The increased risk of CHD in patients with prediabetes suggests that prediabetes is a serious condition that should not be ignored.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

These insights can help to create a positive business impact by helping healthcare professionals to identify people who are at high risk for CHD. This can lead to early intervention and treatment, which can help to prevent or delay the onset of CHD. CHD is a serious condition that can lead to heart attack, stroke, and other health problems. By identifying people who are at high risk for CHD, healthcare professionals can help to improve their health and reduce their risk of developing these serious health problems.

#### Chart - 13
Visualize the relationship between two quantitative variables

In [None]:
# Chart - 13 visualization code
#visualize the relationship between two quantitative variables
sns.scatterplot(x=CVD_df['systolic_bp'], y=CVD_df['diastolic_bp'])

##### 1. Why did you pick the specific chart?

Answer Here.


I picked the scatter plot because it is a good way to visualize the relationship between two quantitative variables, in this case, diastolic blood pressure (DBP) and systolic blood pressure (SBP). The cluster of dots shows that there is a positive correlation between DBP and SBP, meaning that people with higher DBP tend to have higher SBP. This is an important insight for healthcare professionals, as it can help them to identify people who are at risk for hypertension.

##### 2. What is/are the insight(s) found from the chart?

Answer Here


There is a positive correlation between DBP and SBP.

People with higher DBP tend to have higher SBP.

The correlation between DBP and SBP is not perfect, meaning that there are some people with high DBP who have low SBP, and vice versa

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

These insights can help to create a positive business impact by helping healthcare professionals to identify people who are at risk for hypertension. This can lead to early intervention and treatment, which can help to prevent or delay the onset of hypertension. Hypertension is a serious condition that can lead to heart disease, stroke, and other health problems. By identifying people who are at risk for hypertension, healthcare professionals can help to improve their health and reduce their risk of developing these serious health problems.

There are no insights from the chart that lead to negative growth. However, it is important to note that the correlation between DBP and SBP is not perfect. This means that there will be some people with high DBP who have low SBP, and vice versa. This can make it difficult to predict who is at risk for hypertension based on their DBP and SBP alone. It is important to consider other factors, such as family history, lifestyle factors, and medical conditions, when assessing a person's risk for hypertension.

Overall, the scatter plot provides valuable insights into the relationship between DBP and SBP. These insights can help healthcare professionals to identify people who are at risk for hypertension and to provide early intervention and treatment. This can help to improve people's health and reduce their risk of developing serious health problems.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(16,10))
sns.heatmap(CVD_df.corr(), cmap="coolwarm", annot=True)

##### 1. Why did you pick the specific chart?

Answer Here.


I picked the chart because it shows the relationship between education level and prevalent stroke for people over the age of 45. This is an important chart to look at because it can help us understand the risk factors for stroke.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Above is the correlation magnitude heatmap for all the continuous variables in the dataset.

The variables systolic BP and diastolic BP are highly correlated.

There is a negative correlation between education level and prevalent stroke. This means that people with higher education levels are less likely to have a prevalent stroke.

The correlation is strong, which means that education level is a significant risk factor for stroke.

There is a clear trend of decreasing prevalence of stroke with increasing education level. This means that the risk of stroke decreases as education level increases.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

corr_df = CVD_df[CVD_df['bp_meds'] > 0]

# Create a pairplot
sns.pairplot(corr_df[['is_smoking', 'diabetes', 'bmi', 'heart_rate']])

##### 1. Why did you pick the specific chart?

Answer Here.


I picked the chart because it shows the relationship between BMI and heart rate for people with diabetes. This is an important chart to look at because it can help us understand the risk factors for heart disease in people with diabetes.



##### 2. What is/are the insight(s) found from the chart?

Answer Here


There is a positive correlation between BMI and heart rate. This means that people with higher BMIs tend to have higher heart rates.

The correlation is not very strong, which means that there are other factors that also affect heart rate in people with diabetes.

There is a clear trend of increasing heart rate with increasing BMI. This means that the risk of heart disease increases as BMI increases.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

**Handling** **Multicollinearity**

In [None]:
# Range of systolic bp and diastolic bp
print("Systolic BP - Min:", CVD_df['systolic_bp'].min(), ", Max:", CVD_df['systolic_bp'].max())
print("Diastolic BP - Min:", CVD_df['diastolic_bp'].min(), ", Max:", CVD_df['diastolic_bp'].max())


To handle multicollinearity between these two independent continuous variables, we can replace these two columns with a new variable 'pulse pressure', which is given as follows:

Pulse Pressure = Systolic BP - Diastolic BP

From the Reference , we also found that:

1.The normal pulse pressure is around 40 mmHg

2.Pulse pressures of 50 mmHg or more can increase the risk of heart disease, heart rhythm disorders, stroke and more.

3.Higher pulse pressures are also thought to play a role in eye and kidney damage from diseases like diabetes.

4.Low pulse pressure - is where the pulse pressure is one-fourth or less of the systolic blood pressure.

5.This happens when your heart isn’t pumping enough blood, which is seen in heart failure and certain heart valve diseases. It also happens when a person has been injured and lost a lot of blood or is bleeding internally.

In [None]:
# Creating a new column pulse_pressure
# and dropping systolic_bp and diastolic_bp

# Creating a new column 'pulse_pressure'

# Dropping 'systolic_bp' and 'diastolic_bp' columns
#CVD_df = CVD_df.drop(['systolic_bp', 'diastolic_bp'], axis=1)

# Displaying the updated DataFrame
CVD_df.columns


In [None]:
CVD_df['pulse_pressure'] = CVD_df['systolic_bp']-CVD_df['diastolic_bp']


In [None]:
CVD_df.head(5)

In [None]:
continuous_var

In [None]:
# Updating the continuous_var list

continuous_var.remove('systolic_bp')
continuous_var.remove('diastolic_bp')
continuous_var.append('pulse_pressure')

In [None]:
# Analyzing the distribution of pulse_pressure
plt.figure(figsize=(10,5))
sns.histplot(CVD_df['pulse_pressure'])
plt.axvline(CVD_df['pulse_pressure'].mean(), color='magenta', linestyle='dashed', linewidth=2)
plt.axvline(CVD_df['pulse_pressure'].median(), color='cyan', linestyle='dashed', linewidth=2)
plt.title('Pulse Pressure Distribution')

The pulse pressures are positively skewed

In [None]:
# Relationship between pulse pressure with the dependent variable
# Relationship between pulse pressure with the dependent variable
plt.figure(figsize=(10,5))
sns.violinplot(x=dependent_var[0], y='pulse_pressure', data=CVD_df)
plt.title('ten_year_chd vs pulse_pressure')
plt.show()


On average, the patients with higher pulse pressure are exposed to the coronary heart disease over the period of 10 years.

In [None]:
# Updated correlations
plt.figure(figsize=(15,8))
plt.title('Correlation Analysis')
correlation = CVD_df[continuous_var].corr()
sns.heatmap(abs(correlation), annot=True, cmap='coolwarm')
plt.show()


We were successful in handling multicollinearity amongst the continuous variables in the dataset.

# Feature Selection:

Discrete feature selection:

To check whether discrete variables are related, chi2 test can be used. We define:

Null Hypothesis (H0): Two variables are independent.

Alternate Hypothesis (H1): Two variables are not independent.

We can use Chi2 test to get a p-value and check if a categorical variable is dependent or independent to the dependent variable. If the p value obtained is greater than 0.05 then we reject the null hypothesis, and accept the alternate hypothesis.

In [None]:
from sklearn.feature_selection import chi2

# chi2 scores
chi_scores = chi2(CVD_df[categorical_var], df[dependent_var])
chi_scores


In [None]:
# P values for discrete features
p_values = pd.Series(chi_scores[1],index = CVD_df[categorical_var].columns)
p_values.sort_values(ascending = False , inplace = True)
p_values

In [None]:
# Plotting p values for chi2 test for discrete features
plt.figure(figsize=(10,5))
plt.xscale('log')
plt.xlabel('P-value')
plt.title('P-value for discrete features')
p_values.plot.barh()
plt.show()


Since prevalent hypertension column (prevalent_hyp) has the smallest p value, we can say that it is the most important feature (among the categorical independent variables) which determines the outcome of the dependent variable.

The is_smoking feature has the highest p-value, which indicates that it is the least important feature (among categorical independent variables).

We can drop this column since we already have a column cigs_per_day, which gives the number of cigarettes smoked by the patient in a day. The patients who don't smoke have entered zero in this column.

In [None]:
# dropping is_smoking
#CVD_df.drop('is_smoking', axis=1)


In [None]:
# dropping is smoking
categorical_var.remove('is_smoking')
categorical_var

# Outlier analysis:

In [None]:
# checking for outliers in continuous features
for col in continuous_var:
  plt.figure(figsize=(10,5))
  sns.boxplot(y = col,x = dependent_var[0],data=CVD_df)
  plt.title(col+' boxplot')
  plt.show()

There are outliers in the data, the effect of the outliers can be reduced to some extent by transforming it.

Once the data is transformed, if outliers beyond 3 standard deviations from the mean still remain, then they can be imputed with its respective median value.

This is done on the train data only to prevent data leakage.

 **Transforming continuous variables to reduce skew:**

In [None]:

# skewness along the index axis
skewness=(CVD_df[continuous_var]).skew(axis = 0)
skewness

Many continuous variables are skewed. By log transformation, we aim to reduce the magnitude of skew in these variables to a certain extent.

In [None]:
# Skew for log10 transformation
np.log10(CVD_df[continuous_var]+1).skew(axis = 0)

We can clearly see that by log transformation of the continuous variables, we are able to reduce it's skew to some extent.

In [None]:
# Apply log transformation to reduce skewness
for var in continuous_var :
    df[var] = np.log1p(CVD_df[var])


In [None]:
# Checking skew after log transformation
df[continuous_var].skew(axis = 0)


Analyzing the distribution of transformed features:

In [None]:
# Analysing the distribution of continuous varaibles after transformation
for col in continuous_var:
  plt.figure(figsize=(10,5))
  sns.histplot(df[col])
  plt.axvline(df[col].mean(), color='magenta', linestyle='dashed', linewidth=2)
  plt.axvline(df[col].median(), color='cyan', linestyle='dashed', linewidth=2)
  plt.title(col+' distribution')
  plt.show()

Except cigs_per_day, we have successfully been able to reduce the skewness in the continuous variables. Now these distributions are closer to symmetric distribution.

 **Checking for outliers in transformed features:**

In [None]:
for col in continuous_var:
    plt.figure(figsize=(10,5))
    sns.boxplot(y=np.log10(df[col]+1), x=dependent_var[0], data=df)
    plt.title(col+' boxplot')
    plt.show()


except for age and cigs_per_day columns, rest of the numerical columns contain outliers even after log transformation.
To handle this, we can impute the outliers beyond 3 standard deviations from the mean with its median value on the train data.

# **Data_Preprocessing**

In [None]:
# Defining dependent and independent variables
X = CVD_df.drop('ten_year_chd',axis=1)
y = CVD_df[dependent_var]

Choice of prediction model:

We are working on binary classification problem.

Here we can start with a simple model, as a baseline model, which is interpretable, ie, Logistic Regression

Try other standard binary classification models like K nearest neighbors, Naive Bayes, decision tree classifier, and support vector machines.

Use ensemble models, with hyperparameter tuning to check whether they give better predictions.

 Evaluation metrics:

8.2. Evaluation metrics: Since the data we are dealing with is unbalanced, accuracy may not be the best evaluation metric to evaluate the model performance. Also, since we are dealing with data related to healthcare, False Negatives are of higher concern than False Positive In other words, it doesn’t matter whether we raise a false alarm but the actual positive cases should not go undetected Considering these points in mind, it is decided that we use Recall as the model evaluation metric.

recall= True Positive(TP)/[True Positive(TP) + False Negatie(FN)]

**Train Test Split:**

In [None]:
# function to get recall score
def recall(actual,predicted):
  '''
  recall(actual,predicted)
  '''
  return recall_score(y_true=actual, y_pred=predicted, average='binary')

In [None]:

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y, shuffle=True)

In [None]:
# Checking the train distribution of dependent variable
y_train.value_counts()

In [None]:
# Proportion of positive outcomes in train dataset
358/(358+2015)

In [None]:
# Checking the test distribution of dependent variable
y_test.value_counts()

In [None]:
# Proportion of positive outcomes in test dataset
153/(153+864)

The train and test set contain almost equal proportion of results.

 Handling outliers in the train data:

Imputing the outliers in 'total_cholesterol', 'bmi', 'heart_rate', 'glucose', 'pulse_pressure' beyond 3 standard deviations from the mean with its median value.

In [None]:
# imputing the outliers beyond 3 standard deviations from the mean with its median value
for i in ['total_cholesterol', 'bmi', 'heart_rate', 'glucose','pulse_pressure']:
  upper_lim = X_train[i].mean() + 3 * X_train[i].std()
  lower_lim = X_train[i].mean() - 3 * X_train[i].std()
  X_train.loc[(X_train[i] > upper_lim),i] = X_train[i].median()
  X_train.loc[(X_train[i] < lower_lim),i] = X_train[i].median()

In [None]:
X_train[continuous_var].skew(axis = 0)

 Oversampling:

Since we are dealing with unbalanced data, ie, only ~15% of the patients were diagnosed with coronary heart disease, we oversample the train dataset using SMOTE (Synthetic Minority Oversampling Technique).
This ensures that the model has trained equally on all kinds of results, and it is not biased to one particular result.

In [None]:
# visualize the target variable before SMOTE
y_train.value_counts().plot(kind='bar', title='Target variable before SMOTE')

In [None]:
print(X_train.isnull().sum())


In [None]:
# Mean, median, and mode for glucose
CVD_df.glucose.mean(),CVD_df.glucose.median(),CVD_df.glucose.mode()

In [None]:

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='median')
X_train['glucose'] = imputer.fit_transform(X_train[['glucose']])


In [None]:
# Oversampling using SMOTE
smote = SMOTE()

# fit predictor and target variable
X_smote, y_smote = smote.fit_resample(X_train, y_train)

print('Samples in the original dataset', len(y_train))
print('Samples in the resampled dataset', len(y_smote))



In [None]:
# visualize the target variable after SMOTE
y_smote.value_counts().plot(kind='bar', title='Target variable after SMOTE')

We have successfully oversampled the minority class using SMOTE. Now the model we build will be able to learn from both the classes without any bias.

 **Scaling the data:**

Since the predictions from the distance based models will get affected if the attributes are in different ranges, we need to scale them.
We can use StandardScaler to scale down the variables.
The results obtained from scaling can be stored and used while building those models.
Tree algorithms do not necessarily require scaling.

In [None]:
# Scaling data
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_smote_scaled = scaler.fit_transform(X_smote)
X_test_scaled = scaler.transform(X_test)

# Converting array to dataframe
X_train_scaled = pd.DataFrame(X_train_scaled,columns=X_train.columns)
X_smote_scaled = pd.DataFrame(X_smote_scaled,columns=X_smote.columns)
X_test_scaled = pd.DataFrame(X_test_scaled,columns=X_test.columns)

In [None]:
# Scaled train values
X_train_scaled.head()

In [None]:
# scaled SMOTE values
X_smote_scaled.head()

In [None]:
# scaled test values
X_test_scaled.head()

We have successfully scaled down the variables using standard scaler.

 **ML Model Implementation**


 Logistic Regression

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

In [None]:
# Fitting model
lr_model = LogisticRegression()

In [None]:
# training the model
lr_model.fit(X_smote_scaled, y_smote.values.ravel())

In [None]:
from sklearn.metrics import recall_score

# Train predictions
lr_train_pred = lr_model.predict(X_smote_scaled)

In [None]:
# training set recall
lr_train_recall = recall(y_smote,lr_train_pred)
lr_train_recall

In [None]:
# Check for NaN values in X_test_scaled
print(X_test_scaled.isnull().sum())


In [None]:
from sklearn.impute import SimpleImputer

# Initialize the imputer
imputer = SimpleImputer(strategy='mean')

# Fit and transform the imputer on X_test_scaled
X_test_scaled_imputed = pd.DataFrame(imputer.fit_transform(X_test_scaled), columns=X_test_scaled.columns)

# Now you can use X_test_scaled_imputed for making predictions
lr_test_pred = lr_model.predict(X_test_scaled_imputed)


In [None]:
# Test predictions
lr_test_pred = lr_model.predict(X_test_scaled_imputed)

In [None]:
# Test recall
lr_test_recall = recall(y_test,lr_test_pred)
lr_test_recall

In [None]:
# Classification report
print(classification_report(y_test,lr_test_pred))

In [None]:
# Confusion matrix
from sklearn.metrics import confusion_matrix as cm
from sklearn.metrics import ConfusionMatrixDisplay as cmd
lr_confusion_matrix = cm(y_test, lr_test_pred)
cm_display = cmd(confusion_matrix = lr_confusion_matrix, display_labels = [False, True])

font = {'family' : 'DejaVu Sans',
        'weight' : 'bold',
        'size'   : 22}
plt.rc('font', **font)

cm_display.plot(cmap='Oranges')
plt.title('Confusion matrix: LOGISTIC REGRESSION')
plt.show()

**K Nearest Neighbors:**

In [None]:
from sklearn.neighbors import KNeighborsClassifier
# Value of k taken upto sqrt(n)
# Where n is no of records in the train dataset
# sqrt(4030) = 63.48
knn_test_res = []
knn_train_res = []
for k in range(1,65):
  knn_model = KNeighborsClassifier(n_neighbors=k)
  knn_model.fit(X_smote_scaled, y_smote.values.ravel())
  knn_train_pred = knn_model.predict(X_smote_scaled)
  knn_train_recall = recall(y_smote,knn_train_pred)
  knn_test_pred = knn_model.predict(X_test_scaled_imputed)
  knn_test_recall = recall(y_test,knn_test_pred)
  knn_test_res.append(knn_test_recall)
  knn_train_res.append(knn_train_recall)

In [None]:
# Plotting the train and test recalls for different values of k
plt.figure(figsize=(10,5))
x_ = range(1,65)
y1 = knn_train_res
y2 = knn_test_res
plt.plot(x_, y1, label='Train Recall')
plt.plot(x_, y2, label = 'Test Recall')
plt.xlabel('K')
plt.ylabel('Recall')
plt.legend()
plt.show()


In [None]:
# Best k is where the test recall is the highest
best_k = knn_test_res.index(max(knn_test_res))+1
best_k

In [None]:
# building knn model with best parameters
knn_model = KNeighborsClassifier(n_neighbors=best_k)

In [None]:
# training the model
knn_model.fit(X_smote_scaled, y_smote.values.ravel())

In [None]:
# Train predictions
knn_train_pred = knn_model.predict(X_smote_scaled)

In [None]:
# training set recall
knn_train_recall = recall(y_smote,knn_train_pred)
knn_train_recall

In [None]:
# Test predictions
knn_test_pred = knn_model.predict(X_test_scaled_imputed)

In [None]:
# Test recall
knn_test_recall = recall(y_test,knn_test_pred)
knn_test_recall

In [None]:
# Classification report
print(classification_report(y_test,knn_test_pred))

In [None]:
# Confusion matrix
knn_confusion_matrix = cm(y_test, knn_test_pred)
cm_display = cmd(confusion_matrix = knn_confusion_matrix, display_labels = [False, True])

font = {'family' : 'DejaVu Sans',
        'weight' : 'bold',
        'size'   : 22}
plt.rc('font', **font)

cm_display.plot(cmap='Oranges')
plt.title('Confusion matrix: K NEAREST NEIGHBORS')
plt.show()

**Naive Bayes:**

In [None]:
from sklearn.model_selection import RepeatedStratifiedKFold
# Using stratified k fold cross validation so that each split
# has almost equal proportion of classification results
cv_method = RepeatedStratifiedKFold(n_splits=4,
                                    n_repeats=3,
                                    random_state=0)

In [None]:
from sklearn.naive_bayes import GaussianNB
# Fitting model
nb_model = GaussianNB()

In [None]:
# Max depth of dt without hyperparameter tuning = 28 and min samples leaf = 1
nb_model = GaussianNB()
nb_params = {'var_smoothing': np.logspace(0,-9, num=100)
             }

In [None]:
nb_gridsearch = GridSearchCV(nb_model,
                             nb_params,
                             cv=cv_method,
                             scoring= 'recall')
nb_gridsearch.fit(X_smote_scaled,y_smote.values.ravel())
nb_best_params = nb_gridsearch.best_params_

In [None]:
# model best parameters
nb_best_params

In [None]:

# building knn model with best parameters
nb_model = GaussianNB(var_smoothing=nb_best_params['var_smoothing'])

In [None]:
# training the model
nb_model.fit(X_smote_scaled, y_smote.values.ravel())

In [None]:
# Train predictions
nb_train_pred = nb_model.predict(X_smote_scaled)

In [None]:
# training set recall
nb_train_recall = recall(y_smote,nb_train_pred)
nb_train_recall

In [None]:

# Test predictions
nb_test_pred = nb_model.predict(X_test_scaled_imputed)

In [None]:
# Test recall
nb_test_recall = recall(y_test,nb_test_pred)
nb_test_recall

In [None]:
# Classification report
print(classification_report(y_test,nb_test_pred))

In [None]:
# Confusion matrix
nb_confusion_matrix = cm(y_test, nb_test_pred)
cm_display = cmd(confusion_matrix = nb_confusion_matrix, display_labels = [False, True])

font = {'family' : 'DejaVu Sans',
        'weight' : 'bold',
        'size'   : 22}
plt.rc('font', **font)

cm_display.plot(cmap='Oranges')
plt.title('Confusion matrix: NAIVE BAYES')
plt.show()

**Decision tree:**

In [None]:
# Max depth of dt without hyperparameter tuning = 28 and min samples leaf = 1
dt_model = DecisionTreeClassifier()
dt_params = {
    'max_depth': np.arange(1, 10),
    'min_samples_split': np.linspace(0.1, 1.0, 10, endpoint=True),
    'min_samples_leaf': np.linspace(0.1, 0.5, 5, endpoint=True)
}

In [None]:
# using gridsearchcv to find best parameters
dt_gridsearch = GridSearchCV(dt_model,
                             dt_params,
                             cv=cv_method,
                             scoring= 'recall')
dt_gridsearch.fit(X_smote,y_smote)
dt_best_params = dt_gridsearch.best_params_

In [None]:
# model best parameters
dt_best_params


In [None]:
# building knn model with best parameters
dt_model = DecisionTreeClassifier(max_depth=dt_best_params['max_depth'],
                                  min_samples_split=dt_best_params['min_samples_split'],
                                  min_samples_leaf=dt_best_params['min_samples_leaf'])

In [None]:
# training the model
dt_model.fit(X_smote_scaled, y_smote)

In [None]:

# Train predictions
dt_train_pred = dt_model.predict(X_smote_scaled)

In [None]:
# training set recall
dt_train_recall = recall(y_smote,dt_train_pred)
dt_train_recall

In [None]:
# Test predictions
dt_test_pred = dt_model.predict(X_test_scaled_imputed)

In [None]:
# Test recall
dt_test_recall = recall(y_test,dt_test_pred)
dt_test_recall

In [None]:
# Classification report
print(classification_report(y_test,dt_test_pred))

In [None]:
# Feature importances

dt_feat_imp = pd.Series(dt_model.feature_importances_, index=X.columns)
plt.figure(figsize=(10,5))
plt.title('Feature Importances: DECISION TREE')
plt.xlabel('Relative Importance')
dt_feat_imp.nlargest(20).plot(kind='barh')

In [None]:
# Confusion matrix
dt_confusion_matrix = cm(y_test, dt_test_pred)
cm_display = cmd(confusion_matrix = dt_confusion_matrix, display_labels = [False, True])

font = {'family' : 'DejaVu Sans',
        'weight' : 'bold',
        'size'   : 22}
plt.rc('font', **font)

cm_display.plot(cmap='Oranges')
plt.title('Confusion matrix: DECISION TREE')
plt.show()

# **Support Vector Machines:**

In [None]:
# SVM model parameters
svm_model = SVC()
svm_params = {'C': [0.1, 1, 10],
              'gamma': [0.01, 0.001, 0.0001],
              'kernel': ['rbf']
             }

In [None]:
# Using gridsearchcv to find best parameters
svm_gridsearch = GridSearchCV(svm_model,
                              svm_params,
                              cv=cv_method,
                              scoring= 'recall')
svm_gridsearch.fit(X_smote_scaled,y_smote.values.ravel())
svm_best_params = svm_gridsearch.best_params_

In [None]:
# model best parameters
svm_best_params

In [None]:
# building knn model with best parameters
svm_model = SVC(C=svm_best_params['C'],
                gamma=svm_best_params['gamma'],
                kernel=svm_best_params['kernel']
                )

In [None]:
# training the model
svm_model.fit(X_smote_scaled, y_smote.values.ravel())

In [None]:
# Train predictions
svm_train_pred = svm_model.predict(X_smote_scaled)

In [None]:

# training set recall
svm_train_recall = recall(y_smote,svm_train_pred)
svm_train_recall

In [None]:

# Test predictions
svm_test_pred = svm_model.predict(X_test_scaled_imputed)

In [None]:
# Test recall
svm_test_recall = recall(y_test,svm_test_pred)
svm_test_recall

In [None]:
# Classification report
print(classification_report(y_test,svm_test_pred))

In [None]:
# Confusion matrix
svm_confusion_matrix = cm(y_test, svm_test_pred)
cm_display = cmd(confusion_matrix = svm_confusion_matrix, display_labels = [False, True])

font = {'family' : 'DejaVu Sans',
        'weight' : 'bold',
        'size'   : 22}
plt.rc('font', **font)

cm_display.plot(cmap='Oranges')
plt.title('Confusion matrix: SUPPORT VECTOR MACHINES')
plt.show()

# **Random forests:**

In [None]:
# random forest model
rf_model = RandomForestClassifier(random_state=0)
rf_params = {'n_estimators':[500],                    # limited due to computational power availability
             'max_depth':np.arange(1,6),
             'min_samples_split':np.arange(0.1,1,0.1),
             'min_samples_leaf':np.arange(0.1,0.6,0.1)}

In [None]:
# using gridsearchcv to find best parameters
# Random Forest model
rf_model = RandomForestClassifier(random_state=0)
rf_params = {'n_estimators': [100],  # Limit the number of estimators for faster computation
             'max_depth': [5, 10],       # Reduce the number of values to search
             'min_samples_split': [0.1, 0.3],  # Reduce the number of values to search
             'min_samples_leaf': [0.1, 0.3]}   # Reduce the number of values to search

rf_gridsearch = GridSearchCV(rf_model, rf_params, cv=cv_method, scoring='recall', n_jobs=-1)
rf_gridsearch.fit(X_smote, y_smote.values.ravel())
rf_best_params = rf_gridsearch.best_params_


In [None]:

# best parameters for random forests
rf_best_params

In [None]:
# Fitting RF model with best parameters
rf_model = RandomForestClassifier(n_estimators=rf_best_params['n_estimators'],
                                  min_samples_leaf=rf_best_params['min_samples_leaf'],
                                  min_samples_split=rf_best_params['min_samples_split'],
                                  max_depth=rf_best_params['max_depth'],
                                  random_state=0)

In [None]:
# fit
rf_model.fit(X_smote,y_smote.values.ravel())
# train predictions
rf_train_pred = rf_model.predict(X_smote)

In [None]:
# train recall
rf_train_recall = recall(y_smote,rf_train_pred)
rf_train_recall

In [None]:
# Test predictions
rf_test_pred = rf_model.predict(X_test_scaled_imputed)

In [None]:
from sklearn.metrics import recall_score
# test recall
rf_test_recall = recall(y_test,rf_test_pred)
rf_test_recall

In [None]:
# Classification report
print(classification_report(y_test,rf_test_pred))

In [None]:
# Feature importances

rf_feat_imp = pd.Series(rf_model.feature_importances_, index=X.columns)
plt.figure(figsize=(10,5))
plt.title('Feature Importances: RANDOM FORESTS')
plt.xlabel('Relative Importance')
rf_feat_imp.nlargest(20).plot(kind='barh')

In [None]:
# Confusion matrix
rf_confusion_matrix = cm(y_test, rf_test_pred)
cm_display = cmd(confusion_matrix = rf_confusion_matrix, display_labels = [False, True])

font = {'family' : 'DejaVu Sans',
        'weight' : 'bold',
        'size'   : 22}
plt.rc('font', **font)

cm_display.plot(cmap='Oranges')
plt.title('Confusion matrix: RANDOM FORESTS')
plt.show()

# **XG Boost:**

In [None]:
# XGBoost model
import xgboost as xgb
# XGBoost model
xgb_model = xgb.XGBClassifier(random_state=0)
xgb_params = {'n_estimators': [100],  # Limit the number of estimators for faster computation
              'max_depth': [1],       # Reduce the number of values to search
              'learning_rate': [0.1],  # Add a learning rate parameter
              }

xgb_gridsearch = GridSearchCV(xgb_model, xgb_params, cv=cv_method, scoring='recall', n_jobs=-1)
xgb_gridsearch.fit(X_smote, y_smote.values.ravel())
xgb_best_params = xgb_gridsearch.best_params_



In [None]:
# using gridsearchcv to find best parameters
xgb_gridsearch = GridSearchCV(xgb_model,xgb_params,cv=cv_method,scoring='recall')
xgb_gridsearch.fit(X_smote_scaled,y_smote)
xgb_best_params = xgb_gridsearch.best_params_

In [None]:
xgb_best_params
{'max_depth': 1,
 'min_samples_leaf': 0.1,
 'min_samples_split': 0.1,
 'n_estimators': 500}

In [None]:
# Fitting xgb with best parameters
# Fitting xgb with best parameters
xgb_model = xgb.XGBClassifier(n_estimators=xgb_best_params['n_estimators'],
                               max_depth=xgb_best_params['max_depth'],
                               learning_rate=xgb_best_params['learning_rate'],
                               random_state=0)


In [None]:

# fit
xgb_model.fit(X_smote_scaled,y_smote)

In [None]:
# train predictions
xgb_train_pred = xgb_model.predict(X_smote_scaled)
xgb_train_pred = [round(value) for value in xgb_train_pred]

In [None]:
# train recall
xgb_train_recall = recall(y_smote,xgb_train_pred)
xgb_train_recall

In [None]:
# Test predictions
xgb_test_pred = xgb_model.predict(X_test_scaled)
xgb_test_pred = [round(value) for value in xgb_test_pred]

In [None]:
# test recall
xgb_test_recall = recall(y_test,xgb_test_pred)
xgb_test_recall

In [None]:
# Classification report
print(classification_report(y_test,xgb_test_pred))

In [None]:
# Feature importances

xgb_feat_imp = pd.Series(xgb_model.feature_importances_, index=X.columns)
plt.figure(figsize=(10,5))
plt.title('Feature Importances: XG BOOST')
plt.xlabel('Relative Importance')
rf_feat_imp.nlargest(20).plot(kind='barh')

In [None]:
print(classification_report(y_test,xgb_test_pred,target_names=['Negative','Positive']))

In [None]:
# Confusion matrix
xgb_confusion_matrix = cm(y_test, xgb_test_pred)
cm_display = cmd(confusion_matrix = xgb_confusion_matrix, display_labels = [False, True])

font = {'family' : 'DejaVu Sans',
        'weight' : 'bold',
        'size'   : 22}
plt.rc('font', **font)

cm_display.plot(cmap='Oranges')
plt.title('Confusion matrix: XG BOOST')
plt.show()

# **Results:**

The train and test recall scores obtained for different models built are as follows:


In [None]:
!pip install prettytable
import xgboost as xgb

# Summarizing the results obtained
from prettytable import PrettyTable

# Summarizing the results obtained
test = PrettyTable(['Sl. No.','Classification Model', 'Train Recall (%)','Test Recall (%)'])
test.add_row(['1','Logistic Regression',lr_train_recall*100,lr_test_recall*100])
test.add_row(['2','K Nearest Neighbors',knn_train_recall*100,knn_test_recall*100])
test.add_row(['3','Naive Bayes',nb_train_recall*100,nb_test_recall*100])
test.add_row(['4','Decision Tree',dt_train_recall*100,dt_test_recall*100])
test.add_row(['5','Random Forest',rf_train_recall*100,rf_test_recall*100])
test.add_row(['6','Support Vector Machines',svm_train_recall*100,svm_test_recall*100])
test.add_row(['7','XGBoost',xgb_train_recall*100,xgb_test_recall*100])

print(test)


In [None]:

# Plotting Recall scores

ML_models = ['Logistic Regression','K Nearest Neighbors','Naive Bayes','Decision Tree','Support Vector Machines','Random Forests','XG Boost']
train_recalls = [lr_train_recall,knn_train_recall,nb_train_recall,dt_train_recall,svm_train_recall,rf_train_recall,xgb_train_recall]
test_recalls = [lr_test_recall,knn_test_recall,nb_test_recall,dt_test_recall,svm_test_recall,rf_test_recall,xgb_test_recall]

X_axis = np.arange(len(ML_models))

plt.figure(figsize=(10,5))
plt.barh(X_axis - 0.2, train_recalls, 0.4, label = 'Train Recall')
plt.barh(X_axis + 0.2, test_recalls, 0.4, label = 'Test Recall')

plt.yticks(X_axis,ML_models)
plt.xlabel("Recall score")
plt.title("Recall score for each model")
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left',title='Legend')
plt.show()

Random forests has the highest recall score on both the train and test sets. This suggests that random forests is the best model for this dataset.
Support vector machines and decision trees also have high recall scores on both the train and test sets.

Naive Bayes, K nearest neighbors, and logistic regression have lower recall scores on both the train and test sets.

The recall scores for the train and test sets are similar for all models, which suggests that the models are not overfitting the data.

The high recall scores for random forests, support vector machines, and decision trees suggest that these models are good at identifying positive instances. This is important for this dataset, as it is important to correctly identify all of the positive instances.

The lower recall scores for naive Bayes, K nearest neighbors, and logistic regression suggest that these models are not as good at identifying positive instances as the other models. This may be because these models are more simplistic and do not take into account as much information as the other models.

The similar recall scores for the train and test sets suggest that the models are not overfitting the data. This is important, as it means that the models are likely to generalize well to new data.

# **Conclusion**

We trained 7 Machine Learning models using the training dataset, and hyperparameter tuning was used in some models to improve the model performance.
To build the models, missing values were handled, feature engineering and feature selection was performed, and the training dataset was oversampled using SMOTE to reduce bias on one outcome.

Recall was chosen as the model evaluation metric because it was very important that we reduce the false negatives.

Initial set of predictions were obtained using the baseline model, ie, logistic regression model, and other commonly used classification models were also build in search of better predictions.

Predicting the risk of coronary heart disease is critical for reducing fatalities caused by this illness. We can avert deaths by taking the required medications and precautions if we can foresee the danger of this sickness ahead of time.

It is critical that the model we develop has a high recall score. It is OK if the model incorrectly identifies a healthy patient as a high risk patient because it will not result in death, but if a high risk patient is incorrectly labelled as healthy, it may result in fatality.

We were able to create a model with a recall of just 0.77 because of limitated data available and limited computational power availabe.

A recall score of 0.77 indicates that out of 100 individuals with the illness, our model will be able to classify only 77 as high risk patients, while the remaining 33 will be misclassified.

Future developments must include a strategy to improve the model recall score, enabling us to save even more lives from this disease. This includes involving more people in the study, and include people with different medical history, etc build an application with better recall score.

From our analysis, it is also found that the age of a person was the most important feature in determining the risk of a patient getting infected with CHD, followed by pulse pressure, prevalent hypertension and total cholesterol.
Diabetes, prevalent stroke and BP medication were the least important features in determining the risk of CHD.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***