<a href="https://colab.research.google.com/github/puttadharani/cardiovascular-risk-prediction/blob/main/cardiovascular_risk_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -<b>cardiovascular risk prediction
</b>


**Project Type**    - Classification
##### **Contribution**    - Team
##### **Team Member 1 -**Sunil Kumar
##### **Team Member 2 -**Dharani Putta
##### **Team Member 3 -**Vivek Singh


# **Project Summary -**

A cardiovascular disease prediction classification project is a data-driven initiative aimed at developing predictive models to identify individuals who are at risk of developing cardiovascular diseases (CVD). These projects use machine learning and data analysis techniques to analyze medical and health-related data to make predictions and classify individuals into risk categories. Here's a summary of such a project:

The project begins with the collection of relevant medical data, which may include patient demographics, lifestyle factors, medical history, and vital signs (such as blood pressure and heart rate).

The collected data is cleaned, processed, and transformed into a suitable format for analysis. Missing values are addressed, and outliers may be treated.

Relevant features (variables) are selected or engineered to create new ones that may improve the predictive power of the model. For example, features like body mass index (BMI), cholesterol levels, and family history of CVD are often important.

The data is divided into training, validation, and test sets to build and evaluate the predictive models.

Various classification algorithms are considered, such as logistic regression, decision trees, random forests, and support vectors. The choice of the model depends on the nature of the data and the problem's complexity.

The selected model is trained on the training dataset using historical data and target labels (CVD outcomes).

The model's performance is assessed on the validation dataset using evaluation metrics such as accuracy, precision, recall, F1 score, and ROC AUC. Hyperparameter tuning may be performed to optimize the model's performance.

The final model is tested on the independent test dataset to evaluate its real-world performance.

To enhance the project's clinical relevance, models may be designed to provide explanations or feature importance rankings to healthcare professionals, explaining why a specific prediction was made.

Raising awareness among patients about CVD risk factors, prevention, and the role of predictive models is an important aspect of such projects.

CVD prediction classification projects have the potential to significantly impact public health by enabling early identification of at-risk individuals and facilitating timely interventions to prevent or manage cardiovascular diseases. They align with the broader goals of personalized medicine and data-driven healthcare to improve patient outcomes and reduce healthcare costs.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Write Problem Statement Here.**

#### **Define Your Business Objective?**

Answer Here.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.linear_model import LogisticRegression,LogisticRegressionCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.impute import KNNImputer
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from imblearn.over_sampling import SMOTE


from sklearn.metrics import classification_report,accuracy_score,recall_score
from prettytable import PrettyTable

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
df=pd.read_csv('/content/drive/MyDrive/data_cardiovascular_risk.csv')

In [None]:
new=pd.DataFrame(df['education'])
new.value_counts()

### Dataset First View

In [None]:
# Dataset First Look
df

In [None]:
df.head()

In [None]:
df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print('shape of given dataframe=',df.shape)

print('No.of rows=',df.index.size)
print('No.of columns=',df.columns.size)

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df[df.duplicated()]

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isna().sum()

In [None]:
# Visualizing the missing values
sns.heatmap(df.isna())

### What did you know about your dataset?

Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe(include='all')

### Variables Description

**id:** patient id
####Demographic:

**Age:** Age of the patient (Descrete)<br>
**Education:** The level of education of the patient (categorical values - 1,2,3,4)
**Sex:** Gender ("M" or "F")<br>

####Behavioral:

**is_smoking:** whether currently smoking or no ("YES" or "NO")<br>
**Cigs Per Day:** cigarettes smoked per day(descrete)<br>
**BP Meds:** whether taking BP medicines or not(nominal)<br>
**Prevalent Stroke:** if the paitient has a history of stroke (Nominal)<br>
**Prevalent Hyp:** if the patient has a history of hypertension (Nominal)<br>
**Diabetes:** patient has diabetes or not (Nominal)


**Tot Chol**:cholesterol measure (Continuous)<br>
**Sys BP:** systolic BP measure (Continuous)<br>
**Dia BP:** diastolic BP measure (Continuous)<br>
**BMI:** Body Mass Index (Continuous)<br>
**Heart Rate:** heart rate measure (Continuous - In medical research, variables such as heart rate though in fact discrete, yet are considered continuous because of large number of possible values.)<br>
**Glucose:** glucose level (Continuous)
Predict variable (desired target):

**TenYearCHD:**10-year risk of coronary heart disease CHD(binary: “1”, means “Yes”, “0” means “No”)

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in df.columns:
  print('No.of unique values in',i,'=',df[i].nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
df.head()


In [None]:
df.isna().sum()


In [None]:
df['education']=df['education'].fillna(df['education'].mode)
df['BPMeds']=df['BPMeds'].fillna(df['BPMeds'].mode)

In [None]:
non_smokers=df[(df['is_smoking']=='NO')&(df['cigsPerDay'].isna())]
non_smokers

In [None]:
sns.displot(df['cigsPerDay'],kde=True)
plt.axvline(df[df['is_smoking']=='YES']['cigsPerDay'].mean(),color='cyan',linestyle='dashed',linewidth=2)
plt.axvline(df[df['is_smoking']=='YES']['cigsPerDay'].median(),color='violet',linestyle='dashed',linewidth=2)
plt.title('cigsPerDay distribution of smokers')
plt.show()


In [None]:
plt.title('boxplot of cigsPerDay')
sns.boxplot(df[df['is_smoking']=='YES']['cigsPerDay'])

Let's fill the missing values with the median as we can observe that outliers exist in 'cigsPerDay' column

In [None]:
df['cigsPerDay']=df['cigsPerDay'].fillna(df[df['is_smoking']=='YES']['cigsPerDay'].median())

In [None]:
for i in ['totChol','BMI','heartRate','glucose']:
  sns.displot(df[i],kde=True)
  plt.axvline(df[i].mean(),linestyle='dashed',color='cyan',linewidth=2)
  plt.axvline(df[i].median(),linestyle='dashed',color='magenta',linewidth=2)
  plt.show()

In [None]:
for i in ['totChol','BMI','heartRate','glucose']:
  print(i)
  print('mean',round(df[i].mean(),2),' median',df[i].median(),'  mode',df[i].mode())
  print('')

In [None]:
sns.boxplot(df[['totChol','BMI','heartRate','glucose']])

From the above values and displots, it can be observed that 'total_cholesterol','bmi','heart_rate' are normaly distributed. <br>But they also contain outliers, because of which it is good to impute missing values with median

In [None]:
for i in ['totChol','BMI','heartRate']:
  df[i]=df[i].fillna(df[i].median())

'glucose' has 304  missing values, which is high. <br>
But imputing these missing values with mean or median may not be the best choice.
<br>So, let's impute these missing values using KNNImputer

In [None]:
df.head()

In [None]:
df['is_smoking']=np.where(df['is_smoking']=='YES',1,0)
df['sex']=np.where(df['sex']=='M',1,0)

In [None]:
df.loc[3388]

In [None]:
df['BPMeds']=np.where(df['BPMeds']==1,1,0)

In [None]:
df_copy=df
df_copy.drop(['education'],axis=1,inplace=True)

In [None]:
for i in df_copy.columns:
  print(i,df_copy[i].unique())

In [None]:
imputer=KNNImputer(n_neighbors=10)
imputed=imputer.fit_transform(df_copy)
df_copy=pd.DataFrame(imputed,columns=df_copy.columns)

In [None]:
df_copy.isna().sum()

**Exploratory Data Analysis:**

In [None]:
diabetic=df_copy[df_copy['diabetes']==1]
diabetic.head()

In [None]:
sns.displot(diabetic['age']).set(title='displot of diabetes with respect to age')
plt.show()

In [None]:
df_copy['is_smoking'].value_counts()

In [None]:
diabetic['sex'].value_counts()

In [None]:
smokers_df=df[df_copy['is_smoking']==1]
smokers_df.head()

In [None]:
df_copy

In [None]:
smokers_df['cigsPerDay'].value_counts()

In [None]:
smokers_genderwise=smokers_df['sex'].value_counts()
smokers_genderwise

In [None]:
plt.title('pie chart of percentage of smokers gender wise')
plt.pie(smokers_genderwise,labels=['Male','Female'],autopct="%1.0f%%")
plt.show()

In [None]:
hyper_tension=df[df_copy['prevalentHyp']==1]

In [None]:
plt.title('hyper_tesion vs age ')
sns.histplot(hyper_tension['age'],kde=True)
plt.show()

In [None]:
sns.countplot(hyper_tension,x='sex').set(title='countplot of hyper_tesion with respect to sex')
plt.show()

In [None]:
sns.catplot(hyper_tension,x='glucose').set(title='catplot of hyper_tesion with hyper tension to glucose')
plt.show()

### 6. Data Scaling

In [None]:
scale=StandardScaler()
#df_copy=scale.fit_transform(df)

In [None]:
df_copy

from the above obtained 'TenYearCHD' it is observed that the dataset is imbalenced.<br>
Let's try using SMOTE technique to balance technique

In [None]:
plt.figure(figsize=(15,10))
sns.heatmap(df_copy.corr(),annot=True,cmap='BuPu')

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

def calculate_vif(x):
  vif=pd.DataFrame()
  vif['variables']=x.columns
  vif['VIF']=[variance_inflation_factor(x.values, i) for i in range(x.shape[1])]
  return(vif)

In [None]:
column=df.columns
column

In [None]:
df_copy.columns

In [None]:
#df_copy['education']=new

In [None]:
calculate_vif(df_copy[[ i for i in df_copy.columns]])

In [None]:
chd=df[df['TenYearCHD']==1]

In [None]:

sns.histplot(chd['age'],kde=True).set(title='histplot of "TenYearCHD" vs age')
plt.show()

In [None]:
print(chd['sex'].value_counts())
plt.title('pie chart of "TenYearCHD" vs sex')
plt.pie(chd['sex'].value_counts(),labels=['Male','Female'],autopct='%1.0f%%')
plt.show()

In [None]:
sns.countplot(chd,x="is_smoking")

In [None]:
sns.countplot(chd,x='prevalentHyp')

hyper_tension

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

In [None]:
# Chart - 2 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:
# Chart - 3 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Answer Here.

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***