# **Project Name**    -  **Cardiovascular Risk Prediction**



##### **Project Type**    - Classification
##### **Contribution**    - Team
##### **Team Member 1** - Nitin Pal
##### **Team Member 2** - Shristhi Patel


# **Problem Statement**


**BUSINESS PROBLEM OVERVIEW**


Cardiovascular disease is one of the most prevalent chronic diseases and the leading cause of death in the United States. Every year about 1 million Americans die due to cardiovascular related problems, which is about 42% of total death per year. Cardiovascular disease is the disease of heart and vascular system of the human body that includes narrowing down or blocking of blood vessels (arteries, veins and capillaries) that can lead to range of diseases such as coronary artery disease, arrhythmia, congenital heart defects, angina (chest pain) and stroke.

> 


So if the cardiovascular diseases can be predicted then we'd be able to save a million of people around the world.  

#### **Define Your Business Objective?** 

***Reducing risk of cardiovascular diseases***

# GitHub Link

https://github.com/palnitin12345/Cardio-Vascular-Disease-Prediction.git

# Let's begin

## ***1. Know Your Data***

In [1]:
#Importing required libraries

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from numpy import math
from scipy import stats

%pip install squarify
import squarify

import datetime
from datetime import datetime, timedelta

import warnings
warnings.filterwarnings("ignore")
from sklearn import preprocessing


from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

# Importing packages to split data into train and test
from sklearn.model_selection import train_test_split
     

# Libraries for modelling and evaluation
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix, roc_auc_score, roc_curve

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

# Importing libraries for hyperparameter tuning
from sklearn.model_selection import GridSearchCV
     

import requests
import io
 




In [2]:
# Downloading the csv file from your GitHub account

url = "https://raw.githubusercontent.com/palnitin12345/Cardio-Vascular-Disease-Prediction/main/data_cardiovascular_risk.csv"
 # Make sure the url is the raw version of the file on GitHub
download = requests.get(url).content

# Reading the downloaded content and turning it into a pandas dataframe

dataset = pd.read_csv(io.StringIO(download.decode('utf-8')))


In [None]:
#First look of the data.

dataset.head()

Unnamed: 0,id,age,education,sex,is_smoking,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,0,64,2.0,F,YES,3.0,0.0,0,0,0,221.0,148.0,85.0,,90.0,80.0,1
1,1,36,4.0,M,NO,0.0,0.0,0,1,0,212.0,168.0,98.0,29.77,72.0,75.0,0
2,2,46,1.0,F,YES,10.0,0.0,0,0,0,250.0,116.0,71.0,20.35,88.0,94.0,0
3,3,50,1.0,M,YES,20.0,0.0,0,1,0,233.0,158.0,88.0,28.26,68.0,94.0,1
4,4,64,1.0,F,YES,30.0,0.0,0,0,0,241.0,136.5,85.0,26.42,70.0,77.0,0


In [None]:
# Dataset Rows & Columns count

print(f'Shape of first dataset {dataset.shape}')

Shape of first dataset (3390, 17)


In [None]:
#Dataset info
#Looking the datatypes

dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3390 entries, 0 to 3389
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   id               3390 non-null   int64  
 1   age              3390 non-null   int64  
 2   education        3303 non-null   float64
 3   sex              3390 non-null   object 
 4   is_smoking       3390 non-null   object 
 5   cigsPerDay       3368 non-null   float64
 6   BPMeds           3346 non-null   float64
 7   prevalentStroke  3390 non-null   int64  
 8   prevalentHyp     3390 non-null   int64  
 9   diabetes         3390 non-null   int64  
 10  totChol          3352 non-null   float64
 11  sysBP            3390 non-null   float64
 12  diaBP            3390 non-null   float64
 13  BMI              3376 non-null   float64
 14  heartRate        3389 non-null   float64
 15  glucose          3086 non-null   float64
 16  TenYearCHD       3390 non-null   int64  
dtypes: float64(9),

Duplicate Values

In [None]:

#Number of duplicated Rows

len(dataset[dataset.duplicated()])

0

Missing/Null Values

In [None]:
#Finding null columns in the dataset and percentage of null values.
print("Col     Null values percent")
for col in dataset.columns:
  if dataset[col].notnull().sum() != len(dataset):
    print(f"{col}  :- {round(dataset[col].isnull().sum()*100/len(dataset),2)}")

Col     Null values percent
education  :- 2.57
cigsPerDay  :- 0.65
BPMeds  :- 1.3
totChol  :- 1.12
BMI  :- 0.41
heartRate  :- 0.03
glucose  :- 8.97


### What did you know about your dataset?

We got to know that we have data 3390 patients whose heart rate, cholestrol level and other parameters are given. And we have to predict whether they possess a risk of Cardiovascular heart Diseases in the next ten years or not. 
There are no columns with more than 3% of null values, so they can be treated and no need to remove columns. There are no duplicated entries in the dataset.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

dataset.columns

Index(['id', 'age', 'education', 'sex', 'is_smoking', 'cigsPerDay', 'BPMeds',
       'prevalentStroke', 'prevalentHyp', 'diabetes', 'totChol', 'sysBP',
       'diaBP', 'BMI', 'heartRate', 'glucose', 'TenYearCHD'],
      dtype='object')

In [None]:
# Dataset Describe

dataset.describe(include='all')

Unnamed: 0,id,age,education,sex,is_smoking,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
count,3390.0,3390.0,3303.0,3390,3390,3368.0,3346.0,3390.0,3390.0,3390.0,3352.0,3390.0,3390.0,3376.0,3389.0,3086.0,3390.0
unique,,,,2,2,,,,,,,,,,,,
top,,,,F,NO,,,,,,,,,,,,
freq,,,,1923,1703,,,,,,,,,,,,
mean,1694.5,49.542183,1.970936,,,9.069477,0.029886,0.00649,0.315339,0.025664,237.074284,132.60118,82.883038,25.794964,75.977279,82.08652,0.150737
std,978.753033,8.592878,1.019081,,,11.879078,0.170299,0.080309,0.464719,0.158153,45.24743,22.29203,12.023581,4.115449,11.971868,24.244753,0.357846
min,0.0,32.0,1.0,,,0.0,0.0,0.0,0.0,0.0,107.0,83.5,48.0,15.96,45.0,40.0,0.0
25%,847.25,42.0,1.0,,,0.0,0.0,0.0,0.0,0.0,206.0,117.0,74.5,23.02,68.0,71.0,0.0
50%,1694.5,49.0,2.0,,,0.0,0.0,0.0,0.0,0.0,234.0,128.5,82.0,25.38,75.0,78.0,0.0
75%,2541.75,56.0,3.0,,,20.0,0.0,0.0,1.0,0.0,264.0,144.0,90.0,28.04,83.0,87.0,0.0


Variables Description 

1.  **id** :-  Unique id referring to a patient case
2.  **age** :- Age of the person
3.  **education** :- Describes their education level.
4.	**sex**:- Gender of the patient
5.	**is_smoking**:- whether the person smokes or not
6.	**cigsPerDay**:- number of ciggerattes consumed in a day
7.	**BPMeds**:- whether taking blood pressure meds or not
8.	**prevalentStroke**:- Whether patient has a history of stroke
9.	**prevalentHyp**:-whether patient has a history of hypertension
10.	**diabetes**:- whether patient has diabetes or not
11.	**totChol**:-Cholestrol level of the pateint
12.	**sysBP**	:-systolic blood pressure value
13. **diaBP**:-diastolic blood pressure value
14.	**BMI**:- Body Mass Index. Refers to general physique of the person
15.	**heartRate**:- Heart rate 
16.	**glucose**:-Glucose level in the person's body
17.	**TenYearCHD**:- Whether there are chances of cardiovascular Heart Disease or not

Unique Values

In [None]:
# Check Unique Values for each variable.
for i in dataset.columns.tolist():
  print("No. of unique values in ",i,"is",dataset[i].nunique(),)

No. of unique values in  id is 3390
No. of unique values in  age is 39
No. of unique values in  education is 4
No. of unique values in  sex is 2
No. of unique values in  is_smoking is 2
No. of unique values in  cigsPerDay is 32
No. of unique values in  BPMeds is 2
No. of unique values in  prevalentStroke is 2
No. of unique values in  prevalentHyp is 2
No. of unique values in  diabetes is 2
No. of unique values in  totChol is 240
No. of unique values in  sysBP is 226
No. of unique values in  diaBP is 142
No. of unique values in  BMI is 1259
No. of unique values in  heartRate is 68
No. of unique values in  glucose is 132
No. of unique values in  TenYearCHD is 2


In [None]:
#Columns that are categorical in nature

for col in dataset.columns:
  if dataset[col].nunique()<=10:
    print(col)
    print(dataset[col].value_counts())
    print('')

education
education
1.0    1391
2.0     990
3.0     549
4.0     373
Name: count, dtype: int64

sex
sex
F    1923
M    1467
Name: count, dtype: int64

is_smoking
is_smoking
NO     1703
YES    1687
Name: count, dtype: int64

BPMeds
BPMeds
0.0    3246
1.0     100
Name: count, dtype: int64

prevalentStroke
prevalentStroke
0    3368
1      22
Name: count, dtype: int64

prevalentHyp
prevalentHyp
0    2321
1    1069
Name: count, dtype: int64

diabetes
diabetes
0    3303
1      87
Name: count, dtype: int64

TenYearCHD
TenYearCHD
0    2879
1     511
Name: count, dtype: int64

