# TASK #1 : UNDERSTAND THE PROBLEM STATEMENT


Aim of the problem is to detect the presence or absence of cardiovascular disease in person based on the given features.
Features available are:


- Age | Objective Feature | age | int (days)
- Height | Objective Feature | height | int (cm) |
- Weight | Objective Feature | weight | float (kg) |
- Gender | Objective Feature | gender | categorical code |
- Systolic blood pressure | Examination Feature | ap_hi | int |
- Diastolic blood pressure | Examination Feature | ap_lo | int |
- Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal |
- Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal |
- Smoking | Subjective Feature | smoke | binary |
- Alcohol intake | Subjective Feature | alco | binary |
- Physical activity | Subjective Feature | active | binary |
- Presence or absence of cardiovascular disease | Target Variable | cardio | binary |

Note that:
- Objective: factual information;
- Examination: results of medical examination;
- Subjective: information given by the patient.

Data Source:https://www.kaggle.com/sulianova/cardiovascular-disease-dataset

# TASK #2: IMPORT LIBRARIES AND DATASETS

In [None]:
# import the necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# read the csv file 
cardio_df = pd.read_csv("cardio_train.csv", sep=";")

In [None]:
cardio_df.head()

# TASK #3: PERFORM EXPLORATORY DATA ANALYSIS

In [None]:
# Drop id

cardio_df = cardio_df.drop(columns = 'id')

In [None]:
# since the age is given in days, we convert it into years

cardio_df['age'] = cardio_df['age']/365

In [None]:
cardio_df.head()

In [None]:
# checking the null values
cardio_df.isnull().sum()

In [None]:
# Checking the dataframe information

cardio_df.info()

In [None]:
# Statistical summary of the dataframe
cardio_df.describe()

# TASK #4: VISUALIZE DATASET

In [None]:
sns.pairplot(cardio_df)

# TASK #5: CREATE TRAINING AND TESTING DATASET

In [None]:
# split the dataframe into target and features

df_target = cardio_df['cardio']
df_features = cardio_df.drop(columns =['cardio'])

In [None]:
cardio_df.columns

In [None]:
df_features.shape

In [None]:
df_features.columns

In [None]:
df_target.shape

In [None]:
#spliting the data in to test and train sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df_features, df_target, test_size = 0.2)



In [None]:
X_train.shape

In [None]:
X_train.columns

In [None]:
y_train.shape

In [None]:
X_test.shape

# TASK #6: TRAIN AND TEST XGBOOST MODEL IN LOCAL MODE

In [None]:
# install xgboost

!pip install xgboost

In [None]:
# use xgboost model in local mode

# note that we have not performed any normalization or scaling since XGBoost is not sensitive to this.
# XGboost is a type of ensemble algorithms and works by selecting thresholds or cut points on features to split a node. 
# It doesn't really matter if the features are scaled or not.


from xgboost import XGBClassifier

# model = XGBClassifier(learning_rate=0.01, n_estimators=100, objective='binary:logistic')
model = XGBClassifier()

model.fit(X_train, y_train)

In [None]:
# make predictions on test data

predict = model.predict(X_test)

In [None]:
predict

In [None]:
# print metrics for testing dataset
from sklearn.metrics import precision_score, recall_score, accuracy_score

print("Precision = {}".format(precision_score(y_test, predict)))
print("Recall = {}".format(recall_score(y_test, predict)))
print("Accuracy = {}".format(accuracy_score(y_test, predict)))

In [None]:
model.save_model('bst_save_model.json')
model.save_model('bst_save_model.pkl')