[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/<your_github_username>/<your_repository_name>/blob/main/notebooks/model_development.ipynb)
[![Kaggle Notebook](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/new?source=https://github.com/mobadara/<your_repository_name>/blob/main/notebooks/model_development.ipynb)
[![Python](https://img.shields.io/badge/python-3.7+-blue.svg)](https://www.python.org/downloads/)

# **Model Development - Cardiovascular Disease Risk Prediction**

## **Introduction**
This notebook marks the beginning of the model development phase for our Cardiovascular Disease Risk Prediction project. Having thoroughly explored the dataset in the Exploratory Data Analysis (EDA) and enriched it with new features during Feature Engineering, we are now ready to train and compare various machine learning models.

* **EDA Notebook:** [![Open In GitHub](https://img.shields.io/badge/View%20EDA%20Notebook-blue?logo=github)](https://github.com/mobadara/cardiovascular-disease-risk-prediction/blob/main/notebooks/exploratory-data-analysis.ipynb)
* **Feature Engineering Notebook:** [![Open In GitHub](https://img.shields.io/badge/View%20FE%20Notebook-blue?logo=github)](https://github.com/mobadara/cardiovascular-disease-risk-prediction/blob/main/notebooks/feature-engineering.ipynb)

In this notebook, we will focus on building and evaluating several classification models to predict cardiovascular disease (`cardio`), including:

* **Logistic Regression**
* **Decision Tree / Random Forest**
* **Gradient Boosting Machines (e.g., LightGBM, XGBoost)**
* And potentially others like **Support Vector Machines (SVM)** or **K-Nearest Neighbors (KNN)**.

We will also implement essential preprocessing steps such as One-Hot Encoding and Standard Scaling, and carefully evaluate each model's performance using relevant metrics. Let's get started!

## **Notebook Setup**

Before diving into model development, we need to ensure all necessary libraries are imported and initial settings are configured. The following code cell will import the required Python libraries for data manipulation, numerical operations, machine learning model building, and visualization, and set up basic display options for pandas.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
import warnings

warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)

## **Data Loading**

The initial step in this model development phase is to load the dataset that has undergone the complete feature engineering process. This dataset, enriched with new features like BMI, age in years, and blood pressure categories, was the output of our previous feature engineering notebook and has been saved to the GitHub repository.

The following cell will load this prepared dataset directly from its raw URL on GitHub into a pandas DataFrame.


In [4]:
df = pd.read_csv('https://raw.githubusercontent.com/mobadara/cardiovascular-disease-risk-prediction/main/data/cardio-engineered.csv')
df.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,bmi,age_group,blood_pressure_category,pulse_pressure
0,0,50.35729,Male,168,62.0,110,80,Normal,Normal,No,No,Active,0,21.96712,Middle-Aged,Hypertension Stage 1,30
1,1,55.381246,Female,156,85.0,140,90,Well Above Normal,Normal,No,No,Active,1,34.927679,Senior,Hypertension Stage 2,50
2,2,51.627652,Female,165,64.0,130,70,Well Above Normal,Normal,No,No,Inactive,1,23.507805,Middle-Aged,Hypertension Stage 1,60
3,3,48.249144,Male,169,82.0,150,100,Normal,Normal,No,No,Active,1,28.710479,Middle-Aged,Hypertension Stage 2,50
4,4,47.841205,Female,156,56.0,100,60,Normal,Normal,No,No,Inactive,0,23.011177,Middle-Aged,Normal,40


## **Data Splitting**

Before applying any further preprocessing steps or training models, it is essential to split the dataset into training and testing sets. This division ensures that we evaluate our models on data they have not seen during the training phase, providing a more reliable estimate of their generalization performance.

We will typically split the data into an 80% training set and a 20% testing set. It is also good practice to stratify the split based on the target variable (**`cardio`**) to ensure that both training and testing sets have a similar proportion of positive and negative cases, especially important given the balanced nature of our target variable.

In [6]:
X = df.drop('cardio', axis=1)
y = df['cardio']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
assert X_train.shape[0] == y_train.shape[0]
assert X_test.shape[0] == y_test.shape[0]
assert X_train.shape[1] == X_test.shape[1]
assert y_train.name == y_test.name == 'cardio'