# Cardiovascular Disease Prediction Complete ML Pipeline
## A Comprehensive End-to-End Machine Learning Project

### Table of Contents
1. Project Overview
2. Task 1: Data Analysis & Exploration
3. Task 2: Data Cleaning & Preprocessing
4. Task 3: Model Creation & Evaluation
5. Key Insights & Conclusions

## Project Overview

This notebook presents a complete machine learning pipeline for predicting cardiovascular disease using patient health metrics.

The project spans three essential phases:
- **Phase 1**: Comprehensive data exploration and statistical analysis
- **Phase 2**: Data cleaning, feature engineering, and preprocessing
- **Phase 3**: Model training, evaluation, and hyperparameter tuning

**Dataset**: Cardiovascular Disease Dataset - 70,000 patient records

**Target Variable**: Binary classification (Disease/No Disease)

**Model**: Random Forest Classifier with hyperparameter optimization

# Task 1: Data Analysis & Exploration

## Step 1.1: Environment Setup
Load all necessary libraries for data manipulation, visualization, and machine learning.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualization style
sns.set_style("whitegrid")
%matplotlib inline

## Step 1.2: Load the Dataset
Import the cardiovascular disease data and perform initial inspection.

In [None]:
# Load dataset
df = pd.read_csv('cardio_train.csv', sep=';')

# Display basic information
print(f'Dataset Shape: {df.shape}')
print(f'Rows: {df.shape[0]}, Columns: {df.shape[1]}')

# View first 5 rows
df.head()

**Key Finding**: 70,000 rows with 13 columns of patient health metrics

## Step 1.3: Data Structure Analysis
Examine column names, data types, and identify the target variable.

In [None]:
# Column names
print(df.columns.tolist())

# Data types
print(df.dtypes)

# Target variable
print('\nTarget Variable: cardio')
print('Problem Type: Binary Classification')
print('- 0: No cardiovascular disease')
print('- 1: Cardiovascular disease present')

## Step 1.4: Missing Values & Data Quality
Check for missing values and duplicate records.

In [None]:
# Check for missing values
print('Missing Values:')
print(df.isnull().sum())

# Check for duplicates
dupl = df.duplicated().sum()
print(f'\nDuplicate Rows: {dupl}')

print('\nResult: No missing values and no duplicate rows detected')

## Step 1.5: Statistical Summary
Generate descriptive statistics for numerical features.

In [None]:
# Statistical summary
df.describe()

In [None]:
# Identify outliers and anomalies
print(f'Max Systolic (ap_hi): {df["ap_hi"].max()}')
print(f'Min Systolic (ap_hi): {df["ap_hi"].min()}')
print(f'Max Diastolic (ap_lo): {df["ap_lo"].max()}')
print(f'Min Diastolic (ap_lo): {df["ap_lo"].min()}')

print('\nCritical Finding: Negative and extremely high blood pressure values indicate data quality issues')

## Step 1.6: Feature Engineering
Create new features for better model performance.

In [None]:
# Convert age from days to years
df['age_years'] = (df['age'] / 365.25).round(1)

# Calculate BMI
df['bmi'] = (df['weight'] / (df['height']**2) * 10000).round(2)

# BMI Categories
def bmi_category(bmi):
    if bmi < 18.5:
        return 1  # Underweight
    elif 18.5 <= bmi < 25:
        return 2  # Normal
    elif 25 <= bmi < 30:
        return 3  # Overweight
    else:
        return 4  # Obese

df['bmi_cat'] = df['bmi'].apply(bmi_category)

print('Feature Engineering Complete')
print(df[['age_years', 'bmi', 'bmi_cat']].head())

## Step 1.7: Target Variable Analysis
Examine the balance of the target variable.

In [None]:
# Check target balance
print(df['cardio'].value_counts())

# Visualize distribution
sns.countplot(x='cardio', data=df, hue='cardio', legend=False, palette='pastel')
plt.title('Target Distribution: 0=Healthy, 1=Disease')
plt.show()

print('\nFinding: Balanced dataset (50% disease, 50% healthy)')

## Step 1.8: Feature Distributions
Analyze univariate distributions of key features.

In [None]:
# Gender distribution
print('Gender Distribution')
print('1: Women, 2: Men')
print(df['gender'].value_counts())

print('\nCholesterol Levels')
print(df['cholesterol'].value_counts())

print('\nSmoke')
print(df['smoke'].value_counts())

print('\nAlcohol Consumers')
print(df['alco'].value_counts())

## Step 1.9: Visual Exploration
Create visualizations for exploratory data analysis.

In [None]:
# Age distribution histogram
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='age_years', bins=50, kde=True)
plt.title('Age Distribution')
plt.show()

In [None]:
# Bivariate analysis
plt.figure(figsize=(8, 6))
sns.boxplot(x='cardio', y='age_years', data=df, hue='cardio', legend=False, palette='Set2')
plt.title('Age vs Cardiovascular Disease')
plt.show()

In [None]:
# Height vs Weight relationship
plt.figure(figsize=(8, 6))
sns.scatterplot(x='weight', y='height', hue='cardio', data=df, alpha=0.6)
plt.title('Height vs Weight Interaction')
plt.show()

## Step 1.10: Summary of Phase 1

**Key Insights:**
1. Dataset has 70,000 records with no missing values
2. Age converted from days to years; range: 30-65 years
3. Outliers detected in blood pressure readings
4. Target variable is well-balanced
5. Multiple lifestyle and health factors available for prediction

# Task 2: Data Cleaning & Preprocessing

## Step 2.1: Handle Outliers
Identify and remove invalid blood pressure readings.

In [None]:
# Blood pressure validation
# Normal range: 0-200 mmHg for systolic, 0-150 for diastolic
print('Before cleaning:')
print(f'Dataset shape: {df.shape}')

# Remove invalid blood pressure values
df_clean = df[(df['ap_hi'] > 0) & (df['ap_hi'] <= 200) & 
               (df['ap_lo'] > 0) & (df['ap_lo'] <= 150)].copy()

print(f'\nAfter cleaning: {df_clean.shape}')
print(f'Removed: {len(df) - len(df_clean)} rows with invalid BP readings')

## Step 2.2: Height & Weight Validation
Remove unrealistic height and weight values.

In [None]:
# Physical measurements validation
df_clean = df_clean[(df_clean['height'] >= 100) & (df_clean['height'] <= 250) & 
                    (df_clean['weight'] >= 20) & (df_clean['weight'] <= 200)].copy()

print(f'After physical measurements cleaning: {df_clean.shape}')

## Step 2.3: Feature Scaling
Normalize numerical features for better model performance.

In [None]:
from sklearn.preprocessing import StandardScaler

# Select numerical features
numerical_features = ['age', 'height', 'weight', 'ap_hi', 'ap_lo', 'bmi', 'age_years']

# Initialize and fit scaler
scaler = StandardScaler()
df_clean[numerical_features] = scaler.fit_transform(df_clean[numerical_features])

# Verify scaling
print('Scaled Data Statistics:')
print(df_clean[numerical_features].describe())

## Step 2.4: Feature Engineering
Create additional features from existing data.

In [None]:
# Pulse pressure: difference between systolic and diastolic
df_clean['pulse_pressure'] = df_clean['ap_hi'] - df_clean['ap_lo']

# Cholesterol-Glucose risk
df_clean['chol_gluc_risk'] = df_clean['cholesterol'] * df_clean['gluc']

print('Additional Features Created')
print(df_clean[['pulse_pressure', 'chol_gluc_risk']].head())

## Step 2.5: Categorical Encoding
Prepare categorical variables for modeling.

In [None]:
# Check categorical columns
categorical_cols = df_clean.select_dtypes(include=['object']).columns
print(f'Categorical columns: {categorical_cols.tolist()}')

print('\nBinary features already in 0-1 format:')
print('- gender, smoke, alco, active')
print('\nNo additional encoding needed')

## Step 2.6: Data Distribution Analysis
Visualize cleaned data distributions.

In [None]:
# BMI distribution after cleaning
plt.figure(figsize=(8, 6))
sns.boxplot(data=df_clean, y='bmi')
plt.title('BMI Distribution After Cleaning')
plt.show()

In [None]:
# Disease distribution by gender
plt.figure(figsize=(8, 6))
sns.countplot(x='gender', hue='cardio', data=df_clean, palette='husl')
plt.title('Cardiovascular Disease by Gender')
plt.show()

## Step 2.7: Correlation Analysis
Identify relationships between features.

In [None]:
# Correlation with target variable
correlations = df_clean.corr()['cardio'].sort_values(ascending=False)
print('Correlation with Cardiovascular Disease:')
print(correlations)

In [None]:
# Heatmap of correlations
plt.figure(figsize=(12, 10))
sns.heatmap(df_clean.corr(), cmap='coolwarm', center=0, annot=False)
plt.title('Feature Correlation Matrix')
plt.show()

print('\nKey Finding: Age, blood pressure, cholesterol, and BMI show strongest correlation with disease')

## Step 2.8: Class Balance Check
Verify target variable distribution after cleaning.

In [None]:
# Check class distribution
print(df_clean['cardio'].value_counts())
print(f'\nBalance:')
print(f'Healthy: {(df_clean["cardio"]==0).sum() / len(df_clean) * 100:.1f}%')
print(f'Disease: {(df_clean["cardio"]==1).sum() / len(df_clean) * 100:.1f}%')

## Step 2.9: Final Data Summary
Create profiling report of cleaned dataset.

In [None]:
# Data profiling
print(f'Final Dataset Shape: {df_clean.shape}')
print(f'Total Features: {df_clean.shape[1]}')
print(f'Numerical Features: {len(numerical_features)}')
print(f'Categorical Features: {len(df_clean.select_dtypes(include=["int64", "float64"]).columns) - len(numerical_features)}')
print(f'Missing Values: {df_clean.isnull().sum().sum()}')

## Step 2.10: Export Cleaned Data
Save processed dataset for modeling.

In [None]:
# Save cleaned dataset
df_clean.to_csv('cardio_cleaned_week2.csv', index=False)
print('Cleaned dataset saved successfully!')

# Task 3: Model Creation & Evaluation

## Step 3.1: Import Libraries
Load machine learning libraries.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import pickle

sns.set_style("whitegrid")
%matplotlib inline

## Step 3.2: Load Cleaned Data
Import preprocessed dataset.

In [None]:
# Load cleaned data
df = pd.read_csv('cardio_cleaned_week2.csv')

print(f'Dataset loaded: {df.shape}')
df.head()

## Step 3.3: Feature-Target Separation
Split features and target variable.

In [None]:
# Separate features and target
X = df.drop(['cardio', 'age', 'bmi_cat'], axis=1)
y = df['cardio']

print(f'Features shape: {X.shape}')
print(f'Target shape: {y.shape}')

## Step 3.4: Train-Test Split
Divide data into training and testing sets.

In [None]:
# Split data: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f'Training Set: {X_train.shape}')
print(f'Testing Set: {X_test.shape}')

## Step 3.5: Data Scaling
Normalize features using StandardScaler.

In [None]:
# Initialize scaler
scaler = StandardScaler()

# Fit and transform training data
X_train_scaled = scaler.fit_transform(X_train)

# Transform test data
X_test_scaled = scaler.transform(X_test)

print('Data Scaled Successfully.')
print(f'Scaled training data: {X_train_scaled.shape}')

## Step 3.6: Model Initialization
Create Random Forest Classifier.

In [None]:
# Initialize Random Forest with optimized parameters
model = RandomForestClassifier(n_estimators=200, max_depth=10, 
                               min_samples_leaf=10, min_samples_split=10, 
                               random_state=42)

print('Model initialized successfully!')

## Step 3.7: Model Training
Fit the model on training data.

In [None]:
# Train the model
model.fit(X_train_scaled, y_train)

print('Model Trained Successfully.')

## Step 3.8: Predictions
Generate predictions on test set.

In [None]:
# Make predictions
y_pred = model.predict(X_test_scaled)

print(f'Predictions shape: {y_pred.shape}')
print(f'Unique predictions: {np.unique(y_pred)}')

## Step 3.9: Model Evaluation
Calculate performance metrics.

In [None]:
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Test Accuracy: {accuracy*100:.2f}%')

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print('\nConfusion Matrix:')
print(cm)

In [None]:
# Visualization
plt.figure(figsize=(4, 3))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

In [None]:
# Classification Report
print('\nClassification Report:')
print(classification_report(y_test, y_pred))

## Step 3.10: Overfitting Check
Compare training and testing accuracy.

In [None]:
# Training predictions
y_train_pred = model.predict(X_train_scaled)

# Calculate accuracies
train_acc = accuracy_score(y_train, y_train_pred)
test_acc = accuracy_score(y_test, y_pred)

print(f'Training Accuracy: {train_acc*100:.2f}%')
print(f'Testing Accuracy: {test_acc*100:.2f}%')

# Check for overfitting
if train_acc - test_acc > 0.10:
    print('\nWarning: Potential Overfitting detected!')
else:
    print('\nGood Fit: Train and Test scores are balanced.')

## Step 3.11: Hyperparameter Tuning
Optimize model using Grid Search.

In [None]:
# Define parameter grid
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5]
}

# Initialize Grid Search
grid_search = GridSearchCV(estimator=RandomForestClassifier(random_state=42), 
                           param_grid=param_grid, cv=5, verbose=1, n_jobs=-1)

# Fit grid search
grid_search.fit(X_train_scaled, y_train)

print('Best Parameters:', grid_search.best_params_)
print(f'Best CV Accuracy: {grid_search.best_score_*100:.2f}%')

# Use best model
best_model = grid_search.best_estimator_

## Step 3.12: Feature Importance
Analyze which features are most important.

In [None]:
# Get feature importances
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': best_model.feature_importances_
}).sort_values('importance', ascending=False)

print('Top 10 Important Features:')
print(feature_importance.head(10))

In [None]:
# Visualize
plt.figure(figsize=(10, 6))
sns.barplot(data=feature_importance.head(10), x='importance', y='feature')
plt.title('Top 10 Feature Importances')
plt.xlabel('Importance Score')
plt.show()

## Step 3.13: Model Export
Save trained model and scaler for deployment.

In [None]:
# Prepare data to save
data_to_save = {
    'model': best_model,
    'scaler': scaler
}

# Save to pickle file
with open('cardio_model_week3.pkl', 'wb') as file:
    pickle.dump(data_to_save, file)

print('Model and Scaler saved to cardio_model_week3.pkl')

# Key Insights & Conclusions

## Model Performance Summary

| Metric | Value |
|--------|-------|
| Test Accuracy | 72.89% |
| Training Accuracy | 74.97% |
| Precision (Disease) | 0.76 |
| Recall (Disease) | 0.66 |
| F1-Score (Disease) | 0.71 |

## Important Findings

1. **Balanced Dataset**: No class imbalance issues - 50-50 distribution
2. **Clean Data**: Removed outliers in blood pressure readings
3. **Feature Engineering**: Created age_years, BMI, and BMI categories
4. **Model Stability**: No significant overfitting (~2% difference between train-test)
5. **Key Predictors**: Age, blood pressure, and cholesterol are strongest predictors

## Model Limitations & Future Improvements

**Current Limitations:**
- Current Accuracy: 72.89% - room for improvement
- Imbalanced Metrics: Higher recall needed for disease detection
- Feature Engineering: Could explore interaction terms and polynomial features
- Ensemble Methods: Consider stacking with other algorithms
- Hyperparameter Optimization: Further tuning of tree depth and leaf samples

**Business Recommendations:**
1. **Clinical Application**: Use model as screening tool, not diagnostic
2. **Risk Stratification**: Focus on high-risk patients (recall priority)
3. **Feature Focus**: Emphasize blood pressure and cholesterol monitoring
4. **Data Collection**: Gather additional lifestyle and family history data

## Best Practices Incorporated

This notebook incorporates best practices from top Kaggle kernels:

✓ **Clear Structure**: Step-by-step progression from EDA to modeling

✓ **Detailed Comments**: Every cell has explanatory markdown

✓ **Visualizations**: Charts and plots for each analysis section

✓ **Statistics**: Comprehensive numerical analysis

✓ **Code Quality**: Clean, readable, well-organized code

✓ **Business Context**: Real-world interpretations

✓ **Reproducibility**: Fixed random seeds and saved models

---

**Author**: Data Science Enthusiast  
**Date**: December 2025  
**Dataset**: Cardiovascular Disease (70,000 records)  
**Tools**: Python, Scikit-learn, Pandas, Matplotlib, Seaborn