# Student Information
**Name: Vishal Baraiya**  
**Enrollment No.: 23010101014**  
**Roll No.: C3-635**  
**Course:** Machine Learning & Deep Learning Project  

---

# **Objectives**

- Load cleaned data from Week 2
- Split into train and test sets
- Train Random Forest model
- Evaluate using accuracy and other metrics
- Plot confusion matrix and ROC curve
- Save model for deployment

# **1. Import Libraries**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_curve, roc_auc_score

# **2. Load Dataset**

In [None]:
df = pd.read_csv("../data/processed/clean_cardio.csv")
df.head()

In [None]:
df.info()

# **3. Define Features & Target**

In [None]:
feature_cols = [
    'age_years', 'ap_hi', 'ap_lo', 'bmi',
    'cholesterol', 'gluc',
    'smoke', 'alco', 'active',
    'smoke_age', 'smoke_bmi',
    'alco_age', 'alco_bmi'
]

X = df[feature_cols].values
y = df['cardio'].values

X.shape, y.shape

# **4. Split Dataset**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

X_train.shape, X_test.shape

# **5. Train Random Forest Model**

In [None]:
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# **6. Make Predictions**

In [None]:
y_pred = rf_model.predict(X_test)
y_proba = rf_model.predict_proba(X_test)[:, 1]

# **7. Evaluate Model**

In [None]:
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))

# **8. Confusion Matrix**

In [None]:
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(6,5))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.title("Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

# **9. ROC Curve**

In [None]:
fpr, tpr, _ = roc_curve(y_test, y_proba)

plt.figure(figsize=(6,5))
plt.plot(fpr, tpr, label="Random Forest")
plt.plot([0,1], [0,1], '--', color='gray')
plt.title("ROC Curve")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend()
plt.show()

print("AUC Score:", roc_auc_score(y_test, y_proba))

# **10. Feature Importance**

In [None]:
feature_importance = pd.DataFrame({
    'Feature': feature_cols,
    'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=False)

feature_importance

In [None]:
plt.figure(figsize=(10, 6))
plt.barh(feature_importance['Feature'], feature_importance['Importance'])
plt.xlabel('Importance')
plt.title('Feature Importance')
plt.show()

# **11. Save Model**

In [None]:
import pickle
import os

In [None]:
os.makedirs("../models", exist_ok=True)

with open('../models/random_forest_model.pkl', 'wb') as file:
    pickle.dump(rf_model, file)
    
print("Model saved successfully!")

# **Week 3 Completed**
* Loaded cleaned dataset
* Split data into train and test
* Trained Random Forest model with 100 trees
* Evaluated using accuracy, precision, recall, F1 score
* Plotted confusion matrix and ROC curve
* Analyzed feature importance
* Saved model for deployment