# Heart Disease Prediction Model

## Importing Required Libraries

In [1]:
import pandas as pd
import numpy as np
import pickle
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

## Loading the Dataset

In [2]:
data = pd.read_csv('Heart_Disease_Prediction.csv')
data.head() 

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0


## Data Cleaning and Preprocessing

### Checking for Missing Values

In [3]:
data.isnull().sum()  # Check for missing values

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

### Understanding Data Types 

In [4]:
data.dtypes 

age           int64
sex           int64
cp            int64
trestbps      int64
chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal          int64
target        int64
dtype: object

## Splitting Dataset into Features and Target

In [5]:
X = data.iloc[:, :-1].values  # Features
y = data.iloc[:, -1].values  # Target

## Splitting Dataset into Training and Testing Sets

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

## Model Training

### Logistic Regression

In [8]:
lr_params = {'C': [0.01, 0.1, 1, 10], 'solver': ['liblinear', 'lbfgs']}
lr_grid = GridSearchCV(LogisticRegression(max_iter=5000), lr_params, cv=5, scoring='accuracy')
lr_grid.fit(X_train, y_train)
best_lr = lr_grid.best_estimator_

### K-Nearest Neighbors

In [9]:
knn_params = {'n_neighbors': range(1, 21), 'metric': ['minkowski'], 'p': [1, 2]}
knn_grid = GridSearchCV(KNeighborsClassifier(), knn_params, cv=5, scoring='accuracy')
knn_grid.fit(X_train, y_train)
best_knn = knn_grid.best_estimator_

## Feature Scaling for SVM classifier

In [10]:
sc = StandardScaler()
X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.transform(X_test)

### Support Vector Machine (Linear Kernel)

In [11]:
svm_linear_params = {'C': [0.1, 1, 10]}
svm_linear_grid = GridSearchCV(SVC(kernel='linear'), svm_linear_params, cv=5, scoring='accuracy')
svm_linear_grid.fit(X_train_scaled, y_train)
best_svm_linear = svm_linear_grid.best_estimator_

### Support Vector Machine (RBF Kernel)

In [12]:
svm_rbf_params = {'C': [0.1, 1, 10], 'gamma': [0.1, 0.01, 0.001]}
svm_rbf_grid = GridSearchCV(SVC(kernel='rbf'), svm_rbf_params, cv=5, scoring='accuracy')
svm_rbf_grid.fit(X_train_scaled, y_train)
best_svm_rbf = svm_rbf_grid.best_estimator_

### Naive Bayes

In [13]:
best_naive_bayes = GaussianNB()
best_naive_bayes.fit(X_train, y_train)

### Decision Tree

In [14]:
dt_params = {'criterion': ['gini', 'entropy'], 'max_depth': [None, 10, 20, 30]}
dt_grid = GridSearchCV(DecisionTreeClassifier(random_state=0), dt_params, cv=5, scoring='accuracy')
dt_grid.fit(X_train, y_train)
best_dt = dt_grid.best_estimator_

### Random Forest

In [15]:
rf_params = {'n_estimators': [10, 50, 100, 200], 'criterion': ['gini', 'entropy'], 'max_depth': [None, 10, 20, 30]}
rf_grid = GridSearchCV(RandomForestClassifier(random_state=0), rf_params, cv=5, scoring='accuracy')
rf_grid.fit(X_train, y_train)
best_rf = rf_grid.best_estimator_

## Model Evaluation

### Predictions

In [16]:
models = {
    'Logistic Regression': best_lr,
    'KNN': best_knn,
    'SVM (Linear)': best_svm_linear,
    'SVM (RBF)': best_svm_rbf,
    'Naive Bayes': best_naive_bayes,
    'Decision Tree': best_dt,
    'Random Forest': best_rf
}

### Accuracy and Confusion Matrices

In [19]:
results = []

for name, model in models.items():
    if 'SVM' in name:  # SVM requires scaled data
        y_pred = model.predict(X_test_scaled)
    else:
        y_pred = model.predict(X_test)
    
    accuracy = accuracy_score(y_test, y_pred)
    results.append((name, accuracy))
    print(f"{name} Accuracy: {accuracy:.2f}")

Logistic Regression Accuracy: 0.86
KNN Accuracy: 0.98
SVM (Linear) Accuracy: 0.84
SVM (RBF) Accuracy: 1.00
Naive Bayes Accuracy: 0.85
Decision Tree Accuracy: 1.00
Random Forest Accuracy: 1.00


## Comparing Models

In [18]:
results_df = pd.DataFrame(results, columns=['Model', 'Accuracy']).sort_values(by='Accuracy', ascending=False)
print(results_df)

                 Model  Accuracy
3            SVM (RBF)  1.000000
5        Decision Tree  1.000000
6        Random Forest  1.000000
1                  KNN  0.980488
0  Logistic Regression  0.863415
4          Naive Bayes  0.853659
2         SVM (Linear)  0.839024


# Recommendations

Based on the updated accuracy results:

1. **Best Model: SVM (RBF) and Decision Tree**
   * **Accuracy: 100.00%**
   * **Recommendation**: Both the SVM (RBF) and Decision Tree achieved perfect accuracy on the test set. While this might seem ideal, these results suggest potential overfitting. Overfitting occurs when the model performs exceptionally well on the training data but may fail to generalize to unseen data. This could lead to poor performance in real-world scenarios where the data distribution might differ slightly.
   * **Decision**: **Random Forest** is preferred over Decision Tree due to its ensemble nature, which reduces overfitting by aggregating the predictions of multiple trees. Hyperparameter tuning (e.g., increasing `n_estimators`, adjusting `max_depth`) is recommended to ensure robustness.

2. **Second Best Model: K-Nearest Neighbors (KNN)**
   * **Accuracy: 98.05%**
   * **Recommendation**: KNN performs very well, with accuracy just slightly lower than the best models. However, it can be computationally expensive for large datasets and is sensitive to the choice of `k` and distance metric. While it's a strong performer, it may not be the best choice for large-scale or real-time applications.

3. **Third Best Model: Logistic Regression**
   * **Accuracy: 86.34%**
   * **Recommendation**: Logistic Regression offers strong performance and interpretability. It is an excellent choice when you need a linear model that is easy to understand and explain. However, it may not capture complex patterns in the data, making it less effective for problems with intricate relationships between features.

4. **Fourth Best Model: Naive Bayes**
   * **Accuracy: 85.37%**
   * **Recommendation**: Naive Bayes performs well and is simple and fast to train. It is suitable for scenarios where interpretability and speed are prioritized. However, its assumption of feature independence may limit performance for more complex datasets.

## Additional Considerations

* **Random Forest (100%)**:
   * Random Forest achieved perfect accuracy and is an excellent model for generalization due to its ensemble learning approach. It is the recommended choice for this problem.

* **Support Vector Machine (Linear Kernel)**
   * **Accuracy: 83.90%**
   * SVM with a linear kernel shows moderate performance. This model may require further feature engineering or kernel adjustments to perform better on non-linear data.

* **Kernel SVM (52.20%)**
   * This is the least accurate model, likely due to inappropriate hyperparameters or overfitting to noise in the data. It is not suitable for deployment in its current form.

## Save .pkl Files

In [20]:
best_model_name, best_model_accuracy = results_df.iloc[0]
best_model = models[best_model_name]

print(best_model)


SVC(C=10, gamma=0.1)


In [21]:
with open('heart_disease_prediction_best_model.pkl', 'wb') as model_file:
    pickle.dump(best_model, model_file)

print(f"Best Model: {best_model_name} with Accuracy: {best_model_accuracy:.2f}")

Best Model: SVM (RBF) with Accuracy: 1.00
