# Heart Disease Prediction Model

## Importing Required Libraries

In [1]:
import pandas as pd
import numpy as np
import pickle
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

## Loading the Dataset

In [2]:
data = pd.read_csv('Heart_Disease_Prediction.csv')
data.head() 

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0


## Data Cleaning and Preprocessing

### Checking for Missing Values

In [3]:
data.isnull().sum()  # Check for missing values

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

### Understanding Data Types 

In [4]:
data.dtypes 

age           int64
sex           int64
cp            int64
trestbps      int64
chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal          int64
target        int64
dtype: object

## Splitting Dataset into Features and Target

In [5]:
X = data.iloc[:, :-1].values  # Features
y = data.iloc[:, -1].values  # Target

## Splitting Dataset into Training and Testing Sets

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

## Model Training

### Logistic Regression

In [8]:
logistic_classifier = LogisticRegression(max_iter=1000)
logistic_classifier.fit(X_train, y_train)

### K-Nearest Neighbors

In [9]:
KNN_classifier = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)
KNN_classifier.fit(X_train, y_train)

## Feature Scaling for SVM classifier

In [10]:
sc = StandardScaler()
X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.transform(X_test)

### Support Vector Machine (Linear Kernel)

In [11]:
SVM_classifier = SVC(kernel='linear', random_state=0)
SVM_classifier.fit(X_train_scaled, y_train)

### Support Vector Machine (RBF Kernel)

In [12]:
kernal_SVM_classifier = SVC(kernel='rbf', random_state=0)
kernal_SVM_classifier.fit(X_train_scaled, y_train)

### Naive Bayes

In [13]:
Naive_Bayes_classifier = GaussianNB()
Naive_Bayes_classifier.fit(X_train, y_train)

### Decision Tree

In [14]:
Decision_Tree_classifier = DecisionTreeClassifier(random_state=0)
Decision_Tree_classifier.fit(X_train, y_train)

### Random Forest

In [15]:
Random_Forest_classifier = RandomForestClassifier(n_estimators=10, criterion='entropy', random_state=0)
Random_Forest_classifier.fit(X_train, y_train)

## Model Evaluation

### Predictions

In [16]:
logistic_y_pred = logistic_classifier.predict(X_test)
KNN_y_pred = KNN_classifier.predict(X_test)
SVM_y_pred = SVM_classifier.predict(X_test)
kernal_SVM_y_pred = kernal_SVM_classifier.predict(X_test)
naive_bayes_y_pred = Naive_Bayes_classifier.predict(X_test)
decision_tree_y_pred = Decision_Tree_classifier.predict(X_test)
random_forest_y_pred = Random_Forest_classifier.predict(X_test)

### Accuracy and Confusion Matrices

In [17]:
logistic_accuracy = accuracy_score(y_test, logistic_y_pred)
KNN_accuracy = accuracy_score(y_test, KNN_y_pred)
SVM_accuracy = accuracy_score(y_test, SVM_y_pred)
KSVM_accuracy = accuracy_score(y_test, kernal_SVM_y_pred)
naive_bayes_accuracy = accuracy_score(y_test, naive_bayes_y_pred)
decision_tree_accuracy = accuracy_score(y_test, decision_tree_y_pred)
random_forest_accuracy = accuracy_score(y_test, random_forest_y_pred)

print("Logistic Regression Accuracy:", logistic_accuracy)
print("KNN Accuracy:", KNN_accuracy)
print("SVM Accuracy:", SVM_accuracy)
print("Kernel SVM Accuracy:", KSVM_accuracy)
print("Naive Bayes Accuracy:", naive_bayes_accuracy)
print("Decision Tree Accuracy:", decision_tree_accuracy)
print("Random Forest Accuracy:", random_forest_accuracy)

Logistic Regression Accuracy: 0.8634146341463415
KNN Accuracy: 0.7463414634146341
SVM Accuracy: 0.6
Kernel SVM Accuracy: 0.5219512195121951
Naive Bayes Accuracy: 0.8536585365853658
Decision Tree Accuracy: 1.0
Random Forest Accuracy: 1.0


## Comparing Models

In [18]:
models = pd.DataFrame(
    {
        'Classifier': ['Logistic Regression', 'KNN', 'SVM', 'Kernel SVM', 'Naive Bayes', 'Decision Tree', 'Random Forest'],
        'Accuracy': [logistic_accuracy, KNN_accuracy, SVM_accuracy, KSVM_accuracy, naive_bayes_accuracy, decision_tree_accuracy, random_forest_accuracy]
    }
)

models.sort_values(by='Accuracy', ascending=False)

Unnamed: 0,Classifier,Accuracy
5,Decision Tree,1.0
6,Random Forest,1.0
0,Logistic Regression,0.863415
4,Naive Bayes,0.853659
1,KNN,0.746341
2,SVM,0.6
3,Kernel SVM,0.521951


## Recommendations

Based on the updated accuracy results:

1. **Best Model: Decision Tree and Random Forest**
   - **Accuracy: 100.00%**
   - **Recommendation**: Both Decision Tree and Random Forest achieved perfect accuracy on the test set. While this might seem ideal, these results suggest potential overfitting. Overfitting occurs when the model performs exceptionally well on the training data but may fail to generalize to unseen data. This could lead to poor performance in real-world scenarios where the data distribution might differ slightly.
   - **Decision**: **Random Forest** is preferred over Decision Tree due to its ensemble nature, which reduces overfitting by aggregating the predictions of multiple trees. Hyperparameter tuning (e.g., increasing `n_estimators`, adjusting `max_depth`) is recommended to ensure robustness.

2. **Second Best Model: Logistic Regression**
   - **Accuracy: 86.34%**
   - **Recommendation**: Logistic Regression offers strong performance and interpretability. It is an excellent choice when you need a linear model that is easy to understand and explain. However, it may not capture complex patterns in the data.

3. **Third Best Model: Naive Bayes**
   - **Accuracy: 85.37%**
   - **Recommendation**: Naive Bayes performs well and is simple and fast to train. It is suitable for scenarios where interpretability and speed are prioritized. However, its assumption of feature independence may limit performance for more complex datasets.

## Additional Considerations

- **K-Nearest Neighbors (74.63%)**:
  - The model performs reasonably but is less accurate compared to others. It is computationally expensive for large datasets and sensitive to the choice of `k` and distance metric. Not recommended for deployment in this scenario.

- **Support Vector Machine (Linear Kernel, 60%)**:
  - SVM performs poorly here. The lower accuracy suggests that the features may not be linearly separable. It might require further feature engineering, tuning, or kernel adjustments. Not recommended for deployment.

- **Kernel SVM (52.20%)**:
  - This is the least accurate model, likely due to inappropriate hyperparameters or overfitting to noise in the data. It is not suitable for deployment in its current form.

## Save .pkl Files

In [19]:
best_model = Random_Forest_classifier

# Save the model
with open('heart_disease_prediction_model.pkl', 'wb') as model_file:
    pickle.dump(best_model, model_file)