Step 1: Understand the Dataset
The dataset includes the following fields:

Age, RestingBP, Cholesterol, MaxHR, Oldpeak: Continuous numerical features.
Sex, ChestPainType, FastingBS, RestingECG, ExerciseAngina, ST_Slope: Categorical features.
HeartDisease: Target variable (binary).

Step 2: Preprocessing the Data
Encoding: Categorical features.
scaling : Continuous numerical features. ( standardize scaling)

Convert categorical variables to numerical representations using one-hot encoding or label encoding.
Feature Scaling:

Scale numerical features like Age, RestingBP, Cholesterol, MaxHR, and Oldpeak using Min-Max Scaling or Standard Scaling.

Step 3: Train the Naive Bayes Model
Use the processed dataset to split into training and testing sets.
Train the Naive Bayes model (GaussianNB for numerical, CategoricalNB for categorical features).


Step 4:
        save the model . Scaler , encoder , ML model

Step 5: Create a Web App Using Flask
Build the ML pipeline in Python.
Deploy the trained model and provide an interface for users to input feature values and get predictions.


In [27]:
import pandas as pd
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler,LabelEncoder
import warnings
warnings.filterwarnings('ignore')

In [28]:
# load the dataset
data=pd.read_csv('heart.csv')
print(data.head())

   Age Sex ChestPainType  ...  Oldpeak  ST_Slope  HeartDisease
0   40   M           ATA  ...      0.0        Up             0
1   49   F           NAP  ...      1.0      Flat             1
2   37   M           ATA  ...      0.0        Up             0
3   48   F           ASY  ...      1.5      Flat             1
4   54   M           NAP  ...      0.0        Up             0

[5 rows x 12 columns]


In [29]:
# Data preprocesing 

#1. Datacleaning - missing values, duplicate_entries, data type checking
#2. features scaling
#3. Encoder

# missing values
data.isna().sum()

Age               0
Sex               0
ChestPainType     0
RestingBP         0
Cholesterol       0
FastingBS         0
RestingECG        0
MaxHR             0
ExerciseAngina    0
Oldpeak           0
ST_Slope          0
HeartDisease      0
dtype: int64

In [30]:
# duplicated entries
data.duplicated()

0      False
1      False
2      False
3      False
4      False
       ...  
913    False
914    False
915    False
916    False
917    False
Length: 918, dtype: bool

In [31]:
# correct formatation
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB


In [32]:
# Encode categorical features
print(list(data.columns))
categorical_features=[]
numerical_features=[]
for col in data.columns:
    if data[col].dtype == 'O':
        categorical_features.append(col)
    else:
        numerical_features.append(col)
categorical_features

encoders={col:LabelEncoder() for col in categorical_features}

for col in categorical_features:
    data[col]=encoders[col].fit_transform(data[col])

['Age', 'Sex', 'ChestPainType', 'RestingBP', 'Cholesterol', 'FastingBS', 'RestingECG', 'MaxHR', 'ExerciseAngina', 'Oldpeak', 'ST_Slope', 'HeartDisease']


In [33]:
data

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,1,1,140,289,0,1,172,0,0.0,2,0
1,49,0,2,160,180,0,1,156,0,1.0,1,1
2,37,1,1,130,283,0,2,98,0,0.0,2,0
3,48,0,0,138,214,0,1,108,1,1.5,1,1
4,54,1,2,150,195,0,1,122,0,0.0,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...
913,45,1,3,110,264,0,1,132,0,1.2,1,1
914,68,1,0,144,193,1,1,141,0,3.4,1,1
915,57,1,0,130,131,0,1,115,1,1.2,1,1
916,57,0,1,130,236,0,0,174,0,0.0,1,1


In [36]:
# Scale Numerical Features
scaler=StandardScaler()
numerical_features=numerical_features[0:-1]
numerical_features
data[numerical_features]=scaler.fit_transform(data[numerical_features])

In [37]:
data

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,-1.433140,1,1,0.410909,0.825070,-0.551341,1,1.382928,0,0.0,2,0
1,-0.478484,0,2,1.491752,-0.171961,-0.551341,1,0.754157,0,1.0,1,1
2,-1.751359,1,1,-0.129513,0.770188,-0.551341,2,-1.525138,0,0.0,2,0
3,-0.584556,0,0,0.302825,0.139040,-0.551341,1,-1.132156,1,1.5,1,1
4,0.051881,1,2,0.951331,-0.034755,-0.551341,1,-0.581981,0,0.0,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...
913,-0.902775,1,3,-1.210356,0.596393,-0.551341,1,-0.188999,0,1.2,1,1
914,1.536902,1,0,0.627078,-0.053049,1.813758,1,0.164684,0,3.4,1,1
915,0.370100,1,0,-0.129513,-0.620168,-0.551341,1,-0.857069,1,1.2,1,1
916,0.370100,0,1,-0.129513,0.340275,-0.551341,0,1.461525,0,0.0,1,1


In [38]:
# segregate and split the dataset
X=data.drop('HeartDisease',axis=1)
Y=data['HeartDisease']

In [39]:
# split the dataset
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,random_state=42)

In [40]:
# train the naive bayes model
model=GaussianNB()
model.fit(X_train,Y_train)

In [41]:
# Save model, encoders, scaler

import pickle
with open('model.pkl','wb') as f:
    pickle.dump(model,f)
with open('scaler.pkl','wb') as f:
    pickle.dump(scaler,f)
with open('encoders.pkl','wb') as f:
    pickle.dump(encoders,f)

In [42]:
y_pred=model.predict(X_test)

In [43]:
accuracy=accuracy_score(Y_test,y_pred)
accuracy

0.842391304347826