**OBJECTIVE: To predict whether or not a patient has diabetes or not using Naive Bayes classifier**

The Naive Bayes classifier is a widely used machine learning algorithm for classification tasks. It is based on Bayes' theorem and assumes that the features are conditionally independent, given the class label. This "naive" assumption simplifies the calculations, making the classifier efficient and effective, especially in text classification and spam filtering.

The Naive Bayes classifier calculates the probability of each class given the observed features and assigns the class with the highest probability as the predicted label. It can handle both binary and multi-class classification problems.

In [1]:
import pandas as pd

In [3]:
#Importing the dataset
data = pd.read_csv('/Users/prashastisaraf/Downloads/diabetes.csv')
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
data.shape

(768, 9)

The datasets nine columns, the first eight are features and the last one (Outcome) is the label. Outcome has two types of labels 0 (Non-Diabetic) and 1 (Diabetic)

Next, we seperate the columns into independent (X) and dependent (Y) variables:

In [10]:
X = data.drop('Outcome',axis=1)
Y = data[['Outcome']]

In [11]:
X

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33
...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63
764,2,122,70,27,0,36.8,0.340,27
765,5,121,72,23,112,26.2,0.245,30
766,1,126,60,0,0,30.1,0.349,47


In [12]:
Y

Unnamed: 0,Outcome
0,1
1,0
2,1
3,0
4,1
...,...
763,0
764,0
765,0
766,1


Then we will split these variables into train and test set:

In [22]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,/
                        test_size = 0.20,random_state = 0)

Then, we standardise the data:

In [14]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

After splitting data into training and testing, we will generate a Naive Bayes model on the training set and perform prediction on the test datasets:

In [15]:
# Training the Naive Bayes model on the Training set
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X_train, Y_train)

# Predicting the Test set results
Y_pred = model.predict(X_test)

  y = column_or_1d(y, warn=True)


Next, we evaluate the model we will check the accuracy using actual and predicted values.

In [16]:
from sklearn.metrics import confusion_matrix, accuracy_score
ac = accuracy_score(Y_test,Y_pred)
ac

0.7922077922077922

Accuracy is the overall performance metric that measures the proportion of correct predictions (both true positives and true negatives) out of all instances. It is calculated as the ratio of the total number of correct predictions to the total number of instances.

In [20]:
from sklearn import metrics
test_pred=model.predict(X_test)

print(metrics.classification_report(Y_test,test_pred))

              precision    recall  f1-score   support

           0       0.84      0.87      0.85       107
           1       0.67      0.62      0.64        47

    accuracy                           0.79       154
   macro avg       0.76      0.74      0.75       154
weighted avg       0.79      0.79      0.79       154



Precision measures the accuracy of the positive predictions made by the model. It is calculated as the ratio of true positives (correctly predicted positive instances) to the sum of true positives and false positives (instances predicted as positive but are actually negative). A higher precision indicates fewer false positives.

Recall measures the model's ability to identify positive instances correctly out of all the actual positive instances. It is calculated as the ratio of true positives to the sum of true positives and false negatives (instances that are actually positive but predicted as negative). A higher recall indicates fewer false negatives.

F1 score is what percent of positive prediction were correct.

In [18]:
cm = confusion_matrix(Y_test, Y_pred)
cm

array([[93, 14],
       [18, 29]])

True Positives (TP): The number of instances correctly predicted as class 1. In this case, there are 29 true positives.

True Negatives (TN): The number of instances correctly predicted as class 0. Here, there are 93 true negatives.

False Positives (FP): The number of instances incorrectly predicted as class 1 when the actual class is 0. In this case, there are 14 false positives.

False Negatives (FN): The number of instances incorrectly predicted as class 0 when the actual class is 1. There are 18 false negatives.