# Predicting Diabetes Using Naive Bayes

### Objective
- Apply Naive Bayes for binary classification.
- Practice data exploration and preprocessing.
- Evaluate model performance using appropriate metrics.
- Understand and interpret the log probabilities used in Naive Bayes.

### Dataset
This lab uses the Pima Indians Diabetes Dataset from the UCI Machine Learning Repository. It contains 8 features based on medical information, with a binary target indicating the presence of diabetes (1) or absence (0).

### Features
`Pregnancies`: Number of times pregnant
`Glucose`: Plasma glucose concentration
`BloodPressure`: Diastolic blood pressure (mm Hg)
`SkinThickness`: Triceps skinfold thickness (mm)
`Insulin`: 2-Hour serum insulin (mu U/ml)
`BMI`: Body mass index (weight in kg/(height in m)^2)
`DiabetesPedigreeFunction`: Diabetes pedigree function (a function based on family history)
`Age`: Age (years)
`Outcome`: Class variable (1 if patient has diabetes, 0 otherwise)

### Import packages

In [70]:
# Import the necessary libraries for data manipulation, model training, and evaluation.
# your code here

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, N
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, precision_score, f1_score, recall_score
import matplotlib.pyplot as plt
import seaborn as sns

### Data Loading and Exploration

In [44]:
data = pd.read_csv("pima_diabetes.csv")
print(pd.DataFrame(data.describe()))
data.head()

       Pregnancies     Glucose  BloodPressure  SkinThickness     Insulin  \
count   768.000000  768.000000     768.000000     768.000000  768.000000   
mean      3.845052  120.894531      69.105469      20.536458   79.799479   
std       3.369578   31.972618      19.355807      15.952218  115.244002   
min       0.000000    0.000000       0.000000       0.000000    0.000000   
25%       1.000000   99.000000      62.000000       0.000000    0.000000   
50%       3.000000  117.000000      72.000000      23.000000   30.500000   
75%       6.000000  140.250000      80.000000      32.000000  127.250000   
max      17.000000  199.000000     122.000000      99.000000  846.000000   

              BMI  DiabetesPedigreeFunction         Age     Outcome  
count  768.000000                768.000000  768.000000  768.000000  
mean    31.992578                  0.471876   33.240885    0.348958  
std      7.884160                  0.331329   11.760232    0.476951  
min      0.000000                  

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


### Data Cleaning

1. *Handling Missing Values*: Replace 0 values in Glucose, BloodPressure, SkinThickness, Insulin, and BMI columns with their respective median values.
2. *Split Data*: Separate the feature columns (X) and target (y), and then split into training and test sets with an 80-20 split.

In [33]:
#handling of missing data
missing_data_columns = ["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI"]
for col in missing_data_columns:
    median_value = data[col][data[col] != 0].median() #to exclude zero when calculating the median
    data[col] = data[col].replace(0, median_value)

In [56]:
X = data.drop(columns = ["Outcome"])
y = data["Outcome"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 2)

### Train a Naïve Bayes Classifier

Since the features are continuous, we need to use the `GaussianNB` model instead of the ones we used so far for categorical variables

In [92]:
# your code here
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X_train,y_train)



### Model Evaluation

Evaluate the model's accuracy, precision and recall. Analyse the confusion matrix.
Give the setting of the problem, which metrics would you privilege?

In [84]:
#perform the prediction
y_pred = model.predict(X_test)

#check the classification test
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
cm = confusion_matrix(y_test,y_pred)
cr = classification_report(y_test, y_pred)

#print the results
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1:", f1)
print("Confusion Matrix:", cm)
# Detailed classification report
print("\nClassification Report:\n", cr)

Accuracy: 0.7597402597402597
Precision: 0.6
Recall: 0.5333333333333333
F1: 0.5647058823529412
Confusion Matrix: [[93 16]
 [21 24]]

Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.85      0.83       109
           1       0.60      0.53      0.56        45

    accuracy                           0.76       154
   macro avg       0.71      0.69      0.70       154
weighted avg       0.75      0.76      0.76       154



In [None]:
#here one will priveledge the precision more because of the high cost of having a false positive in the diabetes disease diagnosis

### Exploring Log Probabilities in Naïve Bayes

Naive Bayes calculates log probabilities (logprobs) for each class to make predictions. Let's use `predict_log_proba` to calculate the log probabilities for each class (diabetes vs. no diabetes) for a few samples in the test set.

Question: For a given instance in the test set, calculate the log probabilities for each class (diabetes vs. no diabetes) and interpret the values. How does Naive Bayes decide the predicted class based on these log probabilities?

In [102]:
# Select a few samples from the test set
sample_indices = [0, 12, 20]  # Change these indices as desired
X_sample = X_test.iloc[sample_indices]

# Calculate log probabilities for each class
log_probs = model.predict_log_proba(X_sample)

# Display results
for i, index in enumerate(sample_indices):
    print(f"Sample {index} - Log Probabilities:")
    print(f"No Diabetes (0): {log_probs[i][0]:.4f}, Diabetes (1): {log_probs[i][1]:.4f}")
    print(f"Predicted Class: {model.predict(X_sample.iloc[[i]])[0]}")
    print()

Sample 0 - Log Probabilities:
No Diabetes (0): -0.0147, Diabetes (1): -4.2275
Predicted Class: 0

Sample 12 - Log Probabilities:
No Diabetes (0): -3.6964, Diabetes (1): -0.0251
Predicted Class: 1

Sample 20 - Log Probabilities:
No Diabetes (0): -5.5508, Diabetes (1): -0.0039
Predicted Class: 1



- Interpretation of Log Probabilities: Log probabilities represent the logarithm of the probability for each class. A higher log probability (closer to zero, since log values are negative) indicates a higher likelihood for that class.
- Decision-Making: The model predicts the class with the highest log probability. If the log probability for Diabetes (1) is higher (closer to zero) than for No Diabetes (0), the model will predict Diabetes (1).

Convert log probabilities back to regular probabilities using np.exp(log_probs) to see how log transformations aid computation without changing predictions.

In [104]:
# Select a few samples from the test set
sample_indices = [0, 12, 20]  # Change these indices as desired
X_sample = X_test.iloc[sample_indices]

# Calculate log probabilities for each class
log_probs = model.predict_log_proba(X_sample)

#convert log probabilities back to regular probabilities
probabilities = np.exp(log_probs)

# Display results
for i, index in enumerate(sample_indices):
    print(f"Sample {index} - Log Probabilities:")
    print(f"No Diabetes (0): {log_probs[i][0]:.4f}, Diabetes (1): {log_probs[i][1]:.4f}")
    
    print(f"Sample {index} - Regular Probabilities:")
    print(f"No Diabetes (0): {probabilities[i][0]:.4f}, Diabetes (1): {probabilities[i][1]:.4f}")
    
    print(f"Predicted Class: {model.predict(X_sample.iloc[[i]])[0]}")
    print()


Sample 0 - Log Probabilities:
No Diabetes (0): -0.0147, Diabetes (1): -4.2275
Sample 0 - Regular Probabilities:
No Diabetes (0): 0.9854, Diabetes (1): 0.0146
Predicted Class: 0

Sample 12 - Log Probabilities:
No Diabetes (0): -3.6964, Diabetes (1): -0.0251
Sample 12 - Regular Probabilities:
No Diabetes (0): 0.0248, Diabetes (1): 0.9752
Predicted Class: 1

Sample 20 - Log Probabilities:
No Diabetes (0): -5.5508, Diabetes (1): -0.0039
Sample 20 - Regular Probabilities:
No Diabetes (0): 0.0039, Diabetes (1): 0.9961
Predicted Class: 1



Change your decision treshold so that either class 1 or 0 becomes more frequent in order to optimize your preferred metric (precision or recall) for this problem. Try multiple tresholds until you are satisfied with your choice.

In [116]:
treshold = 0.5 # adjust this number to make it easier/harder to classify someone as diabetic


# Calculate log probabilities for each class
log_probs = model.predict_log_proba(X_test)

#compute probability of No Diabetes (0) vs Diabetes (1)
probabilities = []
for i, index in enumerate(X_test.index):
    prediction = 0 if np.exp(log_probs[i][0]) >= treshold else 1
    probabilities.append({'index':index,'no_diab_prob': np.exp(log_probs[i][0]), 'diab_prob': np.exp(log_probs[i][1]), 'prediction':prediction})

probabilities = pd.DataFrame(probabilities)
probabilities.head()

Unnamed: 0,index,no_diab_prob,diab_prob,prediction
0,158,0.985411,0.014589,0
1,251,0.925025,0.074975,0
2,631,0.938409,0.061591,0
3,757,0.762709,0.237291,0
4,689,0.170097,0.829903,1


In [None]:
# create the confusion matrix for the adjusted problem
y_pred_adjusted = probabilities['prediction']
cm = confusion_matrix(y_test, y_pred_adjusted)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['No Diabetes', 'Diabetes'], yticklabels=['No Diabetes', 'Diabetes'])
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix")
plt.show()