# Lab 4. Probabilistic Inference
# Task 4.2 Diabetes Diagnosis Using Naïve Bayes
## Problem Descriptions
In this task, we implement the Gaussian Naive Bayes model to classify diabetes patients into 3 classes (Class 1: Progression measure 0-150,
Class 2: Progression measure 150-300, Class 3: Progression measure 300+).The diabetes diagnosis problem can be formulated as follow.


---
Assuming conditional independence between 10 features (f1=age, f2=sex, f3=bmi, f4=average blood pressure, f5=total serum cholesterol, f6=low-density lipoproteins, f7=high-density lipoproteins, f8=total cholesterol / HDL, f9=possibly log of serum triglycerides level, f10=blood sugar level) given 3 progression measure classes (c1=0-150, c2=150-300, c3=300+).

    𝑃(𝑐𝑖|𝑓1, 𝑓2, ..., f10) = 𝛼𝑃(𝑓1|𝑐𝑖) 𝑃(𝑓2|𝑐𝑖) ... P(𝑓10|𝑐𝑖) 𝑃(𝑐𝑖)

* 𝑐𝑖: progression measure class, where i=1,2,3.

*  𝑃(𝑐𝑖|𝑓1, 𝑓2, ..., f10): Posterior probability of progression measure class ci given f1, f2, f3, f4, f5, f6, f7 ,f8, f9, and f10 features.
* 𝑃(𝑓1|𝑐𝑖): Conditional probability of f1 occuring given ci has   
  occured.
*  𝑃(𝑓2|𝑐𝑖): Conditional probability of f2 occuring given ci has occured.
*  P(ci): Probability of progression measure class ci occurs.

We need to identify the prior probability 𝑃(𝑐𝑖) of each progression measure class and the conditional probabilities 𝑃(𝑓1|𝑐𝑖), 𝑃(𝑓2|𝑐𝑖), 𝑃(𝑓3|𝑐𝑖) until P(𝑓10|𝑐𝑖) first in order to estimate the posterior probability of progression meassure class given  features 𝑃(𝑐𝑖| 𝑓1, 𝑓2,𝑓3, 𝑓4, ..., f10) where i=1,2,3 for the three progression measure classes.




## Implementation and Results

In [None]:
!pip install sklearn
from sklearn import datasets
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split, cross_validate

# import numpy as np
import matplotlib.pyplot as plt
# from matplotlib import patches
import math

Collecting sklearn
  Downloading sklearn-0.0.post12.tar.gz (2.6 kB)
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Preparing metadata (setup.py) ... [?25l[?25herror
[1;31merror[0m: [1mmetadata-generation-failed[0m

[31m×[0m Encountered error while generating package metadata.
[31m╰─>[0m See above for output.

[1;35mnote[0m: This is an issue with the package mentioned above, not pip.
[1;36mhint[0m: See above for details.


In [None]:
diabetes = datasets.load_diabetes()
# X = diabetes.data[:,[2,3,9]]
X = diabetes.data
Y = [math.floor(x/150) for x in diabetes.target]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25)

nb = GaussianNB()
nb.fit(X_train, Y_train)
Y_pred = nb.predict(X_test)
acc = accuracy_score(Y_test, Y_pred)
cm = confusion_matrix(Y_test, Y_pred)
cr = classification_report(Y_test, Y_pred)

print("Accuracy:", acc)
print("Confusion Matrix:\n", cm)
print("Prior:\n", nb.class_prior_)
print("Mean:\n", nb.theta_)
print("Variance:\n", nb.var_)


Accuracy: 0.7117117117117117
Confusion Matrix:
 [[42 17  0]
 [ 9 36  3]
 [ 0  3  1]]
Prior:
 [0.5407855  0.42900302 0.03021148]
Mean:
 [[-0.0078686  -0.00203973 -0.01786921 -0.01664023 -0.00848716 -0.00598924
   0.01384965 -0.01667859 -0.02296223 -0.0133189 ]
 [ 0.00594585 -0.00369356  0.02146801  0.01849403  0.00803364  0.00581356
  -0.01573671  0.01659892  0.02441361  0.01260292]
 [ 0.0042933   0.01255142  0.06536077  0.03564379  0.00503561 -0.00125125
  -0.02646531  0.02545259  0.04148047  0.04655653]]
Variance:
 [[0.00217604 0.00224597 0.00154143 0.00162569 0.00190696 0.00194978
  0.00204143 0.00157131 0.00151421 0.00175995]
 [0.0020812  0.0022265  0.00212051 0.00238253 0.00233841 0.00233336
  0.00153695 0.00223229 0.00184442 0.00234737]
 [0.00265239 0.0021807  0.00137571 0.0024228  0.00076896 0.00095186
  0.00106592 0.00206759 0.00127187 0.00179549]]


## Discussions

In this task, the Gaussian Naive Bayes model is trained and tested. The performance is evaluated in terms of model accuracy, and confusion matrix. The results are collected in terms of a prior probability for each class, the mean value of each feature for each class, and the variance of each feature for each class.

The Naive Bayes classifier offers a fairly good accuracy of 71.17%. It correctly classifies 42 instances from class 1, but 17 instances of class 1 are incorrectly predicted as class 2. For class 2, it successfully classifies 36 instances but incorrectly predicted 9 instances as belonging to class 1 and 3 instances as belonging to class 3. For class 3, it correctly predicted 1 instance, but 3 instances of class 3 were incorrectly predicted as class 2.

From the prior probability, we can infer that class 1 makes up 54.08% of the data, class 2 makes up 42.9% of the data, and class 3 makes up 3.02% of the data. The mean values of each feature are close across different classes, and the variance of each class feature is small too. It means that the feature data points for each class are concentrated around the mean, increasing the model's difficulty in differentiating them. Hence, there is a risk that the model might not generalize well to new, unseen data.

In conclusion, the classifier is fairly good at classifying patients into class 1 and class 2. There is improvement required for class 3 prediction. It might be due to the small sample size of class 3, as the patient with 300+ progression measure is very rare. The imbalanced dataset is likely to affect the classifier's performance. Also, class 3 seems like not being well-separated from class 2 in feature space, but well separated from class 1.
