### Naive Bayes using Gaussian Quantile Transformer
This notebook implements Naive Bayes using a Gaussian Quantile Transformer from sklearn. 

This dataset is transformed using scaling functions from sklearn, specifically a Gaussian Quantile Transformer. This was a topic covered in machinelearningmastery.com and an offline ebook. I have covered Naive Bayes using Weka in my MSc course using a separate GUI application for Weka, so this lets me use it in Python with sklearn.

Refer to below sklearn documentation for impact of scalers for input features.
- https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html

We import pandas, sklearn, train_test_split and the naive bayes GaussianNB library option.

Note: Naive Bayes assumes the features are independent and do not interact. We know from applied statistics this is unlikely to be the case in real world datasets, but Naive Bayes is known to perform well in any case.

In [1]:
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import seaborn as sns
from matplotlib import pyplot

#### Exploratory Data Analysis
We use the diabetes dataset. For information about this dataset, refer to the Kaggle dataset library summary:
- https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database

This will predicte whether a patient does or does not have diabetes. We have multiple predictor variables and one target variable - Outcome.

In [2]:
diabetes_df = pd.read_csv("../datasets/diabetes.csv")

FileNotFoundError: [Errno 2] No such file or directory: 'datasets/diabetes.csv'

In [None]:
diabetes_df.head()

In [None]:
diabetes_df

In [None]:
diabetes_df.info()

In [None]:
diabetes_df.shape

In [None]:
diabetes_df.size

In [None]:
diabetes_df.ndim

In [None]:
# The .T at the end displays the transpose i.e. we flip the columns to be the rows and rows become the columns
diabetes_df.describe().T

In [None]:
diabetes_df.sample(n=5)

Verify if any null values in the dataframe for a given data column.

In [None]:
diabetes_df.isnull().sum()

In [None]:
diabetes_df.nunique()

In [None]:
diabetes_df.columns

In [None]:
diabetes_df.dtypes

The following calculates the pairwise correlation of columns. It does not include NA/null values.
- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html

In [None]:
diabetes_df.corr()

Produce data visualization histogram plots for the pandas dataframe for the diabetes dataset and focus on the input feature distributions as we are going to scale the input data before training the model.

In [None]:
# Get histogram for each numeric variable - 9 variables so layout = 3x3
features_including_output_label = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
diabetes_df[features_including_output_label].hist(bins=15, figsize=(15, 10), layout=(3, 3));

##### Transform the dataset input features
Transform the data input features. Import the QuantileTransformer. For more information, refer here:
- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html

The goal of the Quantile Transformer is to transform the target dataset features and produce a normal distribution.

In [None]:
from sklearn.preprocessing import QuantileTransformer
qt = QuantileTransformer(output_distribution='normal', n_quantiles=10, random_state=0)

Target specific columns for scaling - omitting "Pregnancies", based on data type considerations on Kaggle samples.

In [None]:
# make a copy of the original dataframe
diabetes_df_scaled = diabetes_df
diabetes_df_scaled["Age"] = qt.fit_transform(diabetes_df[["Age"]])

In [None]:
diabetes_df_scaled["SkinThickness"] = qt.fit_transform(diabetes_df[["SkinThickness"]])

In [None]:
diabetes_df_scaled["Insulin"] = qt.fit_transform(diabetes_df[["Insulin"]])

In [None]:
diabetes_df_scaled["DiabetesPedigreeFunction"] = qt.fit_transform(diabetes_df[["DiabetesPedigreeFunction"]])

In [None]:
# Get histogram for each numeric variable - 9 variables so layout = 3x3
features_scaled = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
diabetes_df_scaled[features_scaled].hist(bins=15, figsize=(15, 10), layout=(3, 3));

Compare above scaled output with the previous default graphs for the dataset. You can see the histograms for the target feature columns now follow a normal distribution where they did not in the original default dataset prior to the transformation step.

##### Create a train:test split on the dataset
Next, we define our train:test split using sklearn library. This lets us defined our supervised learning training set and a holdout test data subset. We will use a 33% portion of the dataset as a test set. We will use this to test the accuracy of the ML model on data it has not seen before in training and if it has overfit during training or can generalize to unseen data.

In [None]:
# split into features and categorical predictor variable (0 or 1)
X = diabetes_df_scaled.iloc[: , :8]
y = diabetes_df_scaled.iloc[: , -1]

# split into train and test sets with sklearn native train_test_split 33% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

In [None]:
X_train

In [None]:
X_test

In [None]:
X_train.shape

In [None]:
X_test.shape

##### Machine Learning Model and Training
Now we create a GaussianNB machine learning model and fit it to our training dataset.

In [None]:
gnb_ml_clf = GaussianNB()

In [None]:
gnb_ml_clf.fit(X_train, y_train)

##### Gaussian Naive Bayes Model using Transformed Data: Run Predictions, Evaluate Performance
We now run the predictive analytics against the test dataset and calculate the accuracy of the ML model when predicting on new data, our test dataset.

In [None]:
y_pred = gnb_ml_clf.predict(X_test)

In [None]:
print("Number of mislabeled points out of a total %d points : %d"
      % (X_test.shape[0], (y_test != y_pred).sum()))

In [None]:
gnb_ml_clf.score(X_test, y_test)

Include the sklearn classification report for precision, recall, f1-score, and support metrics.

In [None]:
metrics = classification_report(y_test, y_pred, output_dict=False)
print(metrics)

In [None]:
cm = confusion_matrix(y_test, y_pred)
cm

Below is a useful sklearn confusion matrix display utility plot. I always refer to it ever I need to check the array dimensions for true label, predicted labels. Seaborn is used later for an alternative display plot with text labels.

In [None]:
ConfusionMatrixDisplay(cm).plot()

In [None]:
df_confusion_matrix = pd.DataFrame(cm, ['True Non-Diabetes', 'True Diabetes'],
                     ['Predicted No Diabetes', 'Predicted Diabetes'])

sns.heatmap(df_confusion_matrix, annot=True, fmt='g')

##### Gradio User Interface Layer
In this section, we add a user interface layer. While we can create synthetic data functions in Python code to test the ML model, it helps for a human stakeholder to be able to test out a machine learning model in a web browser, especially when evaluating a prototype. This was a really nice utility demonstrated on the [serverless-ml MLOps](https://github.com/niallguerin/serverless-ml-course/tree/main/src/01-module) course.

The below code re-uses the base code from serverless-ml GitHub course module 1 source code. However, this one is using a different dataset with more input values this time - diabetes and 8 input fields. I also modified the return output label and format to be a Text field not an Image result object so imports and code differs at those points below versus original template referenced.

I have also modified it to include an additional input function to transform the input values as users would not enter transformed data formats, so the function transforms the data inputs to map to the type of transformations we performed on the ML model during training and testing so it gets data in a format it was trained on; otherwise it has a mismatch. 

The model only has about 75% accuracy and from the confusion matrix, we can see it's better at predicting True Negatives. Two test cases are included and we can see it does not generally give Expected Result for True Positive for test case 2, which it should, but this is expected given our evaluation metrics for the model.

In [None]:
# reuses template code from the serverless-ml course to scaffold the gradio UI
import gradio as gr
import numpy as np

def convert_label(label_num):
    if(label_num == 0):
        return "Outcome: This patient does not have Diabetes."
    if(label_num == 1):
        return "Outcome: This patient has Diabetes."

def transform_input_sample(input_lst):
    input_sample_df = pd.DataFrame(columns = X.columns)    
    input_sample_df.loc[0] = input_lst
    
    # apply the quantile transformer to the input sample before calling the predictive analytics service
    input_sample_df["SkinThickness"] = qt.fit_transform(input_sample_df[["SkinThickness"]])
    input_sample_df["Insulin"] = qt.fit_transform(input_sample_df[["Insulin"]])
    input_sample_df["DiabetesPedigreeFunction"] = qt.fit_transform(input_sample_df[["DiabetesPedigreeFunction"]])
    input_sample_df["Age"] = qt.fit_transform(input_sample_df[["Age"]])

    return input_sample_df
    
def diabetes(Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age):
    input_list = []
    input_list.append(Pregnancies)
    input_list.append(Glucose)
    input_list.append(BloodPressure)
    input_list.append(SkinThickness)
    input_list.append(Insulin)
    input_list.append(BMI)
    input_list.append(DiabetesPedigreeFunction)
    input_list.append(Age)
    
    transformed_input_sample = transform_input_sample(input_list)    
    result = gnb_ml_clf.predict(transformed_input_sample)
    patient_status = convert_label(result)
    
    return patient_status

demo = gr.Interface(
    fn=diabetes,
    title="Diabetes Patient Predictive Analytics",
    description="Experiment with inputs to predict whether the patient has diabetes. Test Case 1: Expected Result: No Diabetes - [1,85,66,29,0,26.6,0.351,31], Test Case 2: Expected Result: Has Diabetes [6,148,72,35,0,33.6,0.627,50]",
    allow_flagging="never",
    inputs=[
        gr.inputs.Number(default=1, label="Pregnancies"),
        gr.inputs.Number(default=85, label="Glucose"),
        gr.inputs.Number(default=66, label="BloodPressure"),
        gr.inputs.Number(default=29, label="SkinThickness"),
        gr.inputs.Number(default=0, label="Insulin"),
        gr.inputs.Number(default=26.6, label="BMI"),
        gr.inputs.Number(default=0.351, label="DiabetesPedigreeFunction"),
        gr.inputs.Number(default=31, label="Age"),
        ],
    outputs=gr.Textbox(label="Outcome"))

demo.launch(share=False)

#### Web References
- https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html
- https://gradio.app/docs/