# Exploring Classical Machine Learning

Let's load in any libraries we will use in this notebook.

In [None]:
import sklearn #great machine learning library
import pandas as pd #we'll use this to read in our data in a csv file nicely
import numpy as np #let's us do lots of math operations
import matplotlib.pyplot as plt #for plotting data!
from sklearn.model_selection import train_test_split

# Loading in the Dataset

We're going to be using a publicly available dataset -- the 'Maternal Health Risk Data', available from https://www.kaggle.com/datasets/csafrit2/maternal-health-risk-data

From the dataset website: "Data has been collected from different hospitals, community clinics, maternal health cares through the IoT based risk monitoring system.

* Age: Age in years when a woman is pregnant.
* SystolicBP: Upper value of Blood Pressure in mmHg, another significant attribute during pregnancy.
* DiastolicBP: Lower value of Blood Pressure in mmHg, another significant attribute during pregnancy.
* BS: Blood glucose levels is in terms of a molar concentration, mmol/L.
* HeartRate: A normal resting heart rate in beats per minute.
* Risk Level: Predicted Risk Intensity Level during pregnancy considering the previous attributes."

We're going to see if we can predict the Risk Level of a patient -- low risk, medium risk, or high risk -- based on the other variables provided.

Below, I'm going to load in the dataset and do some initial processing. There's nothing for you to change here, but I'll leave comments in case you're interested on what's going on.

In [None]:
all_data = pd.read_csv('Maternal Health Risk Data Set.csv')   #read the file into a pandas data frame

print(all_data.info())   #we can call this command to get some stats on the dataset, including the features we have, the number of data points for each category, and the data type for each category

#converting both to numpy, as these will be easier to work with following on from here
input_features = ['Age', 'SystolicBP', 'DiastolicBP', 'BS', 'BodyTemp', 'HeartRate']
input_data = all_data[input_features].to_numpy()

gt_output = all_data['RiskLevel'].to_numpy()

Above, we can see that there are 1014 data points for a variety of features. 

We're interested in using features 0-5 to help us predict which risk level the patient has -- 'low risk', 'medium risk', or 'high risk'.

### Question: What ML Task are we performing here?

# Inspect the Data

Let's look at some of the input_data and gt_output to get a feel for it's current format and what we're working with.

# Normalise the data

The different features in input_data have very different scales - find the minimum and maximum values, and then apply min-max scaling to normalise the data to be between 0 and 1.

$$ x_{norm} = {x-x_{min}\over x_{max}-x_{min}}$$

# Training, Validation and Test Subsets 

We have 1014 data points, and we are going to split this data in the following way:
- we have 50% for the training subset, and 25% each for the validation and test subsets
- we want to do so randomly with a random state of 0
- we want to create a stratified split

Use the same approach that we used in Week 1 -- the sklearn train_test_split function -- https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html. 

In [None]:
print('Total Dataset shape:')
print(f'    Input shape: {input_data.shape}   GT shape: {gt_output.shape}')
print('Train Subset shape:')
print(f'    Input shape: {input_train.shape}   GT shape: {gt_train.shape}')
print('Validation Subset shape:')
print(f'    Input shape: {input_val.shape}   GT shape: {gt_val.shape}')
print('Test Subset shape:')
print(f'    Input shape: {input_test.shape}   GT shape: {gt_test.shape}')


In [None]:
plt.hist([gt_train, gt_val, gt_test]) #can add density = True to see normalised densities
plt.xlabel('GT Classification')
plt.ylabel('Count')
plt.show()

# Model 1: Implementing a K Nearest Neighbour Model

## K=1 Nearest Neighbour
Let's use the sklearn KNeighborsClassifier -- this uses a K Nearest Neighbour approach to classification, and start with a value of K = 1 to create a simple nearest neighbour classifier.
You can read from the documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier


In [None]:
from sklearn.neighbors import KNeighborsClassifier


## Use the validation dataset to find hyperparameter K

Let's use the validation dataset to find the best value of K! You can adapt the code above to search through a range of K values, store the validation accuracy, and then store the best value of K in a variable called *K_best*.

It's also a good idea to plot the results you get, using something like plt.plot() -- see here: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html

Sometimes, you'll get similar performance with a high value of K and a low value of K -- remember: the lower value is usually the better choice in this case (see Occam's razor)

## Find KNN Performance on the Test Data 

Now that we've used our validation dataset to find the best value of K, let's use this value of K to create a model, and then test it on the test data to see the final 'real-world' performance.

It'll be very similar to the approach from above -- you're still using a KNeighborsClassifier and fitting it to the training data subset. This time, use the K_best variable and test on the test data to find the accuracy of the model.

## Visualise performance with a confusion matrix

Create a Confusion Matrix based on the performance of the KNN model on the test dataset.

Looking at the Confusion Matrix, reflect on the following questions:
1. Is performance consistent across the classes, or is there a clear discrepancy for some classes? If there is, why do you think this might be?
2. Given the potential use of this ML model, are some types of errors worse or more dangerous than others? How does the KNN model perform for these types of errors? (e.g. if a patient is medium risk, is it better or worse for them to be misclassified as low risk or high risk?)

Sklearn has a useful function -- ConfusionMatrixDisplay.from_predictions() -- that creates a confusion matrix if given an array of predicted labels and an array of true labels. Read the documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ConfusionMatrixDisplay.html#sklearn.metrics.ConfusionMatrixDisplay.from_predictions

Note: you may want to use the normalize argument in the above function to allow easy interpretation in the presence of class imbalance.

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay


# Model 2: Implementing a Decision Tree

## Create a decision tree
In the cell below, implement the sklearn DecisionTreeClassifier using a random_state of 0. 
Read the sklearn documentation on DecisionTreeClassifier to see how to implement -- https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

You will follow a similar process to the KNN model: creating the model, fitting it to the training data, and finding the accuracy on the validation dataset.

In [None]:
from sklearn.tree import DecisionTreeClassifier


## Use the validation dataset to select the best maximum depth for the tree

In the lecture, we explored how overfitting in decision trees can be mitigated by 'pruning' the decision tree. One technique is 'pre-pruning', where we prevent overfitting while creating the decision tree. You can do this by specifying the maximum depth the tree can reach.

Test over a range of maximum depths to find the best performance on the validation dataset.

## (For your interest) Visualise the Decision Tree!

You can use the code below to visualise the decision tree. 

Things to note: 
* Normally the features would not be normalised, making this more interpretable.
* It's still a very busy decision tree! This is not necessarily from overfitting, it also indicates a complex decision boundary between input data and output predictions

In [None]:
from sklearn import tree

dt_model = DecisionTreeClassifier(random_state=0, max_depth = best_depth)
dt_model.fit(input_train, gt_train)

plt.figure(figsize=(12,12))
tree.plot_tree(dt_model, feature_names = input_features, class_names = ['low risk', 'mid risk', 'high risk'], fontsize = 6)
plt.show()

## Find the performance on the test dataset with your selected best maximum depth

## Visualise performance with a confusion matrix

Use the same approach as earlier.

How does performance compare with the KNN classifier? Does it have a similar distribution of errors, or different? 

# Make a Recommendation!
Q: The dataset has been collected from different hospitals, community clinics, and maternal health cares through the IoT based risk monitoring system. You have tested the K Nearest Neighbour classifier and the Random Forest classifier for Queensland Health -- your client. The client is planning to deploy a ML model that allows for automatic classification of pregnancy risk level in community clinics that have less Obstetrics* expertise available or do not have enough staff to cope with current demand. The client is asking for your opinion on the following:

(i) Which classifier should we use, and why?

(ii) Are there any characteristics of performance that we should be aware of (i.e. differences in performance based on risk level, etc.)?

*Obstetrics is the field of study concentrated on pregnancy, childbirth and the postpartum period. 