# Non-linear Models

Next we will try some nonlinear models. We will use decision trees and KNN models to see if we can achieve a higher accuracy than with our logistic regression models. 

In [1]:
# importing relevant packages

# generic
import numpy as np
import pandas as pd

# for training models 
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# pickling models
import joblib

# finding best model
from sklearn.model_selection import GridSearchCV, cross_val_score

# plots
import plotly.express as px
import plotly.graph_objects as go

# packages for model evaluation
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import classification_report

We will follow the same flow to load in data and get our train/test split as in the notebook '04-modelling'.

In [2]:
# loading the data
df = pd.read_parquet('../data/cleaned/dataCleanWMedicalUrgency.parquet')

# saetting y to be the target variable
y = df['medical_urgency']

# importing the preprocessed data
X = pd.read_parquet('../data/cleaned/featuresPreprocessed.parquet')

# splitting the data
X_rem, X_test, y_rem, y_test = train_test_split(X, y, test_size=0.25, random_state=1234, stratify=y)

# resetting indices to allow models to run properly
X_rem.reset_index(drop=True, inplace=True)
X_test.reset_index(drop=True, inplace=True)
y_rem.reset_index(drop=True, inplace=True)
y_test.reset_index(drop=True, inplace=True)

# setting up training and validation sets to guide choices for hyper parameters
X_train, X_val, y_train, y_val = train_test_split(X_rem, y_rem, test_size=0.25, random_state=1234, stratify=y_rem)
X_train.reset_index(drop=True, inplace=True)
X_val.reset_index(drop=True, inplace=True)
y_train.reset_index(drop=True, inplace=True)
y_val.reset_index(drop=True, inplace=True)

# making an experimental dataframe that has 10% of the rows of X_rem
X_exp, X_bin, y_exp, y_bin = train_test_split(X_rem, y_rem, test_size=0.999, random_state=1234, stratify=y_rem)
X_exp.reset_index(drop=True, inplace=True)
y_exp.reset_index(drop=True, inplace=True)

Now we are ready to set up a pipeline with our non-linear models. 

## Decision Tree

Let's first do a decision tree, which we will optimise over the hyperparameters: 'criterion', 'max_depth', and 'min_samples_leaf'. In order to know which values of max_depth to optimmise over, lets plot their accuracy on a graph using default vallues for 'criterion' and 'min_samples_leaf'.

To do this we will spolot X_rem and y_rem in to training and validation sets, so that we can see the effect of different max_depth values without lookign at the test set and introducing data leakage.

In [44]:
# setting up list of accuracy scores
train_acc_scores_decision_tree = list()
val_acc_scores_decision_tree = list()

# setting up list of max_depths that we will investigate
depths = np.arange(5, 51, 5)

# fitting scaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_s = scaler.transform(X_train)
X_val_s = scaler.transform(X_val)

# creating a for loop to cycle through the different values of max_depth
for i in depths:
    decision_tree_model = DecisionTreeClassifier(max_depth=i)
    decision_tree_model.fit(X_train_s, y_train)
    train_acc_scores_decision_tree.append(decision_tree_model.score(X_train_s, y_train))
    val_acc_scores_decision_tree.append(decision_tree_model.score(X_val_s, y_val))


# plotting the max_depth scores against train and validation accuracy scores 
fig1 = go.Figure()
train_scores_decision_tree = go.Scatter(x=depths, y=train_acc_scores_decision_tree, name='Train accuracy')
val_scores_decision_tree = go.Scatter(x=depths, y=val_acc_scores_decision_tree, name='Validation accuracy')
fig1.add_trace(train_scores_decision_tree)
fig1.add_trace(val_scores_decision_tree)
fig1.update_layout(title_text='Accuracy Score Against Max Depth of Decision Tree', title_x=0.5, xaxis_title='Max Depth', yaxis_title='Accuracy')
fig1.show()


From the figure we can see that the validation accuracy peaks at a max depth of 25. After that the model begons to overfit and the validation accuracy starts to drop. The optimum value for max depth lies between 21 and 29, so we will run our gridsearch on this range of values. 

Let's set up the pipeline.

In [50]:
# seeting up the estimators for the pipeline
estimators = [
    ('normalise', StandardScaler()),
    ('decision_tree', DecisionTreeClassifier())
]

# making the pipeline
pipe = Pipeline(estimators)

# setting up the values for the parameters
criterion = ['gini', 'entropy']
max_depth = range(21, 30)
min_samples_leaf = [1, 2, 3, 4, 5]

# setting up the parameters dictionary
params = dict(
    decision_tree__criterion = criterion,
    decision_tree__max_depth = max_depth,
    decision_tree__min_samples_leaf = min_samples_leaf
)

Time for the gridsearch.

In [9]:
# making the gridsearch
gridsearch_model_decision_tree = GridSearchCV(pipe, param_grid=params, cv=5)

# fitting the grid search model
gridsearch_model_decision_tree.fit(X_rem, y_rem)

# saving the gridsearch
joblib.dump(gridsearch_model_decision_tree, '../model/gridsearch_decision_tree.pkl')

['../model/gridsearch_decision_tree.pkl']

Let's load the model in.

In [8]:
# loading the model in
decision_tree_optimised = joblib.load('../model/gridsearch_decision_tree.pkl')

In [3]:
decision_tree_optimised.best_estimator_

Now we can look at how good the model is. 

In [9]:
decision_tree_optimised.score(X_test, y_test)

0.6430449029563213

Now let's look at the classification report.

In [9]:
# getting the classification report
y_pred = decision_tree_optimised.predict(X_test)
class_report = classification_report(y_test, y_pred)

# printing the classification report
print(class_report)

              precision    recall  f1-score   support

           0       0.71      0.56      0.63     42006
           1       0.63      0.60      0.62     58779
           2       0.61      0.79      0.69     38070

    accuracy                           0.64    138855
   macro avg       0.65      0.65      0.65    138855
weighted avg       0.65      0.64      0.64    138855



The class 0 recall is 56% and the accuracy is lower than the logistic regression models. We will try KNN to see if we get any improvement. 

## KNN

For our KNN model, we will optimise the 'n_neighbours' parameter.

In [5]:
# setting up list of accuracy scores
train_acc_scores_knn = list()
val_acc_scores_knn = list()

# setting up list of max_depths that we will investigate
num_neighbours = np.arange(5, 51, 5)

# fitting scaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_s = scaler.transform(X_train)
X_val_s = scaler.transform(X_val)

# creating a for loop to cycle through the different values of max_depth
for i in num_neighbours:
    knn_model = KNeighborsClassifier(n_neighbors=i)
    knn_model.fit(X_train_s, y_train)
    train_acc_scores_knn.append(knn_model.score(X_train_s, y_train))
    val_acc_scores_knn.append(knn_model.score(X_val_s, y_val))

# plotting the max_depth scores against accuracy score
fig2 = go.Figure()
train_scores_knn = go.Scatter(x=depths, y=train_acc_scores_knn, name='Train accuracy')
val_scores_knn = go.Scatter(x=depths, y=val_acc_scores_knn, name='Validation accuracy')
fig2.add_trace(train_scores_knn)
fig2.add_trace(val_scores_knn)
fig2.update_layout(title_text='Accuracy Score Against Number of Neighbours in KNN Model', title_x=0.5, xaxis_title='Number of Neighbours', yaxis_title='Accuracy')
fig2.show()

Looks like we will not get much improvemet as we increase the number of neighbours, and that the score is worse than our logistic regression model. 

We will use n_neighbours = 50 for our final KNN model to have a look at the class 0 recall. 

First we have to set up a scaler.

In [4]:
# fitting scaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_s = scaler.transform(X_train)
X_test_s = scaler.transform(X_test)

Now we make the model.

In [5]:
# making the model
knn = KNeighborsClassifier(n_neighbors=50)
knn.fit(X_train_s, y_train)

Now let's look at the classification report.

In [6]:
# getting the classification report
y_pred = knn.predict(X_test_s)
class_report = classification_report(y_test, y_pred)

# printing the classification report
print(class_report)

              precision    recall  f1-score   support

           0       0.75      0.51      0.60     42006
           1       0.60      0.73      0.66     58779
           2       0.71      0.73      0.72     38070

    accuracy                           0.66    138855
   macro avg       0.69      0.66      0.66    138855
weighted avg       0.68      0.66      0.66    138855



The class 0 recall is 51%, much worse than any model we have seen previously. We will not investigate KNN further, as it seems it will not be optimal for our tool.

## Conclusion

Niether KNN nor decision tree has offered any improvement on the logistic regression models. We will try Random forest and XGBoost in our next modelling notebook to see if we can get an improvement there. 