In [1]:
# improting relevant packages
import numpy as np
import pandas as pd

First we will import our preprocessed data and our target variable

In [4]:
# loading the data
df = pd.read_parquet('../data/cleaned/dataCleanWMedicalUrgency.parquet')

# saetting y to be the target variable
y = df['medical_urgency']

# importing the preprocessed data
X = pd.read_parquet('../data/cleaned/featuresPreprocessed.parquet')


Now we will split the data into remainder data and testig data. We will keep the test data separate and only use it for testing purposes. 

We will train on 75% of the data and leave 25% to test on. Our test set will be a random sample stratified by velues in the target variable, y, so as to keep proportional representation of each value in y.

We will measure the success of our models by their accuracy scores on the test data.

In [6]:
# importing relevant packages
from sklearn.model_selection import train_test_split

# splitting the data
X_rem, X_test, y_rem, y_test = train_test_split(X, y, test_size=0.25, random_state=1234, stratify=y)

We will reset indices to allow our models to run properly.

In [8]:
# resetting indices
X_rem.reset_index(drop=True, inplace=True)
X_test.reset_index(drop=True, inplace=True)
y_rem.reset_index(drop=True, inplace=True)
y_test.reset_index(drop=True, inplace=True)

We will quickly check the accuracy score of a naive model that predicts the mode of y.

In [18]:
# checking proportions of each value in y
y.value_counts() / len(y) 

medical_urgency
1    0.423311
0    0.302515
2    0.274174
Name: count, dtype: float64

1 is the most common value in the target variable. A naive model that guesses 1 every time would have an accuracy of 42.33%

# Initial logistic regression model

Our first model will be a logistic regression. It will be a baseline and hopefully give us a starting point for further models and optimisation.It will be basic and will not use cross validation. Its purpose it just to light the way for future models.

In [10]:
# importing relevant package
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

Let's make our first model. We will use a pipeline that includes scaling. The parameters for the logistic regression are set more or less at random, providing starting points for future models. 

In [11]:
# setting up the estimators 
estimators = [
    ('normalise', StandardScaler()),
    ('log', LogisticRegression(C=0.1, max_iter=10000))
]

# making the pipeline
pipe = Pipeline(estimators)

In [12]:
# fitting the pipeline model to the remainder data
pipe.fit(X_rem, y_rem)

Pipeline(steps=[('normalise', StandardScaler()),
                ('log', LogisticRegression(C=0.1, max_iter=10000))])

In [27]:
import joblib

In [28]:
joblib.dump(pipe, '../model/something.pkl')

['../model/something.pkl']

In [29]:
test = joblib.load('../model/something.pkl')

In [31]:
test

Pipeline(steps=[('normalise', StandardScaler()),
                ('log', LogisticRegression(C=0.1, max_iter=10000))])

In [15]:
# printing the logistic regression model's accuracy score for the remainder set and the test set
print(pipe.score(X_rem, y_rem))
print(pipe.score(X_test, y_test))

0.74145265290641
0.73941161643441


We have achieved 74.15% accuracy on the remainder set, and 73.94% accuracy on the test set. This is a significant imporovement on 42.33% for the naive model. Hopefully we can do even better during model optimisation.

In [26]:
pd.DataFrame(columns=X.columns, data=pipe['log'].coef_)

Unnamed: 0,age,2ndarymalig,abdomhernia,abdomnlpain,abortcompl,acqfootdef,acrenlfail,acutecvd,acutemi,acutphanm,...,arrivalhour_bin_23-02,previousdispo_Admit,previousdispo_Discharge,previousdispo_Eloped,previousdispo_LWBS after Triage,previousdispo_LWBS before Triage,previousdispo_No previous dispo,previousdispo_Observation,previousdispo_Send to L&D,previousdispo_Transfer to Another Facility
0,0.260844,0.033072,-0.008434,-0.00056,-0.000371,-0.005881,-0.002112,0.012781,0.017828,-0.009147,...,0.035033,0.014309,-0.12454,-0.009484,-0.024235,-0.017866,-0.038646,0.002259,0.001696,-0.000255
1,-0.028251,0.000162,0.008843,-0.002241,-0.000775,-0.003925,-0.002045,0.001595,0.000789,-0.010264,...,-0.014412,0.002644,-0.020903,0.000525,0.003922,-0.000825,-0.015867,-9.6e-05,0.001118,0.018326
2,-0.232593,-0.033235,-0.000409,0.002801,0.001146,0.009806,0.004157,-0.014376,-0.018617,0.019411,...,-0.020622,-0.016953,0.145443,0.008959,0.020313,0.018691,0.054513,-0.002163,-0.002814,-0.01807
