### Fit Baseline Model

In [32]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LogisticRegression
import pandas as pd
import os
import dtale
import numpy as np

Load in dataset

In [33]:
project_dir = os.path.dirname(os.path.abspath(''))
df = pd.read_json(os.path.join(project_dir, 'model_prepped_dataset.json'))
dtale.show(df).open_browser()

Assign dataset to X matrix and y vector and apply any necessary transformations.

In [34]:
X = df.loc[:, df.columns != 'Outcome']
y = df['Outcome']
X.tail()

Split into training set and test set.

In [36]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42)
print(len(X_train))
print(len(X_test))
print(len(X))

83509
14738
98247


Scale X by standardising it. Use the mean and standard deviation of the training set to standardise the test set.

In [37]:
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)

#### Option 1) Logistic Regression

Train model.

In [38]:
lgr = LogisticRegression()
lgr.fit(X_train, y_train)

LogisticRegression()

Test the accuracy of the model by checking against the test set. This consists of calculating the MSE and accuracy of the predicted classification.

In [39]:
y_pred_log = lgr.predict(X_test)
mse_log = mean_squared_error(y_test, y_pred_log)
accu_log = accuracy_score(y_test, y_pred_log) * 100
# R2 score is not a good measure for classification.
print(f'Logistic Regression Model\nAccuracy: {accu_log:.2f}%.\nMSE: {mse_log:.2f}')

Logistic Regression Model
Accuracy: 49.83%.
MSE: 1.16


#### Option 2) Linear Regression

Train model.

In [40]:
lr = LinearRegression()
lr.fit(X_train, y_train)

LinearRegression()

Test the accuracy of the model by checking against the test set. This consists of calculating the MSE and accuracy of the predicted outcome.

In [41]:
y_pred_lin = np.rint(lr.predict(X_test))
mse_lin = mean_squared_error(y_test, y_pred_lin)
accu_lin = accuracy_score(y_test, y_pred_lin) * 100
r2_lin = r2_score(y_test, y_pred_lin)
print(f'Linear Regression Model\nAccuracy: {accu_lin:.2f}%.\nMSE: {mse_lin:.2f}\nR2 score: {r2_lin:.2f}')

Linear Regression Model
Accuracy: 34.46%.
MSE: 0.69
R2 score: -0.03
Executing shutdown due to inactivity...


2022-03-29 20:41:05,560 - INFO     - Executing shutdown due to inactivity...


Executing shutdown...


2022-03-29 20:41:05,688 - INFO     - Executing shutdown...


#### Conclusion:
Baseline model is simple logistic regression with 49.83% accuracy. The best model will need to be a classification model to better fit the outcome.