### Training a Model Locally - A
Mico Ellerich M. Comia

This notebook trains a Sklearn's Logistic Regression model to predict a binary output given a multi-dimensional input. No hyperparameter optimizations were applied and as such, the default values were used as is.

---

- SELECT 2 MACHINE LEARNING ALGORITHMS 
- FOR EACH OF THE ALGORITHMS
    - PERFORM TRAINING ON THE TRAINING DATASET
    - EVALUATE ON THE VALIDATION DATASET
    - TEST THE TRAINED MODEL ON THE TEST SET
    - SAVE THE MODEL USING JOBLIB (OR ALTERNATIVE)
- COMPARE THE “PERFORMANCE” OF THE 2 MODELS USING THE EVALUATION METRICS

---

In [11]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
import warnings
import joblib
import time

from sklearn import metrics
from sklearn.linear_model import LogisticRegression

In [None]:
pd.options.display.float_format = '{:,.2f}'.format
warnings.filterwarnings(action="ignore")

### I. Import dataset splits
---

First, we import the generated synthetic dataset from the previous notebook using Pandas' read_csv. This imports the CSV files as dataframes.

In [2]:
X_train =  pd.read_csv('data/X_train.csv')
X_test =  pd.read_csv('data/X_test.csv') 
X_val = pd.read_csv('data/X_val.csv') 
y_train =  pd.read_csv('data/y_train.csv') 
y_test =  pd.read_csv('data/y_test.csv') 
y_val = pd.read_csv('data/y_val.csv') 

### II. Training the Logistic Regression model
---

#### A. Training on the training dataset

Using the fit method, we use the training set split to train our model. 

In [4]:
logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)

LogisticRegression()

Since we did not use automated hyperparameter turners or assigned different hyperparameter values, the default values for the model were used. The get_params method shows us these default values.

In [15]:
logistic_model.get_params()

{'C': 1.00,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': None,
 'solver': 'lbfgs',
 'tol': 0.00,
 'verbose': 0,
 'warm_start': False}

In [26]:
logistic_pred_train =  logistic_model.predict(X_train)

In [28]:
logi_train_scores = [metrics.accuracy_score(y_train, logistic_pred_train)*100,
                     metrics.precision_score(y_train, logistic_pred_train)*100,
                     metrics.recall_score(y_train, logistic_pred_train)*100] 

df_logi_train = pd.DataFrame(logi_train_scores, columns = ['Scores'], index = ['Accuracy', 'Precision', ' Recall'])
df_logi_train

Unnamed: 0,Scores
Accuracy,89.33
Precision,86.3
Recall,94.57


The trained model achieved an accuracy of 89.33% on the training set. We can check if our model is overfitting or underfitting once we compare these values with the validiation and test set scores.

---
#### B. Evaluate on the validation set

To evaluate the performance of our model, we use the accuracy, precision, and recall metrics. A higher value for these metrics are desirable. The evaluation steps are similar for both the validation and test sets. Ideally, we use the validation set when we're performing cross validation techniques.

In [13]:
logistic_pred_val =  logistic_model.predict(X_val)

In [20]:
logi_val_scores = [metrics.accuracy_score(y_val, logistic_pred_val)*100,
                   metrics.precision_score(y_val, logistic_pred_val)*100,
                   metrics.recall_score(y_val, logistic_pred_val)*100] 

df_logi_val = pd.DataFrame(logi_val_scores, columns = ['Scores'], index = ['Accuracy', 'Precision', ' Recall'])
df_logi_val

Unnamed: 0,Scores
Accuracy,90.5
Precision,85.19
Recall,96.84


For our validation set, we can see that the trained model attained respectable scores, garnering an accuracy of 90.50%.

---
#### C. Testing on test set

In [23]:
logistic_pred_test =  logistic_model.predict(X_test)

In [24]:
logi_test_scores = [metrics.accuracy_score(y_test, logistic_pred_test)*100,
                    metrics.precision_score(y_test, logistic_pred_test)*100,
                    metrics.recall_score(y_test, logistic_pred_test)*100] 

df_logi_test = pd.DataFrame(logi_test_scores, columns = ['Scores'], index = ['Accuracy', 'Precision', ' Recall'])
df_logi_test

Unnamed: 0,Scores
Accuracy,88.5
Precision,84.0
Recall,92.31


The difference between the three sets are small. From this, we can assume that the model is neither underfitting nor overfitting.

---
#### D. Saving metrics and model

For future use and reference, we save the scores of the model and the model itself. For the metrics, we have concatenated the different scores into a single dataframe and exported it as CSV with a timestamp. We also use the store magic so that we can access it in other notebooks.

In [35]:
# Getting the current time to serve as timestamps
timestr = time.strftime("%m%d-%H%M")

In [None]:
df_logi_scores = pd.concat([df_logi_train, df_logi_val, df_logi_test], axis = 1)
df_logi_scores.columns = ['Training','Validation','Test']
df_logi_scores

In [None]:
metrics_filename = 'model/results/logistic_' + timestr + '.csv'
df_logi_scores.to_csv(metrics_filename, index = False)
%store df_logi_scores

Likewise, we do the same for the model itself, attaching the timestamp for easy reference.

In [7]:
# Logistic regression model saving
model_filename = 'model/logistic_' + timestr + '.sav'
joblib.dump(logistic_model, model_filename)

['model/logistic_0528-1815.sav']