# 3a- Training with built-in LinearLearner <a class="anchor" id="top"></a>
* [Introduction](#intro)
* [Setup](#setup)
* [Estimator creation](#estim)
    * [Define estimator](#define)
    * [Train estimator and tune parameters](#tune)
* [Evaluate training result](#eval)
* [Cleanup resources](#clean)

## Introduction <a class="anchor" id="intro"></a>
In this notebook, we will train a Linear Learner model, evaluate the training performance, and output model artifacts.
Linear Learner is based on logistic regression models.
We will be using Amazon's built-in Linear Learner implementation which has an internal model jyperparameter tuning mechanism.

## Setup <a class="anchor" id="setup"></a>
First, we import Sageamker SDK dependencies as well as modules used in application below.
We also get relevant sessions and read in local environment data.

In [1]:
import json
import uuid
import boto3
import random
import tarfile
import pickle as pkl
import datetime as dt
import sagemaker as sm

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn.metrics as metrics
sns.set_style("darkgrid")

In [2]:
sm_session = sm.Session()
role = sm.get_execution_role()
boto3_session = boto3.session.Session()
now = dt.datetime.now().strftime(r"%Y%m%dT%H%M%S")

In [3]:
# Get boto3 session attributes.
account = boto3_session.client("sts").get_caller_identity()["Account"]
region = boto3_session.region_name

# Create clients to access S3.
s3_client = boto3_session.client("s3")
s3_resource = boto3_session.resource("s3")

In [4]:
# Retrieve data and model bucket names.
with open("/home/ec2-user/.aiml-bb/stack-data.json", "r") as f:
    data = json.load(f)
    data_bucket = data["data_bucket"]
    model_bucket = data["model_bucket"]

## Estimator creation <a class="anchor" id="estim"></a>
We can now create the Linear Learner estimator, using Amazon's built-in implementation.
Because the model is managed for us, there is little to do in way of setup.

### Define and train estimator <a class="anchor" id="define"></a>
Here we create the `Estimator` object, and all resources that are required to do so.

In [5]:
# Get 'oost container image for current region.
ll_container_image = sm.image_uris.retrieve("linear-learner", region)

# Create a unique training job name.
training_job_name = f"'ll-{str(uuid.uuid4())[:8]}"

In [9]:
train_input = sm.inputs.TrainingInput(
    s3_data=f"s3://{data_bucket}/preprocessing_output/train/", 
    content_type="text/csv"
)
validation_input = sm.inputs.TrainingInput(
    s3_data=f"s3://{data_bucket}/preprocessing_output/validation/",
    content_type="text/csv"
)

In [10]:
# Create estimator running the Linear Learner container.
ll_estimator = sm.estimator.Estimator(
    ll_container_image,
    role, 
    instance_count=1, 
    instance_type="ml.m5.4xlarge",
    volume_size=50,
    output_path=f"s3://{model_bucket}/sagemaker-linear-learner/"
)

In [None]:
# Define starting hyperparameters for the model.
ll_estimator.set_hyperparameters(
    predictor_type="binary_classifier",
    binary_classifier_model_selection_criteria="precision_at_target_recall",
)
ll_estimator.fit(
    {"train": train_input, "validation": validation_input}
)

2022-01-27 06:38:53 Starting - Starting the training job...
2022-01-27 06:38:55 Starting - Launching requested ML instancesProfilerReport-1643265533: InProgress
......
2022-01-27 06:40:14 Starting - Preparing the instances for training............
2022-01-27 06:42:15 Downloading - Downloading input data..................
2022-01-27 06:45:27 Training - Training image download completed. Training in progress..[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[01/27/2022 06:45:33 INFO 140589668607808] Reading default configuration from /opt/amazon/lib/python3.7/site-packages/algorithm/resources/default-input.json: {'mini_batch_size': '1000', 'epochs': '15', 'feature_dim': 'auto', 'use_bias': 'true', 'binary_classifier_model_selection_criteria': 'accuracy', 'f_beta': '1.0', 'target_recall': '0.8', 'target_precision': '0.8', 'num_models': 'auto', 'num_calibration_samples': '10000000', 'init_method': 'uniform', 'init_scal

## Create endpoint to test model <a class="anchor" id="endpoint"></a>
To test the model, we must now create an endpoint that we can send the test data set aside during preprocessing.

In [None]:
# Create model and endpoint from best fitted estimator above.
ll_model = sm.model.Model(
    image_uri=ll_container_image,
    model_data=f"s3://{model_bucket}/sagemaker-linear-learner/output/model.tar.gz",
    role=role
)
endpoint_name = f"ll-test-endpt-{now}"
ll_model.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.xlarge",
    endpoint_name=endpoint_name
)

In [None]:
# Connect a predictor to the endpoint for inference.
ll_predictor = sm.predictor.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sm_session,
    serializer=sm.serializers.CSVSerializer(
        content_type="text/csv"
    )
)

In [None]:
# Iterate over testing data and compute statistics.
list_objs_response = s3_client.list_objects_v2(
    Bucket=data_bucket, 
    Prefix="preprocessing_output/train"
)

# Arrays to keep track of results.
test_actuals = []
test_predictions = []
for obj in list_objs_response["Contents"]:
    
    # Iterate over lines in object contents via stream.
    obj_resource = s3_resource.Object(data_bucket, obj["Key"])
    for line in obj_resource.get()["Body"].iter_lines():
        target, features = line.decode("utf-8").split(",", maxsplit=1)
        features = features.strip()
        prediction = ll_predictor.predict(features)
        
        test_actuals.append(float(target))
        test_predictions.append(float(prediction))
    
        if len(test_actuals) > 100_000:
            break

## Evaluate training results <a class="anchor" id="eval"></a>
Lastly, we evaluate the results of training against the testing data set.
Note that this data set is not included in training and has never been seen by the model.

In [None]:
# Wrap lists in numpy arrays for analysis.
test_actuals_np = np.array(test_actuals)
test_predictions_np = np.array(test_predictions)

In [None]:
# Compute summary statistics on perfomance.
performance_statistics = {
    "accuracy": metrics.accuracy_score(test_actuals_np, test_predictions_np),
    "precision": metrics.precision_score(test_actuals_np, test_predictions_np),
    "recall": metrics.recall_score(test_actuals_np, test_predictions_np),
    "f1": metrics.f1_score(test_actuals_np, test_predictions_np),
    "auc": metrics.roc_auc_score(test_actuals_np, test_predictions_np),
}
print(json.dumps(performance_statistics, indent=4))

In [None]:
# Compute confusion matrix.
confusion_df = pd.crosstab(
    test_actuals_np, 
    test_predictions_np, 
    rownames=["Actuals"], 
    colnames=["Predictions"]
)
norm_confusion_df = confusion_df / confusion_df.sum(axis=1)

# Show confusion matrix.
fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(
    norm_confusion_df, 
    vmin=-1.0, vmax=1.0, annot=True, fmt=".2f", 
    ax=ax
)
ax.set_title("Confusion matrix of testing results")
plt.show()

In [None]:
# Compute ROC curve.
fpr, tpr, thresholds = metrics.roc_curve(test_actuals_np, test_predictions_np)

# Plot ROC matrix.
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(fpr, tpr, label='ROC')
ax.plot([0, 1], [0, 1], linestyle='--')
ax.set_xlabel('False positive rate')
ax.set_ylabel('True positive rate')
ax.set_title('Receiver operating characteristic curve')
ax.legend()
plt.show()

## Cleanup resources <a class="anchor" id="clean"></a>
Because this is a temporary project, delete the endpoint.

In [None]:
sm_session.delete_endpoint(endpoint_name)