# Predict Total Loss Based on Structured Claim Data

## Establish some parameters

In [None]:
# use firstname-lastname, or initials, or username
# this will be used as a prefix for your training job name, making it easier for you to find the job
# later on using the console
USER_PREFIX = '<your user name goes here>' 

## First Generate a Dataset

We will synthesize a binary classification dataset to represent structured insurance claim data.  We will set two informative features.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.datasets import make_classification

plt.figure(figsize=(8, 8))
plt.subplots_adjust(bottom=.05, top=.9, left=.05, right=.95)

plt.subplot(321)
plt.title("Two informative features, one cluster per class", fontsize='small')
X1, Y1 = make_classification(n_samples=1000, n_features=13, n_redundant=0, n_informative=2,
                             n_classes=2, n_clusters_per_class=1, shuffle=False,
                             class_sep=2.0)

# scatter plot of the first 2 features, highlighting separation of Loss/Not-Loss classes
plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1,
            s=25, edgecolor='k')

plt.show()

## Next, split the dataset into 3 subsets: training, testing, and validation

Here we take the synthetic dataset and create subsets for training, testing, and validation.

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# split data into train and test sets
seed = 7
val_size  = 0.20
test_size = 0.10

# Give 70% to train
X_train, X_test, y_train, y_test = train_test_split(X1, Y1, 
                                                    test_size=(test_size + val_size), 
                                                    random_state=seed)

# Of the remaining 30%, give 2/3 to validation and 1/3 to test
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, 
                                                test_size=(test_size / (test_size + val_size)),
                                                random_state=seed)

## Now, normalize the features

In [None]:
# normalize
mean = X_train.mean(axis=0)
X_train -= mean
std = X_train.std(axis=0)
X_train /= std

X_test -= mean
X_test /= std

X_val -= mean
X_val /= std

Display the shape of each dataset and show a sample observation and target.

In [None]:
print('Train shape: {}, Test shape: {}, Val shape: {}'.format(X_train.shape, 
                                                              X_test.shape, X_val.shape))
print('Train target: {}, Test target: {}, Val target: {}'.format(y_train.shape, 
                                                                 y_test.shape, y_val.shape))

print('\nSample observation: {}\nSample target: {}'.format(X_test[0], y_test[0]))

## Save the datasets in files to feed the training script

Here we use CSV format for files that will be passed to training. We could have just as easily used binary numpy files.

In [None]:
import os
data_dir = os.path.join(os.getcwd(), 'data')
print('Local data dir: {}'.format(data_dir))

import pandas as pd
xtrain = pd.DataFrame(X_train)
xtrain.to_csv(f'{data_dir}/train/xtrain.csv', header=None, index=False)
ytrain = pd.DataFrame(y_train)
ytrain.to_csv(f'{data_dir}/train/ytrain.csv', header=None, index=False)

xtest = pd.DataFrame(X_test)
xtest.to_csv(f'{data_dir}/test/xtest.csv', header=None, index=False)
ytest = pd.DataFrame(y_test)
ytest.to_csv(f'{data_dir}/test/ytest.csv', header=None, index=False)

xval = pd.DataFrame(X_val)
xval.to_csv(f'{data_dir}/val/xval.csv', header=None, index=False)
yval = pd.DataFrame(y_val)
yval.to_csv(f'{data_dir}/val/yval.csv', header=None, index=False)

## Prepare the training script

When using Amazon SageMaker's script mode, you provide a training script. In our example, we provide a complete script that you can see below. If you want to experiment with the script, you can edit it directly in the `scripts` folder.

In [1]:
!pygmentize 'scripts/loss_train.py'

[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mfrom[39;49;00m [04m[36mos[39;49;00m [34mimport[39;49;00m listdir
[34mfrom[39;49;00m [04m[36mos.path[39;49;00m [34mimport[39;49;00m isfile, join
[34mimport[39;49;00m [04m[36mtensorflow[39;49;00m [34mas[39;49;00m [04m[36mtf[39;49;00m
[34mfrom[39;49;00m [04m[36mtensorflow.keras[39;49;00m [34mimport[39;49;00m models, layers
[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m

[34mdef[39;49;00m [32mprep_data[39;49;00m():
    base_dir = os.environ.get([33m'[39;49;00m[33mSM_INPUT_DIR[39;49;00m[33m'[39;49;00m) + [33m'[39;49;00m[33m/data[39;49;00m[33m'[39;49;00m
    [34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
    xtest_df = pd.read_csv(f[33m'[39;49;00m[33m{base_dir}/test/xtest.csv[39;49;00m[33m'[39;49;00m, header=[36mNone[39;49;00m)
    

# Create a training job using the `TensorFlow` estimator

The `sagemaker.tensorflow.TensorFlow` estimator handles locating the script mode container, uploading your script to an S3 location and creating a SageMaker training job. 

## Prepare to initiate training

Here we get ready to kick off the training. We start by using local mode, which executes training directly in the context of your notebook instance. This is a handy way to iteratively build and debug your training script without the overhead of creating separate training infrastructure. 

For large scale training, this is critical. Typically you would use a subset of data for these iterations. Once the script is working reliably, you could then use the standard mode for training and hosting which automatically provisions infrastructure on your behalf. Likewise, SageMaker destroys that infrastructure when training is complete, ensuring no charges for idle servers.

To train in Local Mode, it is necessary to have docker-compose or nvidia-docker-compose (for GPU) installed in the notebook instance. Running following script will install docker-compose or nvidia-docker-compose and configure the notebook environment for you.

In [None]:
!/bin/bash ./setup.sh

In [None]:
import os
import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

role = get_execution_role()
region = sagemaker_session.boto_session.region_name

from sagemaker.tensorflow import TensorFlow

In [None]:
local = False
if (local):
    train_instance_type = 'local'
    serve_instance_type = 'local'
else:
    train_instance_type = 'ml.c5.xlarge' 
    serve_instance_type = 'ml.m4.xlarge'

hyperparameters = {'epochs': 35, 'data_dir': '/opt/ml/input/data'}

loss_estimator = TensorFlow(entry_point='loss_train.py',
                       source_dir='scripts',
                       train_instance_type=train_instance_type,
                       train_instance_count=1,
                       metric_definitions=[
                           {'Name' : 'validation:acc', 
                            'Regex': '.*step.* - val_acc: (\\S+)\n'},
                           {'Name' : 'validation:loss', 
                            'Regex': '- val_loss: (.*?) '}],
                       hyperparameters=hyperparameters,
                       role=sagemaker.get_execution_role(), # Pass notebook role to container
                       framework_version='1.12',
                       py_version='py3',
                       base_job_name=USER_PREFIX,
                       script_mode=True)

# In training script, you have to save the model in 'saved model' format to use TF serving
#https://www.tensorflow.org/guide/saved_model#structure_of_a_savedmodel_directory

Now start the training by using the `fit` method passing in the appropriate definition of the train, test, and val data channels. For local mode training, the training starts immediately. For the non-local mode, you will need to wait several minutes as the infrastructure is launched. Following training, the infrastructure is automatically removed.

In [None]:
if (local):
    loss_estimator.fit({'train': f'file://{data_dir}/train',
                    'test' : f'file://{data_dir}/test',
                    'val'  : f'file://{data_dir}/val'}) 
else:
    # upload the files to the s3 bucket
    s3_base = sagemaker_session.upload_data(path=data_dir, 
                                           bucket=sagemaker_session.default_bucket(),
                                           key_prefix='loss')
    print(s3_base)
    loss_estimator.fit({'train': f'{s3_base}/train',
                    'test' : f'{s3_base}/test',
                    'val'  : f'{s3_base}/val'})

## Deploy the model for real-time inference

With the training completed, we now deploy the model to provide real-time inference. This command works in either local mode or stands up its own hosting infrastructure. The type is dictated by the `instance_type` parameter. For the non-local mode, this will take several minutes.

When deploying locally, you may get some errors. You can ignore them safely as long as the prediction call works in the next cell.

In [None]:
loss_predictor = loss_estimator.deploy(initial_instance_count=1, 
                                       instance_type=serve_instance_type,
                                       endpoint_type='tensorflow-serving')

## Make predictions

Now that the endpoint is deployed, we execute a set of inferences against the testing subset of our data. The identical interface is supported for local and non-local modes.

In [None]:
results = loss_predictor.predict(X_test)
print('Results: {}\n'.format(results))

Here we display the results, highlighting which predictions were accurate.

In [None]:
tmp_results = results['predictions']
fail_count = 0
test_count = len(X_test)
for i in range(test_count):
    if (tmp_results[i][0] > 0.5):
        class_predict = 1
    else:
        class_predict = 0
    if (class_predict == y_test[i]):
        result = 'PASS'
    else:
        result = '*FAIL'
        fail_count += 1
    print('Result: {:.3f}, Target: {}, Result: {}'.format(tmp_results[i][0], 
                                                          y_test[i],
                                                         result))
print('Tests: {}, Fails: {}'.format(test_count, fail_count))

## Clean up

We can now get rid of our endpoint. This is important in non-local mode, as you will otherwise  be billed for an idle endpoint.

In [None]:
if not local:
    sagemaker.Session().delete_endpoint(loss_predictor.endpoint)

## After local mode, perform remote training, deployment, and inference

The first time through the notebook, try everything in local mode. Once you have completed that successfully, go back to the cell that defines `local` and set that to `False`. Re-run the rest of the notebook to experience non-local mode training, deployment, and inference. Be sure to clean up when complete.