## Train, tune, deploy and review ML algorithm/model VBfFA (Variational Bayesian filtering Factor Analysis) from AWS Marketplace

#### Overview of the algorithm  
  
The variational Bayesian filtering factor analysis (VBfFA) algorithm/model is a filter (of dimension-reduction, or rank-reduction) to extract a number of ever-evolving unobserved common factors, or signals from common sources, underlying and influencing a large number of related time-series data.
  
Relevant examples of time-series data include: economic indicators in a nation, a region, or an international economic sector; prices of assets in a national, regional or global asset class(es), or marketplace; performance measurement time-series with various lags related to a business marketing campaign; and time-series signals from an array of radar or sonar sensors tracking a number of moving targets; etc.
  
By applying (variational) Bayesian filtering (instead of traditional moving/rolling data windows for frequentist time-dependent analysis), the VBfFA algorithm is able to update predictions with only the newly arrived time-series data point (instead of all data points in the data window); to speed up, as a result, real-time prediction process; to predict underlying changes in time-series early; and to avoid over- or under-fitting by setting a reasonable “estimation error reduction target”.
  
In addition to serving as a stand-alone filtering package for time-varying factor analysis on multiple time-series data, the VBfFA algorithm will be employed as the underlying factor analysis engine of other machine learning packages here introduced earlier by i4cast LLC: LMDFM (long memory dynamic factor model); YWpcAR (Yule-Walker-PCA autoregressive model); LMVAR (long memory vector autoregressive model); and CTVARF (continuously trained vector autoregressive forecast model).
  
Current version of the VBfFA algorithm estimates: time-series of posterior from the (variational) Bayesian filtering; predicted values and time-dependent variances of common factors (or common signals) and time-dependent factor loadings; predicted time-dependent (or time-varying) variance-covariance matrix of multiple time-series; and evaluation scores of the predicted time-series of variance-covariance matrix.
  
A notable application of the VBfFA estimates, detecting timely changes in time-dependent variance-covariance matrix of financial instruments, presented in a particular form, can serve as an early warning system indicating potential troubles in the financial market.
  
This VBfFA module implements VBfFA formulation published by Figure 1 through Figure 4 in a paper in journal of Quantitative Finance, https://doi.org/10.1080/14697688.2016.1268708, or, https://www.tandfonline.com/doi/abs/10.1080/14697688.2016.1268708, or, https://github.com/i4cast/aws/blob/main/variational_Bayesian_filtering_factor_analysis/publication/VBfFA_Publication.pdf, or manuscript of the publication, https://github.com/i4cast/aws/blob/main/variational_Bayesian_filtering_factor_analysis/publication/VBfFA_Manuscript.pdf.

#### Publications on variational Bayesian filtering factor analysis (VBfFA) modeling  

HFL & CF. (2017) "Online learning of time-varying stochastic factor structure by variational sequential Bayesian factor analysis", Quantitative Finance, Vol. 17 (8), pp. 1277-1304. Publication: https://doi.org/10.1080/14697688.2016.1268708, or, https://www.tandfonline.com/doi/abs/10.1080/14697688.2016.1268708, or, https://github.com/i4cast/aws/blob/main/variational_Bayesian_filtering_factor_analysis/publication/VBfFA_Publication.pdf. Manuscript: https://github.com/i4cast/aws/blob/main/variational_Bayesian_filtering_factor_analysis/publication/VBfFA_Manuscript.pdf.

#### This notebook
   
This sample notebook shows you how to train, tune, deploy and understand a custom ML algorithm/model: [variational Bayesian filtering factor analysis (VBfFA)](https://aws.amazon.com/marketplace/pp/prodview-vdwcbntcsnu72?sr=0-5&ref_=beagle&applicationId=AWSMPContessa), guided by common practices to [Use Algorithm and Model Package Resources](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-mkt-buy.html).
   
> **Note**: This is a reference notebook and it cannot run unless you make changes suggested in the notebook.

#### Pre-requisites

1. **Note**: This notebook contains elements which render correctly in Jupyter interface. Open this notebook from an Amazon SageMaker Notebook Instance or Amazon SageMaker Studio.
1. Ensure that IAM role used has **AmazonSageMakerFullAccess**
1. Some hands-on experience using [Amazon SageMaker](https://aws.amazon.com/sagemaker/).
1. To use this algorithm successfully, ensure that:
    1. Either your IAM role has these three permissions and you have authority to make AWS Marketplace subscriptions in the AWS account used: 
        1. **aws-marketplace:ViewSubscriptions**
        1. **aws-marketplace:Unsubscribe**
        1. **aws-marketplace:Subscribe**  
    1. or your AWS account has a subscription to [variational Bayesian filtering factor analysis (VBfFA)](https://aws.amazon.com/marketplace/pp/prodview-vdwcbntcsnu72?sr=0-5&ref_=beagle&applicationId=AWSMPContessa)

#### Contents

1. [Subscribe to the algorithm](#1.-Subscribe-to-the-algorithm)
    1. [Subscription](#1.1.-Subscription)
    1. [Prepare relevant environment](#1.2.-Prepare-relevant-environment)
1. [Prepare dataset](#2.-Prepare-dataset)
    1. [Dataset format expected by the algorithm](#2.1.-Dataset-format-expected-by-the-algorithm)
    1. [Configure and visualize training dataset](#2.2.-Configure-and-visualize-training-dataset)
    1. [Upload datasets to Amazon S3](#2.3.-Upload-datasets-to-Amazon-S3)
1. [Train a machine learning model](#3.-Train-a-machine-learning-model)
    1. [Set hyperparameters](#3.1.-Set-hyperparameters)
    1. [Train a model](#3.2.-Train-a-model)
    1. [Update a model](#3.3.-Update-a-model-with-"trained-model-retrieval")
1. [Tune your model (optional)](#4.-Tune-your-model-(optional))
    1. [Tuning Guidelines](#4.1.-Tuning-guidelines)
    1. [Define Tuning configuration](#4.2.-Define-tuning-configuration)
    1. [Run a model tuning job](#4.3.-Run-a-model-tuning-job)
1. [Deploy model and verify results](#5.-Deploy-model-and-verify-results)
    1. [Trained or tuned model](#5.1.-Trained-or-tuned-model)
    1. [Deploy trained or tuned model](#5.2.-Deploy-trained-or-tuned-model)
    1. [Create input payload](#5.3.-Create-input-payload)
    1. [Perform real-time inference](#5.4.-Perform-real-time-inference)
1. [Perform Batch inference](#6.-Perform-batch-inference)
    1. [Batch transform](#6.1.-Batch-transform)
    1. [Delete the model](#6.2.-Delete-the-model)
1. [Model review by using Transformer (optional)](#7.-Model-review-by-using-Transformer-(optional))
    1. [VBfFA predictions and goodness scores](#7.1.-VBfFA-predictions-and-goodness-scores-of-the-predictions)
    1. [Select prediction or score for review](#7.2.-Select-prediction-or-score-for-review)
    1. [Model output review with Transformer](#7.3.-Model-output-review-with-Transformer)
1. [Clean-up](#8.-Clean-up)
    1. [Delete endpoint and model](#8.1.-Delete-endpoint-and-model)
    1. [Unsubscribe to the listing (optional)](#8.2.-Unsubscribe-to-the-listing-(optional))

#### Usage instructions

You can run this notebook one cell at a time (By using Shift+Enter for running a cell).

#### Sagemaker Notebook

For readers who like to review how to use Sagemaker Notebook in general, following Sagemaker documentation pages are best resources.  
    [Get Started with Amazon SageMaker Notebook Instances](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-console.html)  
    [Step 1: Create an Amazon SageMaker Notebook Instance](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-setup-working-env.html)  
    [Step 2: Create a Jupyter Notebook](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-prepare.html)  
    [Step 3: Download, Explore, and Transform a Dataset](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-preprocess-data.html)  
    [Step 4: Train a Model](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-train-model.html)  
    [Step 5: Deploy the Model to Amazon EC2](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-model-deployment.html)  
    [Step 6: Evaluate the Model](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-test-model.html)  
    [Step 7: Clean Up](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-cleanup.html)

### 1. Subscribe to the algorithm

#### 1.1. Subscription

To subscribe to the algorithm:
  
1. Open the algorithm listing page,
[variational Bayesian filtering factor analysis (VBfFA)](https://aws.amazon.com/marketplace/pp/prodview-vdwcbntcsnu72?sr=0-5&ref_=beagle&applicationId=AWSMPContessa)
1. On the AWS Marketplace listing,  click on **Continue to subscribe** button.
1. On the **Subscribe to this software** page, review and click on **"Accept Offer"** if you agree with EULA, pricing, and support terms. 
1. Once you click on **Continue to configuration button** and then choose a **region**, you will see a **Product Arn**. This is the algorithm ARN that you need to specify while training a custom ML model. Copy the ARN corresponding to your region and specify the same in the following cell.

In [None]:
# get the newest boto3
get_ipython().run_line_magic('pip', 'install --upgrade boto3')

In [None]:
# specify your valid algorithm ARN
# my_algorithm_arn = 'arn:aws:sagemaker:{region}:123456789012:algorithm/{vbffa_algorithm}'
my_algorithm_arn = 'arn:aws:sagemaker:{}:{}:algorithm/{}'.format(
    'your_region', 'your_aws_account_number', 'your_vbffa_algorithm_label')
my_algorithm_arn = 'arn:aws:sagemaker:us-east-1:123456789012:algorithm/vbffa'
my_prefix = 'vbffa'

#### 1.2. Prepare relevant environment

In [None]:
# Python packages
import sagemaker
import os

# remind
print('Wait for Sagemaker values assigned to TWO important variables: my_bucket and my_role.\n')

# sagemaker session
my_session = sagemaker.session.Session()

# sagemaker attributes
my_bucket = my_session.default_bucket()
my_role = sagemaker.session.get_execution_role()

# review
print('my_bucket = {}'.format(my_bucket))
print('my_role = {}'.format(my_role))

To run this Sagemaker machine learning ('ml') notebook example, following S3 folders are expected to be in place:

1. {my_bucket}/{my_prefix}/input/data/train/
1. {my_bucket}/{my_prefix}/input/data/inference/
1. {my_bucket}/{my_prefix}/model/
1. {my_bucket}/{my_prefix}/output/data/inference/

In [None]:
# aws s3 paths
my_input_data_train_path = 's3://{}/{}/input/data/train'.format(my_bucket, my_prefix)
my_model_path = 's3://{}/{}/model'.format(my_bucket, my_prefix)
my_input_data_infer_path = 's3://{}/{}/input/data/inference'.format(my_bucket, my_prefix)
my_output_data_infer_path = 's3://{}/{}/output/data/inference'.format(my_bucket, my_prefix)

# vbffa Docker container input data channel
input_data_train_channel = 'train'
input_data_model_channel = 'model'

# aws computing instance type: 'ml.m5.xlarge'
my_EC2 = 'ml.m5.xlarge'

# input CSV data file name
my_input_data_file = 'Weekly_VTS_4Yr.csv'

# information available model and endpoint
my_model_data = str()  # to be assigned / defined
my_model_name = str()  # to be assigned / defined 
my_endpoint_name = 'my-endpoint'

If you are re-visiting this demo notebook, and your model training job and/or hyperparameter tuning job (to be defined later) were already run at least once successfully, you can avoid running these job(s) again by copying the resulted Sagemaker s3 path of your trained model artifact data file, model.tar.gz, and/or the s3 path of your tuned model artifact data file, model.tar.gz, to the variables, my_trained_model_data and/or my_tuned_model_data, in the cell below.

In [None]:
# trained model placeholder
# my_trained_model_data = str()
my_trained_model_data = str()
my_trained_model_name = 'my-trained-model'

# AVAILABLE trained model
# IF your model is trained and not to be trained again, copy-paste or type the s3 path of
# your trained model artifact data file as the value of variable my_trained_model_data
# my_trained_model_data = '{my_bucket}/{my_prefix}/model/{some_path}/model.tar.gz'
my_trained_model_data = ''

# review
print('Model artifact data file of trained model:')
print(my_trained_model_data)
print('Name of trained model:')
print(my_trained_model_name)

# ----------------------------------------------------------------------------------------------------

# tuned model placeholder
# my_tuned_model_data = str()
my_tuned_model_data = str()
my_tuned_model_name = 'my-tuned-model'

# AVAILABLE tuned model
# IF your model is tuned and not to be tuned again, copy-paste or type the s3 path of
# your tuned model artifact data file as the value of variable my_tuned_model_data
# my_tuned_model_data = '{my_bucket}/{my_prefix}/model/{some_path}/model.tar.gz'
my_tuned_model_data = ''

# review
print('\nModel artifact data file of tuned model:')
print(my_tuned_model_data)
print('Name of tuned model:')
print(my_tuned_model_name)

### 2. Prepare dataset

#### 2.1. Dataset format expected by the algorithm

The VBfFA (variational Bayesian filtering factor analysis) algorithm/model takes, as input data, multiple time-series data contained in a CSV (comma separated value) data table, in the form of a CSV text-string or a CSV text-file.  
    
Each row of the data table is for values of an individual time-series (TS). Row header is the label or symbol of the time-series.  
  
Each column is for values of all time-series at a specific moment in time. Column header is the time-index or time-stamp of the moment. The first data column is for the earliest data point and the last column for the most recent data point. The current version of VBfFA requires equally spaced time-stamps.  
    
Since VBfFA makes factor analysis on multiple time-series, or a time-series of vector, the input data is essentially in the format of "Row Time-Series of Column Vector".  
    
One of the simplest methods to generate such a CSV text-file is to save a Microsoft Excel spreadsheet as (or into) a CSV file.  
    
You can also find more information about dataset format in **Usage Information** section of [variational Bayesian filtering factor analysis (VBfFA)](https://aws.amazon.com/marketplace/pp/prodview-[xxx999]=beagle&applicationId=AWSMPContessa)  

#### 2.2. Configure and visualize training dataset

A [sample data](https://github.com/i4cast/aws/blob/main/variational_Bayesian_filtering_factor_analysis/input/Weekly_VTS_4Yr.csv) provided with this product/example is six-year weekly (logarithmic) performances of mutual funds traded in the U.S. invested in equities, fixed income, and commodities. Each row is of an individual mutual fund. Each column is of a specific calendar week in history. The last week (the last column) was the week with a time-stamp as "2021-12-31". Following simple steps you can upload this sample data to your AWS S3 folder.  

#### 2.3. Upload datasets to Amazon S3

To download the sample dataset from https://github.com/i4cast/aws/blob/main/variational_Bayesian_filtering_factor_analysis/input/Weekly_VTS_4Yr.csv, and then upload the dataset to
  
1. {my_bucket}/{my_prefix}/input/data/train/ for training
1. {my_bucket}/{my_prefix}/input/data/inference/ for inference
  
following simple steps can be used:

1. Open webpage https://github.com/i4cast/aws/blob/main/variational_Bayesian_filtering_factor_analysis/input/Weekly_VTS_4Yr.csv
1. Click [Raw] option located at top right of the data table
1. In the Raw data window, right click [Save as]
1. Set local file folder and file name in the "Save As" window, then click [Save]
  
1. Open AWS S3 Console
1. Go to S3 folder: {my_bucket}/{my_prefix}/input/data/train/
1. Upload the saved local data file to your AWS S3 folder
1. Go to S3 folder: {my_bucket}/{my_prefix}/input/data/inference/
1. Upload the saved local data file to your AWS S3 folder

### 3. Train a machine learning model

#### 3.1. Set hyperparameters



Hyperparameters of the VBfFA algorithm/model are for model learning and inference.  
   
The parameter, len_leaveout_window >= 0, is the length of the LAST data window containing data points of input vector time-series to be left out for model fitting and inference (to be used later for model validation and/or test).  
  
Here, we set a larger len_leaveout_window (e.g. 145) for initial model training and re-set a smaller len_leaveout_window (e.g. 132) for second training. The TWO separate model trainings illustrates the capability of VBfFA algorithm to update a trained model when new data points become available.  
    
You can also find more information about dataset format in **Hyperparameters** section of [variational Bayesian filtering factor analysis (VBfFA)](https://aws.amazon.com/marketplace/pp/prodview-[xxx999]=beagle&applicationId=AWSMPContessa).

A training oriented format of hyperparameters is presented and utilized in this section.

In [None]:
# define hyperparameters
# all individual elements must be individual strings
my_hyperparam = {
    'num_factors': '10',
    'error_reduct_target': '0.9',
    'num_data_points': '52',
    'num_va_iteration': '5',
    'len_moving_window': '52',
    'ts_standardization': 'exp',
    # 'len_leaveout_window': '145',
    'len_leaveout_window': '132',
    'max_len_output_ts': '156',
    'score_target_type': 'S',
    'max_predict_step': '3',
    'weight_dict': "dict: {}".format({'*': 1.0}),
    'max_num_ts_add_del' : '2'
}

# metric_list
metric_list = [
    'avg_fitvar', 'avg_aggvar', 'avg_zscore',
    'avg_bias', 'avg_loglik', 'avg_qstat',
    'diff_avg_fitvar', 'diff_avg_aggvar', 'diff_avg_zscore',
    'diff_avg_bias', 'diff_avg_loglik', 'diff_avg_qstat']

# metrics (all individual elements must be individual strings)
my_metrics = list()
for metric in metric_list:
    my_metrics.append(dict({
        'Name': '{}'.format(metric),
        'Regex': '{}=(.*?);'.format(metric)
    }))

# review
print('Hyperparameters: my_hyperparam =')
print(my_hyperparam)

# review
print('\nEvaluation metrics: my_metrics =')
print(my_metrics)

In [None]:
# create an estimator object for running a training job
# Information on sagemaker.algorithm.AlgorithmEstimator():
# https://sagemaker.readthedocs.io/en/stable/api/training/algorithm.html
my_estimator = sagemaker.algorithm.AlgorithmEstimator(
    algorithm_arn=my_algorithm_arn,
    role=my_role,
    instance_count=1,
    instance_type=my_EC2,
    input_mode='File',
    output_path=my_model_path,
    base_job_name='my-training-job',
    sagemaker_session=my_session,
    hyperparameters=my_hyperparam,
    model_channel_name='model',
    metric_definitions=my_metrics
)

Having the above my_estimator object, a method my_estimator.fit(inputs=my_training_input) will be called to either train a new model or update a (existing) trained model.  
  
Both training and updating a model need input vector time-series data. But, updating a trained model needs an additional input: previous trained model artifacts generated by an earlier model training/updating.  
  
The argument of fit(), my_training_input, defined in the following cell will make input data (located in s3 path my_input_data_train_path) available in channel "train", and make previous trained model artifacts (located in s3 path my_model_path) available in channel "model".  

In [None]:
# Information on sagemaker.inputs.TrainingInput():
# https://sagemaker.readthedocs.io/en/stable/api/utility/inputs.html
my_training_input = dict(
{
    input_data_train_channel:
        sagemaker.inputs.TrainingInput(
            # input vector time-series data
            s3_data=my_input_data_train_path,
            content_type='text/csv',
            s3_data_type='S3Prefix',
            input_mode='File'),
    
    input_data_model_channel:
        sagemaker.inputs.TrainingInput(
            # previous trained model artifacts
            s3_data=my_model_path,
            content_type='application/gzip',
            s3_data_type='S3Prefix',
            input_mode='File')
})

#### 3.2. Train a model

In the following cell, set the boolean indicator, run_training_job, to TRUE, in order to
1. run VBfFA model training job
1. save model artifacts of trained model

In [None]:
# run_training_job = True | False
run_training_job = False

When, training a NEW model, we do NOT have or do NOT need previous trained model artifacts and, therefore, in the argument of fit(), the object my_training_input, we will assign "s3_data=my_input_data_train_path" to both channels "train" and "model".  
  
During waiting time after setting indicator run_training_job above to TRUE and running model training job in the cell below, you can re-set run_training_job indicator back to FALSE in order to avoid accidentally running model training job again.  

In [None]:
# if TRUE then train the model and save the result
if run_training_job and (len(my_trained_model_data) < 0.5):
    
    # prompt
    print()
    print(my_estimator.hyperparameters())
    print()
    print('Setting a larger value to len_leaveout_window for initial model training')
    
    # https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html
    # sagemaker.estimator.Estimator().set_hyperparameters(**kwargs)
    my_estimator.set_hyperparameters(len_leaveout_window="145")
    
    # prompt
    print()
    print(my_estimator.hyperparameters())
    print()
    
    # training a NEW model
    # NOTE: CHANG s3 path of input_data_model_channel in order
    #       to prevent getting any model data from my_model_path
    my_training_input = dict(
    {
        input_data_train_channel:
            sagemaker.inputs.TrainingInput(
                # input vector time-series data
                s3_data=my_input_data_train_path,
                content_type='text/csv',
                s3_data_type='S3Prefix',
                input_mode='File'),
        
        input_data_model_channel:
            sagemaker.inputs.TrainingInput(
                # NOT previous trained model artifacts
                s3_data=my_input_data_train_path,
                content_type='application/gzip',
                s3_data_type='S3Prefix',
                input_mode='File')
    })
    
    # remind
    print('Train the model. Wait for training job completes with information:')
    print('Model data of trained model\n')
    
    # Information on sagemaker.algorithm.AlgorithmEstimator().fit()
    # https://sagemaker.readthedocs.io/en/stable/api/training/algorithm.html
    my_estimator.fit(
        inputs=my_training_input,
        wait=True,
        logs='All')
    
    # model data information
    my_trained_model_data = my_estimator.model_data
    
    # review
    print('\nModel data of trained model:')
    print(my_trained_model_data)

For more information how to visualize metrics during the process, see [Easily monitor and visualize metrics while training models on Amazon SageMaker](https://aws.amazon.com/blogs/machine-learning/easily-monitor-and-visualize-metrics-while-training-models-on-amazon-sagemaker/).

You can also open the training job from [Amazon SageMaker console](https://console.aws.amazon.com/sagemaker/home?#/jobs/) and monitor the metrics/logs in **Monitor** section.

#### 3.3. Update a model with "trained model retrieval"

A variational Bayesian filtering factor analysis (VBfFA) model is continuously fitted or updated by filtering: when a new data point becoming available after the model being fitted, the model is updated based on this new data point and the existing trained model.  
   
The trained model artifact data file, e.g. {my_bucket}/{my_prefix}/model/{some_path}/model.tar.gz, in this example, generated at the end of the training process contains all model artifacts based on the input vector time-series data points earmarked as available for the model training.  
    
When new data points of the time-series become available after the training ended, an inference process with the new data and trained model will make an additional model fitting and updating first, and then make inference based on the fully updated model artifacts.  
     
But the inference process does not generate a new model.tar.gz file to be utilized later again.  
  
Fortunately, the argument, inputs=my_training_input, of the model training method, sagemaker.algorithm.AlgorithmEstimator().fit(), is able to implement a "trained model retrieval" mechanism for updating a trained model: both (a) the newly available data points and (b) the previous trained model artifacts can be fed together into the model updating process.  
  
The argument of fit(), my_training_input, defined in the following cell will make input data (located in s3 path my_input_data_train_path) available in channed "train", and make previous trained model artifacts (located in s3 path my_model_path) available in channel "model". The only requirement for a successful trained model retrieval process is to keep only the latest model data file and delete the older one(s).  
  
This way, an updated trained model data file, model.tar.gz, will be generated containing the newly updated model artifacts.  
     
In the VBfFA modeling example shown in this notebook, we can delete older model.tart.gz file(s) in the s3 model path, my_model_path = 's3://{}/{}/model'.format(my_bucket, my_prefix), only keep the latest model data file. And then, run the model fitting function, my_estimator.fit(), again.  

In [None]:
# if TRUE then train the model and save the result
if run_training_job:
    
    print()
    print(my_estimator.hyperparameters())
    print()
    print('Now, re-setting a smaller value to len_leaveout_window for updating model')
    
    # https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html
    # sagemaker.estimator.Estimator().set_hyperparameters(**kwargs)
    my_estimator.set_hyperparameters(len_leaveout_window="132")
    
    # prompt
    print()
    print(my_estimator.hyperparameters())
    print()
    
    # updating a TRAINED model
    # NOTE: KEEP ONLY the last model data in s3 model path of my_model_path
    # NOTE: DELETE ALL older model data in s3 model path of my_model_path
    my_training_input = dict(
    {
        input_data_train_channel:
            sagemaker.inputs.TrainingInput(
                # input vector time-series data
                s3_data=my_input_data_train_path,
                content_type='text/csv',
                s3_data_type='S3Prefix',
                input_mode='File'),
        
        input_data_model_channel:
            sagemaker.inputs.TrainingInput(
                # previous trained model artifacts
                s3_data=my_model_path,
                content_type='application/gzip',
                s3_data_type='S3Prefix',
                input_mode='File')
    })
    
    # remind
    print('Train the model. Wait for training job completes with information:')
    print('Model data of trained model\n')
    
    # Information on sagemaker.algorithm.AlgorithmEstimator().fit()
    # https://sagemaker.readthedocs.io/en/stable/api/training/algorithm.html
    my_estimator.fit(
        inputs=my_training_input,
        wait=True,
        logs='All')
    
    # model data information
    my_trained_model_data = my_estimator.model_data
    
    # review
    print('\nModel data of trained model:')
    print(my_trained_model_data)

### 4. Tune your model (optional)

#### 4.1. Tuning guidelines

Modeling and/or predicting different sets of multiple time-series require different values of hyperparameters: num_factors, error_reduct_target, num_data_points, num_va_iteration, and len_moving_window. 
   
Therefore, decisions on specific (integer or decimal) values of these hyperparameters need to be made before making meaningful training and inference. There are a variety of commonly practiced methods to estimate the appropriate hyperparameter values. When using AWS Sagemaker, it is natural to use Sagemaker's HyperparameterTuner class to search for appropriate hyperparameter values which result in better forecasts.  
   
For information about Automatic model tuning, also see [Perform Automatic Model Tuning](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html)  

In [None]:
# define hyperparameters
# all individual elements must be individual strings
my_hyperparam = {
    'num_factors': '10',
    'error_reduct_target': '0.9',
    'num_data_points': '52',
    'num_va_iteration': '5',
    'len_moving_window': '52',
    'ts_standardization': 'exp',
    # 'len_leaveout_window': '145',
    'len_leaveout_window': '132',
    'max_len_output_ts': '156',
    'score_target_type': 'S',
    'max_predict_step': '3',
    'weight_dict': "dict: {}".format({'*': 1.0}),
    'max_num_ts_add_del' : '2'
}

# metric_list
metric_list = [
    'avg_fitvar', 'avg_aggvar', 'avg_zscore',
    'avg_bias', 'avg_loglik', 'avg_qstat',
    'diff_avg_fitvar', 'diff_avg_aggvar', 'diff_avg_zscore',
    'diff_avg_bias', 'diff_avg_loglik', 'diff_avg_qstat']

# metrics (all individual elements must be individual strings)
my_metrics = list()
for metric in metric_list:
    my_metrics.append(dict({
        'Name': '{}'.format(metric),
        'Regex': '{}=(.*?);'.format(metric)
    }))

# create an estimator object for running a training job
# Information on sagemaker.algorithm.AlgorithmEstimator():
# https://sagemaker.readthedocs.io/en/stable/api/training/algorithm.html
my_estimator = sagemaker.algorithm.AlgorithmEstimator(
    algorithm_arn=my_algorithm_arn,
    role=my_role,
    instance_count=1,
    instance_type=my_EC2,
    input_mode='File',
    output_path=my_model_path,
    base_job_name='my-training-job',
    sagemaker_session=my_session,
    hyperparameters=my_hyperparam,
    model_channel_name='model',
    metric_definitions=my_metrics
)

#### 4.2. Define tuning configuration

Possible ranges of appropriate hyperparameter values depend on specific dataset at hand. For the sample dataset used in this example, a set of reasonable ranges of hyperparameter values are as follows.

In [None]:
# Information on sagemaker.parameter.IntegerParameter():
# https://sagemaker.readthedocs.io/en/stable/api/training/parameter.html
tuning_hyperpar_range_example = dict({
    'num_factors':
        sagemaker.parameter.IntegerParameter(
        min_value=5, max_value=50, scaling_type='Auto'),
    'error_reduct_target':
        sagemaker.parameter.ContinuousParameter(
        min_value=0.75, max_value=0.99, scaling_type='Auto'),
    'num_data_points':
        sagemaker.parameter.IntegerParameter(
        min_value=10, max_value=100, scaling_type='Auto'),
    'num_va_iteration':
        sagemaker.parameter.IntegerParameter(
        min_value=2, max_value=10, scaling_type='Auto'),
    'len_moving_window':
        sagemaker.parameter.IntegerParameter(
        min_value=52, max_value=157, scaling_type='Auto')
})

Natural seasonality of time-series and some "rule of thumb for choices" may be utilized to focus on a few reasonable values within reasonable ranges. Following example can be used for a simpler model tuning.

For general information about AWS SageMaker Hyperparameter Tuning, referred to [How Hyperparameter Tuning Works](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-how-it-works.html) and [Define Hyperparameter Ranges](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-define-ranges.html).

In [None]:
# Information on sagemaker.parameter.CategoricalParameter():
# https://sagemaker.readthedocs.io/en/stable/api/training/parameter.html
my_hyperparam_range = dict({
    'num_factors':
        sagemaker.parameter.CategoricalParameter(['5', '10', '20', '30']),
    'error_reduct_target':
        sagemaker.parameter.CategoricalParameter(['0.85', '0.9', '0.95']),
    'num_data_points':
        sagemaker.parameter.CategoricalParameter(['26', '52', '78']),
    'num_va_iteration':
        sagemaker.parameter.CategoricalParameter(['5', '10']),
    'len_moving_window':
        sagemaker.parameter.CategoricalParameter(['52', '65'])
})

Different modeling and inference applications need to use different metrics to measure relevant goodness of fit.  
  
With the VBfFA (variational Bayesian filtering factor analysis) model, we try to predict a number of unobserved common factors underlying a set of large number of related time-series.  
  
covariance between time-series can be estimated by common factors, while variance of time-series can be estimated by both common factors and residual variance. therefore, evaluation scores of time-varying or time-deplendent variance-covariance matrix of the time-series can serve as measures of goodness of fit of a factor model, such as our VBfFA model.  
  
In this example, we will use log-likelihood of variance of aggregated time-series as the evaluation score.
detailes are discussed in the publication on the VBfFA model, https://doi.org/10.1080/14697688.2016.1268708, or, https://www.tandfonline.com/doi/abs/10.1080/14697688.2016.1268708, or,         https://github.com/i4cast/aws/blob/main/variational_Bayesian_filtering_factor_analysis/publication/VBfFA_Publication.pdf,         or https://github.com/i4cast/aws/blob/main/variational_Bayesian_filtering_factor_analysis/publication/VBfFA_Manuscript.pdf.
  
For general information about AWS SageMaker Metrics, referred to [Define Metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-define-metrics.html).  

In [None]:
# available choices for objective tuning metric
print('Available model evaulation metrics:')
print(my_metrics)

# name of objective tuning metric
my_objective_metric = my_metrics[4]['Name']

# review
print('\nObjective tuning metric')
print(my_objective_metric)

In general, minimizing error and/or maximizing similarity are desirable tuning directions. Therefore, we will maximize our objective metric, projection coefficient, in this hyperparameter tuning example.

In [None]:
# direction of hyperparameter optimization
my_objective_type = 'Maximize'

#### 4.3. Run a model tuning job

In [None]:
# setting up hyperparameter tuning job
# Information on sagemaker.tuner.HyperparameterTuner():
# https://sagemaker.readthedocs.io/en/stable/api/training/tuner.html
my_tuner = sagemaker.tuner.HyperparameterTuner(
    estimator=my_estimator,
    objective_metric_name=my_objective_metric,
    hyperparameter_ranges=my_hyperparam_range,
    objective_type=my_objective_type,
    max_jobs=1,
    max_parallel_jobs=1,
    base_tuning_job_name='my-tuning-job',
    early_stopping_type='Auto'
)

In the following cell, set the boolean indicator, run_tuning_job, to TRUE, in order to
1. run hyperparameter optimization job
1. save optimal model artifacts

In [None]:
# run_tuning_job = True | False
run_tuning_job = False

During waiting time after setting indicator run_tuning_job above to TRUE and running hyperparameter tuning job in the cell below, you can re-set run_tuning_job indicator back to FALSE in order to avoid accidentally running hyperparameter tuning job again.

In [None]:
# if TRUE then optimize model and save the result
if run_tuning_job and (len(my_tuned_model_data) < 0.5):
    
    # Tuning Job
    # NOTE: CHANG s3 path of input_data_model_channel inorder
    #       to prevent getting any model data from my_model_path
    my_training_input = dict(
    {
        input_data_train_channel:
            # input vector time-series data
            sagemaker.inputs.TrainingInput(
                s3_data=my_input_data_train_path,
                content_type='text/csv',
                s3_data_type='S3Prefix',
                input_mode='File'),
        
        input_data_model_channel:
            # NOT previous trained model artifacts
            sagemaker.inputs.TrainingInput(
                s3_data=my_input_data_train_path,
                content_type='application/gzip',
                s3_data_type='S3Prefix',
                input_mode='File')
    })
    
    # remind
    print('Tune the model. Wait for tuning job completes with information:')
    print('Model data of tuned model\n')
    
    # tuning and waiting
    # Information on sagemaker.tuner.HyperparameterTuner().fit():
    # https://sagemaker.readthedocs.io/en/stable/api/training/tuner.html
    my_tuner.fit(
        inputs=my_training_input)
    my_tuner.wait()
    
    # get tuned model and artfacts of the tuned model
    # Information on sagemaker.tuner.HyperparameterTuner().best_estimator():
    # https://sagemaker.readthedocs.io/en/stable/api/training/tuner.html
    my_tuned_estimator = my_tuner.best_estimator()
    
    # optimized hyperparameters
    my_tuned_hyperparam = my_tuned_estimator.hyperparameters()
    
    # optimal model artfacts
    my_tuned_model_data = my_tuned_estimator.model_data
    
    # review
    print('\nTuned hyperparameters:')
    print(my_tuned_hyperparam)
    
    # review
    print('\nModel data of tuned model:')
    print(my_tuned_model_data)

As recommended by AWS Sagemaker Team, once you have completed a tuning job, (or even while the job is still running) you can [clone and use this notebook](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/hyperparameter_tuning/analyze_results/HPO_Analyze_TuningJob_Results.ipynb) to analyze the results to understand how each hyperparameter effects the quality of the model.

### 5. Deploy model and verify results

#### 5.1. Trained or tuned model

In [None]:
# available trained model
if len(my_trained_model_data) > len('s3://.tar.gz'):
    my_model_data = my_trained_model_data
    my_model_name = my_trained_model_name

# available tuned model
if len(my_tuned_model_data) > len('s3://.tar.gz'):
    my_model_data = my_tuned_model_data
    my_model_name = my_tuned_model_name

# Information on sagemaker.model.ModelPackage():
# https://sagemaker.readthedocs.io/en/stable/api/inference/model.html
my_model = sagemaker.model.ModelPackage(
    role=my_role,
    model_data=my_model_data,
    algorithm_arn=my_algorithm_arn,
    name=my_model_name
)

# review
print('Name of model:')
print(my_model_name)

# review
print('\nArtifacts of model:')
print(my_model_data)

# review
print('\nModel pacakge')
print(my_model)

#### 5.2. Deploy trained or tuned model

In [None]:
# remind
print('Start endpoint for inference. Wait for endpoint becomes ready')

# Information on sagemaker.model.Model().deploy():
# https://sagemaker.readthedocs.io/en/stable/api/inference/model.html
my_endpoint = my_model.deploy(
    initial_instance_count=1,
    instance_type=my_EC2,
    endpoint_name=my_endpoint_name
)

# review
print('\nSagemaker endpoint, ' + my_endpoint_name + ', is ready NOW')

In [None]:
# Information on sagemaker.serializers.IdentitySerializer():
# https://sagemaker.readthedocs.io/en/stable/api/inference/serializers.html
my_serializer = sagemaker.serializers.IdentitySerializer()

# Information on sagemaker.deserializers.StreamDeserializer():
# https://sagemaker.readthedocs.io/en/stable/api/inference/deserializers.html
my_deserializer = sagemaker.deserializers.StreamDeserializer()

# Predictor
# Information on sagemaker.predictor.Predictor():
# https://sagemaker.readthedocs.io/en/stable/api/inference/predictors.html
my_predictor = sagemaker.predictor.Predictor(
    endpoint_name=my_endpoint_name,
    sagemaker_session=my_session,
    serializer=my_serializer,
    deserializer=my_deserializer
)

# review
print(my_predictor)

#### 5.3. Create input payload

Input payload can be created by following functions of the class [S3 Utilities](https://sagemaker.readthedocs.io/en/stable/api/utility/s3.html)

1. **sagemaker.s3.s3_path_join(*args)**: similarly to os.path.join()
1. **sagemaker.s3.S3Downloader.read_file(s3_uri, sagemaker_session=None)**: returns the contents of an s3 uri file body as a string

In [None]:
# data file for inference
my_infer_input_file = sagemaker.s3.s3_path_join(
    my_input_data_infer_path,
    my_input_data_file)

# CSV data: string
my_infer_input_str = sagemaker.s3.S3Downloader.read_file(
    my_infer_input_file, 
    sagemaker_session=my_session)

# CSV data: byte stream object
my_inference_input_obj = my_infer_input_str.encode()

# review
print('my_infer_input_file:')
print(my_infer_input_file + '\n')

# review
print('my_infer_input_str: ' + str(type(my_infer_input_str)))
print('my_inference_input_obj: ' + str(type(my_inference_input_obj)))

#### 5.4. Perform real-time inference

In [None]:
# Information on sagemaker.predictor.Predictor().predict():
# https://sagemaker.readthedocs.io/en/stable/api/inference/predictors.html
my_predict = my_predictor.predict(
    data=my_inference_input_obj)

In [None]:
# review
print('Output of real-time inference:')
print(my_predict)

# review
# Information on botocore.response.StreamingBody()
# https://botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html
print('\nReal-time prediction')
print(my_predict[0].read())

Now that you have successfully performed a real-time inference, you do not need the endpoint any more. You can terminate it to avoid being charged.

In [None]:
# Information on sagemaker.predictor.Predictor().delete_endpoint():
# https://sagemaker.readthedocs.io/en/stable/api/inference/predictors.html
my_predictor.delete_endpoint(
    delete_endpoint_config=True)

### 6. Perform batch inference

#### 6.1. Batch transform

In [None]:
# default inference ENV variables
my_ENV = dict({
    'MODELOUTPUT': 'fit_output'
})

# available output type
output_type_choice = dict({
    1: 'text/csv',
    2: 'application/json'
})

# output type
output_type = output_type_choice[
    2
]

# Information sagemaker.transformer.Transformer():
# https://sagemaker.readthedocs.io/en/stable/api/inference/transformer.html
my_transformer = sagemaker.transformer.Transformer(
    model_name=my_model_name,
    instance_count=1,
    instance_type=my_EC2,
    output_path=my_output_data_infer_path,
    accept=output_type,
    env=my_ENV,
    sagemaker_session=my_session
)

Note: Batch-transform job input file is located in the S3 folder: {my_bucket}/{my_prefix}/input/data/inference/

In [None]:
# Information on sagemaker.inputs.TransformInput():
# https://sagemaker.readthedocs.io/en/stable/api/utility/inputs.html
my_transform_data_path = my_input_data_infer_path
my_transform_data_type = 'S3Prefix'
my_transform_content_type = 'text/csv'

In [None]:
# remind
print('Run batch transform. Wait for transform job completes with information:')
print('Batch transform output path')

# Information on sagemaker.transformer.Transformer().transform():
# https://sagemaker.readthedocs.io/en/stable/api/inference/transformer.html
my_transformer.transform(
    data=my_transform_data_path,
    data_type=my_transform_data_type,
    content_type=my_transform_content_type,
    # wait=False,
    logs=True
)

# wait
my_transformer.wait()

# output is available on following path
my_transform_output_path = my_transformer.output_path
print('Batch transform output path:')
print(my_transform_output_path)

Now you can display and review output generated by the batch transform job available in S3.

In [None]:
# transform output file name = {input_data_file}.csv.out
my_transform_output_file = my_input_data_file + '.out'

# data file for inference
my_inference_file = sagemaker.s3.s3_path_join(
    my_transform_output_path,
    my_transform_output_file)

# CSV data string
my_inference = sagemaker.s3.S3Downloader.read_file(
    my_inference_file, 
    sagemaker_session=my_session)

# review
print('Output of batch transform job:\n')
print(my_inference)

You may change the transform output file name to keep the file from being overwritten.

Open AWS S3 Console, go to the batch transform output path shown above, re-name the file "{inference_input_data_file_name}.csv.out" to
1. "fit_output.csv", if accept = output_type = 'text/csv', or
1. "fit_output.json", if accept = output_type = 'application/json'

#### 6.2. Delete the model

Now that you have successfully performed a batch inference. IF you plan to review the trained or tuned model structure by using Transformer as demonstrated later, do NOT run the cell below. Otherwise, you can delete the model.

In [None]:
# need more batch transform?
more_batch_transform = True

# Information on sagemaker.session.Session().delete_model():
# https://sagemaker.readthedocs.io/en/stable/api/utility/session.html
if not more_batch_transform:
    my_session.delete_model(my_model_name)

### 7. Model review by using Transformer (optional)

#### 7.1. VBfFA predictions and goodness scores of the predictions

**Labels**  
  
1. 'fit_output': Output time-series from fitting variational Bayesian filtering factor analysis (VBfFA) model
1. 'posterior': Posterior of variational Bayesian filtering as of the last time stamp of sample data vector time-series
1. 'pred_output': Output time-series from prediction by fitted variational Bayesian filtering factor analysis (VBffA) model
1. 'last_pred': Predictions as of the last output time-stamp by fitted variational Bayesian filtering factor analysis (VBffA) model
1. 'eval_score': Variational Bayesian filtering factor analysis (VBfFA) model evaluation score time-series
1. 'eval_stats': Variational Bayesian filtering factor analysis (VBfFA) model evaluation statistics, by averages over time of VBffA model evaluation score time-series
1. 'diff_score': Arithmetic or logarithmic difference of VBfFA model evaluation score time-series minus OBS model evaluation score time-series
1. 'diff_stats': VBfFA vs. OBS model evaluation statistics, by averages over time of arithmetic or logarithmic difference of VBfFA model evaluation score time-series minus OBS model evaluation score time-series

#### 7.2. Select prediction or score for review

Trained or tuned VBfFA predictions or scores can be reviewed using Transformer with specific values of environment variable,
MODELOUTPUT, and choice of output type.  
   
Choices of values of model output and output type are:  

In [None]:
# available choices for MODELOUTPUT
model_output_choice = dict({
    1: 'fit_output',
    2: 'posterior',
    3: 'pred_output',
    4: 'last_pred',
    5: 'eval_score',
    6: 'eval_stats',
    7: 'diff_score',
    8: 'diff_stats'
})

# available choices for output type
output_type_choice = dict({
    1: 'text/csv',
    2: 'application/json'
})

You can make any valid pair of choices as exemplified as in following cell:

In [None]:
# choice for MODELOUTPUT (an integer between 1 and 8)
model_output = model_output_choice[
    3
]

# output type
output_type = output_type_choice[
    2
]

# review
print('model_output = ' + model_output)
print('output_type = ' + output_type)

#### 7.3. Model output review with Transformer

In [None]:
# ENV variables
my_ENV = dict({
    'MODELOUTPUT': model_output})

# sagemaker.transformer.Transformer()
my_transformer = sagemaker.transformer.Transformer(
    model_name=my_model_name,
    instance_count=1,
    instance_type=my_EC2,
    output_path=my_output_data_infer_path,
    accept=output_type,
    env=my_ENV,
    sagemaker_session=my_session)

In [None]:
# sagemaker.inputs.TransformInput()
my_transform_data_path = my_input_data_infer_path
my_transform_data_type = 'S3Prefix'
my_transform_content_type = 'text/csv'

# remind
print('Run batch transform. Wait for transform job completes with information:')
print('Batch transform output path')

# sagemaker.transformer.Transformer()
my_transformer.transform(
    data=my_transform_data_path,
    data_type=my_transform_data_type,
    content_type=my_transform_content_type,
    # wait=False,
    logs=True)

# wait
my_transformer.wait()

# output is available on following path
my_transform_output_path = my_transformer.output_path
print('Batch transform output path:')
print(my_transform_output_path)

You can display and review output generated by the batch transform job available in S3.

In [None]:
# transform output file name = {input_data_file}.csv.out
my_transform_output_file = my_input_data_file + '.out'

# data file for inference
my_inference_file = sagemaker.s3.s3_path_join(
    my_transform_output_path,
    my_transform_output_file)

# CSV data string
my_inference = sagemaker.s3.S3Downloader.read_file(
    my_inference_file, 
    sagemaker_session=my_session)

# display
print('Selected output:\n')
print(my_inference)

You may change the selected output file name to keep the file from being overwritten.

Open AWS S3 Console, go to the batch transform output path shown above, re-name the file "{inference_input_data_file_name}.csv.out" to
1. "{model_output}.csv", if accept = output_type = 'text/csv', or
1. "{model_output}.json", if accept = output_type = 'application/json'

### 8. Clean-up

#### 8.1. Delete endpoint and model

In [None]:
# Information on sagemaker.predictor.Predictor().delete_endpoint():
# https://sagemaker.readthedocs.io/en/stable/api/inference/predictors.html
my_predictor.delete_endpoint(
    delete_endpoint_config=True)

In [None]:
# Information on sagemaker.session.Session().delete_model():
# https://sagemaker.readthedocs.io/en/stable/api/utility/session.html
my_session.delete_model(my_model_name)

#### 8.2. Unsubscribe to the listing (optional)

If you would like to unsubscribe to the algorithm, follow these steps. Before you cancel the subscription, ensure that you do not have any [deployable model](https://console.aws.amazon.com/sagemaker/home#/models) created from the model package or using the algorithm. Note - You can find this information by looking at the container name associated with the model. 

**Steps to unsubscribe to product from AWS Marketplace**:  

1. Navigate to __Machine Learning__ tab on [__Your Software subscriptions page__](https://aws.amazon.com/marketplace/ai/library?productType=ml&ref_=mlmp_gitdemo_indust)
2. Locate the listing that you want to cancel the subscription for, and then choose __Cancel Subscription__  to cancel the subscription.

