# Train the AutoModel with SageMaker training jobs

This notebook is a walk-through on how to train and evaluated the AutoModel within Sagemaker training job. Sagemaker training jobs have several advantages over a normal notebook, like:

- provide a nice overview of all the training you ran
- automatically store the results of a training run (metrics, [logs](https://console.aws.amazon.com/cloudwatch) and models)
- allows running multiple training jobs in parallel if sufficient GPUs is allocated

## Setup

First, we need to import required libraries and functions.

In [None]:
import sys                                                                             # Python system library needed to load custom functions
import numpy as np                                                                     # for performing calculations on numerical arrays
import pandas as pd                                                                    # home of the DataFrame construct, _the_ most important object for Data Science
import seaborn as sns                                                                  # additional plotting library
import matplotlib.pyplot as plt                                                        # allows creation of insightful plots
import os                                                                              # for changing the directory

import sagemaker                                                                       # dedicated sagemaker library to execute training jobs
import boto3                                                                           # for interacting with S3 buckets

from sagemaker.huggingface import HuggingFace                                           # for executing the trainig jobs
from sklearn.metrics import precision_recall_fscore_support, accuracy_score             # tools to understand how our model is performing

sys.path.append('../src')                                                               # Add the source directory to the PYTHONPATH. This allows to import local functions and modules.
from gdsc_utils import create_encrypted_bucket, download_and_extract_model, PROJECT_DIR # functions to create S3 buckets and to help with downloading models. Importing our root directory
from gdsc_eval import plot_confusion_matrix                                             # function for creating confusion matrix
from config import DEFAULT_BUCKET, DEFAULT_REGION                                       # importing the bucket name that contains data for the challenge and the default region
os.chdir(PROJECT_DIR)                                                                   # changing our directory to root

# Running the Training Script

The training job will run on a virtual machine (called an instance) in the AWS cloud. Need to set the name of our experiment and every experiment should have a unique name.

In [None]:
entry_point = 'auto_train.py'
exp_name = entry_point.split('.')[0].replace('_', '-')  # AWS does not allow . and _ as experiment names
exp_name

Next, we need to define the AWS settings for the job.

In [None]:
account_id = boto3.client('sts').get_caller_identity().get('Account')
role = sagemaker.get_execution_role()

Sagemaker has built-in functionality for downloading the data to train a model.
Via the ```input_channels``` parameter we can specify multiple S3 locations. The contents are downloaded in the training job and made available under the provided name (dictionary key).
Sagemaker will download the complete content of the training data bucket, store it on the instance and, save its location in an environment variable called ```SM_CHANNEL_DATA```.<br>

In [None]:
input_channels = {
    "data": f"s3://{DEFAULT_BUCKET}/data"
}
input_channels

In [None]:
# We need to create our own s3 bucket if it doesn't exist yet:
sagemaker_bucket = f"sagemaker-{DEFAULT_REGION}-{account_id}"
create_encrypted_bucket(sagemaker_bucket)

s3_output_location = f"s3://{sagemaker_bucket}/{exp_name}"
s3_output_location

<b>argparse</b> module is to define the parameters that will be passed to the script.

In [None]:
hyperparameters={
    "epochs":12,                                                   # number of training epochs
    "patience":2,                                                  # early stopping - how many epoch without improvement will stop the training
    "train_batch_size":4,                                          # training batch size
    "eval_batch_size":4,                                           # evaluation batch size
    "model_name":"MIT/ast-finetuned-audioset-10-10-0.4593",        # name of the pretrained model from HuggingFace
    "train_dir":"train",                                           # folder name with training data
    "val_dir":"val",                                               # folder name with validation data
    "test_dir":"test",                                             # folder name with test data
    "sampling_rate":44100,                                         # sampling rate
    "learning_rate":float(3e-5),                                   # learning rate
    "gradient_accumulation_steps":8,                               # the number of gradient accumulation steps to be used during training.
    "num_hidden_layers":8,                                         # number of hidden layers to prune the model
}

Finally, we need to specify which metrics we want Sagemaker to automatically track. For this, we need to set up [regular expressions](https://en.wikipedia.org/wiki/Regular_expression) that will be applied to the logs.
The corresponding values will then be stored and made visible in the training job.

In [None]:
metric_definitions=[
    {'Name': 'loss', 'Regex': "'loss': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'learning_rate', 'Regex': "'learning_rate': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_loss', 'Regex': "'eval_loss': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_accuracy', 'Regex': "'eval_accuracy': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_f1', 'Regex': "'eval_f1': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_precision', 'Regex': "'eval_precision': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_recall', 'Regex': "'eval_recall': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'epoch', 'Regex': "'epoch': ([0-9]+(.|e\-)[0-9]+),?"}]

Below is the estimator, to call the *fit* method to start the training job. As this might take a while, we can set ```wait=False``` so our notebook will not wait for the training job to finish and we can continue working, but for the sake of the tutorial let's set it to ```True```.

In [None]:
image_uri = '954362353459.dkr.ecr.us-east-1.amazonaws.com/sm-training-custom:latest'

huggingface_estimator = HuggingFace(
    entry_point=entry_point,                # fine-tuning script to use in training job
    source_dir="./src",                     # directory where fine-tuning script is stored. This directory will be downloaded to training instance
    instance_type="ml.g4dn.xlarge",         # instance type - ml.g4dn.xlarge is a GPU instance so the training will be faster
    output_path = s3_output_location,       # outputbucket to store our model after training
    instance_count=1,                       # number of instances. We are limited to 1 instance
    role=role,                              # IAM role used in training job to acccess AWS resources (S3)
    image_uri = image_uri,                  # passing our custom image with the required libraries
    py_version="py310",                     # Python version
    hyperparameters=hyperparameters,        # hyperparameters to use in training job
    metric_definitions = metric_definitions # metrics we want to extract from logs. It will be visible in SageMaker training job UI
)

In [None]:
huggingface_estimator.fit(input_channels, wait=True)

In [None]:
# save the model location to the filesystem so that we can use it later
model_location = f'{s3_output_location}/{huggingface_estimator._hyperparameters["sagemaker_job_name"]}/output/model.tar.gz'
print(model_location)

In [None]:
# saving the csv file under the appropriate location. Create the folder if it doesn't exist
model_folder_path =f"models/{huggingface_estimator._hyperparameters['sagemaker_job_name']}"

if not os.path.exists(model_folder_path):
    os.makedirs(model_folder_path)

with open(f'{model_folder_path}/model_location.txt', 'w+') as f:
    f.write(model_location)

# The newly trained model!

After the training job is finished you can download the results of the training job.

First specify where the results should be stored.

In [None]:
# read the model location from the filesystem
with open(f'{model_folder_path}/model_location.txt', 'r') as f:
    model_location = f.read()

A custom function to downloads the results to the local directory.

In [None]:
local_model_dir = download_and_extract_model(model_uri=model_location, local_dir='models')
local_model_dir

With everything set up, let's proceed to loading the test set predictions!

In [None]:
test_preds = pd.read_csv(f'{local_model_dir}/prediction_test.csv', index_col = False)
test_preds.head()