# Fine-Tune, Deploy A Text Classification Model Using Amazon SageMaker  

Disclaimer: This code is adapted from the official AWS SageMaker Starter learnings. 

## 1. Set Up

Before executing the notebook, there are some initial steps required for setup. This notebook requires latest version of sagemaker and ipywidgets.

In [2]:
!pip -qqq install -U sagemaker ipywidgets

In [4]:
import sagemaker, boto3, json
import sys
import time
import importlib

import sys
sys.path.append('./my_config')

import my_config

In [5]:
# Run this code after making changes in the config file

importlib.reload(my_config)

<module 'my_config' from '/Users/zoumanakeita/Desktop/Perso/Freelance/YouTube/AWS-SageMaker/notebooks/my_config.py'>

In [6]:
aws_region = boto3.Session().region_name
sess = sagemaker.Session()
BUCKET_NAME = my_config.BUCKET_NAME

In [7]:
print(f"AWS Region: {aws_region}")
print(f"SageMaker Session: {sess}")
print(f"AWS Role: {my_config.MY_AWS_ROLE}")
print(f"Default Bucket: {BUCKET_NAME}")

AWS Region: us-east-1
SageMaker Session: <sagemaker.session.Session object at 0x13a107f40>
AWS Role: arn:aws:iam::654654631565:role/sagemaker-classification-role
Default Bucket: sagemaker-classification-bucket


In [8]:
model_id = my_config.MODEL_ID

## 2. Finetune the pre-trained model on Stanford Sentiment Treebank 2 (SST-2) dataset for Movie Reviews

The target values represent the sentiment of the sentences:

- 0: Negative sentiment  
- 1: Positive sentiment  

This dataset is used for binary classification tasks, where the goal is to determine whether a given sentence expresses a positive or negative sentiment


### 2.1. Retrieve jumpStart training artifacts

Here, for the selected model, we retrieve the training docker container, the training algorithm source, the pre-trained model, and a python dictionary of the training hyper-parameters that the algorithm accepts with their default values. Note that the model_version="*" fetches the lates model. Also, we do need to specify the training_instance_type to fetch train_image_uri.

In [9]:
from sagemaker import image_uris, model_uris, script_uris

model_id, model_version = (
    model_id,
    "1.1.2",
)  

training_instance_type = my_config.TRAINING_INSTANCE_TYPE

# Retrieve the docker image
train_image_uri = image_uris.retrieve(
    region=None,
    framework=None,
    model_id=model_id,
    model_version=model_version,
    image_scope="training",
    instance_type=training_instance_type,
)

# Retrieve the training script
train_source_uri = script_uris.retrieve(
    model_id=model_id, model_version=model_version, script_scope="training"
)
# Retrieve the pre-trained model tarball to further fine-tune
train_model_uri = model_uris.retrieve(
    model_id=model_id, model_version=model_version, model_scope="training"
)

### 2.2. Set training parameters

Now that we are done with all the setup that is needed, we are ready to fine-tune our Text Classification model. To begin, let us create a `sageMaker.estimator.Estimator` object. This estimator launches the training job.

There are two kinds of parameters that need to be set for training.

The first one are the parameters for the training job. These include: 
- Training data path. This is S3 folder in which the input data is stored   
- Output path: This the s3 folder in which the training output is stored. 
- Training instance type: This indicates the type of machine on which to run the training. Typically, we use GPU instances for these training. We defined the training instance type above to fetch the correct train_image_uri.   

The second set of parameters are algorithm specific training hyper-parameters.

In [10]:
# Sample training data is available in this bucket
training_data_bucket = f"jumpstart-cache-prod-{aws_region}"
training_data_prefix = "training-datasets/SST/"

training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}"

output_bucket = BUCKET_NAME
output_prefix = "TC"

s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"

For algorithm specific hyper-parameters, we start by fetching python dictionary of the training hyper-parameters that the algorithm accepts with their default values. This can then be overridden to custom values.

In [11]:
from sagemaker import hyperparameters

# Retrieve the default hyper-parameters for fine-tuning the model
hyperparameters = hyperparameters.retrieve_default(model_id=model_id, model_version=model_version)

# [Optional] Override default hyperparameters with custom values
hyperparameters["batch-size"] = "64"
hyperparameters["adam-learning-rate"] = "1e-6"
hyperparameters["epochs"] = "1"

In [12]:
print(hyperparameters)

{'epochs': '1', 'adam-learning-rate': '1e-6', 'batch-size': '64', 'reinitialize-top-layer': 'Auto', 'train-only-top-layer': 'False'}


### 2.3. Download, preprocess, and upload the training data

In [13]:
!aws s3 cp --recursive $training_dataset_s3_path data/sst2

download: s3://jumpstart-cache-prod-us-east-1/training-datasets/SST/data.csv to data/sst2/data.csv


In [14]:
import pandas as pd

data = pd.read_csv("data/sst2/data.csv", header=None)
data.columns = ["Target", "Sentence Input"]

View the first five observations of the training data

In [15]:
data.head(5)

Unnamed: 0,Target,Sentence Input
0,0,hide new secretions from the parental units
1,0,"contains no wit , only labored gags"
2,1,that loves its characters and communicates som...
3,0,remains utterly satisfied to remain the same t...
4,0,on the worst revenge-of-the-nerds clichés the ...


In [16]:
data.shape

(68221, 2)

In [17]:
data.Target.unique()

array([0, 1])

We have a very large dataset, and fine-tuning would take a lot of since, since I am not using any GPU power. Let's take only 3% of the data for simplicity sake.

In [18]:
# Take a random 2% sample of the data
sampled_data = data.sample(frac=0.02, random_state=2024)

# Optionally, reset the index
sampled_data = sampled_data.reset_index(drop=True)

sampled_data.shape

(1364, 2)

In [19]:
sampled_data.Target.unique()

array([0, 1])

In [20]:
from sklearn.model_selection import train_test_split

In [21]:
train_data, test_data = train_test_split(sampled_data, test_size=0.01, random_state=2024)

In [22]:
train_data.to_csv("data/sst2/split_train.csv", header=False, index=False)

Upload the splitted training data into the S3 bucket. The training data is further splitted into training and validation data during training. The test data is used as hold-out data to evaluate the model performance.

In [23]:
import boto3

prefix = "TC"
file_path = "train/data.csv"
local_file_path = "data/sst2/split_train.csv"

# Manually construct the S3 path to ensure forward slashes
s3_path = f"{prefix}/{file_path}"

boto3.Session().resource("s3").Bucket(BUCKET_NAME).Object(s3_path).upload_file(local_file_path)

### 2.4 Fine-tuning without hyperparameter optimization

We start by creating the estimator object with all the required assets and then launch the training job.

In [24]:
from sagemaker.estimator import Estimator

In [25]:
training_job_name = f"{my_config.SOLUTION_PREFIX}-tc-finetune"

# Create SageMaker Estimator instance
tc_estimator = Estimator(
    role=my_config.MY_AWS_ROLE,
    image_uri=train_image_uri,
    source_dir=train_source_uri,
    model_uri=train_model_uri,
    entry_point="transfer_learning.py",
    instance_count=1,
    instance_type=training_instance_type,
    max_run=3500,
    hyperparameters=hyperparameters,
    output_path=s3_output_location,
    tags=[{"Key": my_config.TAG_KEY, "Value": my_config.SOLUTION_PREFIX}],
    base_job_name=training_job_name,
)

training_data_path_updated = f"s3://{BUCKET_NAME}/{prefix}/train"

# Launch a SageMaker Training job by passing s3 path of the training data
tc_estimator.fit({"training": training_data_path_updated}, logs=True)

INFO:sagemaker:Creating training-job with name: sagemaker-soln-documents--tc-finetune-2024-08-09-08-16-42-330


2024-08-09 08:16:43 Starting - Starting the training job...
2024-08-09 08:16:58 Starting - Preparing the instances for training...
2024-08-09 08:17:25 Downloading - Downloading input data...
2024-08-09 08:18:11 Training - Training image download completed. Training in progress...2024-08-09 08:18:21.535255: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.
2024-08-09 08:18:21.535411: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.
2024-08-09 08:18:21.562322: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.
2024-08-09 08:18:22,803 sagemaker-training-toolkit INFO     Imported framework sagemaker_tensorflow_container.training
2024-08-09 08:18:22,813 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2024-08-09 08:18:23,1

### 2.5. Deploy & run Inference on the fine-tuned model

We now want to use the model to perform inference, meaning predicting the class label of an input sentence. 

We retrieve the jumpstart artifacts for deploying an endpoint. So, instead of base_predictor, we deploy the tc_estimator that we fine-tuned.



In [26]:
import uuid

inference_instance_type = my_config.INFERENCE_INSTANCE_TYPE 

# Retrieve the inference docker container uri
deploy_image_uri = image_uris.retrieve(
    region=None,
    framework=None,
    image_scope="inference",
    model_id=model_id,
    model_version=model_version,
    instance_type=inference_instance_type,
)

# Retrieve the inference script uri
deploy_source_uri = script_uris.retrieve(
    model_id=model_id, model_version=model_version, script_scope="inference"
)
unique_hash = str(uuid.uuid4())[:6]
endpoint_name_tc_finetune = f"{my_config.SOLUTION_PREFIX}-{unique_hash}-tc-finetune-endpoint"


# Use the estimator from the previous step to deploy to a SageMaker endpoint
finetuned_predictor = tc_estimator.deploy(
    initial_instance_count=1,
    instance_type=inference_instance_type,
    entry_point="inference.py",
    image_uri=deploy_image_uri,
    source_dir=deploy_source_uri,
    endpoint_name=endpoint_name_tc_finetune,
)

time.sleep(10)

INFO:sagemaker.image_uris:Ignoring unnecessary Python version: py37.
INFO:sagemaker:Repacking model artifact (s3://sagemaker-classification-bucket/TC/output/sagemaker-soln-documents--tc-finetune-2024-08-09-08-16-42-330/output/model.tar.gz), script artifact (s3://jumpstart-cache-prod-us-east-1/source-directory-tarballs/tensorflow/inference/tc/v1.1.1/sourcedir.tar.gz), and dependencies ([]) into single tar.gz file located at s3://sagemaker-us-east-1-654654631565/sagemaker-jumpstart-2024-08-09-08-27-34-562/model.tar.gz. This may take some time depending on model size...
INFO:sagemaker:Creating model with name: sagemaker-jumpstart-2024-08-09-08-27-34-562
INFO:sagemaker:Creating endpoint-config with name sagemaker-soln-documents--6883e3-tc-finetune-endpoint
INFO:sagemaker:Creating endpoint with name sagemaker-soln-documents--6883e3-tc-finetune-endpoint


-----!

Next, we query each of the examples in the test data to get its predicted label.

In [27]:
ground_truth, test_examples = (
    test_data.iloc[:, 0].values.tolist(),
    test_data.iloc[:, 1].values.tolist(),
)

In [28]:
newline, bold, unbold = "\n", "\033[1m", "\033[0m"


def query_endpoint(encoded_text, predictor):
    response = predictor.predict(
        encoded_text,
        {"ContentType": "application/x-text", "Accept": "application/json;verbose"},
    )
    return response


def parse_response(query_response):
    model_predictions = json.loads(query_response)
    probabilities, labels, predicted_label = (
        model_predictions["probabilities"],
        model_predictions["labels"],
        model_predictions["predicted_label"],
    )
    return probabilities, labels, predicted_label


predict_prob, predict_label = [], []
for text in test_examples:
    query_response = query_endpoint(text.encode("utf-8"), finetuned_predictor)
    probabilities, labels, predicted_label = parse_response(query_response)
    predict_prob.append(probabilities)
    predict_label.append(predicted_label)

### 2.6. Compute evaluation metrics
Since it is a binary classification task, we use [accuracy score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) and [f1 score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html) as the evaluation metrics.

In [29]:
from sklearn.metrics import accuracy_score, f1_score

f1 = f1_score(predict_label, ground_truth)
accuracy = accuracy_score(predict_label, ground_truth)
result = {"Accuracy": [accuracy], "F1 Score": [f1]}

In [30]:
result = pd.DataFrame.from_dict(result, orient="index", columns=["No HPO"])

In [31]:
result

Unnamed: 0,No HPO
Accuracy,0.571429
F1 Score,0.727273


For accuracy and F1 score, larger value indicates the better performance.

## 3. Clean Up the endpoint

When you've finished with the summarization endpoint (and associated
endpoint-config), make sure that you delete it to avoid accidental
charges.

In [None]:
# Delete the SageMaker endpoint and the attached resources
#finetuned_predictor.delete_model()
#finetuned_predictor.delete_endpoint()