# Introduction to JumpStart - Text Classification

---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. 

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/introduction_to_amazon_algorithms|jumpstart_text_classification|Amazon_JumpStart_Text_Classification.ipynb)

---

---
Welcome to Amazon [SageMaker JumpStart](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-jumpstart.html)! You can use JumpStart to solve many Machine Learning tasks through one-click in SageMaker Studio, or through [SageMaker JumpStart API](https://sagemaker.readthedocs.io/en/stable/overview.html#use-prebuilt-models-with-sagemaker-jumpstart). 

In this demo notebook, we demonstrate how to use the JumpStart API for Text Classification. Text Classification refers to classifying an input sentence to one of the class labels of the training dataset.  We demonstrate two use cases of Text Classification models:

* How to use a Transformer model pre-trained on English dataset, and fine-tuned on [SST2](https://nlp.stanford.edu/sentiment/index.html) dataset, to perform Sentiment Analysis.
* How to fine-tune a pre-trained Transformer model to a custom dataset, and then run inference on the fine-tuned model.

Note: This notebook was tested on ml.t3.medium instance in Amazon SageMaker Studio with Python 3 (Data Science) kernel and in Amazon SageMaker Notebook instance with conda_python3 kernel.

---

1. [Set Up](#1.-Set-Up)
2. [Select a pre-trained model](#2.-Select-a-pre-trained-model)
3. [Run inference on the pre-trained model](#3.-Run-inference-on-the-pre-trained-model)
    * [Retrieve JumpStart Artifacts & Deploy an Endpoint](#3.1.-Retrieve-JumpStart-Artifacts-&-Deploy-an-Endpoint)
    * [Example input sentences for inference](#3.2.-Example-input-sentences-for-inference)
    * [Query endpoint and parse response](#3.3.-Query-endpoint-and-parse-response)
    * [Clean up the endpoint](#3.4.-Clean-up-the-endpoint)
4. [Finetune the pre-trained model on a custom dataset](#4.-Finetune-the-pre-trained-model-on-a-custom-dataset)
    * [Retrieve JumpStart Training artifacts](#4.1.-Retrieve-JumpStart-Training-artifacts)
    * [Set Training parameters](#4.2.-Set-Training-parameters)
    * [Train with Automatic Model Tuning (HPO)](#AMT)
    * [Start Training](#4.4.-Start-Training)
    * [Deploy & run Inference on the fine-tuned model](#4.5.-Deploy-&-run-Inference-on-the-fine-tuned-model)

## 1. Set Up
***
Before executing the notebook, there are some initial steps required for setup. This notebook requires latest version of sagemaker and ipywidgets.
***

In [2]:
!pip install sagemaker ipywidgets --upgrade --quiet

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
autovizwidget 0.20.5 requires pandas<2.0.0,>=0.20.1, but you have pandas 2.0.3 which is incompatible.
hdijupyterutils 0.20.5 requires pandas<2.0.0,>=0.17.1, but you have pandas 2.0.3 which is incompatible.
sparkmagic 0.20.5 requires nest-asyncio==1.5.5, but you have nest-asyncio 1.5.6 which is incompatible.
sparkmagic 0.20.5 requires pandas<2.0.0,>=0.17.1, but you have pandas 2.0.3 which is incompatible.[0m[31m
[0m

---

To train and host on Amazon Sagemaker, we need to setup and authenticate the use of AWS services. Here, we use the execution role associated with the current notebook instance as the AWS account role with SageMaker access. It has necessary permissions, including access to your data in S3. 

---

In [3]:
import sagemaker, boto3, json
from sagemaker import get_execution_role

aws_role = get_execution_role()
aws_region = boto3.Session().region_name
sess = sagemaker.Session()

## 2. Select a pre-trained model
***
You can continue with the default model, or can choose a different model from the dropdown generated upon running the next cell. A complete list of JumpStart models can also be accessed at [JumpStart Models](https://sagemaker.readthedocs.io/en/stable/doc_utils/jumpstart.html#).
***

In [4]:
model_id = "tensorflow-tc-bert-en-uncased-L-12-H-768-A-12-2"

***
[Optional] Select a different JumpStart model. Here, we download jumpstart model_manifest file from the jumpstart s3 bucket, filter-out all the Text Classification models and select a model.
***

In [5]:
import IPython
from ipywidgets import Dropdown

# download JumpStart model_manifest file.
boto3.client("s3").download_file(
    f"jumpstart-cache-prod-{aws_region}", "models_manifest.json", "models_manifest.json"
)
with open("models_manifest.json", "rb") as json_file:
    model_list = json.load(json_file)

# filter-out all the Text Classification models from the manifest list.
tc_models_all_versions, tc_models = [
    model["model_id"] for model in model_list if "-tc-" in model["model_id"]
], []
[tc_models.append(model) for model in tc_models_all_versions if model not in tc_models]

# display the model-ids in a dropdown, for user to select a model.
dropdown = Dropdown(
    value=model_id,
    options=tc_models,
    description="JumpStart Text Classification Models:",
    style={"description_width": "initial"},
    layout={"width": "max-content"},
)
display(IPython.display.Markdown("## Select a JumpStart pre-trained model from the dropdown below"))
display(dropdown)

## Select a JumpStart pre-trained model from the dropdown below

Dropdown(description='JumpStart Text Classification Models:', index=24, layout=Layout(width='max-content'), op…

## 3. Run inference on the pre-trained model
***
Using JumpStart, we can perform inference on the pre-trained model, even without fine-tuning it first on a custom dataset. For this example, that means on an input sentence, predicting the class label from one of the 2 classes of the [SST2](https://nlp.stanford.edu/sentiment/index.html) dataset. 

***

### 3.1. Retrieve JumpStart Artifacts & Deploy an Endpoint
***
We retrieve the deploy_image_uri, deploy_source_uri, and base_model_uri for the pre-trained model. To host the pre-trained model, we create an instance of [`sagemaker.model.Model`](https://sagemaker.readthedocs.io/en/stable/api/inference/model.html) and deploy it.
***

In [6]:
from sagemaker import image_uris, model_uris, script_uris
from sagemaker.model import Model
from sagemaker.predictor import Predictor
from sagemaker.utils import name_from_base

# model_version="*" fetches the latest version of the model.
infer_model_id, infer_model_version = dropdown.value, "*"

endpoint_name = name_from_base(f"jumpstart-example-{infer_model_id}")

inference_instance_type = "ml.m5.xlarge"

# Retrieve the inference docker container uri.
deploy_image_uri = image_uris.retrieve(
    region=None,
    framework=None,
    image_scope="inference",
    model_id=infer_model_id,
    model_version=infer_model_version,
    instance_type=inference_instance_type,
)
# Retrieve the inference script uri.
deploy_source_uri = script_uris.retrieve(
    model_id=infer_model_id, model_version=infer_model_version, script_scope="inference"
)
# Retrieve the base model uri.
base_model_uri = model_uris.retrieve(
    model_id=infer_model_id, model_version=infer_model_version, model_scope="inference"
)
# Create the SageMaker model instance. Note that we need to pass Predictor class when we deploy model through Model class,
# for being able to run inference through the sagemaker API.
model = Model(
    image_uri=deploy_image_uri,
    source_dir=deploy_source_uri,
    model_data=base_model_uri,
    entry_point="inference.py",
    role=aws_role,
    predictor_cls=Predictor,
    name=endpoint_name,
)
# deploy the Model.
base_model_predictor = model.deploy(
    initial_instance_count=1,
    instance_type=inference_instance_type,
    endpoint_name=endpoint_name,
)

----!

### 3.2. Example input sentences for inference
***
These examples are taken from SST2 dataset downloaded from [TensorFlow](https://www.tensorflow.org/datasets/catalog/glue#gluesst2). [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). [Dataset Homepage](https://nlp.stanford.edu/sentiment/index.html). 
***

In [7]:
text1 = "astonishing ... ( frames ) profound ethical and philosophical questions in the form of dazzling pop entertainment"
text2 = "simply stupid , irrelevant and deeply , truly , bottomlessly cynical "

### 3.3. Query endpoint and parse response
***
Input to the endpoint is a single sentence. Response from the endpoint is a dictionary containing the predicted class label, and a list of class label probabilities.
***

In [8]:
newline, bold, unbold = "\n", "\033[1m", "\033[0m"


def query_endpoint(encoded_text):
    response = base_model_predictor.predict(
        encoded_text, {"ContentType": "application/x-text", "Accept": "application/json;verbose"}
    )
    return response


def parse_response(query_response):
    model_predictions = json.loads(query_response)
    probabilities, labels, predicted_label = (
        model_predictions["probabilities"],
        model_predictions["labels"],
        model_predictions["predicted_label"],
    )
    return probabilities, labels, predicted_label


for text in [text1, text2]:
    query_response = query_endpoint(text.encode("utf-8"))
    probabilities, labels, predicted_label = parse_response(query_response)
    print(
        f"Inference:{newline}"
        f"Input text: '{text}'{newline}"
        f"Model prediction: {probabilities}{newline}"
        f"Labels: {labels}{newline}"
        f"Predicted Label: {bold}{predicted_label}{unbold}{newline}"
    )

Inference:
Input text: 'astonishing ... ( frames ) profound ethical and philosophical questions in the form of dazzling pop entertainment'
Model prediction: [0.000452965876, 0.999547064]
Labels: [0, 1]
Predicted Label: [1m1[0m

Inference:
Input text: 'simply stupid , irrelevant and deeply , truly , bottomlessly cynical '
Model prediction: [0.998723, 0.0012769578]
Labels: [0, 1]
Predicted Label: [1m0[0m



### 3.4. Clean up the endpoint

In [9]:
# Delete the SageMaker endpoint and the attached resources
base_model_predictor.delete_model()
base_model_predictor.delete_endpoint()

## 4. Finetune the pre-trained model on a custom dataset
***
Previously, we saw how to run inference on a pre-trained model, which was fine-tuned on SST dataset. Next, we discuss how a model can be finetuned to a custom dataset with any number of classes. 

The Text Embedding model can be fine-tuned on any text classification dataset in the same way the 
model available for inference has been fine-tuned on the SST2 movie review dataset.

The model available for fine-tuning attaches a classification layer to the Text Embedding model 
and initializes the layer parameters to random values. 
The output dimension of the classification layer is determined based on the number of classes 
detected in the input data. The fine-tuning step fine-tunes all the model 
parameters to minimize prediction error on the input data and returns the fine-tuned model. 
The model returned by fine-tuning can be further deployed for inference. 
Below are the instructions for how the training data should be formatted for input to the model.


- **Input:** A directory containing a 'data.csv' file. 
    - Each row of the first column of 'data.csv' should have integer class labels between 0 to the number of classes.
    - Each row of the second column should have the corresponding text. 
- **Output:** A trained model that can be deployed for inference. 
 
Below is an example of 'data.csv' file showing values in its first two columns. Note that the file should not have any header.

|   |   |
|---|---|
|0	|hide new secretions from the parental units| 
|0	|contains no wit , only labored gags| 
|1	|that loves its characters and communicates something rather beautiful about human nature| 
|...|...|
 
source: [TensorFlow Hub](model_url). License:[Apache 2.0 License](https://jumpstart-cache-alpha-us-west-2.s3-us-west-2.amazonaws.com/licenses/Apache-License/LICENSE-2.0.txt).
 
SST2 dataset is downloaded from [TensorFlow](https://www.tensorflow.org/datasets/catalog/glue#gluesst2).
 [Apache 2.0 License](https://jumpstart-cache-prod-us-west-2.s3-us-west-2.amazonaws.com/licenses/Apache-License/LICENSE-2.0.txt).
  [Dataset Homepage](https://nlp.stanford.edu/sentiment/index.html). 
***

### 4.1. Retrieve JumpStart Training artifacts
***
Here, for the selected model, we retrieve the training docker container, the training algorithm source, the pre-trained model, and a python dictionary of the training hyper-parameters that the algorithm accepts with their default values. Note that the model_version="*" fetches the lates model. Also, we do need to specify the training_instance_type to fetch train_image_uri.
***

In [10]:
from sagemaker import image_uris, model_uris, script_uris, hyperparameters

model_id, model_version = dropdown.value, "*"
training_instance_type = "ml.p3.2xlarge"

# Retrieve the docker image
train_image_uri = image_uris.retrieve(
    region=None,
    framework=None,
    model_id=model_id,
    model_version=model_version,
    image_scope="training",
    instance_type=training_instance_type,
)
# Retrieve the training script
train_source_uri = script_uris.retrieve(
    model_id=model_id, model_version=model_version, script_scope="training"
)
# Retrieve the pre-trained model tarball to further fine-tune
train_model_uri = model_uris.retrieve(
    model_id=model_id, model_version=model_version, model_scope="training"
)

In [11]:
train_source_uri

's3://jumpstart-cache-prod-us-east-1/source-directory-tarballs/tensorflow/transfer_learning/tc/v2.0.1/sourcedir.tar.gz'

### 4.2. Set Training parameters
***
Now that we are done with all the setup that is needed, we are ready to fine-tune our Text Classification model. To begin, let us create a [``sageMaker.estimator.Estimator``](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html) object. This estimator will launch the training job. 

There are two kinds of parameters that need to be set for training. 

The first one are the parameters for the training job. These include: (i) Training data path. This is S3 folder in which the input data is stored, (ii) Output path: This the s3 folder in which the training output is stored. (iii) Training instance type: This indicates the type of machine on which to run the training. Typically, we use GPU instances for these training. We defined the training instance type above to fetch the correct train_image_uri. 

The second set of parameters are algorithm specific training hyper-parameters.
***

#### Load, preprocess and store training data in S3

In [33]:
import pandas as pd
import boto3
from sklearn.preprocessing import LabelEncoder

In [34]:
df_train = pd.read_csv('train.csv')

In [35]:
import re
df_train = df_train[pd.notnull(df_train.cmdb_ci)]
df_train = df_train[pd.notnull(df_train.short_description)]

df_train.short_description = df_train.short_description.apply(lambda text: re.sub(' +', ' ', text))

df_train.cmdb_ci = LabelEncoder().fit_transform(df_train.cmdb_ci)
df_train

Unnamed: 0,short_description,cmdb_ci
0,Excel Macro Progrem not Working,13
1,# [REM-GSB] I've received this message 3-4 tim...,3
2,*[SDSS ONS] Wireless Network troubles in Mitch...,4
3,iPad - how to mark as a shared device?,13
4,[SWEEP] - SRWC A323,2
...,...,...
25384,"Image Macbook 13"" c02zj0x7lvdp",13
25385,trying to ensure my computer is compliant,4
25386,3 Cisco phones need to be picked up,9
25387,[SWEEP] - SRWC U565,2


In [36]:
df_train = df_train[['cmdb_ci', 'short_description']]
df_train

Unnamed: 0,cmdb_ci,short_description
0,13,Excel Macro Progrem not Working
1,3,# [REM-GSB] I've received this message 3-4 tim...
2,4,*[SDSS ONS] Wireless Network troubles in Mitch...
3,13,iPad - how to mark as a shared device?
4,2,[SWEEP] - SRWC A323
...,...,...
25384,13,"Image Macbook 13"" c02zj0x7lvdp"
25385,4,trying to ensure my computer is compliant
25386,9,3 Cisco phones need to be picked up
25387,2,[SWEEP] - SRWC U565


In [37]:
df_train.to_csv('data.csv', header=False, index=False)

In [38]:
s3 = boto3.client('s3')
training_data_filepath = 'data.csv'
training_data_bucket = sess.default_bucket()
training_data_prefix = f'jumpstart_tc/training/{training_data_filepath}'
training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}"

s3.upload_file(training_data_filepath, training_data_bucket, training_data_prefix)

In [39]:
# Sample training data is available in this bucket


output_bucket = sess.default_bucket()
output_prefix = "jumpstart-tc-training"

s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"

***
For algorithm specific hyper-parameters, we start by fetching python dictionary of the training hyper-parameters that the algorithm accepts with their default values. This can then be overridden to custom values.
***

In [40]:
from sagemaker import hyperparameters

# Retrieve the default hyper-parameters for fine-tuning the model
hyperparameters = hyperparameters.retrieve_default(model_id=model_id, model_version=model_version)

# [Optional] Override default hyperparameters with custom values
hyperparameters["batch_size"] = "64"
hyperparameters["epochs"] = "4"

hyperparameters

{'train_only_top_layer': 'False',
 'epochs': '4',
 'batch_size': '64',
 'optimizer': 'adamw',
 'learning_rate': '2e-05',
 'warmup_steps_fraction': '0.1',
 'beta_1': '0.9',
 'beta_2': '0.999',
 'momentum': '0.9',
 'epsilon': '1e-06',
 'rho': '0.95',
 'initial_accumulator_value': '0.1',
 'early_stopping': 'False',
 'early_stopping_patience': '5',
 'early_stopping_min_delta': '0.0',
 'dropout_rate': '0.2',
 'regularizers_l2': '0.01',
 'validation_split_ratio': '0.2',
 'reinitialize_top_layer': 'Auto'}

### 4.3. Train with Automatic Model Tuning ([HPO](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html)) <a id='AMT'></a>
***
Amazon SageMaker automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by running many training jobs on your dataset using the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric that you choose. We will use a [HyperparameterTuner](https://sagemaker.readthedocs.io/en/stable/api/training/tuner.html) object to interact with Amazon SageMaker hyperparameter tuning APIs.
***

In [41]:
from sagemaker.tuner import ContinuousParameter

# Use AMT for tuning and selecting the best model
use_amt = False

# Define objective metric per framework, based on which the best model will be selected.
metric_definitions_per_model = {
    "tensorflow": {
        "metrics": [{"Name": "val_accuracy", "Regex": "val_accuracy: ([0-9\\.]+)"}],
        "type": "Maximize",
    }
}

# You can select from the hyperparameters supported by the model, and configure ranges of values to be searched for training the optimal model.(https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-define-ranges.html)
hyperparameter_ranges = {
    "adam-learning-rate": ContinuousParameter(0.00001, 0.01, scaling_type="Logarithmic")
}

# Increase the total number of training jobs run by AMT, for increased accuracy (and training time).
max_jobs = 6
# Change parallel training jobs run by AMT to reduce total training time, constrained by your account limits.
# if max_jobs=max_parallel_jobs then Bayesian search turns to Random.
max_parallel_jobs = 2

### 4.4. Start Training
***
We start by creating the estimator object with all the required assets and then launch the training job.
***

In [42]:
from sagemaker.estimator import Estimator
from sagemaker.utils import name_from_base
from sagemaker.tuner import HyperparameterTuner

training_job_name = name_from_base(f"jumpstart-example-{model_id}-transfer-learning")

# Create SageMaker Estimator instance
tc_estimator = Estimator(
    role=aws_role,
    image_uri=train_image_uri,
    source_dir=train_source_uri,
    model_uri=train_model_uri,
    entry_point="transfer_learning.py",
    instance_count=1,
    instance_type=training_instance_type,
    max_run=360000,
    hyperparameters=hyperparameters,
    output_path=s3_output_location,
    base_job_name=training_job_name,
)

if use_amt:
    metric_definitions = next(
        value for key, value in metric_definitions_per_model.items() if model_id.startswith(key)
    )

    hp_tuner = HyperparameterTuner(
        tc_estimator,
        metric_definitions["metrics"][0]["Name"],
        hyperparameter_ranges,
        metric_definitions["metrics"],
        max_jobs=max_jobs,
        max_parallel_jobs=max_parallel_jobs,
        objective_type=metric_definitions["type"],
        base_tuning_job_name=training_job_name,
    )

    # Launch a SageMaker Tuning job to search for the best hyperparameters
    hp_tuner.fit({"training": training_dataset_s3_path})
else:
    # Launch a SageMaker Training job by passing s3 path of the training data
    tc_estimator.fit({"training": training_dataset_s3_path}, logs=True)

INFO:sagemaker:Creating training-job with name: jumpstart-example-tensorflow-tc-bert-en-2023-08-31-17-26-54-055


2023-08-31 17:26:54 Starting - Starting the training job......
2023-08-31 17:27:27 Starting - Preparing the instances for training......
2023-08-31 17:28:37 Downloading - Downloading input data...
2023-08-31 17:29:17 Training - Downloading the training image........................
2023-08-31 17:33:03 Training - Training image download completed. Training in progress..[34m2023-08-31 17:33:25.624280: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.[0m
[34m2023-08-31 17:33:25.624493: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.[0m
[34m2023-08-31 17:33:25.668424: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.[0m
[34m2023-08-31 17:33:28,235 sagemaker-training-toolkit INFO     Imported framework sagemaker_tensorflow_container.training[0m


[34m'_input_model_extracted/__models_info__.json' file could not be found.[0m
[34mNo training configuration found in save file, so the model was *not* compiled. Compile it manually.[0m
[34mAttaching a randomly initialized classification layer on top of the original encoder layer model to classify input text to one of the 22 classes.[0m
[34mModel: "model"[0m
[34m__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     [0m
[34mtext (InputLayer)              [(None,)]            0           [][0m
[34mkeras_layer (KerasLayer)       {'input_type_ids':   0           ['text[0][0]']                   
                                (None, 128),                                                      
                                 'input_word_ids':[0m
[34m(None, 128),                                                      
                    

## 4.5. Deploy & run Inference on the fine-tuned model
***
A trained model does nothing on its own. We now want to use the model to perform inference. For this example, that means predicting the class label of an input sentence. We follow the same steps as in [3. Run inference on the pre-trained model](#3.-Run-inference-on-the-pre-trained-model). We start by retrieving the jumpstart artifacts for deploying an endpoint. However, instead of base_predictor, we  deploy the `tc_estimator` that we fine-tuned.
***

In [43]:
inference_instance_type = "ml.m5.xlarge"

# Retrieve the inference docker container uri
deploy_image_uri = image_uris.retrieve(
    region=None,
    framework=None,
    image_scope="inference",
    model_id=model_id,
    model_version=model_version,
    instance_type=inference_instance_type,
)
# Retrieve the inference script uri
deploy_source_uri = script_uris.retrieve(
    model_id=model_id, model_version=model_version, script_scope="inference"
)

endpoint_name = name_from_base(f"jumpstart-example-FT-{model_id}-")

# Use the estimator from the previous step to deploy to a SageMaker endpoint
finetuned_predictor = (hp_tuner if use_amt else tc_estimator).deploy(
    initial_instance_count=1,
    instance_type=inference_instance_type,
    entry_point="inference.py",
    image_uri=deploy_image_uri,
    source_dir=deploy_source_uri,
    endpoint_name=endpoint_name,
)

INFO:sagemaker.image_uris:Ignoring unnecessary Python version: py39.
INFO:sagemaker:Repacking model artifact (s3://sagemaker-us-east-1-305283204878/jumpstart-tc-training/output/jumpstart-example-tensorflow-tc-bert-en-2023-08-31-17-26-54-055/output/model.tar.gz), script artifact (s3://jumpstart-cache-prod-us-east-1/source-directory-tarballs/tensorflow/inference/tc/v2.0.0/sourcedir.tar.gz), and dependencies ([]) into single tar.gz file located at s3://sagemaker-us-east-1-305283204878/sagemaker-jumpstart-2023-08-31-17-54-37-003/model.tar.gz. This may take some time depending on model size...
INFO:sagemaker:Creating model with name: sagemaker-jumpstart-2023-08-31-17-54-37-003
INFO:sagemaker:Creating endpoint-config with name jumpstart-example-FT-tensorflow-tc-bert-2023-08-31-17-54-37-003
INFO:sagemaker:Creating endpoint with name jumpstart-example-FT-tensorflow-tc-bert-2023-08-31-17-54-37-003


----!

---
Next, we input example sentences for running inference.
These examples are taken from SST2 dataset downloaded from [TensorFlow](https://www.tensorflow.org/datasets/catalog/glue#gluesst2). [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). [Dataset Homepage](https://nlp.stanford.edu/sentiment/index.html). 

---

In [48]:
# text1 = "astonishing ... ( frames ) profound ethical and philosophical questions in the form of dazzling pop entertainment"
# text2 = "simply stupid , irrelevant and deeply , truly , bottomlessly cynical "
df_test = pd.read_csv("test.csv")
predictor_input = df_test.drop('cmdb_ci', axis=1).short_description.values

In [49]:
predictor_input

array(["trying to call southbay clinic (ccsb clinic) and it's asking for ID and pin",
       '++logged into Stanford VPN to access lab\'s server, gene. However, whenever when trying to access the server through the powershell, we receive an time out error. "ssh: connect to host gene.stanford.edu port 22: Connection timed out"',
       '* [ONSITE - ORA - CARD HALL LOBBY] Latitude 7420 not working - duplicate ticket of INC01804449 (ORA-LOANER-01 assigned) (loaner handoff appt. 1/31 @ 11:30 AM)',
       ..., 'Updating SUNet password and Cardinal Key questions',
       'SUNet ID PW Reset',
       'Citrix workspace issue | 800 Welch West | CCTO 3rd floor Rad Dept'],
      dtype=object)

---
Next, we query the finetuned model, parse the response and print the predictions.

---

In [54]:
from tqdm import tqdm

In [55]:
newline, bold, unbold = "\n", "\033[1m", "\033[0m"


def query_endpoint(encoded_text):
    response = finetuned_predictor.predict(
        encoded_text, {"ContentType": "application/x-text", "Accept": "application/json;verbose"}
    )
    return response


def parse_response(query_response):
    model_predictions = json.loads(query_response)
    probabilities, labels, predicted_label = (
        model_predictions["probabilities"],
        model_predictions["labels"],
        model_predictions["predicted_label"],
    )
    return probabilities, labels, predicted_label

predictions = []
confidence = []

for i in tqdm(range(len(predictor_input))):
    text = predictor_input[i]
    query_response = query_endpoint(text.encode("utf-8"))
    probabilities, labels, predicted_label = parse_response(query_response)
    predictions.append(predicted_label)
    confidence.append(probabilities[int(predicted_label)])
#     print(
#         f"Inference:{newline}"
#         f"Input text: '{text}'{newline}"
#         f"Model prediction: {probabilities}{newline}"
#         f"Labels: {labels}{newline}"
#         f"Predicted Label: {bold}{predicted_label}{unbold}{newline}"
#         f"Confidence: 
#     )

100%|██████████| 3700/3700 [12:38<00:00,  4.88it/s]


In [59]:
def measure_coverage_accuracy(true_labels, predicted_labels_and_scores, eval_fn):
    thresholds = [0.70, 0.75, 0.80, 0.85, 0.90, 0.95]
    print("Threshold     |     Coverage     |     Accuracy    ")
    for threshold in thresholds:
        indexes = [i for i in range(len(predicted_labels_and_scores)) if float(predicted_labels_and_scores[i][1]) >= threshold]
        
        coverage = len(indexes) / 3700
        
        true_covered = []
        pred_covered = []
        
        for index in indexes:
            true_covered.append(true_labels[index])
            pred_covered.append(predicted_labels_and_scores[index][0])
        
        score = eval_fn(true_covered, pred_covered)
        
        print(threshold, "    |     ", coverage, "    |    ", score)


In [83]:
def measure_coverage_ci(true_labels, predicted_labels, scores, le):
    unique_labels = np.unique(true_labels)
    coverage_ci = {}
    
    for label in unique_labels:
        y_true_bool = true_labels == label
        y_pred_bool = predicted_labels == label
        covered_scores = scores >= 0.75
        correct_preds = y_true_bool & y_pred_bool
        
        covered_and_correct = covered_scores & correct_preds
        
        
        label_accuracy = np.sum(covered_and_correct) / np.sum(y_true_bool & covered_scores)
        coverage_ci[le.inverse_transform([label])[0]] = label_accuracy
    
    return coverage_ci    
    

In [84]:
from sklearn.metrics import accuracy_score
import numpy as np

test_label_encoder = LabelEncoder()
y_true = test_label_encoder.fit_transform(df_test.cmdb_ci)
y_pred = predictions
confidence = np.array(confidence)

predicted_label_and_scores = list(zip(y_pred, confidence))
measure_coverage_accuracy(y_true, predicted_label_and_scores, accuracy_score)

Threshold     |     Coverage     |     Accuracy    
0.7     |      0.657027027027027     |     0.8819415878239407
0.75     |      0.6167567567567568     |     0.8965819456617002
0.8     |      0.5659459459459459     |     0.9140401146131805
0.85     |      0.5202702702702703     |     0.9324675324675324
0.9     |      0.46054054054054056     |     0.9471830985915493
0.95     |      0.3608108108108108     |     0.9737827715355806


In [85]:
label_accuracies = measure_coverage_ci(y_true, y_pred, confidence, test_label_encoder)

In [86]:
label_accuracies.values()

dict_values([0.98, 0.8085106382978723, 0.9974025974025974, 0.9547511312217195, 0.06818181818181818, 0.9507042253521126, 0.0, 0.09090909090909091, 0.5833333333333334, 0.9469026548672567, 0.9117647058823529, 0.7368421052631579, 0.8571428571428571, 0.8648648648648649, 0.9714285714285714, 0.9722222222222222, 0.88, 0.8, 1.0, 0.9761904761904762, 0.9831932773109243, 1.0])

---
Next, we clean up the deployed endpoint.

---

In [87]:
# Delete the SageMaker endpoint and the attached resources
finetuned_predictor.delete_model()
finetuned_predictor.delete_endpoint()

INFO:sagemaker:Deleting model with name: sagemaker-jumpstart-2023-08-31-17-54-37-003
INFO:sagemaker:Deleting endpoint configuration with name: jumpstart-example-FT-tensorflow-tc-bert-2023-08-31-17-54-37-003
INFO:sagemaker:Deleting endpoint with name: jumpstart-example-FT-tensorflow-tc-bert-2023-08-31-17-54-37-003


## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/introduction_to_amazon_algorithms|jumpstart_text_classification|Amazon_JumpStart_Text_Classification.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/introduction_to_amazon_algorithms|jumpstart_text_classification|Amazon_JumpStart_Text_Classification.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/introduction_to_amazon_algorithms|jumpstart_text_classification|Amazon_JumpStart_Text_Classification.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/introduction_to_amazon_algorithms|jumpstart_text_classification|Amazon_JumpStart_Text_Classification.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/introduction_to_amazon_algorithms|jumpstart_text_classification|Amazon_JumpStart_Text_Classification.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/introduction_to_amazon_algorithms|jumpstart_text_classification|Amazon_JumpStart_Text_Classification.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/introduction_to_amazon_algorithms|jumpstart_text_classification|Amazon_JumpStart_Text_Classification.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/introduction_to_amazon_algorithms|jumpstart_text_classification|Amazon_JumpStart_Text_Classification.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/introduction_to_amazon_algorithms|jumpstart_text_classification|Amazon_JumpStart_Text_Classification.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/introduction_to_amazon_algorithms|jumpstart_text_classification|Amazon_JumpStart_Text_Classification.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/introduction_to_amazon_algorithms|jumpstart_text_classification|Amazon_JumpStart_Text_Classification.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/introduction_to_amazon_algorithms|jumpstart_text_classification|Amazon_JumpStart_Text_Classification.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/introduction_to_amazon_algorithms|jumpstart_text_classification|Amazon_JumpStart_Text_Classification.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/introduction_to_amazon_algorithms|jumpstart_text_classification|Amazon_JumpStart_Text_Classification.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/introduction_to_amazon_algorithms|jumpstart_text_classification|Amazon_JumpStart_Text_Classification.ipynb)
