# GPT-2 News Classifier

## Introduction

In this notebook, I will demonstrate how to train and deploy a GPT-2-fine-tuned text classification model using Amazon SageMaker.

Here I picked text-classification as a sample use case, but the same approach can be applied to any kind of machine learning model: from RandomForest classification models to YOLO object-detection models. On this news classifier application, a user types/pastes a news paragraph and its category can be predicted with probability. To achieve this we will fine-tune a GPT-2 model and deploy it on this SageMaker Notebook.

I used Pytorch and HuggingFace transformers library for the model development. For more information about the PyTorch in SageMaker, please visit [sagemaker-pytorch-training-toolkit](https://github.com/aws/sagemaker-pytorch-training-toolkit) and [sagemaker-python-sdk](https://github.com/aws/sagemaker-python-sdk) github repositories.

## Packages

First, install all of the required Python packages on this SageMaker Notebook instance.

In [19]:
# !pip install -U sagemaker
# !pip3 install torch==1.10.1
# !pip3 install transformers==4.15.0
# !pip3 install pandas
# !pip3 install numpy
# !pip3 install tqdm==4.62.3
# !pip3 install streamlit

Then, import all of the required Python packages for this notebook.

In [34]:
# imports
import pandas as pd
import torch
import sagemaker

from sagemaker.pytorch import PyTorch
from sagemaker.pytorch.model import PyTorchModel
from sagemaker.predictor import json_deserializer, json_serializer
from transformers import GPT2Tokenizer

import utils

## Setup

This notebook was created and tested on an ml.t2.medium notebook instance.

Let's start by creating a **SageMaker session** and specifying two parameters: **S3 bucket** and **prefix** that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting.

The **IAM role** are used to give training and hosting access to your data. See the documentation for how to create these. Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the sagemaker.get_execution_role() with a the appropriate full IAM role arn string(s).

In [21]:
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

bucket = "gpt2-news-classifier"
prefix = "data"

## Read Data from S3 bucket

Before doing this step, make sure you create a S3 bucket ("gpt2-news-classifier") in your AWS console (see [https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html)), and upload our `data` folder (under the same directory of this notebook) to the S3 bucket. 

In [22]:
train_data_key = "train/train.csv"
val_data_key = "val/val.csv"
test_data_key = "test/test.csv"

train_data_location = 's3://{}/{}/{}'.format(bucket, prefix, train_data_key)
val_data_location = 's3://{}/{}/{}'.format(bucket, prefix, val_data_key)
test_data_location = 's3://{}/{}/{}'.format(bucket, prefix, test_data_key)

In [23]:
pd.read_csv(train_data_location).head()

Unnamed: 0.1,Unnamed: 0,category,text
0,848,tech,slimmer playstation triple sales sony playstat...
1,899,sport,williams stays on despite dispute matt william...
2,553,tech,games win for blu-ray dvd format the next-gene...
3,2071,business,lesotho textile workers lose jobs six foreign-...
4,1155,politics,bid to cut court witness stress new targets to...


In [24]:
pd.read_csv(val_data_location).head()

Unnamed: 0.1,Unnamed: 0,category,text
0,820,sport,hantuchova in dubai last eight daniela hantuch...
1,800,tech,more women turn to net security older people a...
2,98,business,japan narrowly escapes recession japan s econo...
3,310,politics,mps issued with blackberry threat mps will be ...
4,623,sport,bryan twins keep us hopes alive the united sta...


In [25]:
pd.read_csv(test_data_location).head()

Unnamed: 0.1,Unnamed: 0,category,text
0,1510,business,soros group warns of kazakh close the open soc...
1,962,entertainment,german music in a zombie state the german mu...
2,588,sport,mourinho takes swipe at arsenal chelsea boss j...
3,1673,entertainment,mogul wilson backing uk rap band tony wilson ...
4,2171,sport,palace threat over cantona masks manchester un...


## Train GPT-2 model (Fine-tuning a pre-trained GPT-2 model)

### Training Script

Our training script `train_deploy.py` provides all the code we need for training and hosting a SageMaker model (model_fn function to load a model). The training script is very similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables, such as:

`SM_MODEL_DIR`: A string representing the path to the directory to write model artifacts to. These artifacts are uploaded to S3 for model hosting.
`SM_NUM_GPUS`: The number of gpus available in the current container.
`SM_CURRENT_HOST`: The name of the current container on the container network.
`SM_HOSTS`: JSON encoded list containing all the hosts .
Supposing one input channel, 'training', was used in the call to the PyTorch estimator's `fit()` method, the following will be set, following the format `SM_CHANNEL_[channel_name]`:
- `SM_CHANNEL_TRAIN`: A string representing the path to the directory containing data in the `train` channel.

For more information about training environment variables, please visit [SageMaker Containers](https://github.com/aws/sagemaker-containers).

A typical training script loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model to model_dir so that it can be hosted later. Hyperparameters are passed to your script as arguments and can be retrieved with an `argparse.ArgumentParser` instance.

Because the SageMaker imports the training script, you should put your training code in a main guard (if `__name__=='__main__'`:) if you are using the same script to host your model as we do in this example, so that SageMaker does not inadvertently run your training code at the wrong point in execution.

For example, the script run by this notebook (uncomment to see the code):

In [29]:
# Uncomment to see the script code. 
# !pygmentize ./code/train_deploy.py

### Run traning in SageMaker Notebook

SageMaker's `PyTorch` class allows us to run our training function as a training job on SageMaker infrastructure. We need to configure it with our training script, an IAM role, the number of training instances, the training instance type, and hyperparameters. In this case we are going to run our training job on 1 ml.m5.4xlarge instance. But this example can be ran on one or multiple, cpu or gpu instances ([see full list of available instances](https://aws.amazon.com/sagemaker/pricing/). The hyperparameters parameter is a dict of values that will be passed to your training script -- you can see how to access these values in the `train_deploy.py` script above.

In [31]:
pytorch_estimator = PyTorch(entry_point='train_deploy.py',
                            source_dir="code",
                            role=role,
                            instance_type='ml.m5.4xlarge',
                            instance_count=1,
                            framework_version='1.10',
                            py_version='py38',
                            hyperparameters = {
                                'epochs': 1, 
                                'batch-size': 2, 
                                'lr': 1e-5
                            })

After we've constructed our PyTorch object, we can fit it using the data we uploaded to S3 bucket. SageMaker makes sure our data is available in the local filesystem, so our training script can simply read the data from disk.

In [32]:
pytorch_estimator.fit({
    'train': train_data_location,
    'val': val_data_location,
    'test': test_data_location
                      })

2022-01-26 20:00:11 Starting - Starting the training job...
2022-01-26 20:00:23 Starting - Launching requested ML instancesProfilerReport-1643227211: InProgress
......
2022-01-26 20:01:41 Starting - Preparing the instances for training.........
2022-01-26 20:03:13 Downloading - Downloading input data...
2022-01-26 20:03:38 Training - Downloading the training image.....[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2022-01-26 20:04:25,754 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2022-01-26 20:04:25,756 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2022-01-26 20:04:25,763 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m

2022-01-26 20:04:38 Training - Training image download completed. Training in progress.[34m2022-01-26 20:04:31,986 sagemaker_pytorch_co

## Deploy the model to a SageMaker Endpoint

After training, the model is saved automatically to a S3 bucket accessible using `.model_data` method. We can either go to this bucket on AWS console and download the model for deployment elsewhere (e.g. an EC2 instance), or we can deploy the model right here within this notebook. I will show the latter here.

As mentioned above we have implementation of `model_fn` in the `train_deploy.py` script that is required. We also adapted `input_fn`, `predict_fn` and `output_fn` functions according to our specific text processing steps as instructed [here](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#serve-a-pytorch-model).

The arguments to the deploy function allow us to set the number and type of instances that will be used for the Endpoint. These do not need to be the same as the values we used for the training job. For example, you can train a model on a set of GPU-based instances, and then deploy the Endpoint to a fleet of CPU-based instances, but you need to make sure that you return or save your model as a cpu model similar to what we did in `train_deploy.py`. Here we will deploy the model to a single `ml.t2.medium` instance.

To deploy the model within this notebook instance, we first need to use SageMaker's `PytorchModel` object to load the saved model. This step can be skipped since we can directly use the `pytorch_estimator` object trained in the last step. However, in practice, we often need to rerun this notebook for debugging/test, and to avoid having to train the model every time we open this notebook, I created this "breakpoint" here.

In [12]:
model_data = pytorch_estimator.model_data
print(model_data)

NameError: name 'pytorch_estimator' is not defined

In [12]:
model_data = "s3://sagemaker-us-east-1-439373211214/pytorch-training-2022-01-26-20-00-11-378/output/model.tar.gz"

In [13]:
pytorch_model = PyTorchModel(model_data=model_data,
                             role=role,
                             framework_version="1.10",
                             source_dir="code",
                             py_version="py38",
                             entry_point="train_deploy.py")

Then, we use the loaded model object to deploy a `PyTorchPredictor`. This creates a Sagemaker Endpoint -- a hosted prediction service that we can use to perform inference.

In [14]:
predictor = pytorch_model.deploy(initial_instance_count=1, 
                                 instance_type="ml.t2.medium")

-------------------------!

## Invoke the Endpoint within the SageMaker Notebook

After deployment, we can evaluate the Endpoint using example input text. In our `input_fn` function in `train_deploy.py`, we require the input data format to be `JSON`, so we need to convert our example input text to `JSON` format and then feed it to our predictor.

In [15]:
predictor.serializer = json_serializer
predictor.deserializer = json_deserializer

We use a clipped news from [BBC Politics](https://www.bbc.com/news/uk-60095459) as an example.

In [16]:
example_text = """
The UK has accused President Putin of plotting to install a pro-Moscow figure to lead Ukraine's government.

The Foreign Office took the unusual step of naming former Ukrainian MP Yevhen Murayev as a potential Kremlin candidate.

Russia has moved 100,000 troops near to its border with Ukraine but denies it is planning an invasion.

UK ministers have warned that the Russian government will face serious consequences if there is an incursion.

In a statement, Foreign Secretary Liz Truss said: "The information being released today shines a light on the extent of Russian activity designed to subvert Ukraine, and is an insight into Kremlin thinking.

"Russia must de-escalate, end its campaigns of aggression and disinformation, and pursue a path of diplomacy."

The Russian Ministry of Foreign Affairs tweeted that the Foreign Office was "circulating disinformation" and urged it to "cease these provocative activities" and "stop spreading nonsense".

"""

Let's see if our predictor can correctly classify this news:

In [17]:
result = predictor.predict({"text": example_text})
print(result)

The json_serializer has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
The json_deserializer has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


{'politics': 0.9923005104064941, 'business': 0.005291244480758905, 'sport': 0.0014019005466252565, 'entertainment': 0.0005219860468059778, 'tech': 0.00048434155178256333}


Great! It did classify the news as "politics"! "politics" has the highest probability of 0.99. Good job, GPT-2!

## Invoke the endpoint from a Streamlit app

With our model endpoint deployed on SageMaker Notebook Instance, we can also develop a Streamlit app on this instance for quick prototyping. [Streamlit](https://streamlit.io/) is a really nice tool for this purpose because of its simplicity, flexibility and integrations with a wide variety of visualisation tools such as [Altair](https://altair-viz.github.io/), [Bokeh](https://docs.bokeh.org/en/latest/index.html) and [Plotly](https://plotly.com/). The source code for our Streamlit app can be found at `./streamlit_app/src/app.py`.

To avoid differences between development and deployment environments, we use Docker containers to simplify the deployment. Our Dockerfile is defined at `./streamlit_app/Dockerfile` and has an `CMD` set to `streamlit run src/app.py` to start the app server within the container. We start off by building the Docker image, and setting the SAGEMAKER_ENDPOINT_NAME argument to `predictor.endpoint_name`, so the app invokes the correct model endpoint.

In [20]:
print(predictor.endpoint_name)

pytorch-inference-2022-01-27-03-52-16-369


In [5]:
sagemaker_endpoint_name = "pytorch-inference-2022-01-27-03-52-16-369"

In [8]:
!export SAGEMAKER_ENDPOINT_NAME="pytorch-inference-2022-01-27-03-52-16-369"

In [12]:
image_name = "streamlit:gpt2-news-classifier-app"

In [23]:
!(cd streamlit_app && docker build -t {image_name} --build-arg SAGEMAKER_ENDPOINT_NAME={sagemaker_endpoint_name} .)

Sending build context to Docker daemon  8.192kB
Step 1/15 : FROM python:3.7.4-slim-stretch
 ---> fad2b9f06d3b
Step 2/15 : ARG SAGEMAKER_ENDPOINT_NAME
 ---> Running in 5d1d27361b30
Removing intermediate container 5d1d27361b30
 ---> fa39be725832
Step 3/15 : ENV SAGEMAKER_ENDPOINT_NAME=$SAGEMAKER_ENDPOINT_NAME
 ---> Running in a487ce73a924
Removing intermediate container a487ce73a924
 ---> f5e513440b92
Step 4/15 : WORKDIR /usr/src/app
 ---> Running in c8fde7602857
Removing intermediate container c8fde7602857
 ---> 002d541ef861
Step 5/15 : COPY requirements.txt requirements.txt
 ---> 4271a85c9653
Step 6/15 : COPY src/ src/
 ---> 7f41933884a1
Step 7/15 : RUN pip install --upgrade pip
 ---> Running in efe8d37135a5
Collecting pip
  Downloading https://files.pythonhosted.org/packages/a4/6d/6463d49a933f547439d6b5b98b46af8742cc03ae83543e4d7688c2420f8b/pip-21.3.1-py3-none-any.whl (1.7MB)
Installing collected packages: pip
  Found existing installation: pip 19.3
    Uninstalling pip-19.3:
      Su

In [30]:
!docker images

REPOSITORY   TAG                        IMAGE ID       CREATED       SIZE
streamlit    gpt2-news-classifier-app   ddd0674e030c   6 hours ago   737MB
python       3.7.4-slim-stretch         fad2b9f06d3b   2 years ago   155MB


With our Docker image built, we can now choose a local port that will be used by the dashboard server. Although the Streamlit server will be running on port 80 inside the container (defined in the `Dockerfile`), we can map this to a different local port on our Amazon SageMaker Notebook Instance. We'll use port 8501 in our example, which is the Streamlit default.

In [15]:
port = 8501

We're now ready to start the Docker container. We have a defined a utility function, called `get_docker_run_command`, that can be used to construct the correct `docker run` command. It handles a number of different things including:

- port forwarding: access the server running inside the Docker container.
- local directory mounts: edit source files without having to rebuild the Docker container.
- debug modes: change the verbosity of error messages.
- permissions: pass the IAM role from the SageMaker Notebook Instance to the Docker container.

After running the cell below, you should click on the `Dashboard URL` link that appears at the top, rather than the `URL` output by the Streamlit server. All local ports must be accessed via the [Jupyter Server Proxy](https://github.com/jupyterhub/jupyter-server-proxy) at `https://{notebook-url}/proxy/{port}`.

In [35]:
url = utils.get_dashboard_url(port)
command = utils.get_docker_run_command(port, image_name, local_dir_mount='./streamlit_app', debug=True)

In [36]:
print(url)

https://gpt2-news-classifier-2.notebook.us-east-1.sagemaker.aws/proxy/8501/


In [37]:
print(command)

docker run -p 8501:80 \
-v /home/ec2-user/SageMaker/streamlit_app:/usr/src/app/src \
--env AWS_DEFAULT_REGION=us-east-1 \
--env AWS_ACCESS_KEY_ID=ASIAWMTFPEZHKPYPY6PH \
--env AWS_SECRET_ACCESS_KEY=1KLhtxlzmJ2Q00q9oeB00rDbVy3AfJo0MCmI+NML \
--env AWS_SESSION_TOKEN=IQoJb3JpZ2luX2VjELb//////////wEaCXVzLWVhc3QtMSJHMEUCIQDLGkHPFOGAERxtP6sPNqNyXSD2KJYe6XVAy6OUy8rwRwIgY4IDynnUdYdnSFF/YqDH/2u+TW1XfFwZaXhC811DTfkqvQII3///////////ARAAGgw0MzkzNzMyMTEyMTQiDA8yHM9EN5c5pHR9pSqRAoERxzU0V17qUiVG3D1XJb4glHjnMIIbsrOn/WIPUYHcA3V9QAbW6zFIvViiTDKSUpSF1vWHA6AuL9K4ZfVjqI+Pn/r7l+WRLbY3AIaA+ljeWimtmskDc8cucNva1cBfbRB54FcTMWw9Leqj7YZYNULbqTg6FLEZfah3gpAPvqfKEtjxJe+darLX/FLpIPckIjwKdOjh6OyILLlzEgkLnbVR18OAnku9LUk7TRKnGffqGE+TK8yBnuUUJ/32EcQm8KRDLmjgKdwpE6eszQhXojnguAi63FUMx5w6+EUCJl7y9Q244f6vTV844kc8sfjmuS0BvfD2WuybpFZYZwkVaYMwe3masOBBCcqn4zsry7A58DD9wNGPBjqTATD3/QiYGq/NRKVO0BQR0+Tt01WduJmZsfOlN5kg1wQOhfo+TDx+J3e2HCx1XvZFyU98ZuIJCes3IgRiwD2MHGUMnwF6UlUsTNHWngnCAuPPBcaJEMEsMixDlIzFbXPamiQpF+R1OgyTv8SpZNqt3fFxvTD2

In [None]:
!echo APP URL: {url}
!{command}

APP URL: https://gpt2-news-classifier.notebook.us-east-1.sagemaker.aws/proxy/8501/

  You can now view your Streamlit app in your browser.

  URL: http://0.0.0.0:80



You should be able to try our GPT-2 News Classifier in this web app!

When you're finished developing the Streamlit app, interrupt the Jupyter kernel by clicking on the stop button or clicking 'Kernel' -> 'Interrupt'.

## Clean Up

After you have finished with this example, make sure to delete the prediction endpoint to release the instance(s) associated with it and avoid being over-charged!

**Caution**: You also need to manually delete any extra resources that you may have created in these notebooks. Some examples include, extra Amazon S3 buckets (to the solution's default bucket), extra Amazon SageMaker endpoints (using a custom name), and extra Amazon ECR repositories.

In [38]:
# delete endpoint to release instance
# sagemaker_session.delete_endpoint(
#     endpoint_name = sagemaker_endpoint_name
# )
# sagemaker_session.delete_model(
#     model_name=sagemaker_endpoint_name
# )