# Fine-tune TinyLlama-1.1B for movie review classification

## Introduction

In this workshop module, you will learn how to fine-tune a Llama-based LLM using causal language modelling so that the model learns how to perform sentiment classification of movie reviews. Your fine-tuning job will be launched using SageMaker Training which provides a serverless training environment where you do not need to manage the underlying infrastructure. You will learn how to configure a PyTorch training job using [SageMaker's PyTorch estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html), and how to leverage the [Hugging Face Optimum Neuron](https://github.com/huggingface/optimum-neuron) package to easily run the PyTorch training job with AWS Trainium accelerators via an [AWS EC2 trn1.2xlarge instance](https://aws.amazon.com/ec2/instance-types/trn1/).

For this module, you will be using a custom dataset based upon the popular [Large Movie Review Dataset](https://ai.stanford.edu/~amaas/data/sentiment/) which consists of thousands of text-based movie reviews each classified as `positive` or `negative`. Our custom dataset consists of descriptive prompts which will allow the LLM to learn how to respond to queries for movie review classification. The dataset examples look like the following:

*Positive example:*
```
###Query: Classify the following movie review as positive or negative
###Review: "Tulip" is on the "Australian All Shorts" video from "Tribe First Rites" showcasing the talents of \
first time directors.<br /><br />I wish more scripts had such excellent dialogue.<br /><br />I hope Rachel \
Griffiths has more stories to tell, she does it so well.
###Classification: positive</s>\n\n
```

*Negative example:*
```
###Query: Classify the following movie review as positive or negative
###Review: I only watched this film from beginning to end because I promised a friend I would. It lacks even \
unintentional entertainment value that many bad films have. It may be the worst film I have ever seen. I'm \
surprised a distributor put their name on it.
###Classification: negative</s>\n\n
```

By fine-tuning the model over several hundred of these prompt examples, the model will then learn how to predict 'positive' or 'negative' when presented with queries containing new movie review content. For example, once the model has been fine-tuned you can present it with the following prompt:
```
###Query: Classify the following movie review as positive or negative
###Review: This movie is very funny. Amitabh Bachan and Govinda are absolutely hilarious. Acting is good. Comedy is great. \
They are up to their usual thing. It would be good to see a sequel to this :)<br /><br />Watch it. Good time-pass movie
###Classification:
```
and if your model is trained properly it will generate `positive` as the output.


This movie review classification use case was selected so you can successfully fine-tune your model in a reasonably short amount of time (~12 minutes) which is appropriate for this workshop. Although this is a relatively simple use case, please bear in mind that the same techniques and components used in this module can also be applied to fine-tune LLMs for more advanced use cases such as writing poetry, summarizing documents, creating blog posts - the possibilities are endless!

## Prerequisites

This notebook uses the SageMaker Python SDK to prepare, launch, and monitor the progress of a PyTorch-based training job. Before we get started, it is important to upgrade the SageMaker SDK to ensure that you are using the latest version. Run the next two cells to upgrade the SageMaker SDK and set up your session.

In [None]:
# Upgrade SageMaker SDK to the latest version
%pip install -U sagemaker -q

In [None]:
import logging 
sagemaker_config_logger = logging.getLogger("sagemaker.config") 
sagemaker_config_logger.setLevel(logging.WARNING)

# Import SageMaker SDK, setup our session
from sagemaker import get_execution_role, Session
from sagemaker.pytorch import PyTorch

sess = Session()
default_bucket = sess.default_bucket()

## Specify the Neuron deep learning container (DLC) image

The SageMaker Training service uses containers to execute your training script, allowing you to fully customize your training script environment and any required dependencies. For this workshop, you will use a recent Neuron deep learning container (DLC) image which is an AWS-maintained image containing the Neuron SDK, PyTorch, and commonly used Python packages. 

In [None]:
# Specify the Neuron DLC that we will use for training
training_image = "763104351884.dkr.ecr.us-east-2.amazonaws.com/pytorch-training-neuronx:1.13.1-neuronx-py310-sdk2.15.0-ubuntu20.04"

## Configure the PyTorch Estimator

The SageMaker SDK includes a [PyTorch Estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html) class which you can use to define a PyTorch training job which will be executed in the SageMaker managed environment. 

In the following cell, you will create a PyTorch Estimator which will run the attached `run_clm.py` training script on a trn1.2xlarge instance. The `run_clm.py` script is an Optimum Neuron example training script that can be used for causal language modelling with AWS Trainium.

The PyTorch Estimator has many parameters that can be used to configure your training job. A few of the most important parameters include:

- *entry_point*: refers to the name of the training script that will be executed as part of this training job
- *source_dir*: the path to the local source code directory (relative to your notebook) that will be packaged up and included inside your training container
- *instance_count*: defines how many EC2 instances to use for this training job
- *instance_type*: determines which type of EC2 instance will be used for training
- *image_uri*: defines which training DLC will be used to run the training job (see Neuron DLC, above)
- *distribution*: determines which type of distribution to use for the training job - you will need 'torch_distributed' for this workshop
- *environment*: provides a dictionary of environment variables which will be applied to your training environment
- *hyperparameters*: provides a dictionary of command-line arguments to pass to your training script, ex: run_clm.py

In the `hyperparameters` section, you can see the specific command-line arguments that are used to control the behavior of the `run_clm.py` training script. Notably:
- *model_name_or_path*: specifies which model you will be fine-tuning, in this case a recent checkpoint from the TinyLlama-1.1B project
- *dataset_name*: specifies which dataset you will use for fine-tuning, in this case our customized IMDB review prompts dataset
- *per_device_train_batch_size*: the microbatch size to be used for fine-tuning
- *max_steps*: the maximum number of steps of fine-tuning that we want to perform
- *tensor_parallel_size*: the tensor parallel degree for which we want to use for training. In this case we use '2' to shard the model across the 2 NeuronCores available in the trn1.2xlarge instance
- *gradient_accumulation_steps*: how many steps for which gradients will be accumulated between updates
- *bf16*: request BFloat16 training

The below estimator has been pre-configured for you, so you do not need to make any changes.

In [None]:
# Set up the PyTorch estimator
# Note that the hyperparameters are just command-line args passed to the run_clm.py script to control its behavior
pt_estimator = PyTorch(
        entry_point="run_clm.py",
        role=get_execution_role(),
        source_dir='./',
        instance_count=1,
        instance_type="ml.trn1.2xlarge",
        framework_version='1.13.1',
        py_version='py310',
        disable_profiler=True,
        output_path=f"s3://{default_bucket}/reinvent2023",
        base_job_name="trn1-tinyllama",
        sagemaker_session=sess,
        code_bucket=f"s3://{default_bucket}/reinvent2023_code",
        checkpoint_s3_uri=f"s3://{default_bucket}/reinvent_output",
        image_uri=training_image,
        distribution={"torch_distributed": {"enabled": True} },  # Required for torchrun-based job launch
        environment={ "FI_EFA_FORK_SAFE": "1", },
        disable_output_compression=True,
        hyperparameters={
            "model_name_or_path": "PY007/TinyLlama-1.1B-intermediate-step-715k-1.5T",
            "dataset_name": "5cp/imdb_review_prompts",
            "per_device_train_batch_size": 1,
            "do_train": "",
            "max_steps": 100,
            "block_size": 150,
            "tensor_parallel_size": 2,
            "output_dir": "/opt/ml/model",
            "gradient_accumulation_steps": 8,
            "logging_steps": 5,
            "bf16": "",
            "disable_tqdm": True
        }
    )

## Launch the training job

Once the estimator has been created, you can then launch your training job by calling `.fit()` on the estimator:

In [None]:
# Call fit() on the estimator to initiate the training job
pt_estimator.fit(wait=False, logs=False)

## Monitor the training job

When the training job has been launched, the SageMaker Training service will then take care of:
- launching and configuring the requested EC2 infrastructure for your training job
- launching the requested container image on each of the EC2 instances
- copying your source code directory and running your training script within the container(s)
- storing your trained model artifacts in Amazon Simple Storage Service (S3)
- decommissioning the training infrastructure

While the training job is running, the following cell will periodically check and output the job status. When you see 'Completed', you know that your training job is finished and you can proceed to the remainder of the notebook. The training job typically takes about 12 minutes to complete.

If you are interested in viewing the output logs from your training job, you can view the logs by navigating to the AWS CloudWatch console, selecting `Logs -> Log Groups` in the left-hand menu, and then looking for your SageMaker training job in the list. **Note:** it will usually take 4-5 minutes before the infrastructure is running and the output logs begin to be populated in CloudWatch.

In [None]:
# Periodically check job status until it shows 'Completed' (ETA ~12 minutes)
#  You can also monitor job status in the SageMaker console, and view the SageMaker Training job logs in the CloudWatch console
from time import sleep
from datetime import datetime

while (job_status := pt_estimator.jobs[-1].describe()['TrainingJobStatus']) not in ['Completed', 'Error']:
    print(f"{datetime.now().isoformat()} Training job status: {job_status}!")
    sleep(30)
    
print(f"\n{datetime.now().isoformat()} Training job status: {job_status}!")

## Determine location of fine-tuned model artifacts

Once the training job has completed, SageMaker will copy your fine-tuned model artifacts to a specified location in S3.

In the following cell, you can see how to programmatically determine the location of your model artifacts:

In [None]:
# Show where the fine-tuned model is stored - previous job must be 'Completed' before running this cell
model_archive_path = pt_estimator.jobs[-1].describe()['ModelArtifacts']['S3ModelArtifacts']
print(f"Your fine-tuned model is available here:\n\n{model_archive_path}")

<br/>

Please copy the above S3 path, as it will be required in the subsequent workshop module.


Lastly, run the following cell to list the model artifacts available in your S3 model_archive_path:

In [None]:
# View the contents of the fine-tuned model path in S3
!aws s3 ls {model_archive_path}/

Congratulations on completing the LLM fine-tuning module!

In the next notebook, you will learn how to deploy your fine-tuned model in a SageMaker hosted endpoint, and leverage AWS Inferentia accelerators to perform model inference. Have fun!