# SageMaker JumpStart Foundation Models - Fine-tuning text generation model on domain specific dataset

---
Welcome to [Amazon SageMaker Built-in Algorithms](https://sagemaker.readthedocs.io/en/stable/algorithms/index.html)! You can use SageMaker Built-in algorithms to solve many Machine Learning tasks through [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/overview.html). You can also use these algorithms through one-click in SageMaker Studio via [JumpStart](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-jumpstart.html).

In this demo notebook, we demonstrate how to use the SageMaker Python SDK for finetuning Foundation Models and deploying the trained model for inference. The Foundation models perform Text Generation task. It takes a text string as input and predicts next words in the sequence.

* **How to run inference on a large language model without finetuning.**
* **How to fine-tune a large language model on a domain specific dataset, and then run inference on the fine-tuned model. In particular, the example dataset we demonstrated is [publicly available SEC filing](https://www.sec.gov/edgar/searchedgar/companysearch) of Amazon from year 2021 to 2022. The expectation is that after fine-tuning, the model should be able to generate insightful text in financial domain.**
* **We compare the inference result for GPT-J 6B before finetuning and after finetuning.**

Note: This notebook was tested on ml.t3.medium instance in Amazon SageMaker Studio with Python 3 (Data Science) kernel and in Amazon SageMaker Notebook instance with conda_python3 kernel.

---

1. [Set up](#1.-Set-up)
2. [Select text generation model](#2.-Select-text-generation-model)
3. [Run inference on the pre-trained model without finetuning](#3.-Run-inference-on-the-pre-trained-model-without-finetuning)
    * [Retrieve artifacts & deploy an endpoint](#3.1.-Retrieve-artifacts-&-deploy-an-endpoint)
    * [Query endpoint and parse response](#3.2.-Query-endpoint-and-parse-response)
    * [Clean up the endpoint](#3.3.-Clean-up-the-endpoint)
4. [Finetune the pre-trained model on a custom dataset](#4.-Fine-tune-the-pre-trained-model-on-a-custom-dataset)
    * [Set training parameters](#4.1.-Set-training-parameters)
    * [Train with Automatic Model Tuning](#4.2.-Train-with-Automatic-Model-Tuning-([HPO]))
    * [Start training](#4.3.-Start-training)
    * [Extract training performance metrics](#4.4.-Extract-training-performance-metrics)
    * [Deploy & run inference on the fine-tuned model](#4.5.-Deploy-&-run-inference-on-the-fine-tuned-model)

## 1. Set up
Before executing the notebook, there are some initial steps required for setup.

In [2]:
!pip install ipywidgets==7.0.0 --quiet
!pip install --upgrade sagemaker --quiet

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
sagemaker-datawrangler 0.4.3 requires sagemaker-data-insights==0.4.0, but you have sagemaker-data-insights 0.3.3 which is incompatible.[0m[31m
[0m

To train and host on Amazon Sagemaker, we need to setup and authenticate the use of AWS services. Here, we use the execution role associated with the current notebook instance as the AWS account role with SageMaker access. It has necessary permissions, including access to your data in S3. 

In [3]:
import sagemaker, boto3, json
from sagemaker.session import Session

sagemaker_session = Session()
aws_role = sagemaker_session.get_caller_identity_arn()
aws_region = boto3.Session().region_name
sess = sagemaker.Session()

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


## 2. Select text generation model

You can continue with the default model or choose a different model from the dropdown generated upon running the next cell. A complete list of JumpStart fine-tuned models can also be accessed at [JumpStart Fine-Tuned Models](https://sagemaker.readthedocs.io/en/stable/doc_utils/pretrainedmodels.html#).

In [4]:
model_id, model_version = "huggingface-textgeneration1-gpt-j-6b", "*"

In [5]:
import IPython
from ipywidgets import Dropdown
from sagemaker.jumpstart.notebook_utils import list_jumpstart_models
from sagemaker.jumpstart.filters import And


filter_value = And("task == textgeneration1", "framework == huggingface")
text_generation_models = list_jumpstart_models(filter=filter_value)

dropdown = Dropdown(
    value=model_id,
    options=text_generation_models,
    description="Sagemaker Pre-Trained Text Generation Models:",
    style={"description_width": "initial"},
    layout={"width": "max-content"},
)
display(IPython.display.Markdown("## Select a pre-trained model from the dropdown below"))
display(dropdown)

## Select a pre-trained model from the dropdown below

A Jupyter Widget

## 3. Run inference on the pre-trained model without finetuning

Using SageMaker, we can directly perform inference on a pre-trained text generation model. For example, [GPT-J 6B](https://huggingface.co/EleutherAI/gpt-j-6b) is an open source 6 billion parameter model released by Eleuther AI. GPT-J 6B has been trained on a large corpus of text data ([the Pile](https://pile.eleuther.ai/) dataset) and is capable of performing various natural language processing tasks such as text generation, text classification, and text summarization. 

### 3.1. Retrieve artifacts & deploy an endpoint

To host the pre-trained model, we create an instance of [`sagemaker.jumpstart.model.JumpStartModel`](https://sagemaker.readthedocs.io/en/stable/overview.html#deploy-a-pre-trained-model-directly-to-a-sagemaker-endpoint) and deploy it.

In [6]:
model_id, model_version = dropdown.value, "3.*"

In [7]:
from sagemaker.jumpstart.model import JumpStartModel
from sagemaker.serializers import JSONSerializer
from sagemaker.utils import name_from_base

model = JumpStartModel(model_id=model_id, model_version=model_version)

base_model_predictor = model.deploy()

Using model 'huggingface-textgeneration1-gpt-j-6b' with wildcard version identifier '3.*'. You can pin to version '3.1.0' for more stable results. Note that models may have different input/output signatures after a major version upgrade.


--------------!

### 3.2. Query endpoint and parse response
The model takes a text string as input and predicts next words in the sequence. We use three of following input examples.

1. `This Form 10-K report shows that`
2. `We serve consumers through`
3. `Our vision is`

**The input examples are related to company's perforamnce in financial report. You will see the outputs from the model without finetuning are limited in providing insightful contents.**

In [12]:
parameters = {
    "max_length": 400,
    "num_return_sequences": 1,
    "top_k": 250,
    "top_p": 0.8,
    "do_sample": True,
    "temperature": 1,
}

res_gpt_before_finetune = []
for quota_text in [
    "This Form 10-K report shows that",
    "We serve consumers through",
    "Our vision is",
]:
    payload = {"inputs": f"{quota_text}:","parameters": parameters}
    generated_texts = base_model_predictor.predict(payload)[0]["generated_text"]
    res_gpt_before_finetune.append(generated_texts)
    print(generated_texts)
    print('-'*20)
    print("\n")

 (1) we are unable to predict the actual results of our
business in any future period or state of our business, (2) our actual results may differ materially from those anticipated in our consolidated financial statements due to a number of factors, including, without limitation, our inability to predict the impact of federal or state tax law changes, changes in accounting rules, our ability to manage our capital, competition, our ability to generate sufficient operating income to make distributions to our stockholders, adverse fluctuations in
--------------------




Our company was founded in the year 2014, and since then, we have made our presence in the global market by offering a wide range of Vat Compliant & Custom Printed Shoes, Shirts, T-shirts, Sweaters, etc. We manufacture a broad array of products for many different types of business like sports shoes, accessories, clothing, etc. to help you get your business. We provide our services at reasonable rates.
--------------------


### 3.3. Clean up the endpoint

In [13]:
# Delete the SageMaker endpoint and the attached resources
base_model_predictor.delete_model()
base_model_predictor.delete_endpoint()

## 4. Fine-tune the pre-trained model on a custom dataset

Fine-tuning refers to the process of taking a pre-trained language model and retraining it for a different but related task using specific data. This approach is also known as transfer learning, which involves transferring the knowledge learned from one task to another. Large language models (LLMs) like GPT-J 6B are trained on massive amounts of unlabeled data and can be fine-tuned on domain domain datasets, making the model perform better on that specific domain. 

We will use financial text from SEC filings to fine tune a LLM GPT-J 6B for financial applications. 



- **Input**: A train and an optional validation directory. Each directory contains a CSV/JSON/TXT file.
    - For CSV/JSON files, the train or validation data is used from the column called 'text' or the first column if no column called 'text' is found.
    - The number of files under train and validation (if provided) should equal to one.
- **Output**: A trained model that can be deployed for inference.
Below is an example of a TXT file for fine-tuning the Text Generation model. The TXT file is SEC filings of Amazon from year 2021 to 2022.

---
```
This report includes estimates, projections, statements relating to our
business plans, objectives, and expected operating results that are “forward-
looking statements” within the meaning of the Private Securities Litigation
Reform Act of 1995, Section 27A of the Securities Act of 1933, and Section 21E
of the Securities Exchange Act of 1934. Forward-looking statements may appear
throughout this report, including the following sections: “Business” (Part I,
Item 1 of this Form 10-K), “Risk Factors” (Part I, Item 1A of this Form 10-K),
and “Management’s Discussion and Analysis of Financial Condition and Results
of Operations” (Part II, Item 7 of this Form 10-K). These forward-looking
statements generally are identified by the words “believe,” “project,”
“expect,” “anticipate,” “estimate,” “intend,” “strategy,” “future,”
“opportunity,” “plan,” “may,” “should,” “will,” “would,” “will be,” “will
continue,” “will likely result,” and similar expressions. Forward-looking
statements are based on current expectations and assumptions that are subject
to risks and uncertainties that may cause actual results to differ materially.
We describe risks and uncertainties that could cause actual results and events
to differ materially in “Risk Factors,” “Management’s Discussion and Analysis
of Financial Condition and Results of Operations,” and “Quantitative and
Qualitative Disclosures about Market Risk” (Part II, Item 7A of this Form
10-K). Readers are cautioned not to place undue reliance on forward-looking
statements, which speak only as of the date they are made. We undertake no
obligation to update or revise publicly any forward-looking statements,
whether because of new information, future events, or otherwise.

GENERAL

Embracing Our Future ...
```
---
SEC filings data of Amazon is downloaded from publicly available [EDGAR](https://www.sec.gov/edgar/searchedgar/companysearch). Instruction of accessing the data is shown [here](https://www.sec.gov/os/accessing-edgar-data).

### 4.1. Set training parameters
Now that we are done with all the setup that is needed, we are ready to fine-tune our text generation model. Here, we define parameters that need to be set for the training job. These include: (i) Training data path. This is S3 folder in which the input data is stored, (ii) Output path: This the s3 folder in which the training output is stored. (iii) Training instance type: This indicates the type of machine on which to run the training. Typically, we use GPU instances for these training. We defined the training instance type above to fetch the correct train_image_uri. 

In [20]:
# Sample training data is available in this bucket
data_bucket = f"jumpstart-cache-prod-{aws_region}"
data_prefix = "training-datasets/sec_data"

training_dataset_s3_path = f"s3://{data_bucket}/{data_prefix}/train/"
validation_dataset_s3_path = f"s3://{data_bucket}/{data_prefix}/validation/"

output_bucket = sess.default_bucket()
output_prefix = "jumpstart-example-tg-train"

s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"
print(training_dataset_s3_path)

s3://jumpstart-cache-prod-us-east-1/training-datasets/sec_data/train/


In [19]:
total_size = 0
bucket = boto3.resource('s3').Bucket(data_bucket)
for object in bucket.objects.filter(Prefix=data_prefix):
  total_size += object.size
  print(object.size)
print(total_size/1000/1024) #mb

5556620
1114952
6.51520703125


### 4.2. Train with Automatic Model Tuning ([HPO](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html)) <a id='AMT'></a>
***
Amazon SageMaker automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by running many training jobs on your dataset using the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric that you choose. We will use a [HyperparameterTuner](https://sagemaker.readthedocs.io/en/stable/api/training/tuner.html) object to interact with Amazon SageMaker hyperparameter tuning APIs.
***

In [15]:
from sagemaker.tuner import ContinuousParameter

# Use AMT for tuning and selecting the best model
use_amt = False

# Define objective metric, based on which the best model will be selected.
amt_metric_definitions = {
    "metrics": [{"Name": "eval:loss", "Regex": "'eval_loss': ([0-9]+\.[0-9]+)"}],
    "type": "Minimize",
}

# You can select from the hyperparameters supported by the model, and configure ranges of values to be searched for training the optimal model.(https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-define-ranges.html)
hyperparameter_ranges = {
    "learning_rate": ContinuousParameter(0.00001, 0.0001, scaling_type="Logarithmic")
}

# Increase the total number of training jobs run by AMT, for increased accuracy (and training time).
max_jobs = 6
# Change parallel training jobs run by AMT to reduce total training time, constrained by your account limits.
# if max_jobs=max_parallel_jobs then Bayesian search turns to Random.
max_parallel_jobs = 2

### 4.3. Start training
***
We start by creating the estimator object with all the required assets and then launch the training job. For algorithm specific hyper-parameters, we override default `JumpStartEstimator` values for `epoch` and `per_device_train_batch_size`. You can view the full list of hyperparameters via `tg_estimator.hyperparameters`.
***

In [21]:
from sagemaker.jumpstart.estimator import JumpStartEstimator
from sagemaker.tuner import HyperparameterTuner
from sagemaker.utils import name_from_base


training_job_name = name_from_base(f"jumpstart-example-{model_id}-transfer-learning")

metric_definitions = [
    {"Name": "train:loss", "Regex": "'loss': ([0-9]+\.[0-9]+)"},
    {"Name": "eval:loss", "Regex": "'eval_loss': ([0-9]+\.[0-9]+)"},
    {"Name": "eval:runtime", "Regex": "'eval_runtime': ([0-9]+\.[0-9]+)"},
    {"Name": "eval:samples_per_second", "Regex": "'eval_samples_per_second': ([0-9]+\.[0-9]+)"},
    {"Name": "eval:eval_steps_per_second", "Regex": "'eval_steps_per_second': ([0-9]+\.[0-9]+)"},
]


# Create SageMaker Estimator instance
tg_estimator = JumpStartEstimator(
    model_id=model_id,
    hyperparameters={
        "epoch": "3",
        "per_device_train_batch_size": "4",
    },
    output_path=s3_output_location,
    base_job_name=training_job_name,
    metric_definitions=metric_definitions,
)

if use_amt:
    hp_tuner = HyperparameterTuner(
        tg_estimator,
        amt_metric_definitions["metrics"][0]["Name"],
        hyperparameter_ranges,
        amt_metric_definitions["metrics"],
        max_jobs=max_jobs,
        max_parallel_jobs=max_parallel_jobs,
        objective_type=amt_metric_definitions["type"],
        base_tuning_job_name=training_job_name,
    )

    # Launch a SageMaker Tuning job to search for the best hyperparameters
    hp_tuner.fit({"train": training_dataset_s3_path, "validation": validation_dataset_s3_path})
else:
    # Launch a SageMaker Training job by passing s3 path of the training data
    tg_estimator.fit(
        {"train": training_dataset_s3_path, "validation": validation_dataset_s3_path}, logs=True
    )

INFO:sagemaker:Creating training-job with name: hf-textgeneration1-gpt-j-6b-2024-05-09-01-53-02-406


2024-05-09 01:53:02 Starting - Starting the training job...
2024-05-09 01:53:02 Pending - Training job waiting for capacity......
2024-05-09 01:54:20 Pending - Preparing the instances for training...
2024-05-09 01:54:59 Downloading - Downloading input data................................................................................................
2024-05-09 02:10:58 Training - Training image download completed. Training in progress.[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2024-05-09 02:11:00,249 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2024-05-09 02:11:00,286 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2024-05-09 02:11:00,296 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2024-05-09 02:11:00,298 sagemaker_pytorch_container.train

### 4.4. Extract training performance metrics
***
Performance metrics such as training loss and validation accuracy/loss can be accessed through cloudwatch while the training. We can also fetch these metrics and analyze them within the notebook
***

In [22]:
from sagemaker import TrainingJobAnalytics

if use_amt:
    training_job_name = hp_tuner.best_training_job()
else:
    training_job_name = tg_estimator.latest_training_job.job_name

df = TrainingJobAnalytics(training_job_name=training_job_name).dataframe()
df.head(10)

Unnamed: 0,timestamp,metric_name,value
0,0.0,train:loss,1.6971
1,360.0,train:loss,1.3914
2,720.0,train:loss,1.1307
3,1080.0,train:loss,0.8149
4,1440.0,train:loss,0.6572
5,1740.0,train:loss,0.5648
6,2100.0,train:loss,0.4614
7,2460.0,train:loss,0.3456
8,2820.0,train:loss,0.3468
9,0.0,eval:loss,1.270508


```
***** eval metrics *****
  epoch                   =       2.95
  eval_loss               =     0.3635
  eval_runtime            = 0:00:25.19
  eval_samples            =        198
  eval_samples_per_second =      7.859
  eval_steps_per_second   =      0.278
  perplexity              =     1.4384
 ``` 

## 4.5. Deploy & run inference on the fine-tuned model
***
A trained model does nothing on its own. We now want to use the model to perform inference. For this example, that means predicting the class label of an input sentence. We follow the same steps as in [3. Run inference on the pre-trained model without finetuning](#3.-Run-inference-on-the-pre-trained-model-without-finetuning). We start by retrieving the artifacts for deploying an endpoint. However, instead of base_predictor, we  deploy the `tg_estimator` that we fine-tuned.
***

In [23]:
endpoint_name_after_finetune = name_from_base(f"jumpstart-example-{model_id}-")

# Use the estimator from the previous step to deploy to a SageMaker endpoint
finetuned_predictor = (hp_tuner if use_amt else tg_estimator).deploy(
    initial_instance_count=1,
    instance_type=model.instance_type,
    image_uri=model.image_uri,
    endpoint_name=endpoint_name_after_finetune,
)

INFO:sagemaker:Creating model with name: hf-textgeneration1-gpt-j-6b-2024-05-09-03-26-40-668
INFO:sagemaker:Creating endpoint-config with name jumpstart-example-huggingface-textgener-2024-05-09-03-26-40-668
INFO:sagemaker:Creating endpoint with name jumpstart-example-huggingface-textgener-2024-05-09-03-26-40-668


----------!

Next, we query the finetuned model using the same set of examples above, parse the response and print the predictions. The outputs from fine-tune model are generated as below. We can see that after being fine-tuned, the model can generate more insightful contents related to financial domain.

In [24]:
parameters = {
    "max_length": 400,
    "num_return_sequences": 1,
    "top_k": 250,
    "top_p": 0.8,
    "do_sample": True,
    "temperature": 1,
}

res_gpt_finetune = []
for quota_text in [
    "This Form 10-K report shows that",
    "We serve consumers through",
    "Our vision is ",
]:
    payload = {"inputs": f"{quota_text}:", "parameters": parameters}
    generated_texts = finetuned_predictor.predict(payload)[0]["generated_text"]
    res_gpt_finetune.append(generated_texts)
    print(generated_texts)
    print("\n")

(1)The following table provides information about our agreements, includinglegal provisions, regarding lock-up and conversion of certain types ofinvestments (in thousands, except percentages):The following table provides information about collateralization of certainof our liabilities (in thousands):Amazon.com Int’l Sales, Inc.  AMAZON.COM, INC.NOTES TO CONSOLIDATED FINANCIAL STATEMENTS-(Continued)The following table summarizes contractual maturities of our


 Amazon Web Services, which provides technology infrastructure to start-ups and enterprises of all sizes, and to developers building all types of applications; Amazon Books, which offers customers access to over a million new, used, and out-of-print books; Amazon Game Store, which offers quality games for the Amazon.com platform; Amazon Elastic Compute Cloud (Amazon EC2), which provides on-demand compute, storage, database, and other service capabilities for developers and enterprises of all sizes; and digital


 To be the world’s

We compare the outputs between the model before fine-tuning and after fine-tuning.

In [25]:
import pandas as pd

pd.DataFrame(
    {
        "Input example": [
            "This Form 10-K report shows that",
            "We serve consumers through",
            "Our vision is",
        ],
        "Output before finetuning": res_gpt_before_finetune,
        "Output after finetuning": res_gpt_finetune,
    }
)

Unnamed: 0,Input example,Output before finetuning,Output after finetuning
0,This Form 10-K report shows that,(1) we are unable to predict the actual resul...,(1)The following table provides information ab...
1,We serve consumers through,"\n\nOur company was founded in the year 2014, ...","Amazon Web Services, which provides technolog..."
2,Our vision is,\n\nAll of our students should be able to atte...,To be the world’s best media and entertainmen...


In [28]:
quota_text = "The SEC filing data consists of"
payload = {"inputs": f"{quota_text}:", "parameters": parameters}
generated_text = finetuned_predictor.predict(payload)[0]["generated_text"]
generated_text   

' (1) our Code of Business\nEthics and Conduct ("Code") and compliance certification page; (2) the\nrisk factors of our Annual Report on Form 10-K for the year ended December\n31, 2017; (3) the restatement of our Annual Report on Form 10-K for the year\nended December 31, 2016; (4) our Proxy Statement for our Annual Meeting of\nShareholders, to be filed with the SEC in connection with our 2019 Annual\n'

---
Next, we clean up the deployed endpoint.

---

In [29]:
# Delete the SageMaker endpoint and the attached resources
finetuned_predictor.delete_model()
finetuned_predictor.delete_endpoint()

INFO:sagemaker:Deleting model with name: hf-textgeneration1-gpt-j-6b-2024-05-09-03-26-40-668
INFO:sagemaker:Deleting endpoint configuration with name: jumpstart-example-huggingface-textgener-2024-05-09-03-26-40-668
INFO:sagemaker:Deleting endpoint with name: jumpstart-example-huggingface-textgener-2024-05-09-03-26-40-668
