# Fine-tune and deploy LLaMA V2 models on [AWS Trainium](https://aws.amazon.com/ec2/instance-types/trn1/) and [AWS Inferentia](https://aws.amazon.com/ec2/instance-types/inf2/) based instances in SageMaker JumpStart - Evaluate responses with LLaMa Index

____

In this demo notebook, we demonstrate how to use the SageMaker Python SDK to deploy pre-trained Llama 2 model as well as fine-tune it for your dataset in domain adaptation or instruction tuning format on [AWS Trainium](https://aws.amazon.com/ec2/instance-types/trn1/) and [AWS Inferentia](https://aws.amazon.com/ec2/instance-types/inf2/) based instances.

### Model License information
---
To perform inference on these models, you need to pass 'accept_eula=True' as part of model.deploy() call. This means you have read and accept the end-user-license-agreement (EULA) of the model. EULA can be found in model card description or from https://ai.meta.com/resources/models-and-libraries/llama-downloads/. By default, this notebook sets 'accept_eula=False', so all inference requests will fail until you explicitly change this custom attribute.

Similarly, to perform fine-tuning on these models, you need pass environment variable '{"accept_eula": "true"}' to `JumpStartEstimator` class.

---

### Set up

---
We begin by installing and upgrading necessary packages. Restart the kernel after executing the cell below for the first time.

---

In [2]:
%pip install --upgrade sagemaker datasets

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Deploy Pre-trained Model


***
You can now deploy the model using SageMaker JumpStart through 2 options. Option 1 allows you to quickly deploy the endpoint with default setting in two lines of code. Option 2 allows you to have more customized configurations. 

To deploy a model on [AWS Trainium](https://aws.amazon.com/ec2/instance-types/trn1/) or [AWS Inferentia](https://aws.amazon.com/ec2/instance-types/inf2/) based instances, we will firstly need call PyTorch Neuron ([torch-neuronx](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-features/neuron-caching.html?highlight=graph#neuron-persistent-cache)) to compile the model into a Neuron specific graph. Then during runtime the graph is executed on the **NeuronCores** of the [AWS Trainium](https://aws.amazon.com/ec2/instance-types/trn1/) or [AWS Inferentia](https://aws.amazon.com/ec2/instance-types/inf2/) based instances. Compiling the graph involves running optimizations that can make use of the NeuronCores efficiently.

In SageMaker JumpStart, we pre-compile the neuron graphs for a varieity of configurations such that you do not spend time waiting for compiling the graph during endpoint deployment, as long as the deployment parameters (**environmental variables**) matches one of configurations listed as below. Otherwise, the compilation will be triggered during endpoint deployment, which will take a slightly longer time to deploy a model.

|||LLaMA V2 7B and 7B Chat|||
|---|---|---|---|---|
|Instance type|Context length|Batch size| Tensor parallel degree| Data type |
|ml.inf2.xlarge|1024|1|2|fp16|
|ml.inf2.8xlarge|2048|1|2|fp16|
|ml.inf2.24xlarge|4096|4|4|fp16|
|ml.inf2.24xlarge|4096|4|8|fp16|
|ml.inf2.24xlarge|4096|4|12|fp16|
|ml.inf2.48xlarge|4096|4|4|fp16|
|ml.inf2.48xlarge|4096|4|8|fp16|
|ml.inf2.48xlarge|4096|4|12|fp16|
|ml.inf2.48xlarge|4096|4|24|fp16|


|||LLaMA V2 13B and 13B Chat|||
|---|---|---|---|---|
|Instance type|Context length|Batch size| Tensor parallel degree| Data type |
|ml.inf2.8xlarge|1024|1|2|fp16|
|ml.inf2.24xlarge|2048|4|4|fp16|
|ml.inf2.24xlarge|4096|4|8|fp16|
|ml.inf2.24xlarge|4096|4|12|fp16|
|ml.inf2.48xlarge|2048|4|4|fp16|
|ml.inf2.48xlarge|4096|4|8|fp16|
|ml.inf2.48xlarge|4096|4|12|fp16|
|ml.inf2.48xlarge|4096|4|24|fp16|

***

In [3]:
model_id = "meta-textgenerationneuron-llama-2-7b"

In [4]:
model_version = "1.*"

In [5]:
EXAMPLE_ENV = {
    "meta-textgenerationneuron-llama-2-13b-f": {
        "context_length": "1024",
        "batch_size": "1",
        "tensor_parallel_degree": "2",
        "instance_type": "ml.inf2.8xlarge",
    },
    "meta-textgenerationneuron-llama-2-13b": {
        "context_length": "1024",
        "batch_size": "1",
        "tensor_parallel_degree": "2",
        "instance_type": "ml.inf2.8xlarge",
    },
    "meta-textgenerationneuron-llama-2-7b-f": {
        "context_length": "2048",
        "batch_size": "1",
        "tensor_parallel_degree": "2",
        "instance_type": "ml.inf2.8xlarge",
    },
    "meta-textgenerationneuron-llama-2-7b": {
        "context_length": "2048",
        "batch_size": "1",
        "tensor_parallel_degree": "2",
        "instance_type": "ml.inf2.8xlarge",
    }
}

In [9]:
from sagemaker.jumpstart.model import JumpStartModel

option = "option1"

if option == "option1":
    model = JumpStartModel(model_id=model_id)
    
else:
    model = JumpStartModel(
        model_id=model_id,
        env={
            "OPTION_DTYPE": "fp16", ## correspond to the column `Data type`
            "OPTION_N_POSITIONS": EXAMPLE_ENV[model_id]["context_length"], ## correspond to the column `Contexnt length`
            "OPTION_TENSOR_PARALLEL_DEGREE": EXAMPLE_ENV[model_id]["tensor_parallel_degree"], ## correspond to the column `Tensor parallel degree`
            "OPTION_MAX_ROLLING_BATCH_SIZE": EXAMPLE_ENV[model_id]["batch_size"], ## correspond to the column `Batch size`
        },
        instance_type=EXAMPLE_ENV[model_id]["instance_type"] ## correspond to the column `Instance type`
        
    )
    
pretrained_predictor = model.deploy(accept_eula=True) 

Your model is not compiled. Please compile your model before using Inferentia.


---------------!

## Invoke the endpoint

---
Next, we invoke the endpoint with some sample queries. Later, in this notebook, we will fine-tune this model with a custom dataset and carry out inference using the fine-tuned model. We will also show comparison between results obtained via the pre-trained and the fine-tuned models.

---

In [10]:
def print_response(payload, response):
    print(payload["inputs"])
    print(f"> {response['generated_text']}")
    print("\n==================================\n")

In [11]:
payload = {
    "inputs": "I believe the meaning of life is",
    "parameters": {
        "max_new_tokens": 64,
        "top_p": 0.9,
        "temperature": 0.6,
    },
}
try:
    response = pretrained_predictor.predict(payload)
    print_response(payload, response)
except Exception as e:
    print(e)

I believe the meaning of life is
>  to be happy. I believe that happiness is a choice. I believe that happiness is a state of mind. I believe that happiness is a state of being. I believe that happiness is a state of being. I believe that happiness is a state of being. I believe that happiness is a state of being. I believe




## Dataset preparation for fine-tuning

---

You can fine-tune on the dataset with domain adaptation format or instruction tuning format. Below are the instructions for how the training data should be formatted for input to the model.

- **Input:** A train directory containing either a JSON lines (`.jsonl`) or text (`.txt`) formatted file. 
  - For JSON lines (JSONL) file, each line is a dictionary, repsentating a dictionary. The key in dictionary (each line) has to be 'text'.
  - The number of files under train directory should equal to one. 
- **Output:** A trained model that can be deployed for inference. 

In this demo, we will use a subset of [Dolly dataset](https://huggingface.co/datasets/databricks/databricks-dolly-15k) in an instruction tuning format. Dolly dataset contains roughly 15,000 instruction following records for various categories such as question answering, summarization, information extraction etc. It is available under Apache 2.0 license. We will select the summarization examples for fine-tuning.

For demonstration of using text file as input, please see [Appendix 2](#2.-Use-text-file-as-input-to-fine-tune-LLaMA-2)


---

In [12]:
from datasets import load_dataset

dolly_dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

task = "information_extraction"
# To train for summarization/closed question and answering, you can replace the assertion in next line to example["category"] == "sumarization"/"closed_qa".
summarization_dataset = dolly_dataset.filter(lambda example: example["category"] == task)
summarization_dataset = summarization_dataset.remove_columns("category")

# We split the dataset into two where test data is used to evaluate at the end.
train_and_test_dataset = summarization_dataset.train_test_split(test_size=0.1)

# Dumping the training data to a local file to be used for training.
train_and_test_dataset["train"].to_json("train.jsonl")

Creating json from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

2225888

In [13]:
train_and_test_dataset["train"][0]

{'instruction': 'Extract all of the names of people mentioned in this paragraph and list them using bullets in the format {Name}',
 'context': 'The magazine was part of Mondadori and was based in Milan. Its first editor was Alberto Mondadori who was succeeded in the post by Enzo Biagi in 1953. During the period until 1960 when Enzo Biagi edited Epoca the magazine covered current affairs news, social attitudes as well as TV news. The magazine also included frequent and detailed articles about Hollywood stars of the period and Italian movie stars such as Gina Lollobrigida. The weekly had offices in New York City, Paris and Tokyo. From June 1952 to the late 1958 the Cuban-Italian writer Alba de Céspedes wrote an agony column, called Dalla parte di lei, in the magazine.',
 'response': '• Enzo Biagi\n• Alberto Mondadori\n• Gina Lollobrigida\n• Alba de Céspedes'}

---
Next, we use a prompt template for preprocessing the data in an instruction / input format for the training job, and also for inferencing the deployed endpoint.

---

In [14]:
prompt = ("""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Input:\n{context}### Response:\n{response}\n\n<s>""")

In [15]:
def apply_prompt_template(sample):
    return {
        "text": prompt.format(instruction=sample["instruction"], context=sample["context"], response=sample["response"])
    }

In [16]:
dataset_processed = train_and_test_dataset.map(apply_prompt_template, remove_columns=list(train_and_test_dataset["train"].features))

Map:   0%|          | 0/1355 [00:00<?, ? examples/s]

Map:   0%|          | 0/151 [00:00<?, ? examples/s]

In [17]:
dataset_processed["train"].to_json(f"dolly/processed-train-{task}.jsonl")
dataset_processed["test"].to_json(f"dolly/processed-test-{task}.jsonl")

Creating json from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

295876

### Upload dataset to S3
---

We will upload the prepared dataset to S3 which will be used for fine-tuning.

---

In [18]:
from sagemaker.s3 import S3Uploader
import sagemaker
import random

output_bucket = sagemaker.Session().default_bucket()
local_data_file = f"dolly/processed-train-{task}.jsonl"
train_data_location = f"s3://{output_bucket}/dolly_dataset_trn1"
S3Uploader.upload(local_data_file, train_data_location)
print(f"Training data: {train_data_location}")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
Training data: s3://sagemaker-us-west-2-390840497958/dolly_dataset_trn1


## Train the model
---
Next, we fine-tune the LLaMA v2 model on the summarization dataset from Dolly on [AWS Trainium](https://aws.amazon.com/ec2/instance-types/trn1/) instance. You have two options: `ml.trn1.32xlarge` (default) and `ml.trn1n.32xlarge`. Finetuning scripts are based on scripts provided by [Neuronx-Nemo-Megatron](https://github.com/aws-neuron/neuronx-nemo-megatron). For a list of supported hyper-parameters and their default values, please see [supported hyperparameters for fine-tuning](#3.-Supported-Hyper-parameters-for-fine-tuning).

---

In [19]:
from sagemaker import hyperparameters

my_hyperparameters = hyperparameters.retrieve_default(
    model_id=model_id, model_version=model_version
)

print(my_hyperparameters)

{'max_input_length': '2048', 'preprocessing_num_workers': 'None', 'learning_rate': '6e-06', 'min_learning_rate': '1e-06', 'max_steps': '20', 'global_train_batch_size': '256', 'per_device_train_batch_size': '1', 'layer_norm_epilson': '1e-05', 'weight_decay': '0.1', 'lr_scheduler_type': 'CosineAnnealing', 'warmup_steps': '10', 'constant_steps': '0', 'adam_beta1': '0.9', 'adam_beta2': '0.95', 'mixed_precision': 'True', 'tensor_parallel_degree': '8', 'pipeline_parallel_degree': '1', 'append_eod': 'False'}


Overwrite some of the hyperparameters

In [20]:
#my_hyperparameters["max_input_length"] = "4096" # you can increase it up to 4096 for sequence length.
my_hyperparameters["max_steps"] = "25"
my_hyperparameters["learning_rate"] = "0.0001"
print(my_hyperparameters)

{'max_input_length': '2048', 'preprocessing_num_workers': 'None', 'learning_rate': '0.0001', 'min_learning_rate': '1e-06', 'max_steps': '25', 'global_train_batch_size': '256', 'per_device_train_batch_size': '1', 'layer_norm_epilson': '1e-05', 'weight_decay': '0.1', 'lr_scheduler_type': 'CosineAnnealing', 'warmup_steps': '10', 'constant_steps': '0', 'adam_beta1': '0.9', 'adam_beta2': '0.95', 'mixed_precision': 'True', 'tensor_parallel_degree': '8', 'pipeline_parallel_degree': '1', 'append_eod': 'False'}


Validate hyperparameters

In [21]:
hyperparameters.validate(
    model_id=model_id, model_version=model_version, hyperparameters=my_hyperparameters
)

In [None]:
from sagemaker.jumpstart.estimator import JumpStartEstimator


estimator = JumpStartEstimator(
    model_id=model_id,
    model_version=model_version,
    hyperparameters=my_hyperparameters,
    environment={"accept_eula": "true"}, # please change `accept_eula` to be `true` to accept EULA.
    #instance_type="ml.trn1n.32xlarge", if not specified, default `ml.trn1.32xlarge` will be used.
)

estimator.fit({"train": train_data_location})

INFO:sagemaker.jumpstart:No instance type selected for training job. Defaulting to ml.trn1.32xlarge.
INFO:sagemaker:Creating training-job with name: meta-textgenerationneuron-llama-2-7b-2023-12-09-17-28-58-457


2023-12-09 17:28:58 Starting - Starting the training job.

Studio Kernel idle issue:  If your studio kernel goes idle and you lose reference to the estimator object, please see section [4. Studio Kernel Dead/Creating JumpStart Model from the training Job](#4.-Studio-Kernel-Dead/Creating-JumpStart-Model-from-the-training-Job) on how to deploy endpoint using the training job name and the model id. 


### Deploy the fine-tuned model
---
Next, we deploy the fine-tuned model. We will compare the performance of fine-tuned and pre-trained model.

---

In [None]:
finetuned_predictor = estimator.deploy()

### Evaluate the pre-trained and fine-tuned model
---
Next, we use the test data to evaluate the performance of the fine-tuned model and compare it with the pre-trained model. 

---

In [None]:
prompt_inference = ("""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Input:\n{context}""")

In [None]:
import pandas as pd
from IPython.display import display, HTML

test_dataset = train_and_test_dataset["test"]

inputs, ground_truth_responses, responses_before_finetuning, responses_after_finetuning = (
    [],
    [],
    [],
    [],
)


def predict_and_print(datapoint):
    # For instruction fine-tuning, we insert a special key between input and output
    input_output_demarkation_key = "\n\n### Response:\n"

    payload = {
        "inputs": prompt_inference.format(
            instruction=datapoint["instruction"], context=datapoint["context"]
        )
        + input_output_demarkation_key,
        "parameters": {"max_new_tokens": 100},
    }
    inputs.append(payload["inputs"])
    ground_truth_responses.append(datapoint["response"])
    pretrained_response = pretrained_predictor.predict(
        payload
    )
    responses_before_finetuning.append(pretrained_response["generated_text"])
    finetuned_response = finetuned_predictor.predict(payload)
    responses_after_finetuning.append(finetuned_response["generated_text"])


try:
    for i, datapoint in enumerate(test_dataset.select(range(5))):
        predict_and_print(datapoint)

    df = pd.DataFrame(
        {
            "Inputs": inputs,
            "Ground Truth": ground_truth_responses,
            "Response from non-finetuned model": responses_before_finetuning,
            "Response from fine-tuned model": responses_after_finetuning,
        }
    )
    display(HTML(df.to_html()))
except Exception as e:
    print(e)

### Clean up resources

In [None]:
# Delete resources
# pretrained_predictor.delete_model()
# pretrained_predictor.delete_endpoint()
# finetuned_predictor.delete_model()
# finetuned_predictor.delete_endpoint()

# Appendix

### 1. Supported Inference Parameters

---
This model supports the following inference payload parameters:


* **max_length:** Model generates text until the output length (which includes the input context length) reaches `max_length`. If specified, it must be a positive integer.
* **max_new_tokens:** Model generates text until the output length (excluding the input context length) reaches `max_new_tokens`. If specified, it must be a positive integer.
* **num_beams:** Number of beams used in the greedy search. If specified, it must be integer greater than or equal to `num_return_sequences`.
* **no_repeat_ngram_size:** Model ensures that a sequence of words of `no_repeat_ngram_size` is not repeated in the output sequence. If specified, it must be a positive integer greater than 1.
* **temperature:** Controls the randomness in the output. Higher temperature results in output sequence with low-probability words and lower temperature results in output sequence with high-probability words. If `temperature` -> 0, it results in greedy decoding. If specified, it must be a positive float.
* **early_stopping:** If True, text generation is finished when all beam hypotheses reach the end of sentence token. If specified, it must be boolean.
* **do_sample:** If True, sample the next word as per the likelihood. If specified, it must be boolean.
* **top_k:** In each step of text generation, sample from only the `top_k` most likely words. If specified, it must be a positive integer.
* **top_p:** In each step of text generation, sample from the smallest possible set of words with cumulative probability `top_p`. If specified, it must be a float between 0 and 1.
* **stop**: If specified, it must be a list of strings. Text generation stops if any one of the specified strings is generated.

We may specify any subset of the parameters mentioned above while invoking an endpoint.

---

### 2. Use text file as input to fine-tune LLaMA-2

In [8]:
# import boto3
# model_id = "meta-textgenerationneuron-llama-2-7b" #or  "meta-textgenerationneuron-llama-2-13b"

# estimator = JumpStartEstimator(model_id=model_id,  environment={"accept_eula": "false"})
# estimator.set_hyperparameters(max_steps=30)
# estimator.fit({"training": f"s3://jumpstart-cache-prod-{boto3.Session().region_name}/training-datasets/sec_amazon"})

### 3. Supported Hyper-parameters for fine-tuning
---
- max_input_length: Maximum total input sequence length after tokenization. Sequences longer than this will be truncated. Default: 2048.
- learning_rate: The rate at which the model weights are updated after working through each batch of training examples. Must be a positive float greater than 0. Default: 6e-6.
- min_learning_rate: The learning rate at the last step of learning rate scheduler 'CosineAnnealing'. Default: 1e-06.
- global_train_batch_size: The global batch size for training. Based on global_train_batch_size, the gradient accumulation is calculated as global_train_batch_size / (data_parallel_degree * per_device_train_batch_size), where data_parallel_degree is calculated as total number of neuron cores / (tensor_parallel_degree * pipeline_parallel_degree). Default: 256.
- per_device_train_batch_size: The batch size per Neuron core for training. Default: 1
- layer_norm_epilson: During layer normalization, a value added to the denominator for numerical stability. See [documentation](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html). Default: 0.00001.
- preprocessing_num_workers: The number of processors to use for the preprocessing. If None, all of workers (number of vCPUs) are used for preprocessing. Default: "None"
- weight_decay: The weight decay to apply (if not zero) to all layers except all bias and LayerNorm weights in `AdamW` optimizer. Default: 0.1.
- lr_scheduler_type: Learning rate scheduler type. Default: 'CosineAnnealing' (currently we only support 'CosineAnnealing' scheduler type).
- warmup_steps: Linear warmup over warmup steps. Default: 10.
- constant_steps: The number of steps for learning rate to be constant after warmup_steps in 'CosineAnnealing' scheduler type. Default: 0.
- adam_beta1: The beta1 hyperparameter (exponential decay rate for the first moment estimates) for the AdamW optimizer. Default: 0.9.
- adam_beta2: The beta2 hyperparameter (exponential decay rate for the first moment estimates) for the AdamW optimizer. Default: 0.95.
- mixed_precision: Whether to use mixed precision. If mixed_precision to be 'True', it means that master weights and optimizer states are stored in fp32, and model weights are saved in bf16. For details, see [reference](https://arxiv.org/pdf/1710.03740.pdf). Default: 'True'.
- tensor_parallel_degree: The number of neuron cores which specific model weights, gradients, and optimizer states are split across. For details, see [reference](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-tensor-parallelism.html). Default: "8" (currently we only support parallel degree as 8).
- pipeline_parallel_degree: The number of neuron cores which the layers of a model are partitioned across. Default: "1" (currently we only support "1" for LLaMA-2 7B and "4" for LLaMA-2 13B).
- append_eod: Whether to append an `<eod>` token to the end of each example. By setting it to 'True', the fine-tuned model tends to generate succinct output. Default: 'False'.

---

### 4. Studio Kernel goes idle/Creating JumpStart Model from the training Job
---
Training job may take several hours due to setting of hyperparameters and the studio kernel may be in idle stage during the training phase. However, during this time, training is still running in SageMaker. If this happens, you can still deploy the endpoint using the training job name with the following code:

How to find the training job name? Go to Console -> SageMaker -> Training -> Training Jobs -> Identify the training job name and substitute in the following cell. 

---

In [None]:
# from sagemaker.jumpstart.estimator import JumpStartEstimator
# training_job_name = <<training_job_name>>

# attached_estimator = JumpStartEstimator.attach(training_job_name, model_id)
# attached_estimator.logs()
# attached_estimator.deploy()