# Finetune an LLM on Amazon SageMaker
In this notebook, we are going to focus on 3 topics:

1. Process a public available dataset for LLM training/finetuning 
2. Finetune an LLM using QLoRA, an efficient finetuning technique that matches the performance of full-precision fine-tuning approaches.
3. Deploy the finetuned LLM for inference using SageMaker.

For preprocessing the dataset, we use a SageMaker Processing job to help provide the compute resources required to complete the processing steps.

For model finetuning, we'll be using a SageMaker Training job to automatically spins up compute resources, execute the model training steps, and shutdown the resources automatically when the job is complete. 

To deploy the finetuned model, we'll be using the SageMaker Python SDK to deploy the model into SageMaker for a fully managed HTTPS endpoint in a single command.

Let's get started!

First, we need to install the dependencies needed to run the notebook end to end

In [5]:
!pip install sagemaker boto3 datasets pygments -U -q

In [2]:
import sagemaker
import boto3
from sagemaker.local import LocalSession
import os
from datetime import datetime
from sagemaker.experiments.run import Run
import uuid

sess = sagemaker.Session()
region = sess.boto_region_name
sm_client = boto3.client("sagemaker")

# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']


print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {region}")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml
sagemaker role arn: arn:aws:iam::866824485776:role/service-role/AmazonSageMaker-ExecutionRole-20240725T121088
sagemaker bucket: sagemaker-us-east-1-866824485776
sagemaker session region: us-east-1


In [3]:
from sagemaker.huggingface import HuggingFaceProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker import get_execution_role

In [None]:
%store -r

Define the variables to be used for the notebook. 

In [4]:
locals()

{'__name__': '__main__',
 '__doc__': 'Automatically created module for IPython interactive environment',
 '__package__': None,
 '__loader__': None,
 '__spec__': None,
 '__builtin__': <module 'builtins' (built-in)>,
 '__builtins__': <module 'builtins' (built-in)>,
 '_ih': ['',
  "get_ipython().run_line_magic('pip', 'install sagemaker boto3 datasets pygments -U -q')",
  'import sagemaker\nimport boto3\nfrom sagemaker.local import LocalSession\nimport os\nfrom datetime import datetime\nfrom sagemaker.experiments.run import Run\nimport uuid\n\nsess = sagemaker.Session()\nregion = sess.boto_region_name\nsm_client = boto3.client("sagemaker")\n\n# sagemaker session bucket -> used for uploading data, models and logs\n# sagemaker will automatically create this bucket if it not exists\nsagemaker_session_bucket=None\nif sagemaker_session_bucket is None and sess is not None:\n    # set to default bucket if a bucket name is not given\n    sagemaker_session_bucket = sess.default_bucket()\n\ntry:\n  

In [6]:
if "base_model_pkg_group_name" not in locals():
    base_model_pkg_group_name = "None"

In [7]:
rand_id = uuid.uuid4().hex[:5] # this is the random-id assigned for each run. 
training_dataset_s3_loc = f"s3://{sagemaker_session_bucket}/data/bootcamp-{rand_id}/train"
validation_dataset_s3_loc = f"s3://{sagemaker_session_bucket}/data/bootcamp-{rand_id}/eval"
model_output_s3_loc = f"s3://{sagemaker_session_bucket}/data/bootcamp-{rand_id}/model"
model_eval_s3_loc = f"s3://{sagemaker_session_bucket}/data/bootcamp-{rand_id}/modeleval"
model_id = "NousResearch/Llama-2-7b-chat-hf"
hf_dataset_name = "hotpot_qa"

print(f"training_dataset_s3_loc: {training_dataset_s3_loc}")
print(f"validation_dataset_s3_loc: {validation_dataset_s3_loc}")
print(f"model artifact S3 location: {model_output_s3_loc}")
print(f"model evaluation output S3 location: {model_eval_s3_loc}")
print(f"model_id: {model_id}")
print(f"base model package group name: {base_model_pkg_group_name}")
print(f"Huggingfae dataset name: {hf_dataset_name}")

training_dataset_s3_loc: s3://sagemaker-us-east-1-866824485776/data/bootcamp-4854f/train
validation_dataset_s3_loc: s3://sagemaker-us-east-1-866824485776/data/bootcamp-4854f/eval
model artifact S3 location: s3://sagemaker-us-east-1-866824485776/data/bootcamp-4854f/model
model evaluation output S3 location: s3://sagemaker-us-east-1-866824485776/data/bootcamp-4854f/modeleval
model_id: NousResearch/Llama-2-7b-chat-hf
base model package group name: None
Huggingfae dataset name: hotpot_qa


# Proprocessing Data
In our bootcamp, we'll build a generative AI chatbot application which requires the LLM the ability to understand instructions, and to provide accurate answer based on user query in natural language. 
For this reason, we choose an open source Llama2 base model [NousResearch-Llama-2-7b-chat-hf](https://huggingface.co/NousResearch/Llama-2-7b-chat-hf) which has been instruction tuned. We will finetune this model using good quality Q&A dataset. 

For our hands-on, we'll use a public dataset called [hotpotQA](https://hotpotqa.github.io/) as the data source. Here's a short summary of the dataset: 

HotpotQA is a question answering dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems. It is collected by a team of NLP researchers at Carnegie Mellon University, Stanford University, and Université de Montréal.

## SageMaker Processing

To analyze data and evaluate machine learning models on Amazon SageMaker, we use a Amazon SageMaker Processing job. With Processing, you can use a simplified, managed experience on SageMaker to run your data processing workloads, such as feature engineering, data validation, model evaluation, and model interpretation. You can also use the Amazon SageMaker Processing APIs during the experimentation phase and after the code is deployed in production to evaluate performance.

Here's a diagram that depicts how SageMaker Processing work:

![sagemaker-processing](images/sagemaker-processing-diagram.png)

In particular, we'll leverage a python script which contains the required code to handle the dataset. The script is executed in a Sagemaker processing job to automate the task end to end. The processing script can be shown in the following, and accessible in [src/preprocess/preprocess.py](src/preprocess/preprocess.py).

In the next cell, we'll process the data by running the script above as a SageMaker processing job. 

To launch a processing job, we use a Pytorch container by executing the `PytorchProcessor.run()` method. The `run()` method supports passing the arguments to the script.

You can optionally provide input data in run() method to provide an input dataset on S3 bucket. By default, SageMaker processing job will download the data from the specified S3 location into local path inside the processing container in `/opt/ml/processing/input` directory.

You could also provide an S3 location for the output data via the run() method by configuring an `ProcessingOutput` object. If not provided, SageMaker processing job defaults to an S3 bucket that the Amazon SageMaker Python SDK creates for you, following the format `s3://sagemaker-<region>-<account_id>/<processing_job_name>/output/<output_name/`. 

Following code shows the python script to be used for the processing job. 

In [8]:
!pygmentize src/preprocess/preprocess.py

[34mimport[39;49;00m [04m[36margparse[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m[37m[39;49;00m
[34mfrom[39;49;00m [04m[36mdatasets[39;49;00m [34mimport[39;49;00m load_dataset[37m[39;49;00m
[37m[39;49;00m
[34mdef[39;49;00m [32mformat_hotpot[39;49;00m(sample):[37m[39;49;00m
[37m    [39;49;00m[33m"""[39;49;00m
[33m    Function that takes a single data sample derived from Huggingface datasets API: (https://huggingface.co/docs/datasets/index)[39;49;00m
[33m    and formats it into llama2 prompt format. For more information about llama2 prompt format, [39;49;00m
[33m    please refer to https://huggingface.co/blog/llama2#how-to-prompt-llama-2 [39;49;00m
[33m    [39;49;00m
[33m    An example prompt is shown in the following:[39;49;00m
[33m    <s>[39;49;00m
[33m      [INST] <<SYS>>[39;49;00m
[33m        {{system}}[39;49;00m
[33m      <</SYS>>[39;49;00m
[33m[39;49;00m
[33m      ### Question[39;49;00m
[33m      {{ques

In [9]:
# Initialize the HuggingFaceProcessor
from sagemaker.pytorch.processing import PyTorchProcessor

torch_processor = PyTorchProcessor(
    framework_version='2.0',
    role=get_execution_role(),
    instance_type='ml.m5.xlarge',
    # instance_type='local', # uncomment for local mode
    instance_count=1,
    base_job_name='frameworkprocessor-PT',
    py_version="py310",
    sagemaker_session=sess
)

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.


In [10]:
torch_processor.run(
    code="preprocess.py",
    source_dir="src/preprocess",
    outputs=[
        ProcessingOutput(output_name="train_data",
                         source="/opt/ml/processing/train",
                         destination=training_dataset_s3_loc),
        ProcessingOutput(output_name="eval_data",
                         source="/opt/ml/processing/eval",
                         destination=validation_dataset_s3_loc),

    ],
    arguments=["--train-data-split", "1:50",
               "--eval-data-split", "51:100",
               "--hf-dataset-name", hf_dataset_name]
)

INFO:sagemaker.processing:Uploaded src/preprocess to s3://sagemaker-us-east-1-866824485776/frameworkprocessor-PT-2024-08-07-02-06-34-123/source/sourcedir.tar.gz
INFO:sagemaker.processing:runproc.sh uploaded to s3://sagemaker-us-east-1-866824485776/frameworkprocessor-PT-2024-08-07-02-06-34-123/source/runproc.sh
INFO:sagemaker:Creating processing-job with name frameworkprocessor-PT-2024-08-07-02-06-34-123


[34mCollecting datasets (from -r requirements.txt (line 1))
  Downloading datasets-2.20.0-py3-none-any.whl.metadata (19 kB)[0m
[34mCollecting pyarrow>=15.0.0 (from datasets->-r requirements.txt (line 1))
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)[0m
[34mCollecting pyarrow-hotfix (from datasets->-r requirements.txt (line 1))
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl.metadata (3.6 kB)[0m
[34mCollecting requests>=2.32.2 (from datasets->-r requirements.txt (line 1))
  Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)[0m
[34mCollecting tqdm>=4.66.3 (from datasets->-r requirements.txt (line 1))
  Downloading tqdm-4.66.5-py3-none-any.whl.metadata (57 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57.6/57.6 kB 8.5 MB/s eta 0:00:00[0m
[34mCollecting xxhash (from datasets->-r requirements.txt (line 1))
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)[0m
[34mCollect

# Fine-Tune Llama2-7b model on Amazon SageMaker
We are going to use the recently introduced method in the paper "[QLoRA: Quantization-aware Low-Rank Adapter Tuning for Language Generation](https://arxiv.org/abs/2106.09685)" by Tim Dettmers et al. 
QLoRA is a new technique to reduce the memory footprint of large language models during finetuning, without sacrificing performance. 
The TL;DR; of how QLoRA works is:

* Quantize the pretrained model to 4 bits and freezing it.
* Attach small, trainable adapter layers. (LoRA)
* Finetune only the adapter layers, while using the frozen quantized model for context.

We prepared a train.py, which implements QLora using PEFT to train our model. The script also merges the LoRA weights into the model weights after training. That way you can use the model as a normal model without any additional code.

Here's an animation that shows how how QLoRA works in general.

![lora-animated](images/lora-animated.gif)

In [11]:
!pygmentize src/train/train.py

[34mimport[39;49;00m [04m[36mos[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36margparse[39;49;00m[37m[39;49;00m
[34mfrom[39;49;00m [04m[36mtransformers[39;49;00m [34mimport[39;49;00m ([37m[39;49;00m
    AutoModelForCausalLM,[37m[39;49;00m
    AutoTokenizer,[37m[39;49;00m
    BitsAndBytesConfig,[37m[39;49;00m
    HfArgumentParser,[37m[39;49;00m
    TrainingArguments,[37m[39;49;00m
    pipeline,[37m[39;49;00m
    logging,[37m[39;49;00m
)[37m[39;49;00m
[34mfrom[39;49;00m [04m[36mpeft[39;49;00m [34mimport[39;49;00m LoraConfig, PeftModel[37m[39;49;00m
[34mfrom[39;49;00m [04m[36mtrl[39;49;00m [34mimport[39;49;00m SFTTrainer[37m[39;49;00m
[34mfrom[39;49;00m [04m[36mdatasets[39;49;00m [34mimport[39;49;00m load_from_disk[37m[39;49;00m
[34mimport[39;49;00m [04m[36mtarfile[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mboto3[39;49;00m[37m[39

# Setting up Hyper Parameters for the fine tuning job
The following section setup the hyperparameters required for finetuning a QLoRA model. 

For learn more about the hyperparameter setting for quantization and PEFT, please refer to [this](https://huggingface.co/docs/transformers/main_classes/quantization) and [this](https://github.com/huggingface/peft/blob/main/src/peft/tuners/lora/config.py) links.



In [12]:
import time
from sagemaker.huggingface import HuggingFace

# define Training Job Name 
time_suffix = datetime.now().strftime('%y%m%d%H%M')
experiments_name = f"exp-{model_id.replace('/', '-')}"
run_name = f"qlora-finetune-run-{time_suffix}-{rand_id}"

# define Training Job Name 
job_name = f'huggingface-qlora-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}-{rand_id}'

# hyperparameters, which are passed into the training job
hyperparameters ={
  'model_id': model_id,                                # pre-trained model
  'epochs': 2,                                         # number of training epochs
  'per_device_train_batch_size': 8,                    # Batch size per GPU for training
  'per_device_eval_batch_size' : 8,                    # Batch size per GPU for evaluation
  'learning_rate' : 2e-4,                              # Initial learning rate (AdamW optimizer)
  'optimizer' : "paged_adamw_32bit",                   # Optimizer to use
  'logging_steps' : 5,                                 # Log every X updates steps
  'lora_r': 64,                                        # LoRA attention dimension.
  'lora_alpha' : 16,                                   # The alpha parameter for Lora scaling
  'lora_dropout' : 0.1,                                # The dropout probability for Lora layers
  'use_4bit' : True,                                   # Activate 4-bit precision base model loading
  'bnb_4bit_compute_dtype' : "float16",                # Compute dtype for 4-bit base models
  'bnb_4bit_quant_type' : "nf4",                       # Quantization type (fp4 or nf4)
  'base_model_group_name' : base_model_pkg_group_name, # Base model registered in SageMaker Model Registry
  'region': region,                                    # AWS region where the training is run
  'model_eval_s3_loc' : model_eval_s3_loc              # S3 location for uploading the model evaluation metrics
}

print(f"SageMaker experiment name: {experiments_name}")
print(f"SageMaker experiment run name: {run_name}")
print(f"SageMaker training job name: {job_name}")

SageMaker experiment name: exp-NousResearch-Llama-2-7b-chat-hf
SageMaker experiment run name: qlora-finetune-run-2408070215-4854f
SageMaker training job name: huggingface-qlora-2024-08-07-02-15-57-4854f


## Run a SageMaker Training Job
In this lab, we'll leverage SageMaker Training job to finetune a Llama2-7b model. The training job includes the following information:

* The URL of the Amazon Simple Storage Service (Amazon S3) bucket where you've stored the training data.
* The compute resources that you want SageMaker to use for model training. Compute resources are machine learning (ML) compute instances that are managed by SageMaker.
* The URL of the S3 bucket where you want to store the output of the job.
* The Amazon Elastic Container Registry path where the training code is stored. For more information.

In order to create a sagemaker training job we need an HuggingFace Estimator. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. The Estimator manages the infrastructure use. SagMaker takes care of starting and managing all the required ec2 instances for us, provides the correct huggingface container, uploads the provided scripts and downloads the data from our S3 bucket into the container at /opt/ml/input/data. Then, it starts the training job by running.


In order to create a sagemaker training job we need an `HuggingFace Estimator`. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. In addition, the Estimator manages the infrastructure use. SagMaker takes care of starting and managing all the required ec2 instances for us, provides the correct huggingface container, uploads the provided scripts and downloads the data from our S3 bucket into the container at `/opt/ml/input/data`.

After the training job, we'll use the estimator object to deploy the model for inference. 

In [None]:
with Run(
    experiment_name=experiments_name,
    run_name=run_name,
    sagemaker_session=sess
) as run:

    # create the Estimator
    huggingface_estimator = HuggingFace(
        entry_point='train.py',         # train script
        source_dir='src/train',         # directory which includes all the files needed for training
        instance_type='ml.g5.2xlarge', # instances type used for the training job
        # instance_type='local_gpu',      # use local 
        instance_count=1,               # the number of instances used for training
        base_job_name=job_name,         # the name of the training job
        role=get_execution_role(),      # Iam role used in training job to access AWS ressources, e.g. S3
        volume_size=300,    # the size of the EBS volume in GB
        transformers_version='4.28.1',    # the transformers version used in the training job
        pytorch_version='2.0.0',          # the pytorch_version version used in the training job
        py_version='py310',             # the python version used in the training job
        hyperparameters= hyperparameters,
        environment={ "HUGGINGFACE_HUB_CACHE": "/tmp/.cache" }, # set env variable to cache models in /tmp
        sagemaker_session=sess,         # specifies a sagemaker session object
        output_path=model_output_s3_loc # s3 location for model artifact,
    )
    
    # define a data input dictonary with our uploaded s3 uris
    data = { 'training': training_dataset_s3_loc,
             'validation': validation_dataset_s3_loc}

    # starting the train job with our uploaded datasets as input
    huggingface_estimator.fit(data, wait=True)
    run.log_parameters(data)

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: huggingface-qlora-2024-08-07-02-15-57-4-2024-08-07-02-16-02-570


2024-08-07 02:16:03 Starting - Starting the training job
2024-08-07 02:16:03 Pending - Training job waiting for capacity......

# Deploy the finetuned Llama2 model in SageMaker
State-of-the-art deep learning models for applications such as natural language processing (NLP) are large, typically with tens or hundreds of billions of parameters. Larger models are often more accurate, which makes them attractive to machine learning practitioners. However, these models are often too large to fit on a single accelerator or GPU device, making it difficult to achieve low-latency inference. You can avoid this memory bottleneck by using model parallelism techniques to partition a model across multiple accelerators or GPUs.

Amazon SageMaker includes specialized deep learning containers (DLCs), libraries, and tooling for model parallelism and large model inference (LMI). In the following sections, you can find resources to get started with LMI on SageMaker.

With these DLCs you can use third party libraries such as [DeepSpeed](https://github.com/microsoft/DeepSpeed), [Accelerate](https://huggingface.co/docs/accelerate), and [FasterTransformer](https://github.com/NVIDIA/FasterTransformer) to partition model parameters using model parallelism techniques to leverage the memory of multiple GPUs for inference.

After the training job, we will deploy the QLoRA finetuned model into SageMaker for inference. In our example, we will also use a Large Model Inference(LMI) container provided by AWS using `DJL Serving` and `DeepSpeed`. Given the llama2-7b model size, this model could fit in a single `ml.g5.2xlarge` instance on AWS SageMaker.

### Deep Java Library (DJL) 
Deep Java Library (DJL) Serving is a high performance universal stand-alone model serving solution powered by DJL. DJL Serving supports loading models trained with a variety of different frameworks. With the SageMaker Python SDK you can use DJL Serving to host large models using backends like DeepSpeed and HuggingFace Accelerate.

For more information about using `DJL Serving` model server for hosting LLMs in SageMaker, please refer to the following:

* [DeepSpeed and Accelerate](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-tutorials-deepspeed-djl.html)
* [FasterTransformer](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-tutorials-fastertransformer.html)

In [None]:
from sagemaker.huggingface import get_huggingface_llm_image_uri

llm_image = sagemaker.image_uris.retrieve(
    "djl-deepspeed", region=region, version="0.23.0"
)

# print ecr image uri
print(f"llm image uri: {llm_image}")

In [None]:
import json
from sagemaker.huggingface import HuggingFaceModel

# sagemaker config
instance_type = "ml.g5.2xlarge"
number_of_gpu = 1
health_check_timeout = 3600

# create HuggingFaceModel with the a DJI image uri
huggingface_model = HuggingFaceModel(
    model_data=huggingface_estimator.model_data,
    image_uri=llm_image,
    transformers_version="4.28.1",
    pytorch_version="2.0.0",
    py_version="py310",
    model_server_workers=1,
    role=role,
    sagemaker_session=sess,
)

Trigger a SageMaker deployment by invoking huggingface model.deploy()

In [None]:
endpoint_name_random_id = uuid.uuid4().hex[:5]
endpoint_name = f"llama2-7b-djl-deepspeed-{endpoint_name_random_id}"

print(f"endpoint name: {endpoint_name}")
llm = huggingface_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  container_startup_health_check_timeout=health_check_timeout, 
  endpoint_name=endpoint_name
)

# Test the model
In the following section, we'll run a test against the deployed endpoint. Here we format the 
prompt using the llama2 [standard prompt](https://huggingface.co/blog/llama2#how-to-prompt-llama-2).

In the test data, we provide a system prompt along with a question and a few contextual information that might be relevant to the answer. Let's see how the model performs!


In [None]:
prompt_template = """<s>
[INST] <<SYS>>
{{system}}
<</SYS>>

### Question
{{question}}

### Context
{{context}}[/INST] """

In [None]:
system_message = "Given the following context, answer the question as accurately as possible:"
def build_llama2_prompt(message):
    question = message['question']
    context = message['context']
    formatted_message = prompt_template.replace("{{system}}", system_message)
    formatted_message = formatted_message.replace("{{question}}", question)
    formatted_message = formatted_message.replace("{{context}}", context)
    return formatted_message

In [None]:
message = {}
message['question'] = "The Oberoi family is part of a hotel company that has a head office in what city?"
message['context'] = """The Ritz-Carlton Jakarta is a hotel and skyscraper in Jakarta, Indonesia and 14th Tallest building in Jakarta. It is located in city center of Jakarta, near Mega Kuningan, adjacent to the sister JW Marriott Hotel. It is operated by The Ritz-Carlton Hotel Company. The complex has two towers that comprises a hotel and the Airlangga Apartment respectively. The hotel was opened in 2005.
The Oberoi family is an Indian family that is famous for its involvement in hotels, namely through The Oberoi Group.
The Oberoi Group is a hotel company with its head office in Delhi. Founded in 1934, the company owns and/or operates 30+ luxury hotels and two river cruise ships in six countries, primarily under its Oberoi Hotels & Resorts and Trident Hotels brands.
The 289th Military Police Company was activated on 1 November 1994 and attached to Hotel Company, 3rd Infantry (The Old Guard), Fort Myer, Virginia. Hotel Company is the regiment\'s specialty company.\nThe Glennwanis Hotel is a historic hotel in Glennville, Georgia, Tattnall County, Georgia, built on the site of the Hughes Hotel. The hotel is located at 209-215 East Barnard Street. The old Hughes Hotel was built out of Georgia pine circa 1905 and burned in 1920. The Glennwanis was built in brick in 1926. The local Kiwanis club led the effort to get the replacement hotel built, and organized a Glennville Hotel Company with directors being local business leaders. The wife of a local doctor won a naming contest with the name "Glennwanis Hotel", a suggestion combining "Glennville" and "Kiwanis".'"""

In [None]:
input = build_llama2_prompt(message)

In [None]:
print(input)

Run a prediction with inference configuration as shown below:

In [None]:
params = {
    "do_sample": True,
    "top_p": 0.9,
    "temperature": 0.9,
    "top_k": 50,
    "max_new_tokens": 512,
    "repetition_penalty": 1.03,
  }

output = llm.predict({"text":input, "properties" : params})

In [None]:
print(output['outputs'][0]["generated_text"][len(input):]) # automatically removed the bos_token and eos_token_id

# Clean up

In [None]:
llm.delete_endpoint()