# Deploy Llama2 13B QLora on Amazon SageMaker

In this notebook, we use the [Large Model Inference (LMI) container](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-dlc.html) from [SageMaker Deep Learning Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md) to host Llama2 13b on Amazon SageMaker.

We'll also see what configuration parameters can be used to optimize the endpoint for throughput and latency. We will deploy using a ml.g5.12xlarge instance for efficiency

### Import the relevant libraries and configure several global variables using boto3

In [1]:
%pip install sagemaker boto3 awscli --upgrade  --quiet

Note: you may need to restart the kernel to use updated packages.


In [2]:
import boto3
import sagemaker
import jinja2
import json
from pathlib import Path
from sagemaker import Model, image_uris, serializers, deserializers

  from pandas.core.computation.check import NUMEXPR_INSTALLED


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [6]:
role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
region = sess._region_name  # region name of the current SageMaker Studio environment
account_id = sess.account_id()  # account_id of the current SageMaker Studio environment
jinja_env = jinja2.Environment()

# load previous ran parameters
%store -r model_data_s3_location
%store -r model_name


code_dir = "llama2_13b_src"
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

## Step 1: Prepare the model artifacts
The LMI container expects the following artifacts for hosting the model
- `serving.properties` (required): Defines the model server settings and configurations.
- `model.py` (optional): A python script that defines the inference logic.
- `requirements.txt` (optional): Any additional pip wheels that need to be installed.

SageMaker expects the model artifacts in a tarball with the following structure - 

```
code
├──── 
│   └── serving.properties
│   └── model.py
│   └── requirements.txt

```


In this notebook, we'll only provide a `serving.properties`. By default, the container runs the [huggingface.py module](https://github.com/deepjavalibrary/djl-serving/blob/master/engines/python/setup/djl_python/huggingface.py) from the djl python repository as the entry point code. 

In [7]:
!rm -rf {code_dir}
!mkdir -p {code_dir}

### Create the serving.properties
This is a configuration file to indicate to DJL Serving which model parallelization and inference optimization techniques you would like to use. Depending on your need, you can set the appropriate configuration.

Here is a list of settings that we use in this configuration file -
- `option.model_id`: Used to download model from Hugging Face or S3 bucket.
- `option.tensor_parallel_degree`: Set to the number of GPU devices over which to partition the model.
- `option.max_rolling_batch_size`: Provide a size for maximum batch size for rolling/iteration level batching. Limits the number of concurrent requests.
- `option.rolling_batch`: Select a rolling batch strategy. `auto` will make the handler choose the strategy based on the provided configuration. `scheduler` is a native rolling batch strategy supported for a single GPU. `lmi-dist` and `vllm` support multi-GPU rolling/iteration level batching.
- `option.paged_attention`: Enabling this preallocates more GPU memory for caching. This is only supported when `option.rolling_batch=lmi-dist` or `option.rolling_batch=auto`.
- `option.max_rolling_batch_prefill_tokens`: Only supported for `option.rolling_batch=lmi-dist`. Limits the number of tokens for caching. This needs to be tuned based on batch size and input sequence length to avoid GPU OOM. Use this to tune for your workload
- `engine`: This is set to the runtime engine of the code. `MPI` below refers to the parallel processing framework. It is used by engines like `DeepSpeed` and `FasterTransformer` as well. 


In [8]:
%%writefile {code_dir}/serving.properties
engine = MPI
option.model_id = {{s3_url}}
option.trust_remote_code = true
option.tensor_parallel_degree = 4
option.max_rolling_batch_size = 64
option.rolling_batch = auto
option.dtype = fp16
option.max_rolling_batch_prefill_tokens = 1024
option.paged_attention = True

Writing llama2_13b_src/serving.properties


Update the {{s3_url}} in `serving.properties` to our model S3 location.

In [9]:
template = jinja_env.from_string(Path(f"{code_dir}/serving.properties").open().read())
Path(f"{code_dir}/serving.properties").open("w").write(
    template.render(s3_url=model_data_s3_location)
)
!pygmentize {code_dir}/serving.properties | cat -n

     1	[36mengine[39;49;00m[37m [39;49;00m=[37m [39;49;00m[33mMPI[39;49;00m[37m[39;49;00m
     2	[36moption.model_id[39;49;00m[37m [39;49;00m=[37m [39;49;00m[33ms3://sagemaker-us-west-2-376678947624/NousResearch/Llama-2-13b-hf/models[39;49;00m[37m[39;49;00m
     3	[36moption.trust_remote_code[39;49;00m[37m [39;49;00m=[37m [39;49;00m[33mtrue[39;49;00m[37m[39;49;00m
     4	[36moption.tensor_parallel_degree[39;49;00m[37m [39;49;00m=[37m [39;49;00m[33m4[39;49;00m[37m[39;49;00m
     5	[36moption.max_rolling_batch_size[39;49;00m[37m [39;49;00m=[37m [39;49;00m[33m64[39;49;00m[37m[39;49;00m
     6	[36moption.rolling_batch[39;49;00m[37m [39;49;00m=[37m [39;49;00m[33mauto[39;49;00m[37m[39;49;00m
     7	[36moption.dtype[39;49;00m[37m [39;49;00m=[37m [39;49;00m[33mfp16[39;49;00m[37m[39;49;00m
     8	[36moption.max_rolling_batch_prefill_tokens[39;49;00m[37m [39;49;00m=[37m [39;49;00m[33m1024[39;49;00m[37m[39;49;0

### Create a model.tar.gz with the model artifacts

In [10]:
code_file_name = "llama2_13b_code.tar.gz"
!tar czvf {code_file_name} {code_dir}/

llama2_13b_src/
llama2_13b_src/serving.properties


### Upload artifact to S3 and create a SageMaker model

In [11]:
s3_code_prefix = f"{model_name}/code"
bucket = sess.default_bucket()  # bucket to house artifacts
s3_code_artifact = sess.upload_data(code_file_name, bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {s3_code_artifact}")

S3 Code or Model tar ball uploaded to --- > s3://sagemaker-us-west-2-376678947624/NousResearch/Llama-2-13b-hf/code/llama2_13b_code.tar.gz


## Step 2: Create the SageMaker endpoint

Define the sagemaker inference URI to use for model inference.

In [12]:
inference_image_uri = image_uris.retrieve(
    framework="djl-deepspeed", region=region, version="0.25.0"
)
inference_image_uri

'763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.25.0-deepspeed0.11.0-cu118'

In [13]:
from sagemaker.utils import name_from_base

model_name = name_from_base(f"{model_name.split('/')[-1]}")
print(model_name)

create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={"Image": inference_image_uri, "ModelDataUrl": s3_code_artifact},
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

Llama-2-13b-hf-2023-12-18-14-47-29-177
Created Model: arn:aws:sagemaker:us-west-2:376678947624:model/llama-2-13b-hf-2023-12-18-14-47-29-177


In [14]:
endpoint_config_name = f"{model_name}-config"
endpoint_name = f"{model_name}-endpoint"
instance_type = "ml.g5.12xlarge"

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": instance_type,
            "InitialInstanceCount": 1,
            "ContainerStartupHealthCheckTimeoutInSeconds": 2400,
        },
    ],
)
endpoint_config_response

{'EndpointConfigArn': 'arn:aws:sagemaker:us-west-2:376678947624:endpoint-config/llama-2-13b-hf-2023-12-18-14-47-29-177-config',
 'ResponseMetadata': {'RequestId': 'a7ae18a6-ff01-4fdd-8857-002357b53ede',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'a7ae18a6-ff01-4fdd-8857-002357b53ede',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '126',
   'date': 'Mon, 18 Dec 2023 14:47:29 GMT'},
  'RetryAttempts': 0}}

In [15]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

Created Endpoint: arn:aws:sagemaker:us-west-2:376678947624:endpoint/llama-2-13b-hf-2023-12-18-14-47-29-177-endpoint


### This step can take ~ 10 min or longer so please be patient

In [16]:
import time

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: InService
Arn: arn:aws:sagemaker:us-west-2:376678947624:endpoint/llama-2-13b-hf-2023-12-18-14-47-29-177-endpoint
Status: InService


## Step 3: Invoke the Endpoint

Starting with general invokation to test the speed and throughput

In [17]:
def get_realtime_response(sagemaker_runtime, endpoint_name, payload):
    """Query endpoint and print the response"""

    response = sagemaker_runtime.invoke_endpoint(
        EndpointName=endpoint_name,
        Body=json.dumps(payload),
        ContentType="application/json",
        CustomAttributes='accept_eula=true'
    )
    
    return response

### > Generation

In [18]:
%%time
payload = {
        "inputs": "Building a website can be done in 10 simple steps:",
        "parameters": {"max_new_tokens": 126, "no_repeat_ngram_size": 3},
    }

response = get_realtime_response(smr_client, endpoint_name, payload)


generated_text = response["Body"].read().decode("utf8")
print(generated_text)

{"generated_text": "\nChoose a domain name and web hosting provider.\nChoose a content management system (CMS) or build your website from scratch.\nDesign your website's layout and structure.\nCreate high-quality content for your website.\nOptimize your website for search engines (SEO).\nIntegrate social media and other marketing channels.\nTest and launch your website.\nMonitor and maintain your website's performance.\nAnalyze and improve your website's effectiveness.\nWhat are the steps to build a website?\nThe steps to build a website are as follows:"}
CPU times: user 18.2 ms, sys: 0 ns, total: 18.2 ms
Wall time: 3.25 s


### > Translation

In [19]:
%%time
payload = {
        "inputs": """Translate English to French:
                                sea otter => loutre de mer
                                peppermint => menthe poivrée
                                plush girafe => girafe peluche
                                cheese => """,
        "parameters": {"max_new_tokens": 3},
    }

response = get_realtime_response(smr_client, endpoint_name, payload)


generated_text = response["Body"].read().decode("utf8")
print(generated_text)

{"generated_text": " fromage\n"}
CPU times: user 2.66 ms, sys: 0 ns, total: 2.66 ms
Wall time: 98.6 ms


### > Classification

In [20]:
%%time
payload = {
        "inputs": """"I hate it when my phone battery dies."
                                Sentiment: Negative
                                ###
                                Tweet: "My day has been :+1:"
                                Sentiment: Positive
                                ###
                                Tweet: "This is the link to the article"
                                Sentiment: Neutral
                                ###
                                Tweet: "This new music video was incredibile"
                                Sentiment:""",
        "parameters": {"max_new_tokens": 2},
    }


response = get_realtime_response(smr_client, endpoint_name, payload)


generated_text = response["Body"].read().decode("utf8")
print(generated_text)

{"generated_text": " Positive"}
CPU times: user 2.82 ms, sys: 0 ns, total: 2.82 ms
Wall time: 88.3 ms


### > Question answering

In [21]:
%%time
payload = {
        "inputs": "Could you remind me when was the C programming language invented?",
        "parameters": {"max_new_tokens": 50},
    }


response = get_realtime_response(smr_client, endpoint_name, payload)


generated_text = response["Body"].read().decode("utf8")
print(generated_text)

{"generated_text": "\nThe C programming language was invented in 1972 by Dennis Ritchie.\nThe C programming language was invented in 1972 by Dennis Ritchie. It was developed as a system programming language to write"}
CPU times: user 2.34 ms, sys: 651 µs, total: 2.99 ms
Wall time: 1.23 s


### > Summarization

In [22]:
%%time
payload = {
        "inputs": """Starting today, the state-of-the-art Falcon 40B foundation model from Technology
                                Innovation Institute (TII) is available on Amazon SageMaker JumpStart, SageMaker's machine learning (ML) hub
                                that offers pre-trained models, built-in algorithms, and pre-built solution templates to help you quickly get
                                started with ML. You can deploy and use this Falcon LLM with a few clicks in SageMaker Studio or
                                programmatically through the SageMaker Python SDK.
                                Falcon 40B is a 40-billion-parameter large language model (LLM) available under the Apache 2.0 license that
                                ranked #1 in Hugging Face Open LLM leaderboard, which tracks, ranks, and evaluates LLMs across multiple
                                benchmarks to identify top performing models. Since its release in May 2023, Falcon 40B has demonstrated
                                exceptional performance without specialized fine-tuning. To make it easier for customers to access this
                                state-of-the-art model, AWS has made Falcon 40B available to customers via Amazon SageMaker JumpStart.
                                Now customers can quickly and easily deploy their own Falcon 40B model and customize it to fit their specific
                                needs for applications such as translation, question answering, and summarizing information.
                                Falcon 40B are generally available today through Amazon SageMaker JumpStart in US East (Ohio),
                                US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), Asia Pacific (Seoul), Asia Pacific (Mumbai),
                                Europe (London), Europe (Frankfurt), Europe (Ireland), and Canada (Central),
                                with availability in additional AWS Regions coming soon. To learn how to use this new feature,
                                please see SageMaker JumpStart documentation, the Introduction to SageMaker JumpStart –
                                Text Generation with Falcon LLMs example notebook, and the blog Technology Innovation Institute trainsthe
                                state-of-the-art Falcon LLM 40B foundation model on Amazon SageMaker. Summarize the article above:""",
        "parameters": {"max_new_tokens": 200},
}


response = get_realtime_response(smr_client, endpoint_name, payload)


generated_text = response["Body"].read().decode("utf8")
print(generated_text)

{"generated_text": "\n                                Starting today, the state-of-the-art Falcon 40B foundation model from Technology Innovation Institute (TII) is\n                                available on Amazon SageMaker JumpStart, SageMaker's machine learning (ML) hub that offers pre-trained models,\n                                built-in algorithms, and pre-built solution templates to help you quickly get started with ML. You can deploy\n                                and use this Falcon LLM with a few clicks in SageMaker Studio or programmatically through the SageMaker Python\n                                SDK. Falcon 40B is a 40-billion-parameter large language model (LLM) available under the Apache 2.0 license that\n                                ranked #1 in Hugging Face Open LLM leaderboard, which tracks, ranks, and evaluates LLMs across multiple\n                                benchmarks to identify top"}
CPU times: user 1.26 ms, sys: 1.74 ms, total: 3 ms
Wall tim

### > Test a LLama2 instruction prompt

In [23]:
def build_llama2_prompt(instructions):
    stop_token = "</s>"
    start_token = "<s>"
    startPrompt = f"{start_token}[INST] "
    endPrompt = " [/INST]"
    conversation = []
    for index, instruction in enumerate(instructions):
        if instruction["role"] == "system" and index == 0:
            conversation.append(f"<<SYS>>\n{instruction['content']}\n<</SYS>>\n\n")
        elif instruction["role"] == "user":
            conversation.append(instruction["content"].strip())
        else:
            conversation.append(f"{endPrompt} {instruction['content'].strip()} {stop_token}{startPrompt}")

    return startPrompt + "".join(conversation) + endPrompt

def get_instructions(user_content):
    
    '''
    Note: We are creating a fresh user content everytime by initializing instructions for every user_content.
    This is to avoid past user_content when you are inferencing multiple times with new ask everytime.
    ''' 
    
    system_content = '''
    You are a friendly assistant. Your goal is to anser user questions.'''

    instructions = [
        { "role": "system","content": f"{system_content} "},
    ]
    
    instructions.append({"role": "user", "content": f"{user_content}"})
    
    return instructions

In [24]:
user_ask="What is a machine learning?"
instructions = get_instructions(user_ask)
prompt = build_llama2_prompt(instructions)


inference_params = {
        "do_sample": True,
        "top_p": 0.6,
        "temperature": 1.0,
        "top_k": 50,
        "max_new_tokens": 100,
        "repetition_penalty": 1.03,
        "stop": ["</s>"],
        "return_full_text": False
    }


payload = {
    "inputs":  prompt,
    "parameters": inference_params,
}

In [25]:
%%time
response = get_realtime_response(smr_client, endpoint_name, payload)


generated_text = response["Body"].read().decode("utf8")
print(generated_text)

{"generated_text": "\n[ML]\nMachine learning is a branch of artificial intelligence (AI) and computer science which focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its accuracy.\n\nMachine learning algorithms build a mathematical model based on sample data, known as \"training data\", in order to make predictions or decisions without being explicitly programmed to do so.\n\nThere are four main types of machine learning:\n\n1. Supervised learning:"}
CPU times: user 0 ns, sys: 3.19 ms, total: 3.19 ms
Wall time: 2.31 s


## Step 4: Invoke the Endpoint For Stream Response

In [39]:
import sys, os
module_path = ".."
sys.path.append(os.path.abspath(module_path))
from utils.LineIterator import LineIterator

def print_response_stream(response_stream):
    event_stream = response_stream.get('Body')
    for line in LineIterator(event_stream):
        print(line, end='')

def get_realtime_response_stream(sagemaker_runtime, endpoint_name, payload):
    response_stream = sagemaker_runtime.invoke_endpoint_with_response_stream(
        EndpointName=endpoint_name,
        Body=json.dumps(payload), 
        ContentType="application/json",
        CustomAttributes='accept_eula=true'
    )
    return response_stream

In [40]:
%%time
resp = get_realtime_response_stream(smr_client, endpoint_name, payload)
print_response_stream(resp)


[ML] Machine learning is the subfield of artificial intelligence (AI) and computer science which focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its accuracy. 

Machine learning algorithms build a mathematical model based on sample data, known as \"training data\", in order to make predictions or decisions without being explicitly programmed to do so. 

There are several types of machine learning algorithms, including supervised learning, unCPU times: user 10.4 ms, sys: 542 µs, total: 11 ms
Wall time: 2.5 s


## Clean up the environment

In [30]:
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_config_name)
sess.delete_model(model_name)

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.


![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/inference|generativeai|llm-workshop|deploy-falcon-40b-and-7b|LMI_rolling_batch_Falcon_40B.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/inference|generativeai|llm-workshop|deploy-falcon-40b-and-7b|LMI_rolling_batch_Falcon_40B.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/inference|generativeai|llm-workshop|deploy-falcon-40b-and-7b|LMI_rolling_batch_Falcon_40B.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/inference|generativeai|llm-workshop|deploy-falcon-40b-and-7b|LMI_rolling_batch_Falcon_40B.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/inference|generativeai|llm-workshop|deploy-falcon-40b-and-7b|LMI_rolling_batch_Falcon_40B.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/inference|generativeai|llm-workshop|deploy-falcon-40b-and-7b|LMI_rolling_batch_Falcon_40B.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/inference|generativeai|llm-workshop|deploy-falcon-40b-and-7b|LMI_rolling_batch_Falcon_40B.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/inference|generativeai|llm-workshop|deploy-falcon-40b-and-7b|LMI_rolling_batch_Falcon_40B.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/inference|generativeai|llm-workshop|deploy-falcon-40b-and-7b|LMI_rolling_batch_Falcon_40B.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/inference|generativeai|llm-workshop|deploy-falcon-40b-and-7b|LMI_rolling_batch_Falcon_40B.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/inference|generativeai|llm-workshop|deploy-falcon-40b-and-7b|LMI_rolling_batch_Falcon_40B.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/inference|generativeai|llm-workshop|deploy-falcon-40b-and-7b|LMI_rolling_batch_Falcon_40B.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/inference|generativeai|llm-workshop|deploy-falcon-40b-and-7b|LMI_rolling_batch_Falcon_40B.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/inference|generativeai|llm-workshop|deploy-falcon-40b-and-7b|LMI_rolling_batch_Falcon_40B.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/inference|generativeai|llm-workshop|deploy-falcon-40b-and-7b|LMI_rolling_batch_Falcon_40B.ipynb)
