# How to optimize the Meta Llama-3 70B Amazon JumpStart model for inference using Amazon SageMaker model optimization jobs
**Recommended kernel(s):** This notebook can be run with any Amazon SageMaker Studio kernel.

In this notebook, you will learn how to apply state-of-the-art optimization techniques to an Amazon JumpStart model (JumpStart model ID: `meta-textgeneration-llama-3-70b`) using Amazon SageMaker ahead-of-time (AOT) model optimization capabilities. Each example includes the deployment of the optimized model to an Amazon SageMaker endpoint. In all cases, the inference image will be the SageMaker-managed [LMI (Large Model Inference)](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-container-docs.html) Docker image. LMI images features a [DJL serving](https://github.com/deepjavalibrary/djl-serving) stack powered by the [Deep Java Library](https://djl.ai/). 

You will successively:
1. Deploy a pre-optimized variant of the Amazon JumpStart model with speculative decoding enabled (using SageMaker provided draft model). For popular models, the JumpStart team indeed selects and applies the best optimization configurations for you.
2. Customize the speculative decoding with open-source draft model.
3. Quantize the model weights using the AWQ algorithm.
4. Compile the model for a deployment of AWS Inferentia 2 accelerated hardware.

**Notices:**
* Make sure that the `ml.p4d.24xlarge` and `ml.inf2.48xlarge` instance types required for this tutorial are available in your AWS Region.
* Make sure that the value of your "ml.p4d.24xlarge for endpoint usage" and "ml.inf2.48xlarge for endpoint usage" Amazon SageMaker service quotas allow you to deploy at least one Amazon SageMaker endpoint using these instance types.

This notebook leverages the [Model Builder Class](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-modelbuilder-creation.html) within the [`sagemaker` Python SDK](https://sagemaker.readthedocs.io/en/stable/index.html) to abstract out container and model server management/tuning. Via the Model Builder Class you can easily interact with JumpStart Models, HuggingFace Hub Models, and also custom models via pointing towards an S3 path with your Model Data. For this sample we will focus on the JumpStart Optimization path.

### License agreement
* This model is under the Meta license, please refer to the original model card.
* This notebook is a sample notebook and not intended for production use.

### Execution environment setup
This notebook requires the following third-party Python dependencies:
* AWS [`boto3`](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html#)
* AWS [`sagemaker`](https://sagemaker.readthedocs.io/en/stable/index.html) with a version greater than or equal to 2.225.0 

Let's install or upgrade these dependencies using the following command:

In [1]:
%pip install sagemaker>=2.225.0 boto3 huggingface_hub --upgrade --quiet

Note: you may need to restart the kernel to use updated packages.


### Setup

In [2]:
import boto3
from sagemaker.serve.builder.model_builder import ModelBuilder
from sagemaker.serve.builder.schema_builder import SchemaBuilder
from sagemaker.session import Session
import logging
import huggingface_hub
from pathlib import Path

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


In [3]:
sagemaker_session = Session()

artifacts_bucket_name = sagemaker_session.default_bucket()
execution_role_arn = sagemaker_session.get_caller_identity_arn()

js_model_id = "meta-textgeneration-llama-3-70b"
gpu_instance_type = "ml.p4d.24xlarge"
neuron_instance_type = "ml.inf2.48xlarge"

In [4]:
response = "Hello, I'm a language model, and I'm here to help you with your English."

sample_input = {
    "inputs": "Hello, I'm a language model,",
    "parameters": {"max_new_tokens": 128, "top_p": 0.9, "temperature": 0.6},
}

sample_output = [{"generated_text": response}]

schema_builder = SchemaBuilder(sample_input, sample_output)

In [5]:
schema_builder

SchemaBuilder(
input_serializer=<sagemaker.serve.builder.schema_builder.JSONSerializerWrapper object at 0x7fe62015f040>
output_serializer=<sagemaker.serve.builder.schema_builder.JSONSerializerWrapper object at 0x7fe533428c40>
input_deserializer=<sagemaker.base_deserializers.JSONDeserializer object at 0x7fe5df08caf0>
output_deserializer=<sagemaker.base_deserializers.JSONDeserializer object at 0x7fe5df08ca60>)

## 1. Deploy a pre-optimized deployment configuration with speculative decoding (SageMaker provided draft model)
The `meta-textgeneration-llama-3-70b` JumpStart model is available with multiple pre-optimized deployment configuration. Optimized model artifacts for each configuration have already been created by the JumpStart team and a readily available for deployment. In this section, you will deploy one of theses pre-optimized configuration to an Amazon SageMaker endpoint. 

### What is speculative decoding?
Speculative decoding is an inference optimization technique introduced by [Y. Leviathan et al. (ICML 2023)](https://arxiv.org/abs/2211.17192) used to accelerate the decoding process of large and therefore slow LLMs for latency-critical applications. The key idea is to use a smaller, less powerful but faster model called the ***draft model*** to generate candidate tokens that get validated by the larger, more powerful but slower model called the ***target model***. At each iteration, the draft model generates $K>1$ candidate tokens. Then, using a single forward pass of the larger target model, none, part, or all candidate tokens get accepted. The more aligned the selected draft model is with the target model, the better guesses it makes, the higher candidate token acceptance rate and therefore the higher the speed ups. The larger the size gap between the target and the draft model, the largest the potential speedups.

Let's start by creating a `ModelBuilder` instance for the model:

In [6]:
model_builder = ModelBuilder(
    model=js_model_id,
    schema_builder=schema_builder,
    sagemaker_session=sagemaker_session,
    role_arn=execution_role_arn,
    log_level=logging.ERROR
)

For each optimization configuration, the JumpStart team has computed key performance metrics such as time-to-first-token (TTFT) latency and throughput for multiple hardwares and concurrent invocation intensities. Let's visualize these metrics using the `display_benchmark_metrics` method:

In [7]:
model_builder.display_benchmark_metrics()

Model 'meta-textgeneration-llama-3-70b' requires accepting end-user license agreement (EULA). See https://jumpstart-cache-prod-us-east-1.s3.us-east-1.amazonaws.com/fmhMetadata/eula/llama3Eula.txt for terms of use.
INFO:sagemaker.jumpstart:Model 'meta-textgeneration-llama-3-70b' requires accepting end-user license agreement (EULA). See https://jumpstart-cache-prod-us-east-1.s3.us-east-1.amazonaws.com/fmhMetadata/eula/llama3Eula.txt for terms of use.
Using model 'meta-textgeneration-llama-3-70b' with wildcard version identifier '*'. You can pin to version '2.2.0' for more stable results. Note that models may have different input/output signatures after a major version upgrade.
No instance type selected for inference hosting endpoint. Defaulting to ml.p4d.24xlarge.
INFO:sagemaker.jumpstart:No instance type selected for inference hosting endpoint. Defaulting to ml.p4d.24xlarge.
ModelBuilder: INFO:     JumpStart ID meta-textgeneration-llama-3-70b is packaged with Image URI: 763104351884.dkr

Instance rate metrics will be omitted. Reason: User: arn:aws:sts::057716757052:assumed-role/new-sagemaker-studio-gonsoomoon/SageMaker is not authorized to perform: pricing:GetProducts because no identity-based policy allows the pricing:GetProducts action


| Instance Type             | Config Name   |   Concurrent Users |   Latency, TTFT (P50 in sec) |   Throughput (P50 in tokens/sec/user) |
|:--------------------------|:--------------|-------------------:|-----------------------------:|--------------------------------------:|
| ml.g5.48xlarge            | lmi           |                  1 |                         2.02 |                                 18.80 |
| ml.g5.48xlarge            | lmi           |                  2 |                         2.10 |                                 15.40 |
| ml.g5.48xlarge            | lmi           |                  4 |                         2.15 |                                  9.40 |
| ml.g5.48xlarge            | lmi           |                  8 |                         2.93 |                                  6.70 |
| ml.p4d.24xlarge           | lmi           |                 64 |                         0.20 |                                  9.70 |
| ml.p4d.24xlarge           | lmi 

Now, let's pick and deploy the `lmi-optimized` pre-optimized configuration to a `ml.p4d.24xlarge` instance. The `lmi-optimized` configuration enables speculative decoding. In this configuration, a SageMaker provided draft model is used. Therefore, you don't have to supply a draft model. 

In [8]:
model_builder.set_deployment_config(config_name="lmi-optimized", instance_type=gpu_instance_type)

In [9]:
gpu_instance_type

'ml.p4d.24xlarge'

Currently set deployment configuration can be visualized using the `get_deployment_config` method:

In [10]:
model_builder.get_deployment_config()

Instance rate metrics will be omitted. Reason: User: arn:aws:sts::057716757052:assumed-role/new-sagemaker-studio-gonsoomoon/SageMaker is not authorized to perform: pricing:GetProducts because no identity-based policy allows the pricing:GetProducts action


{'DeploymentConfigName': 'lmi-optimized',
 'DeploymentArgs': {'ImageUri': '763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.28.0-lmi10.0.0-cu124',
  'ModelData': {'S3DataSource': {'S3Uri': 's3://jumpstart-private-cache-prod-us-east-1/meta-textgeneration/meta-textgeneration-llama-3-70b/artifacts/inference-prepack/v1.1.0/',
    'S3DataType': 'S3Prefix',
    'CompressionType': 'None'}},
  'ModelPackageArn': None,
  'Environment': {'SAGEMAKER_PROGRAM': 'inference.py',
   'ENDPOINT_SERVER_TIMEOUT': '3600',
   'MODEL_CACHE_ROOT': '/opt/ml/model',
   'SAGEMAKER_ENV': '1',
   'HF_MODEL_ID': '/opt/ml/model',
   'OPTION_SPECULATIVE_DRAFT_MODEL': '/opt/ml/additional-model-data-sources/draft_model',
   'SAGEMAKER_MODEL_SERVER_WORKERS': '1',
   'OPTION_GPU_MEMORY_UTILIZATION': '0.65'},
  'InstanceType': 'ml.p4d.24xlarge',
  'ComputeResourceRequirements': {'MinMemoryRequiredInMb': 589824,
   'NumberOfAcceleratorDevicesRequired': 8},
  'ModelDataDownloadTimeout': 1200,
  'ContainerStartup

Now, let's build the `Model` instance and use it to deploy the selected optimized configuration. This operation may take a few minutes.

In [11]:
optimized_model = model_builder.build()

ModelBuilder: INFO:     Either inference spec or model is provided. ModelBuilder is not handling MLflow model input
Model 'meta-textgeneration-llama-3-70b' requires accepting end-user license agreement (EULA). See https://jumpstart-cache-prod-us-east-1.s3.us-east-1.amazonaws.com/fmhMetadata/eula/llama3Eula.txt for terms of use.
INFO:sagemaker.jumpstart:Model 'meta-textgeneration-llama-3-70b' requires accepting end-user license agreement (EULA). See https://jumpstart-cache-prod-us-east-1.s3.us-east-1.amazonaws.com/fmhMetadata/eula/llama3Eula.txt for terms of use.
Using model 'meta-textgeneration-llama-3-70b' with wildcard version identifier '*'. You can pin to version '2.2.0' for more stable results. Note that models may have different input/output signatures after a major version upgrade.
ModelBuilder: INFO:     JumpStart Model ID detected.
ModelBuilder: INFO:     JumpStart Model ID detected.


In [12]:
optimized_model

<sagemaker.jumpstart.model.JumpStartModel at 0x7fe62015f0a0>

In [13]:
predictor = optimized_model.deploy(accept_eula=True)

ModelBuilder: INFO:     ModelBuilder will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver additional features. To opt out of telemetry, please disable via TelemetryOptOut in intelligent defaults. See https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk for more info.
Model 'meta-textgeneration-llama-3-70b' requires accepting end-user license agreement (EULA). See https://jumpstart-cache-prod-us-east-1.s3.us-east-1.amazonaws.com/fmhMetadata/eula/llama3Eula.txt for terms of use.
INFO:sagemaker.jumpstart:Model 'meta-textgeneration-llama-3-70b' requires accepting end-user license agreement (EULA). See https://jumpstart-cache-prod-us-east-1.s3.us-east-1.amazonaws.com/fmhMetadata/eula/llama3Eula.txt for terms of use.
Using model 'meta-textgeneration-llama-3-70b' with wildcard version identifier '*'. You can pin to version '2.2.0' for more stable results. Note that models may have di

Once the deployment has finished successfully, you can send queries to the model by simply using the predictor's `predict` method:

In [14]:
predictor.predict(sample_input)

{'generated_text': " and I'm here to help you with your questions about the world of technology. Today, I want to talk about the importance of having a strong online presence for your business. In today's digital age, having a website is no longer a luxury, but a necessity. It's the first place potential customers will go to learn more about your business, and it's essential to make a good first impression.\nBut having a website is just the first step. You also need to make sure that your website is optimized for search engines, so that it can be easily found by potential customers. This is where search engine optimization (SEO) comes in"}

In [15]:
# Clean up
predictor.delete_model()
predictor.delete_endpoint(delete_endpoint_config=True)

INFO:sagemaker:Deleting model with name: meta-textgeneration-llama-3-70b-2024-07-21-01-53-51-512


INFO:sagemaker:Deleting endpoint configuration with name: meta-textgeneration-llama-3-70b-2024-07-21-01-53-51-920
INFO:sagemaker:Deleting endpoint with name: meta-textgeneration-llama-3-70b-2024-07-21-01-53-51-920


## 2. Customize the speculative decoding with open-source draft model, then deploy the optimized model
In this section and instead of relying on a pre-optimized model, you will use an Amazon SageMaker optimization toolkit to enable speculative decoding on the `meta-textgeneration-llama-3-70b` JumpStart model. In this example, the draft model is from HuggingFace model hub. We use the HF-Hub model package to download these artifacts to S3 directly, optionally you can also provide your HF Model ID. In this case for the draft model we use Meta-Llama-3-8B, for this model ensure you have access to the artifacts via HF.

In [16]:
from huggingface_hub import notebook_login

# Hugging Face에 로그인
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [17]:
custom_draft_model_id="meta-llama/Meta-Llama-3-8B"

hf_local_download_dir = Path.cwd() / "model_repo"
hf_local_download_dir.mkdir(exist_ok=True)

huggingface_hub.snapshot_download(
    repo_id=custom_draft_model_id,
    revision="main",
    local_dir=hf_local_download_dir,
    local_dir_use_symlinks=False,
)

For more details, check out https://huggingface.co/docs/huggingface_hub/main/en/guides/download#download-files-to-local-folder.


Fetching 17 files:   0%|          | 0/17 [00:00<?, ?it/s]

'/home/sagemaker-user/sagemaker-genai-hosting-examples-1/Llama3/llama3-70b/model_repo'

In [18]:
hf_local_download_dir.as_posix()

'/home/sagemaker-user/sagemaker-genai-hosting-examples-1/Llama3/llama3-70b/model_repo'

In [19]:
! ls -al '/home/sagemaker-user/sagemaker-genai-hosting-examples-1/Llama3/llama3-70b/model_repo'

total 15693168
drwxr-xr-x 4 sagemaker-user users       4096 Jul 20 14:31 .
drwxr-xr-x 3 sagemaker-user users        105 Jul 20 14:30 ..
drwxr-xr-x 3 sagemaker-user users         25 Jul 20 14:28 .cache
-rw-r--r-- 1 sagemaker-user users       1519 Jul 20 14:28 .gitattributes
-rw-r--r-- 1 sagemaker-user users       7801 Jul 20 14:28 LICENSE
-rw-r--r-- 1 sagemaker-user users      36547 Jul 20 14:28 README.md
-rw-r--r-- 1 sagemaker-user users       4696 Jul 20 14:28 USE_POLICY.md
-rw-r--r-- 1 sagemaker-user users        654 Jul 20 14:28 config.json
-rw-r--r-- 1 sagemaker-user users        177 Jul 20 14:28 generation_config.json
-rw-r--r-- 1 sagemaker-user users 4976698672 Jul 20 14:30 model-00001-of-00004.safetensors
-rw-r--r-- 1 sagemaker-user users 4999802720 Jul 20 14:30 model-00002-of-00004.safetensors
-rw-r--r-- 1 sagemaker-user users 4915916176 Jul 20 14:30 model-00003-of-00004.safetensors
-rw-r--r-- 1 sagemaker-user users 1168138808 Jul 20 14:29 model-00004-of-00004.safetensors
-rw-r

In [20]:
custom_draft_model_uri = sagemaker_session.upload_data(
    path=hf_local_download_dir.as_posix(),
    bucket=artifacts_bucket_name,
    key_prefix="spec-dec-custom-draft-model",
)

In [21]:
draft_uri = custom_draft_model_uri + "/" #need to point towards the uncompressed model artifacts
draft_uri

's3://sagemaker-us-east-1-057716757052/spec-dec-custom-draft-model/'

In [22]:
!aws s3 ls {draft_uri} #verify model artifacts

                           PRE .cache/
                           PRE original/
2024-07-21 02:01:59       1519 .gitattributes
2024-07-21 02:01:59       7801 LICENSE
2024-07-21 02:01:59      36547 README.md
2024-07-21 02:01:59       4696 USE_POLICY.md
2024-07-21 02:01:59        654 config.json
2024-07-21 02:01:59        177 generation_config.json
2024-07-21 02:03:35 4976698672 model-00001-of-00004.safetensors
2024-07-21 02:02:53 4999802720 model-00002-of-00004.safetensors
2024-07-21 02:02:11 4915916176 model-00003-of-00004.safetensors
2024-07-21 02:02:00 1168138808 model-00004-of-00004.safetensors
2024-07-21 02:01:59      23950 model.safetensors.index.json
2024-07-21 02:01:59         73 special_tokens_map.json
2024-07-21 02:01:59    9085698 tokenizer.json
2024-07-21 02:01:59      50566 tokenizer_config.json


In [23]:
model_builder = ModelBuilder(
    model=js_model_id,
    schema_builder=schema_builder,
    sagemaker_session=sagemaker_session,
    role_arn=execution_role_arn,
    log_level=logging.ERROR
)

The optimization operation may take a few minutes.

In [24]:
gpu_instance_type

'ml.p4d.24xlarge'

In [25]:
optimized_model = model_builder.optimize(
    instance_type=gpu_instance_type,
    accept_eula=True,
    speculative_decoding_config={
        "ModelSource": draft_uri
    },
)

ModelBuilder: INFO:     ModelBuilder will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver additional features. To opt out of telemetry, please disable via TelemetryOptOut in intelligent defaults. See https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk for more info.
Model 'meta-textgeneration-llama-3-70b' requires accepting end-user license agreement (EULA). See https://jumpstart-cache-prod-us-east-1.s3.us-east-1.amazonaws.com/fmhMetadata/eula/llama3Eula.txt for terms of use.
INFO:sagemaker.jumpstart:Model 'meta-textgeneration-llama-3-70b' requires accepting end-user license agreement (EULA). See https://jumpstart-cache-prod-us-east-1.s3.us-east-1.amazonaws.com/fmhMetadata/eula/llama3Eula.txt for terms of use.
Using model 'meta-textgeneration-llama-3-70b' with wildcard version identifier '*'. You can pin to version '2.2.0' for more stable results. Note that models may have di

Now let's deploy the quantized model to an Amazon SageMaker endpoint. This operation may take a few minutes.

In [26]:
predictor = optimized_model.deploy(accept_eula=True)

ModelBuilder: INFO:     ModelBuilder will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver additional features. To opt out of telemetry, please disable via TelemetryOptOut in intelligent defaults. See https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk for more info.
INFO:sagemaker:Creating model with name: meta-textgeneration-llama-3-70b-2024-07-21-02-06-38-904
INFO:sagemaker:Creating endpoint-config with name meta-textgeneration-llama-3-70b-2024-07-21-02-06-39-418
INFO:sagemaker:Creating endpoint with name meta-textgeneration-llama-3-70b-2024-07-21-02-06-39-418
INFO:sagemaker:CUDA compat package requires Nvidia driver ⩽550.90.07
INFO:sagemaker:Current installed Nvidia driver version is 535.183.01
INFO:sagemaker:Setup CUDA compatibility libs path to LD_LIBRARY_PATH
INFO:sagemaker:/usr/local/cuda/compat:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
INFO:sagemaker:[INFO ] Model

Once the deployment has finished successfully, you can send queries to the model by simply using the predictor's `predict` method:

In [27]:
predictor.predict(sample_input)

{'generated_text': " and I'm here to help you with your questions about the world of technology. Today, I want to talk about the importance of having a strong online presence for your business. In today's digital age, having a website is no longer a luxury, but a necessity. It's the first place potential customers will go to learn more about your business, and it's essential to make a good first impression.\nBut having a website is just the first step. You also need to make sure that your website is optimized for search engines, so that it can be easily found by potential customers. This is where search engine optimization (SEO) comes in"}

In [28]:
# Clean up
predictor.delete_model()
predictor.delete_endpoint(delete_endpoint_config=True)

INFO:sagemaker:Deleting model with name: meta-textgeneration-llama-3-70b-2024-07-21-02-06-38-904


INFO:sagemaker:Deleting endpoint configuration with name: meta-textgeneration-llama-3-70b-2024-07-21-02-06-39-418
INFO:sagemaker:Deleting endpoint with name: meta-textgeneration-llama-3-70b-2024-07-21-02-06-39-418


## 3. Run optimization job to quantize the model using AWQ, then deploy the quantized model
In this section, you will quantize the `meta-textgeneration-llama-3-70b` JumpStart model with the AWQ quantization algorithm by running an Amazon SageMaker optimization job. 

### What is quantization?
In our particular context, quantization means casting the weights of a pre-trained LLM to a data type with a lower number of bits and therefore a smaller memory footprint. The benefits of LLM quantization include:
* Reduced hardware requirements for model serving: A quantized model can be served using less expensive and more available GPUs or even made accessible on consumer devices or mobile platforms.
* Increased space for the KV cache to enable larger batch sizes and/or sequence lengths.
* Faster decoding latency. As the decoding process is memory bandwidth bound, less data movement from reduced weight sizes directly improves decoding latency, unless offset by dequantization overhead.
* A higher compute-to-memory access ratio (through reduced data movement), known as arithmetic intensity. This allows for fuller utilization of available compute resources during decoding.  

AWQ (Activation-aware Weight Quantization) is a post-training weight-only quantization algorithm introduced by [J. Lin et al. (MLSys 2024)](https://arxiv.org/abs/2306.00978) that allows to quantize LLMs to low-bit integer types like INT4 with virtually no loss in model accuracy.

In [29]:
model_builder = ModelBuilder(
    model=js_model_id,
    schema_builder=schema_builder,
    sagemaker_session=sagemaker_session,
    role_arn=execution_role_arn,
    log_level=logging.ERROR
)

Quantizing the model is as easy as supplying the following inputs:
* The location of the unquantized model artifacts, here the Amazon SageMaker JumpStart model ID.
* The quantization configuration.
* The Amazon S3 URI where the output quantized artifacts needs to be stored.
Everything else (compute provisioning and configuration for example) is managed by SageMaker. This operation takes around 120min.

In [30]:
gpu_instance_type

'ml.p4d.24xlarge'

In [31]:
optimized_model = model_builder.optimize(
    instance_type=gpu_instance_type,
    accept_eula=True,
    quantization_config={
        "OverrideEnvironment": {
            "OPTION_QUANTIZE": "awq",
        },
    },
    output_path=f"s3://{artifacts_bucket_name}/awq-quantization/",
)

ModelBuilder: INFO:     ModelBuilder will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver additional features. To opt out of telemetry, please disable via TelemetryOptOut in intelligent defaults. See https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk for more info.
Model 'meta-textgeneration-llama-3-70b' requires accepting end-user license agreement (EULA). See https://jumpstart-cache-prod-us-east-1.s3.us-east-1.amazonaws.com/fmhMetadata/eula/llama3Eula.txt for terms of use.
INFO:sagemaker.jumpstart:Model 'meta-textgeneration-llama-3-70b' requires accepting end-user license agreement (EULA). See https://jumpstart-cache-prod-us-east-1.s3.us-east-1.amazonaws.com/fmhMetadata/eula/llama3Eula.txt for terms of use.
Using model 'meta-textgeneration-llama-3-70b' with wildcard version identifier '*'. You can pin to version '2.2.0' for more stable results. Note that models may have di

........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

ModelBuilder: DEBUG:     ModelBuilder metrics emitted.





Now let's deploy the quantized model to an Amazon SageMaker endpoint. This operation may take a few minutes.

In [32]:
quantized_instance_type = "ml.g5.12xlarge"  # We can use a smaller instance type once quantized
predictor = optimized_model.deploy(instance_type=quantized_instance_type, accept_eula=True)

ModelBuilder: INFO:     ModelBuilder will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver additional features. To opt out of telemetry, please disable via TelemetryOptOut in intelligent defaults. See https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk for more info.
INFO:sagemaker:Creating model with name: meta-textgeneration-llama-3-70b-2024-07-21-02-15-50-465
INFO:sagemaker:Creating endpoint-config with name meta-textgeneration-llama-3-70b-2024-07-21-04-12-18-616
INFO:sagemaker:Creating endpoint with name meta-textgeneration-llama-3-70b-2024-07-21-04-12-18-616
INFO:sagemaker:CUDA compat package requires Nvidia driver ⩽550.90.07
INFO:sagemaker:Current installed Nvidia driver version is 535.183.01
INFO:sagemaker:Setup CUDA compatibility libs path to LD_LIBRARY_PATH
INFO:sagemaker:/usr/local/cuda/compat:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
INFO:sagemaker:[INFO ] Model

Once the deployment has finished successfully, you can send queries to the model by simply using the predictor's `predict` method:

In [33]:
predictor.predict(sample_input)

{'generated_text': " and I'm here to help you with your English language needs. Whether you're a student, a professional, or just someone who wants to improve their English skills, I can provide you with the guidance and support you need to succeed.\nI can help you with a wide range of language-related tasks, including:\nGrammar: I can help you understand the rules of English grammar and how to use them correctly in your writing and speaking.\nVocabulary: I can help you expand your vocabulary and learn new words and phrases that will make your writing and speaking more effective.\nPronunciation: I can help you improve your pronunciation of English words and phrases"}

In [34]:
# Clean up
predictor.delete_model()
predictor.delete_endpoint(delete_endpoint_config=True)

INFO:sagemaker:Deleting model with name: meta-textgeneration-llama-3-70b-2024-07-21-02-15-50-465
INFO:sagemaker:Deleting endpoint configuration with name: meta-textgeneration-llama-3-70b-2024-07-21-04-12-18-616
INFO:sagemaker:Deleting endpoint with name: meta-textgeneration-llama-3-70b-2024-07-21-04-12-18-616


## 4. Run optimization job to compile the model using the AWS Neuron Compiler then deploy the compiled model to an Inferentia 2 endpoint
In this section, you will use an Amazon SageMaker optimization job as a managed model compiler to compile the `meta-textgeneration-llama-3-70b` JumpStart model for [AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/) hardware. The optimization job allows you to decouple compilation from deployment. The model is compiled once while the compiled artifacts can be deployed many times. In other words, the compilation overhead is paid once instead of occuring upon each deployment.

In [35]:
model_builder = ModelBuilder(
    model=js_model_id,
    schema_builder=schema_builder,
    sagemaker_session=sagemaker_session,
    role_arn=execution_role_arn,
)

Now let's compile the model for Inferentia 2. This operation takes around 40min.

In [36]:
optimized_model = model_builder.optimize(
    instance_type=neuron_instance_type,
    accept_eula=True,
    compilation_config={
        "OverrideEnvironment": {
            "OPTION_TENSOR_PARALLEL_DEGREE": "24",
            "OPTION_N_POSITIONS": "8192",
            "OPTION_DTYPE": "fp16",
            "OPTION_ROLLING_BATCH": "auto",
            "OPTION_MAX_ROLLING_BATCH_SIZE": "4",
            "OPTION_NEURON_OPTIMIZE_LEVEL": "2",
        }
    },
    output_path=f"s3://{artifacts_bucket_name}/neuron/",
)

ModelBuilder: INFO:     ModelBuilder will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver additional features. To opt out of telemetry, please disable via TelemetryOptOut in intelligent defaults. See https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk for more info.
Model 'meta-textgeneration-llama-3-70b' requires accepting end-user license agreement (EULA). See https://jumpstart-cache-prod-us-east-1.s3.us-east-1.amazonaws.com/fmhMetadata/eula/llama3Eula.txt for terms of use.
INFO:sagemaker.jumpstart:Model 'meta-textgeneration-llama-3-70b' requires accepting end-user license agreement (EULA). See https://jumpstart-cache-prod-us-east-1.s3.us-east-1.amazonaws.com/fmhMetadata/eula/llama3Eula.txt for terms of use.
Using model 'meta-textgeneration-llama-3-70b' with wildcard version identifier '*'. You can pin to version '2.2.0' for more stable results. Note that models may have di

.....................................................................................................................................................................................................................................................................................................................................................................................................................................................................!

ModelBuilder: DEBUG:     ModelBuilder metrics emitted.





Now let's deploy the compiled model to an Amazon SageMaker endpoint powered by Inferentia 2 hardware. This operation may take a few minutes.

In [37]:
predictor = optimized_model.deploy(accept_eula=True, model_data_download_timeout=3600, volume_size=512)

ModelBuilder: INFO:     ModelBuilder will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver additional features. To opt out of telemetry, please disable via TelemetryOptOut in intelligent defaults. See https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk for more info.
INFO:sagemaker:Creating model with name: meta-textgeneration-llama-3-70b-2024-07-21-04-23-19-484
INFO:sagemaker:Creating endpoint-config with name meta-textgeneration-llama-3-70b-2024-07-21-05-03-18-831
INFO:sagemaker:Creating endpoint with name meta-textgeneration-llama-3-70b-2024-07-21-05-03-18-831
INFO:sagemaker:[INFO ] ModelServer - Starting model server ...
INFO:sagemaker:[INFO ] ModelServer - Starting djl-serving: 0.28.0 ...
INFO:sagemaker:[INFO ] ModelServer - 
INFO:sagemaker:Model server home: /opt/djl
INFO:sagemaker:Current directory: /opt/djl
INFO:sagemaker:Temp directory: /tmp
INFO:sagemaker:Command lin

Once the deployment has finished successfully, you can send queries to the model by simply using the predictor's `predict` method:

In [38]:
predictor.predict(sample_input)

{'generated_text': " and I'm here to help you with your questions about the world of technology. Today, I want to talk about the importance of having a strong online presence for your business. In today's digital age, having a website is no longer a luxury, but a necessity. It's the first place potential customers will go to learn more about your business, and it's essential to make a good first impression.\nBut having a website is just the first step. You also need to make sure that your website is optimized for search engines, so that it can be easily found by potential customers. This is where search engine optimization (SEO) comes in"}

In [39]:
# Clean up
predictor.delete_model()
predictor.delete_endpoint(delete_endpoint_config=True)

INFO:sagemaker:Deleting model with name: meta-textgeneration-llama-3-70b-2024-07-21-04-23-19-484
INFO:sagemaker:Deleting endpoint configuration with name: meta-textgeneration-llama-3-70b-2024-07-21-05-03-18-831
INFO:sagemaker:Deleting endpoint with name: meta-textgeneration-llama-3-70b-2024-07-21-05-03-18-831
