# Llama 2 Hugging Face Hub - ModelBuilder

This notebook was tested with the `conda_python3` kernel on an Amazon SageMaker notebook instance of type `g5.2xl` with a 100GB EBS volume attached. If you are not using local mode, feel free to make use of a smaller CPU instance type / EBS volume. 

In [5]:
!pip install boto3 sagemaker -U --quiet

In [6]:
! pip list | grep sagemaker

sagemaker                     2.228.0
sagemaker_pyspark             1.4.5


In [7]:
! pip install python-dotenv==1.0.1



# SageMaker Model Builder experience

In the new experience, we have introduced a few new constructs. Here we will focus on the following: 

1. ModelBuilder
2. SchemaBuilder
3. InferenceSpec

In the following section, we will define these constructs and provide examples to elaborate on each one.

4.1 ModelBuilder:

ModelBuilder is a Python class that takes a framework model (such as XGBoost or PyTorch) or an Inference Spec (more details below) and converts them into a SageMaker deployable model. ModelBuilder provides a `build` function that generates the artifacts for deployment. The model artifact generated is specific to the model server, which is also customizable as one of the inputs.

```python
Class definition:

class ModelBuilder(
    model_path: str | None = '/tmp/sagemaker/model-builder/' + uuid.uuid1().hex,
    role_arn: str | None = None,
    sagemaker_session: Session | None = None,
    name: str | None = 'model-name-' + uuid.uuid1().hex,
    mode: Mode | None = Mode.SAGEMAKER_ENDPOINT,
    shared_libs: List[str] = lambda : [],
    dependencies: Dict[str, Any] | None = lambda : { "auto": False },
    env_vars: Dict[str, str] | None = lambda : {},
    log_level: int | None = logging.DEBUG,
    content_type: str | None = None,
    accept_type: str | None = None,
    s3_model_data_url: str | None = None,
    instance_type: str | None = "ml.c5.xlarge",
    schema_builder: str | None = None,
    model: Any | None = None,
    inference_spec: InferenceSpec = None,
    image_uri: str | None = None,
    model_server: str | None = None
)
```
Example:

The above class file provide all the options for customization. However to deploy the framework model, the model builder just expects model, input, output and the role. 

```python
model_builder = ModelBuilder(
    model=model,  # Pass in the actual model object. It's "predict" method will be invoked in the endpoint.
    schema_builder=SchemaBuilder(input, output), # Pass in a "SchemaBuilder" which will use the sample test input and output objects to infer the serialization needed.
    role_arn=role, # Pass in the role arn or update intelligent defaults.
    )
```

4.2 SchemaBuilder:

The SchemaBuilder enables you to define the input and output for your endpoint. It allows the SchemaBuilder to generate the corresponding marshalling functions for serializing and deserializing the input and output. For further details, please consult the notebook or refer to the video.

Class definition:
```python
class SchemaBuilder(
    sample_input: Any,
    sample_output: Any,
    input_translator: CustomPayloadTranslator = None,
    output_translator: CustomPayloadTranslator = None
)
```
Example:

The CustomPayloadTranslator class provides all the options for customization. However, for [common inference data format](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html), you can just provide the sample input/output for the SchemaBuilder.
```python
input = "How is the demo going?"
output = "Comment la démo va-t-elle?"
schema = SchemaBuilder(input, output)
```

4.3 InferenceSpec

In the case you want to specify custom function to load and invoke the model instead of the framework model function, then you can pass the inference spec with your implementation in `load` and `invoke` function. 

class definition:
```python
class InferenceSpec(abc.ABC):
    @abc.abstractmethod
    def load(self, model_dir: str):
        pass

    @abc.abstractmethod
    def invoke(self, input_object: object, model: object):
        pass
```
Example:
```python
class MyInferenceSpec(InferenceSpec):
    def load(self, model_dir: str):
        return pipeline("translation_en_to_fr", model="t5-small")
        
    def invoke(self, input, model):
        return model(input)
   
inf_spec = MyInferenceSpec()

```

In this example, we are using ModelBuilder to deploy an Llama 2 model directly. You can use `Mode` to switch between local testing and deploying to a SageMaker Endpoint. 

### SageMaker ModelBuilder: Local deployment

Now we will use SageMaker ModelBuilder class to prepare the model for local and remote deployment.

In [8]:
from sagemaker import get_execution_role, Session, image_uris
import boto3
import os
sagemaker_session = Session()
region = boto3.Session().region_name

# get execution role
# please use execution role if you are using notebook instance or update the role arn if you are using a different role
execution_role = get_execution_role() if get_execution_role() is not None else "your-role-arn"

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/SageMaker/.xdg/config/sagemaker/config.yaml


In [9]:
path = os.path.join(os.getcwd(), 'llama2-7b') 
os.makedirs(path, exist_ok=True)
print(path)

/home/ec2-user/SageMaker/sagemaker-hosting/SageMaker-Model-Builder/foundation-models/llama2-7b


[Llama 2 is a gated model](https://huggingface.co/meta-llama/Llama-2-7b-hf) and requires access to be approved. Once approved you will need to pass your [Hugging Face access token](https://huggingface.co/docs/hub/security-tokens) as seen in the below cell. 

In [10]:
import os

def set_hf_key_env_vars(hf_key_name, key_val):
    os.environ[hf_key_name] = key_val

def get_hf_key_env_vars(hf_key_name):
    HF_key_value = os.environ.get(hf_key_name)

    return HF_key_value


is_sagemaker_notebook = True
# is_sagemaker_notebook = False

if is_sagemaker_notebook:
    hf_key_name = "HF_KEY"
    key_val = "<Type Your HF Key>"
    key_val = "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"    
    set_hf_key_env_vars(hf_key_name, key_val)
    HF_TOKEN = get_hf_key_env_vars(hf_key_name)
else: # VS Code
    from dotenv import load_dotenv
    HF_TOKEN = os.getenv('HF_TOKEN')


# Log in to HF
# !huggingface-cli login --token {HF_TOKEN}


In [11]:
from sagemaker.serve.builder.model_builder import ModelBuilder
from sagemaker.serve.builder.schema_builder import SchemaBuilder
from sagemaker.serve import Mode
import json

prompt = "The diamondback terrapin or simply terrapin is a species of turtle native to the brackish coastal tidal marshes of the"
response = "The diamondback terrapin or simply terrapin is a species of turtle native to the brackish coastal tidal marshes of the east coast."

sample_input = {
    "inputs": prompt,
    "parameters": {}
}

sample_output = [
    {
        "generated_text": response
    }
]

model_builder = ModelBuilder(
    model="meta-llama/Llama-2-7b-hf",
    # model = "meta-llama/Meta-Llama-3-8B",
    schema_builder=SchemaBuilder(sample_input, sample_output),
    model_path=path, #local path where artifacts will be saved
    mode=Mode.LOCAL_CONTAINER, # The model will be deployed locally. Change to Mode.SAGEMAKER_ENDPOINT to deploy to a SageMaker endpoint. 
    env_vars={
        "HUGGING_FACE_HUB_TOKEN": HF_TOKEN # Llama 2 is a gated model and requires a Hugging Face Hub token. 
    }
)

By default, ModelBuilder will use [TGI](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#huggingface-text-generation-inference-containers) container as the underlying container for Hugging Face models. In case you would like to use the [LMI containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers), you can configure the ModelBuilder as follow:

```python
from sagemaker.serve import ModelServer

model_builder = ModelBuilder(
    model="meta-llama/Llama-2-7b-hf",
    schema_builder=SchemaBuilder(sample_input, sample_output),
    model_path=path, #local path where artifacts will be saved
    mode=Mode.LOCAL_CONTAINER, # The model will be deployed locally. Change to Mode.SAGEMAKER_ENDPOINT to deploy to a SageMaker endpoint. 
    model_server=ModelServer.DJL_SERVING,
    env_vars={
        "HUGGING_FACE_HUB_TOKEN": "<YourHuggingFaceToken>" # Llama 2 is a gated model and requires a Hugging Face Hub token. 
    }
)
```

In [12]:
model = model_builder.build()

INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
ModelBuilder: INFO:     Either inference spec or model is provided. ModelBuilder is not handling MLflow model input
ModelBuilder: INFO:     Local instance_type ml.g5.12xlarge detected. ml.g5.12xlarge will be default when deploying to a SageMaker Endpoint. This default can be overriden in model.deploy()
ModelBuilder: INFO:     CUDA enabled hardware on the device: ['NVIDIA A10G, 22723 MiB', 'NVIDIA A10G, 22723 MiB', 'NVIDIA A10G, 22723 MiB', 'NVIDIA A10G, 22723 MiB']
ModelBuilder: INFO:     Max GPU parallelism of 4 is allowed. Total attention heads 32
INFO:sagemaker.image_uris:Defaulting to only available Python version: py310
INFO:sagemaker.image_uris:Defaulting to only supported image scope: gpu.
ModelBuilder: INFO:     Auto detected 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.3.0-tgi2.2.0-gpu-py310-cu121-ubuntu22.04-v2.0. Proceeding with the the deploymen

### Tune the model container locally to get the best configuration for deployment

A neat feature of Model Builder is the ability to run local tuning of the container parameter(s) when you use the LOCAL_CONTAINER mode.

In [13]:
%%time
tuned_model = model.tune()

ModelBuilder: INFO:     ModelBuilder will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver additional features. To opt out of telemetry, please disable via TelemetryOptOut in intelligent defaults. See https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk for more info.
ModelBuilder: INFO:     CUDA enabled hardware on the device: ['NVIDIA A10G, 22723 MiB', 'NVIDIA A10G, 22723 MiB', 'NVIDIA A10G, 22723 MiB', 'NVIDIA A10G, 22723 MiB']
ModelBuilder: INFO:     Model can be sharded across [4, 2, 1] GPUs
ModelBuilder: INFO:     Trying num shard: 4, dtype: bfloat16...
ModelBuilder: INFO:     ModelBuilder will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver additional features. To opt out of telemetry, please disable via TelemetryOptOut in intelligent defaults. See https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using

CPU times: user 614 ms, sys: 207 ms, total: 821 ms
Wall time: 7min 10s


In [15]:
local_predictor = tuned_model.deploy()

ModelBuilder: INFO:     ModelBuilder will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver additional features. To opt out of telemetry, please disable via TelemetryOptOut in intelligent defaults. See https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk for more info.
ModelBuilder: INFO:     Pulling image 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.3.0-tgi2.2.0-gpu-py310-cu121-ubuntu22.04-v2.0 from repository...
ModelBuilder: DEBUG:     Stopping currently running container...
ModelBuilder: INFO:     Waiting for model server TGI to start up...
ModelBuilder: DEBUG:     [2m2024-08-11T03:10:11.559651Z[0m [32m INFO[0m [2mtext_generation_launcher[0m[2m:[0m Args {
ModelBuilder: DEBUG:         model_id: "meta-llama/Llama-2-7b-hf",
ModelBuilder: DEBUG:         revision: None,
ModelBuilder: DEBUG:         validation_workers: 2,
ModelBuilder: D

In [16]:
updated_sample_input = model_builder.schema_builder.sample_input
updated_sample_input

{'inputs': 'The diamondback terrapin or simply terrapin is a species of turtle native to the brackish coastal tidal marshes of the',
 'parameters': {'max_new_tokens': 128}}

In [17]:
%%time
print(local_predictor.predict(updated_sample_input))

[{'generated_text': ' eastern United States. The diamondback terrapin is the only species in the genus Malaclemys.\nThe diamondback terrapin is a medium-sized turtle, with a carapace length of 10–15 cm (4–6 in). The shell is oval in shape, with a high, domed carapace and a flattened plastron. The carapace is dark brown to black, with a yellowish-brown to orange-brown stripe running down the middle of each scute. The plastron is yellowish-'}]
CPU times: user 4.08 ms, sys: 0 ns, total: 4.08 ms
Wall time: 1.44 s


### SageMaker ModelBuilder: Deploy to a SageMaker Endpoint

Now we have tested the model prediction locally, we can continue to deploy the model to a SageMaker endpoint.

In [19]:
predictor = tuned_model.deploy(mode=Mode.SAGEMAKER_ENDPOINT, role=execution_role)

ModelBuilder: INFO:     ModelBuilder will collect telemetry to help us better understand our user's needs, diagnose issues, and deliver additional features. To opt out of telemetry, please disable via TelemetryOptOut in intelligent defaults. See https://sagemaker.readthedocs.io/en/stable/overview.html#configuring-and-using-defaults-with-the-sagemaker-python-sdk for more info.
ModelBuilder: DEBUG:     Uploading TGI Model Resources uncompressed to: s3://sagemaker-us-east-1-057716757052/huggingface-pytorch-tgi-inference-2024-08-11-03-10-46-380/code
INFO:sagemaker:Creating model with name: huggingface-pytorch-tgi-inference-2024-08-11-03-15-02-850
INFO:sagemaker:Creating endpoint-config with name huggingface-pytorch-tgi-inference-2024-08-11-03-15-03-574
INFO:sagemaker:Creating endpoint with name huggingface-pytorch-tgi-inference-2024-08-11-03-15-03-574


-----------!

ModelBuilder: DEBUG:     ModelBuilder metrics emitted.


In [23]:
updated_sample_input

{'inputs': 'The diamondback terrapin or simply terrapin is a species of turtle native to the brackish coastal tidal marshes of the',
 'parameters': {'max_new_tokens': 128}}

In [20]:
%%time
print(predictor.predict(updated_sample_input)[0]["generated_text"])

The diamondback terrapin or simply terrapin is a species of turtle native to the brackish coastal tidal marshes of the eastern United States. The diamondback terrapin is the only species in the genus Malaclemys.
The diamondback terrapin is a medium-sized turtle, with a carapace length of 10–15 cm (4–6 in). The shell is oval in shape, with a high, rounded keel running down the middle of the back. The shell is dark brown to black, with a yellowish-orange stripe running down the middle of the back. The head is small and triangular in shape, with a pointed snout. The eyes are
CPU times: user 11.6 ms, sys: 1.66 ms, total: 13.2 ms
Wall time: 1.54 s


## Clean up

In [22]:
local_predictor.delete_predictor()
predictor.delete_model()
predictor.delete_endpoint()

INFO:sagemaker:Deleting model with name: huggingface-pytorch-tgi-inference-2024-08-11-03-15-02-850
INFO:sagemaker:Deleting endpoint configuration with name: huggingface-pytorch-tgi-inference-2024-08-11-03-15-03-574
INFO:sagemaker:Deleting endpoint with name: huggingface-pytorch-tgi-inference-2024-08-11-03-15-03-574
