# Serve and stream tokens from LMSYS's `vicuna-7b-v1.3` hosted on Amazon SageMaker using LMI (Large Model Inference) DJL-based container

---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. 

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/inference|generativeai|llm-workshop|lab6-llm-token-streaming|lab6-token-streaming-lmsys-vicuna-7b-lmi.ipynb)

---

**Recommended kernel(s):** This notebook can be run with any Amazon SageMaker Studio kernel.

This notebook focuses on deploying the [`lmsys/vicuna-7b-v1.3`](https://huggingface.co/lmsys/vicuna-7b-v1.3) HuggingFace model to a SageMaker Endpoint for a text generation task. In this example, you will use the SageMaker-managed [LMI (Large Model Inference)](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-dlc.html) Docker image as inference image. LMI images features a [DJL serving](https://github.com/deepjavalibrary/djl-serving) stack powered by the [Deep Java Library](https://djl.ai/).

Once the model has been deployed, you will submit text generation requests and get a streamed response in return using SageMaker's native response streaming capability.

In this notebook, we make an extensive use of the higher-level abstractions provided by the [`sagemaker` Python SDK](https://sagemaker.readthedocs.io/en/stable/index.html) to which we delegate the management of as many resources and configuration as we can, hence demonstrating that the deployment of LLMs to SageMaker can be performed with great simplicity and minimal amount of code.

You will successively deploy the `lmsys/vicuna-7b-v1.3` model twice using the HuggingFace Accelerate engine on a `ml.g5.2xlarge` GPU instance (1 device with 24 GiB of device memory):
* Once without writing any custom server-side Python handler script and therefore leveraging the fact that the default Python handlers of the LMI DLC natively support streaming for the HuggingFace Accelerate engine (among others),
* Once with a custom server-side Python handler script. 

Notice that when using the default handlers, streaming cannot be disabled once the endpoint has been deployed with `streaming_enabled` set to `True`, i.e. the endpoint can only be invoked using `sagemaker::InvokeWithStreamingResponse` (and not `sagemaker::InvokeEndpoint`). On the other hand, when implementing a custom handler script, we will be able to choose between streaming our responses or not on a per-request basis.

**Notices:**
* Make sure that the `ml.g5.2xlarge` instance type is available in your AWS Region.
* Make sure that the value of your "ml.g5.2xlarge for endpoint usage" Amazon SageMaker service quota allows you to deploy one Endpoint using this instance type.

### License agreement
* This model is not intended for commercial use, cf. model card for more information about licensing.
* This notebook is a sample notebook and not intended for production use.

### Execution environment setup
This notebook requires the following third-party Python dependencies:
* AWS [`boto3`](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html#)
* AWS [`sagemaker`](https://sagemaker.readthedocs.io/en/stable/index.html). Since we use the 0.23.0 version of the DJL LMI DLC, the minimal SDK version is 2.173.0.
* HuggingFace [`huggingface_hub`](https://huggingface.co/docs/huggingface_hub/index)

Let's install or upgrade these dependencies using the following commands:

In [None]:
!pip install pip --upgrade --quiet

In [None]:
# TODO: TO BE REMOVED
!pip install dependencies/botocore-*-py3-none-any.whl dependencies/boto3-*-py3-none-any.whl --force-reinstall --quiet

In [None]:
# TODO: TO BE REMOVED 
import boto3
assert boto3.__version__ == "1.26.157"

In [None]:
!pip install "sagemaker>=2.173.0" huggingface_hub --upgrade --quiet

### Imports & global variables assignment

In [None]:
import io
import json
import os
from pathlib import Path
import shutil
from typing import Any, Dict, List

import boto3
import huggingface_hub
import sagemaker

In [None]:
SM_DEFAULT_EXECUTION_ROLE_ARN = sagemaker.get_execution_role()
SM_SESSION = sagemaker.session.Session()
SM_ARTIFACT_BUCKET_NAME = SM_SESSION.default_bucket()

REGION_NAME = SM_SESSION._region_name
S3_CLIENT = boto3.client("s3", region_name=REGION_NAME)
# TODO: TO BE MODIFIED: sagemaker-runtime-demo -> sagemaker-runtime
SAGEMAKER_RUNTIME_CLIENT = boto3.client("sagemaker-runtime-demo", region_name=REGION_NAME)

In [None]:
HOME_DIR = os.environ["HOME"]

# HuggingFace local model storage
HF_LOCAL_CACHE_DIR = Path(HOME_DIR) / ".cache" / "huggingface" / "hub"
HF_LOCAL_DOWNLOAD_DIR = Path.cwd() / "model_repo"
HF_LOCAL_DOWNLOAD_DIR.mkdir(exist_ok=True)

# Inference code local storage
SOURCE_DIR = Path.cwd() / "code"
SOURCE_DIR.mkdir(exist_ok=True)

# Selected HuggingFace model
HF_HUB_MODEL_NAME = "lmsys/vicuna-7b-v1.3"
PROMPT_TEMPLATE = "USER: {prompt}\nAssistant:"

# HuggingFace remote model storage (Amazon S3)
HF_MODEL_KEY_PREFIX = f"hf-large-model-djl/{HF_HUB_MODEL_NAME}"

# Other global constants
DJL_VERSION = "0.23.0" # requires sagemaker>=2.173.0 
INSTANCE_TYPE = "ml.g5.2xlarge"

### Utilities

In [None]:
def list_s3_objects(bucket: str, key_prefix: str) -> List[Dict[str, Any]]:
    paginator = S3_CLIENT.get_paginator("list_objects")
    operation_parameters = {"Bucket": bucket, "Prefix": key_prefix}
    page_iterator = paginator.paginate(**operation_parameters)
    return [obj for page in page_iterator for obj in page["Contents"]]


def delete_s3_objects(bucket: str, keys: str) -> None:
    S3_CLIENT.delete_objects(Bucket=bucket, Delete={"Objects": [{"Key": key} for key in keys]})


def get_local_model_cache_dir(hf_model_name: str) -> str:
    for dir_name in os.listdir(HF_LOCAL_CACHE_DIR):
        if dir_name.endswith(hf_model_name.replace("/", "--")):
            break
    else:
        raise ValueError(f"Could not find HF local cache directory for model {hf_model_name}")
    return HF_LOCAL_CACHE_DIR / dir_name

In [None]:
class StreamScanner:
    """
    A helper class for parsing the InvokeEndpointWithResponseStream event stream. 
    
    The output of the model will be in the following format:
    ```
    b'{"outputs": [" a"]}\n'
    b'{"outputs": [" challenging"]}\n'
    b'{"outputs": [" problem"]}\n'
    ...
    ```
    
    While usually each PayloadPart event from the event stream will contain a byte array 
    with a full json, this is not guaranteed and some of the json objects may be split across
    PayloadPart events. For example:
    ```
    {'PayloadPart': {'Bytes': b'{"outputs": '}}
    {'PayloadPart': {'Bytes': b'[" problem"]}\n'}}
    ```
    
    This class accounts for this by concatenating bytes written via the 'write' function
    and then exposing a method which will return lines (ending with a '\n' character) within
    the buffer via the 'readlines' function. It maintains the position of the last read 
    position to ensure that previous bytes are not exposed again. 
    """
    
    def __init__(self) -> None:
        self.buff = io.BytesIO()
        self.read_pos = 0
        
    def write(self, content: bytes) -> None:
        self.buff.seek(0, io.SEEK_END)
        self.buff.write(content)
        
    def readlines(self) -> bytes:
        self.buff.seek(self.read_pos)
        for line in self.buff.readlines():
            if line[-1] != b'\n':
                self.read_pos += len(line)
                yield line[:-1]
                
    def reset(self) -> None:
        self.read_pos = 0

## 1. Model upload to Amazon S3
Models served by a LMI container can be downloaded to the container in different ways:
* Like all the SageMaker Inference containers, having the container to download the model from Amazon S3 as a single `model.tar.gz` file. In the case of LLMs, this approach is discouraged since downloading and decompression times can become unreasonably high.
* Having the container to download the model directly from the HuggingFace Hub for you. This option may involve high download times too.
* Having the container to download the uncompressed model from Amazon S3 with maximal throughput by using the [`s5cmd`](https://github.com/peak/s5cmd) utility. This option is specific to LMI containers and is the recommended one. It requires however, that the model has been previously uploaded to a S3 Bucket. 

In this section, you will:
1. Download the model from the HuggingFace Hub to your local host,
2. Upload the downloaded model to a S3 Bucket. This notebook uses the SageMaker's default regional Bucket. Feel free to upload the model to the Bucket of your choice by modifying the `SM_ARTIFACT_BUCKET_NAME` global variable accordingly.

Each operation takes a few minutes.

In [None]:
huggingface_hub.snapshot_download(
    repo_id=HF_HUB_MODEL_NAME,
    revision="main",
    local_dir=HF_LOCAL_DOWNLOAD_DIR,
    local_dir_use_symlinks="auto",  # Files larger than 5MB are actually symlinked to the local HF cache
    allow_patterns=["*.json", "*.pt", "*.bin", "*.txt", "*.model", "*.py"],
);

In [None]:
MODEL_ID = SM_SESSION.upload_data(
    path=HF_LOCAL_DOWNLOAD_DIR.as_posix(),
    bucket=SM_ARTIFACT_BUCKET_NAME,
    key_prefix=HF_MODEL_KEY_PREFIX,
)
print(f"Model artifacts have been successfully uploaded to: {MODEL_ID}")

The `huggingface_hub.snapshot_download` function downloaded the model repository to a cache located in your home directory. Downloaded files were duplicated in the target local download directory. Large files (larger than 5 MB) were not duplicated however but simply symlinked. Still, uncompressed LLM artifacts consume disk space. The two following cells removes the downloaded files from your local host.

In [None]:
# Remove HF model artifacts from the local download directory
shutil.rmtree(HF_LOCAL_DOWNLOAD_DIR)

In [None]:
# Remove HF model artifacts from the local HF cache directory
hf_local_cache_dir = get_local_model_cache_dir(hf_model_name=HF_HUB_MODEL_NAME)
shutil.rmtree(hf_local_cache_dir)

## 2. Deployment to a SageMaker Endpoint using a SageMaker LMI Docker image and the HuggingFace Accelerate engine
Start up of LLM inference containers can last longer than for smaller models mainly because of longer model downloading and loading times. Timeout values need to be increased accordingly from their default values. Each endpoint deployment takes a few minutes.

In [None]:
MODEL_ARTIFACTS_DOWNLOAD_TIMEOUT_IN_SECS = 9 * 60
CONTAINER_STARTUP_TIMEOUT_IN_SECS = 9 * 60

In [None]:
CONTAINER_STARTUP_CONFIGURATION = {
    "model_data_download_timeout": MODEL_ARTIFACTS_DOWNLOAD_TIMEOUT_IN_SECS,
    "container_startup_health_check_timeout": CONTAINER_STARTUP_TIMEOUT_IN_SECS,
}

### 2.1. Inference using the default HuggingFace Accelerate handler
In this section, you deploy the `lmsys/vicuna-7b-v1.3` model to a SageMaker endpoint consisting of a single `ml.g5.2xlarge` instance. The inference engine used by the DJL Serving stack is HuggingFace Accelerate. Chosen precision is FP16 (native precision). and using the HuggingFace Accelerate handler as inference engine (referred as the `Python` engine in the [DJL Serving general settings](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-configuration.html)).

To each engine corresponds a dedicated `sagemaker.model.Model` class. In the present case, you will use the `sagemaker.djl_inference.HuggingFaceAccelerateModel` class. The model server configuration is generated by the `HuggingFaceAccelerateModel` class from the arguments we pass to its constructor and from an optional and already-existing `serving.properties` file.

Since the HuggingFace Accelerate [default handler script](https://github.com/deepjavalibrary/djl-serving/blob/master/engines/python/setup/djl_python/huggingface.py) natively supports response streaming, we do not implement any custom server-side handler script.

In [None]:
from sagemaker.djl_inference import HuggingFaceAccelerateModel

In [None]:
SOURCE_DIR_ACCELERATE = Path("code-accelerate")
SOURCE_DIR_ACCELERATE.mkdir(exist_ok=True)

In [None]:
%%writefile code-accelerate/serving.properties
option.enable_streaming=true

In [None]:
%%writefile code-accelerate/requirements.txt
protobuf<3.20
transformers>=4.31.0

In [None]:
hf_accelerate_model = HuggingFaceAccelerateModel(
    djl_version=DJL_VERSION,
    model_id=MODEL_ID,
    source_dir=SOURCE_DIR_ACCELERATE.as_posix(),
    role=SM_DEFAULT_EXECUTION_ROLE_ARN,
    task="text-generation",
    # HF Accelerate configuration arguments
    dtype="fp16",
    number_of_partitions=1,
    device_map="auto",
    low_cpu_mem_usage=True,
    load_in_8bit=False,
)

In [None]:
hf_accelerate_predictor = hf_accelerate_model.deploy(
    instance_type=INSTANCE_TYPE, initial_instance_count=1, **CONTAINER_STARTUP_CONFIGURATION
)

***Notices:***
* Requests with response streaming currently do not support multiple input prompts
* The `Predictor` object returned by the `deploy` method is currently not capable of invoking the endpoint it is tied to with response streaming. We therefore use the lower-level `boto3` client to invoke the endpoint.

In [None]:
endpoint_name = hf_accelerate_predictor.endpoint_name

In [None]:
%%time
prompts = [
    "What is Amazon? Be concise."
]
request_content_type = "application/json"
response_content_type = "application/jsonlines"

request_body = {"inputs": [PROMPT_TEMPLATE.format(prompt=prompt) for prompt in prompts], 
                "parameters": {
                    "max_new_tokens": 128,
                    "do_sample": True,
                    "temperature": 1.1,
                    "top_p": 0.85,
                },
               }

response = SAGEMAKER_RUNTIME_CLIENT.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name, 
    Body=json.dumps(request_body), 
    ContentType=request_content_type,
    Accept=response_content_type,
)

event_stream = response['Body']
scanner = StreamScanner()
for event in event_stream:
    scanner.write(event['PayloadPart']['Bytes'])
    for line in scanner.readlines():
        deserialized_line = json.loads(line)
        print(deserialized_line.get("outputs")[0], end='')

Now let's delete the endpoint to redeploy the model with a custom server-side handler script.

In [None]:
# Clean-up
hf_accelerate_predictor.delete_endpoint(delete_endpoint_config=True)
hf_accelerate_model.delete_model()

### 2.2. Inference using a custom server-side handler script
In this section, you will redeploy the same model but first, you will add a custom Python server-side handler script to the code artifacts that are to be deployed to the container (gathered in the `source_dir` / `SOURCE_DIR_ACCELERATE` directory).

The custom handler script below allows to enable or disable streaming on a per-request basis. Default behavior is set using the `option.enable_streaming` field to `true` in the model server's configuration file `serving.properties`.

The custom handler script allows to showcase the main differences when enabling streaming in the LMI container compared to sending the full generated sequences once:
* We use a streamer object, `djl_python.streaming_utils.HFStreamer`, which implements the interface defined by [`transformers.TextStreamer`](https://huggingface.co/docs/transformers/v4.31.0/en/internal/generation_utils#transformers.TextStreamer). The streamer object uses the model's tokenizer to decode the generated token Ids before pushing them to the streamer's internal queue. The `transformers.generation.streamers.TextIteratorStreamer` streamer can be used as an alternative.
* Instead of adding the result to the `djl_python.Output` object, we attach the streamer object using its `add_stream_content`.
* Generation is executed in a background thread. The streamer object is passed to the [`generate` method](https://huggingface.co/docs/transformers/v4.31.0/en/main_classes/text_generation#transformers.GenerationMixin.generate) together with the `GenerationConfig`. The generation logic then uses the streamer to post tokens to the streamer's queue. On the other side, since the `Output` object has access to the streamer object, it is able to retrieve the tokens from its queue to dispatch them to the client.

In [None]:
%%writefile code-accelerate/handler.py
import os
from threading import Thread
from typing import Any, Dict, List, Optional

from djl_python import Input, Output
from djl_python.streaming_utils import HFStreamer
import torch
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer, PreTrainedTokenizer, GenerationConfig, TextGenerationPipeline
from transformers.generation.streamers import BaseStreamer, TextIteratorStreamer

def get_torch_dtype_from_str(dtype: str) -> torch.dtype:
    if dtype == "fp32":
        return torch.float32
    if dtype == "fp16":
        return torch.float16
    if dtype == "bf16":
        return torch.bfloat16
    if dtype == "int8":
        return torch.int8
    if dtype is None:
        return None
    raise ValueError(f"Data type cannot be parsed as valid Torch data type: {dtype}")

    
def start_generation_thread(model: TextGenerationPipeline, streamer: BaseStreamer, input_sequences: List[str], generation_config: GenerationConfig) -> None:
    def run_generation_with_streaming(model: TextGenerationPipeline, streamer: BaseStreamer, input_sequences: List[str], generation_config: GenerationConfig) -> None:
        try:
            model.generate(input_sequences, streamer=streamer, generation_config=generation_config)
        except Exception as e:
            streamer.put_text(str(e))
        finally:
            streamer.end()

    thread = Thread(target=run_generation_with_streaming,
                    args=[model, streamer, input_sequences, generation_config],
                   )
    thread.start()    


class ConfigFactory:
    
    def __init__(self, properties: Dict[str, Any]) -> None:
        self._properties = properties
        self._dtype = get_torch_dtype_from_str(properties.get("dtype", "fp16"))
        
    def build_model_loading_config(self) -> Dict[str, Any]:
        return {
            "low_cpu_mem_usage": (self._properties.get("low_cpu_mem_usage", "true").lower() == "true"),
            "trust_remote_code": (self._properties.get("trust_remote_code", "false").lower() == "true"), 
            "local_files_only": (self._properties.get("local_files_only", "false").lower() == "true"),
            "torch_dtype": self._dtype,
            "revision": self._properties.get("revision", "main"),
            "device_map": self._properties.get("device_map", "auto"),
        }
    
    def build_tokenizer_loading_config(self) -> Dict[str, Any]:
        return {
            "trust_remote_code": (self._properties.get("trust_remote_code", "false").lower() == "true"), 
            "revision": self._properties.get("revision", "main"),
            "legacy": (self._properties.get("tokenizer_legacy_behavior", "false").lower() == "true"),
        }
    
    def build_tokenizer_encoding_config(self) -> Dict[str, Any]:
        return {
            "padding": True, 
            "return_tensors": "pt"
        }
    
    def build_tokenizer_decoding_config(self) -> Dict[str, Any]:
        return {
            "skip_special_tokens": self._properties.get("skip_special_tokens", True)
        }


class HuggingFaceAccelerateInferenceService:
    def __init__(self) -> None:
        self.model_location = None
        self._config_factory = None
        self._tokenizer = None
        self._model = None
        self.initialized = False
        self.default_is_streaming_enabled = None
        self._default_generation_parameters = {}
        self.device = None
    
    def _load_tokenizer(self) -> PreTrainedTokenizer:
        tokenizer_loading_config = self._config_factory.build_tokenizer_loading_config()
        tokenizer = AutoTokenizer.from_pretrained(self.model_location, **tokenizer_loading_config)
        if not tokenizer.pad_token:
            tokenizer.add_special_tokens({"pad_token": tokenizer.eos_token})
            self._default_generation_parameters.update({"pad_token_id": tokenizer.pad_token_id})
        return tokenizer
    
    def initialize(self, properties: Dict[str, str]) -> None:
        self._config_factory = ConfigFactory(properties=properties) 
        # model_id can point to huggingface model_id or local directory.
        # If option.model_id points to a s3 bucket, the DJL model server downloads it and set option.model_id to the local download directory.
        # If option.model_id is not available, it is assumed model artifacts are in the option.model_dir (by default set to /opt/ml/model, which is also the cwd)
        self.model_location = properties.get("model_id") or properties.get("model_dir")
        self.default_is_streaming_enabled = (properties.get("enable_streaming", "true").lower() == "true")
        device_id = properties.get("device_id", None)
        self.device = f"cuda:{device_id}" if device_id is not None else "cuda:0"
        
        self._tokenizer = self._load_tokenizer()
        model_loading_config = self._config_factory.build_model_loading_config()
        self._model = AutoModelForCausalLM.from_pretrained(self.model_location, **model_loading_config).to(self.device)
        self.initialized = True
        
    def handle_generation_request(self, inputs: Input) -> Output:
        request_payload = inputs.get_as_json()
        input_sequences = request_payload["inputs"]
        request_parameters = request_payload["parameters"]
        is_streaming_enabled = request_parameters.pop("stream_response", self.default_is_streaming_enabled)
        generation_parameters = self._default_generation_parameters.copy()
        generation_parameters.update(request_parameters)
        generation_config=GenerationConfig(**generation_parameters)
        outputs = Output()
        encoding_config = self._config_factory.build_tokenizer_encoding_config()
        decoding_config = self._config_factory.build_tokenizer_decoding_config()
        inputs = self._tokenizer(input_sequences, **encoding_config).input_ids.to(self.device)
        if is_streaming_enabled:
            assert len(input_sequences) == 1, "Only one sequence can be processed at a time when stream_response=True"
            #streamer = HFStreamer(tokenizer=self._tokenizer, **decoding_config)
            #streamer = TextIteratorStreamer(tokenizer=self._tokenizer, skip_prompt=True, **decoding_config)
            streamer = TextIteratorStreamer(tokenizer=self._tokenizer, skip_prompt=True, **decoding_config)
            start_generation_thread(model=self._model, streamer=streamer, input_sequences=inputs, generation_config=generation_config)
            outputs.add_property("content-type", "application/jsonlines")
            outputs.add_stream_content(streamer)
        else:
            output_ids = self._model.generate(inputs, generation_config=generation_config)
            output_sequences = self._tokenizer.batch_decode(output_ids, **decoding_config)
            output_sequences = [output_seq[len(input_seq):] for input_seq, output_seq in zip(input_sequences, output_sequences)]
            outputs.add_property("content-type", "application/json")
            outputs.add(output_sequences)
        return outputs
    
    
_service = HuggingFaceAccelerateInferenceService()

def handle(inputs: Input) -> Optional[Output]:
    if not _service.initialized:
        print("Initializing inference service")
        _service.initialize(properties=inputs.get_properties())

    if inputs.is_empty():
        return None

    return _service.handle_generation_request(inputs=inputs)

In [None]:
hf_accelerate_model = HuggingFaceAccelerateModel(
    djl_version=DJL_VERSION,
    model_id=MODEL_ID,
    source_dir=SOURCE_DIR_ACCELERATE.as_posix(),
    entry_point="handler.py",
    role=SM_DEFAULT_EXECUTION_ROLE_ARN,
    task="text-generation",
    # HF Accelerate configuration arguments
    dtype="fp16",
    device_id=0,
    device_map="auto",
    low_cpu_mem_usage=True,
    load_in_8bit=False,
)

In [None]:
hf_accelerate_predictor = hf_accelerate_model.deploy(
    instance_type=INSTANCE_TYPE, initial_instance_count=1, **CONTAINER_STARTUP_CONFIGURATION
)

In [None]:
endpoint_name = hf_accelerate_predictor.endpoint_name

The custom handler script allow to either stream the response tokens (default behavior), i.e. invoke the endpoint using `sagemaker:InvokeEndpointWithResponseStreaming` or to disable streaming at the request level, i.e. invoke the endpoint using `sagemaker:InvokeEndpoint`. Let's first invoke the endpoint with the streaming feature enabled.

In [None]:
%%time
prompts = [
    "What is Amazon? Be concise."
]
request_content_type = "application/json"
response_content_type = "application/jsonlines"

request_body = {"inputs": [PROMPT_TEMPLATE.format(prompt=prompt) for prompt in prompts], 
                "parameters": {
                    "max_new_tokens": 128,
                    "do_sample": True,
                    "temperature": 1.1,
                    "top_p": 0.85,
                },
               }

response = SAGEMAKER_RUNTIME_CLIENT.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name, 
    Body=json.dumps(request_body), 
    ContentType=request_content_type,
    Accept=response_content_type,
)

event_stream = response['Body']
scanner = StreamScanner()
for event in event_stream:
    scanner.write(event['PayloadPart']['Bytes'])
    for line in scanner.readlines():
        deserialized_line = json.loads(line)
        print(deserialized_line.get("outputs"), end='')
print("\n")

Now let's add a `stream_response: False` entry to our request parameters to allow our endpoint to be invoked using the `sagemaker:InvokeEndpoint` API call and let's use the `Predictor` object returned by `Model.deploy` to perform this call.

In [None]:
%%time

prompts = [
    "What is Amazon? Be concise."
]

generation_config = {
                    "max_new_tokens": 128,
                    "do_sample": True,
                    "temperature": 1.1,
                    "top_p": 0.85,
                    "stream_response": False,
}

hf_accelerate_predictor.predict(
            data={
                "inputs": [PROMPT_TEMPLATE.format(prompt=prompt) for prompt in prompts],
                "parameters": generation_config,
            }
        )

In [None]:
# Clean-up
hf_accelerate_predictor.delete_endpoint(delete_endpoint_config=True)
hf_accelerate_model.delete_model()
shutil.rmtree(SOURCE_DIR_ACCELERATE.as_posix())

## 3. Clean-up
At this stage:
* All your SageMaker Endpoint resources are supposed to be deleted, along with the SageMaker EndpointConfig and SageMaker Model resources they were associated with,
* You have freed the disk space of your local host from the large model artifacts downloaded from the HuggingFace Hub.

The only remaining cleanup task consist of removing the model artifacts from Amazon S3. This is what performs the next and last cell of this notebook.

In [None]:
# Remove HF model artifacts from S3
hf_s3_objects = list_s3_objects(bucket=SM_ARTIFACT_BUCKET_NAME, key_prefix=HF_MODEL_KEY_PREFIX)
hf_s3_objects_keys = [obj["Key"] for obj in hf_s3_objects]
delete_s3_objects(bucket=SM_ARTIFACT_BUCKET_NAME, keys=hf_s3_objects_keys)

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/inference|generativeai|llm-workshop|lab6-llm-token-streaming|lab6-token-streaming-lmsys-vicuna-7b-lmi.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/inference|generativeai|llm-workshop|lab6-llm-token-streaming|lab6-token-streaming-lmsys-vicuna-7b-lmi.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/inference|generativeai|llm-workshop|lab6-llm-token-streaming|lab6-token-streaming-lmsys-vicuna-7b-lmi.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/inference|generativeai|llm-workshop|lab6-llm-token-streaming|lab6-token-streaming-lmsys-vicuna-7b-lmi.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/inference|generativeai|llm-workshop|lab6-llm-token-streaming|lab6-token-streaming-lmsys-vicuna-7b-lmi.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/inference|generativeai|llm-workshop|lab6-llm-token-streaming|lab6-token-streaming-lmsys-vicuna-7b-lmi.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/inference|generativeai|llm-workshop|lab6-llm-token-streaming|lab6-token-streaming-lmsys-vicuna-7b-lmi.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/inference|generativeai|llm-workshop|lab6-llm-token-streaming|lab6-token-streaming-lmsys-vicuna-7b-lmi.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/inference|generativeai|llm-workshop|lab6-llm-token-streaming|lab6-token-streaming-lmsys-vicuna-7b-lmi.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/inference|generativeai|llm-workshop|lab6-llm-token-streaming|lab6-token-streaming-lmsys-vicuna-7b-lmi.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/inference|generativeai|llm-workshop|lab6-llm-token-streaming|lab6-token-streaming-lmsys-vicuna-7b-lmi.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/inference|generativeai|llm-workshop|lab6-llm-token-streaming|lab6-token-streaming-lmsys-vicuna-7b-lmi.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/inference|generativeai|llm-workshop|lab6-llm-token-streaming|lab6-token-streaming-lmsys-vicuna-7b-lmi.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/inference|generativeai|llm-workshop|lab6-llm-token-streaming|lab6-token-streaming-lmsys-vicuna-7b-lmi.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/inference|generativeai|llm-workshop|lab6-llm-token-streaming|lab6-token-streaming-lmsys-vicuna-7b-lmi.ipynb)
