# 01_Getting Started with Amazon Nova Models

Amazon Nova is a new generation of multimodal understanding and creative content generation models that offer state-of-the-art quality, unparalleled customization, and the best price-performance. Amazon Nova models incorporate the same secure-by-design approach as all AWS services, with built-in controls for the safe and responsible use of AI.

Amazon Nova has two categories of models: 
 - **Understanding models** —These models are capable of reasoning over several input modalities, including text, video, and image, and output text. 
- **Creative Content Generation models** —These models generate images or videos based on a text or image prompt.
  
### Amazon Nova Models at Glance
![media/model_intro.png](media/model_intro.png)

**Multimodal Understanding Models**
- **Amazon Nova Micro**: Lightening fast, cost-effective text-only model
- **Amazon Nova Lite**: Fastest, most affordable multimodal FM in the industry for its intelligence tier
- **Amazon Nova Pro**:  The fastest, most cost-effective, state-of-the-art multimodal model in the industry

**Creative Content Generation Models**
- **Amazon Nova Canvas**:State-of-the-art image generation model
- **Amazon Nova Reel**:State-of-the-art video generation model


The following notebooks will be focused primarily on Amazon Nova Understanding Models. 

**Amazon Nova Multimodal understanding** foundation models (FMs) are a family of models that are capable of reasoning over several input modalities, including text, video, documents and/or images, and output text. You can access these models through the Bedrock Converse API and InvokeModel API.


---

### 1. Setup

**Step 1: Gain Access to the Model**: If you have not yet requested for model access in Bedrock, you do so [request access following these instructions](https://docs.aws.amazon.com/bedrock/latest/userguide/model-access-modify.html).

![media/model_access.png](media/model_access.png)

## 2 When to Use What?

## 2.1 When to Use Amazon Nova Micro Model

Amazon Nova Micro (Text Input Only) is the fastest and most affordable option, optimized for large-scale, latency-sensitive deployments like conversational interfaces, chats, and high-volume tasks, such as classification, routing, entity extraction, and document summarization.

## 2.2 When to Use Amazon Nova Lite Model

Amazon Nova Lite balances intelligence, latency, and cost-effectiveness. It’s optimized for complex scenarios where low latency (minimal delay) is crucial, such as interactive agents that need to orchestrate multiple tool calls simultaneously. Amazon Nova Lite supports image, video, and text inputs and outputs text. 

## 2.3 When to Use Amazon Nova Pro Model
Amazon Nova Pro is designed for highly complex use cases requiring advanced reasoning, creativity, and code generation. Amazon Nova pro supports image, video, and text inputs and outputs text. 

---

## Prerequisites

Run the cells in this section to install the packages needed by the notebooks in this workshop. ⚠️ You will see pip dependency errors, you can safely ignore these errors. ⚠️

_IGNORE ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts._


In [None]:
%pip install --no-build-isolation --force-reinstall \
    "boto3>=1.28.57" \
    "awscli>=1.29.57" \
    "botocore>=1.31.57"

In [None]:
# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

### 2. Text Understanding [Applicable for Amazon Nova Micro, Amazon Nova Lite, Amazon Nova Pro]

The example below demonstrates how to use a text-based prompt, including a simple system prompt and message list.


Note: Below examples are using Nova Lite for Illustrative Purposes but you can make use of Micro Models for Text Understanidng Usecases as well.


In [None]:
import boto3
import json
import base64
from datetime import datetime

PRO_MODEL_ID = "us.amazon.nova-pro-v1:0"
LITE_MODEL_ID = "us.amazon.nova-lite-v1:0"
MICRO_MODEL_ID = "us.amazon.nova-micro-v1:0"

# Create a Bedrock Runtime client in the AWS Region of your choice.
client = boto3.client("bedrock-runtime", region_name="us-east-1")


### InvokeModel body and output

The invoke_model() method of the Amazon Bedrock runtime client (InvokeModel API) will be the primary method we use for most of our Text Generation and Processing tasks

Although the method is shared, the format of input and output varies depending on the foundation model used - as described below:


```python
{
  "system": [
    {
      "text": string
    }
  ],
  "messages": [
    {
      "role": "user",# first turn should always be the user turn
      "content": [
        {
          "text": string
        },
        {
          "image": {
            "format": "jpeg"| "png" | "gif" | "webp",
            "source": {
              "bytes": "base64EncodedImageDataHere..."#  base64-encoded binary
            }
          }
        },
        {
          "video": {
            "format": "mkv" | "mov" | "mp4" | "webm" | "three_gp" | "flv" | "mpeg" | "mpg" | "wmv",
            "source": {
            # source can be s3 location of base64 bytes based on size of input file. 
               "s3Location": {
                "uri": string, #  example: s3://my-bucket/object-key
                "bucketOwner": string #  (Optional) example: 123456789012)
               }
              "bytes": "base64EncodedImageDataHere..." #  base64-encoded binary
            }
          }
        },
      ]
    },
    {
      "role": "assistant",
      "content": [
        {
          "text": string # prefilling assistant turn
        }
      ]
    }
  ],
 "inferenceConfig":{ # all Optional
    "max_new_tokens": int, #  greater than 0, equal or less than 5k (default: dynamic*)
    "temperature": float, # greater then 0 and less than 1.0 (default: 0.7)
    "top_p": float, #  greater than 0, equal or less than 1.0 (default: 0.9)
    "top_k": int #  0 or greater (default: 50)
    "stopSequences": [string]
  },
  "toolConfig": { #  all Optional
        "tools": [
                {
                    "toolSpec": {
                        "name": string # menaingful tool name (Max char: 64)
                        "description": string # meaningful description of the tool
                        "inputSchema": {
                            "json": { # The JSON schema for the tool. For more information, see JSON Schema Reference
                                "type": "object",
                                "properties": {
                                    <args>: { # arguments 
                                        "type": string, # argument data type
                                        "description": string # meaningful description
                                    }
                                },
                                "required": [
                                    string # args
                                ]
                            }
                        }
                    }
                }
            ],
   "toolChoice": "auto" //Amazon Nova models currently ONLY support tool choice of "auto"
        }
    }
}
```

The following are required parameters.

* `system` – (Optional) The system prompt for the request.
    A system prompt is a way of providing context and instructions to Amazon Nova, such as specifying a particular goal or role.
* `messages` – (Required) The input messages.
    * `role` – The role of the conversation turn. Valid values are user and assistant. 
    * `content` – (required) The content of the conversation turn.
        * `type` – (required) The type of the content. Valid values are image, text. , video
            * if chosen text (text content)
                * `text` - The content of the conversation turn. 
            * If chosen Image (image content)
                * `source` – (required) The base64 encoded image bytes for the image.
                * `format` – (required) The type of the image. You can specify the following image formats. 
                    * `jpeg`
                    * `png`
                    * `webp`
                    * `gif`
            * If chosen video: (video content)
                * `source` – (required) The base64 encoded image bytes for the video or S3 URI and bucket owner as shown in the above schema
                * `format` – (required) The type of the video. You can specify the following video formats. 
                    * `mkv`
                    *  `mov`  
                    *  `mp4`
                    *  `webm`
                    *  `three_gp`
                    *  `flv`  
                    *  `mpeg`  
                    *  `mpg`
                    *  `wmv`
* `inferenceConfig`: These are inference config values that can be passed in inference.
    * `max_new_tokens` – (Optional) The maximum number of tokens to generate before stopping.
        Note that Amazon Nova models might stop generating tokens before reaching the value of max_tokens. Maximum New Tokens value allowed is 5K.
    * `temperature` – (Optional) The amount of randomness injected into the response.
    * `top_p` – (Optional) Use nucleus sampling. Amazon Nova computes the cumulative distribution over all the options for each subsequent token in decreasing probability order and cuts it off once it reaches a particular probability specified by top_p. You should alter either temperature or top_p, but not both.
    * `top_k` – (Optional) Only sample from the top K options for each subsequent token. Use top_k to remove long tail low probability responses.
    * `stopSequences` – (Optional) Array of strings containing step sequences. If the model generates any of those strings, generation will stop and response is returned up until that point. 
    * `toolConfig` – (Optional) JSON object following ToolConfig schema,  containing the tool specification and tool choice. This schema is the same followed by the Converse API





#### 2.1 Synchronous API Call

In [None]:
# Define your system prompt(s).
system_list = [
    { "text": "You should respond to all messages in french" }
]

# Define one or more messages using the "user" and "assistant" roles.
message_list = [
    {"role": "user", "content": [{"text": "tell me a joke"}]},
]

# Configure the inference parameters.
inf_params = {"max_new_tokens": 300, "top_p": 0.9, "top_k": 20, "temperature": 0.7}

native_request = {
    "messages": message_list,
    "system": system_list,
    "inferenceConfig": inf_params,
}

# Invoke the model and extract the response body.
response = client.invoke_model(modelId=LITE_MODEL_ID, body=json.dumps(native_request))
request_id = response["ResponseMetadata"]["RequestId"]
print(f"Request ID: {request_id}")
model_response = json.loads(response["body"].read())

# Pretty print the response JSON.
print("\n[Full Response]")
print(json.dumps(model_response, indent=2))

# Print the text content for easy readability.
content_text = model_response["output"]["message"]["content"][0]["text"]
print("\n[Response Content Text]")
print(content_text)

#### 2.2 Streaming API Call

The example below demonstrates how to use a text-based prompt with the streaming API.

In [None]:

# Define your system prompt(s).
system_list = [
    { "text": "Act as a creative writing assistant. When the user provides you with a topic, write a short story about that topic." }
]

# Define one or more messages using the "user" and "assistant" roles.
message_list = [{"role": "user", "content": [{"text": "A camping trip"}]}]

# Configure the inference parameters.
inf_params = {"max_new_tokens": 500, "top_p": 0.9, "top_k": 20, "temperature": 0.7}

request_body = {
    "messages": message_list,
    "system": system_list,
    "inferenceConfig": inf_params,
}

start_time = datetime.now()

# Invoke the model with the response stream
response = client.invoke_model_with_response_stream(
    modelId=LITE_MODEL_ID, body=json.dumps(request_body)
)

request_id = response.get("ResponseMetadata").get("RequestId")
print(f"Request ID: {request_id}")
print("Awaiting first token...")

chunk_count = 0
time_to_first_token = None

# Process the response stream
stream = response.get("body")
if stream:
    for event in stream:
        chunk = event.get("chunk")
        if chunk:
            # Print the response chunk
            chunk_json = json.loads(chunk.get("bytes").decode())
            # Pretty print JSON
            # print(json.dumps(chunk_json, indent=2, ensure_ascii=False))
            content_block_delta = chunk_json.get("contentBlockDelta")
            if content_block_delta:
                if time_to_first_token is None:
                    time_to_first_token = datetime.now() - start_time
                    print(f"Time to first token: {time_to_first_token}")

                chunk_count += 1
                current_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S:%f")
                # print(f"{current_time} - ", end="")
                print(content_block_delta.get("delta").get("text"), end="")
    print(f"Total chunks: {chunk_count}")
else:
    print("No response stream received.")

### 3. Multimodal Understanding [Applicable only for Amazon Nova lite/Amazon Nova Pro Model]

The following examples show how to pass various media types to the model. *(reminder - this is only supported with the Lite model)*

#### 3.1 Image Understanding

Lets see how Nova model does on Image Understanding Usecase. 

Here we will pass an Image of a Sunset and ask model to try to create 3 art titles for this image. 

![A Sunset Image](media/sunset.png)

In [None]:


# Open the image you'd like to use and encode it as a Base64 string.
with open("media/sunset.png", "rb") as image_file:
    binary_data = image_file.read()
    base_64_encoded_data = base64.b64encode(binary_data)
    base64_string = base_64_encoded_data.decode("utf-8")

# Define your system prompt(s).
system_list = [
    { "text": "You are an expert artist. When the user provides you with an image, provide 3 potential art titles" }
]

# Define a "user" message including both the image and a text prompt.
message_list = [
    {
        "role": "user",
        "content": [
            {
                "image": {
                    "format": "png",
                    "source": {"bytes": base64_string},
                }
            },
            {"text": "Provide art titles for this image."},
        ],
    }
]

# Configure the inference parameters.
inf_params = {"max_new_tokens": 300, "top_p": 0.1, "top_k": 20, "temperature": 0.3}

native_request = {
    "messages": message_list,
    "system": system_list,
    "inferenceConfig": inf_params,
}

# Invoke the model and extract the response body.
response = client.invoke_model(modelId=LITE_MODEL_ID, body=json.dumps(native_request))
model_response = json.loads(response["body"].read())

# Pretty print the response JSON.
print("[Full Response]")
print(json.dumps(model_response, indent=2))

# Print the text content for easy readability.
content_text = model_response["output"]["message"]["content"][0]["text"]
print("\n[Response Content Text]")
print(content_text)

There can be multiple image contents. In this example we ask the model to find what two images have in common:

![](media/cat.jpeg)

![](media/dog.jpeg)

In [None]:

# Open the image you'd like to use and encode it as a Base64 string.
with open("media/dog.jpeg", "rb") as image_file:
    binary_data = image_file.read()
    base_64_encoded_data = base64.b64encode(binary_data)
    dog_base64_string = base_64_encoded_data.decode("utf-8")

with open("media/cat.jpeg", "rb") as image_file:
    binary_data = image_file.read()
    base_64_encoded_data = base64.b64encode(binary_data)
    cat_base64_string = base_64_encoded_data.decode("utf-8")

# Define a "user" message including both the image and a text prompt.
message_list = [
    {
        "role": "user",
        "content": [
            {
                "image": {
                    "format": "jpeg",
                    "source": {"bytes": dog_base64_string},
                }
            },
            {
                "image": {
                    "format": "jpeg",
                    "source": {"bytes": cat_base64_string},
                }
            },
            {"text": "What do these two images have in common?"},
        ],
    }
]

# Configure the inference parameters.
inf_params = {"max_new_tokens": 300, "top_p": 0.1, "top_k": 20, "temperature": 0.3}

native_request = {
    "messages": message_list,
    "inferenceConfig": inf_params,
}

# Invoke the model and extract the response body.
response = client.invoke_model(modelId=LITE_MODEL_ID, body=json.dumps(native_request))
model_response = json.loads(response["body"].read())

# Pretty print the response JSON.
print("[Full Response]")
print(json.dumps(model_response, indent=2))

# Print the text content for easy readability.
content_text = model_response["output"]["message"]["content"][0]["text"]
print("\n[Response Content Text]")
print(content_text)

#### 3.2 Video Understanding

Lets now, try to see how Nova does on Video understanding use case

Here we are going to pass in a video with Quesdilla making instructions.


In [None]:
from IPython.display import Video

Video("media/the-sea.mp4")

In [None]:

# Open the image you'd like to use and encode it as a Base64 string.
with open("media/the-sea.mp4", "rb") as video_file:
    binary_data = video_file.read()
    base_64_encoded_data = base64.b64encode(binary_data)
    base64_string = base_64_encoded_data.decode("utf-8")

# Define your system prompt(s).
system_list = [
    { "text": "You are an expert media analyst. When the user provides you with a video, provide 3 potential video titles" }
]

# Define a "user" message including both the image and a text prompt.
message_list = [
    {
        "role": "user",
        "content": [
            {
                "video": {
                    "format": "mp4",
                    "source": {"bytes": base64_string},
                }
            },
            {"text": "Provide video titles for this clip."},
        ],
    }
]

# Configure the inference parameters.
inf_params = {"max_new_tokens": 300, "top_p": 0.1, "top_k": 20, "temperature": 0.3}

native_request = {
    "messages": message_list,
    "system": system_list,
    "inferenceConfig": inf_params,
}

# Invoke the model and extract the response body.
response = client.invoke_model(modelId=LITE_MODEL_ID, body=json.dumps(native_request))
model_response = json.loads(response["body"].read())

# Pretty print the response JSON.
print("[Full Response]")
print(json.dumps(model_response, indent=2))

# Print the text content for easy readability.
content_text = model_response["output"]["message"]["content"][0]["text"]
print("\n[Response Content Text]")
print(content_text)

### Video Understanding using S3 Path

#### Replace the S3 URI below with the S3 URI where your video is located

In [None]:

# Define your system prompt(s).
system_list = [
    { "text": "You are an expert media analyst. When the user provides you with a video, provide 3 potential video titles" }
]

# Define a "user" message including both the image and a text prompt.
message_list = [
    {
        "role": "user",
        "content": [
            {
                "video": {
                    "format": "mp4",
                    "source": {
                        "s3Location": {
                            # Replace the S3 URI
                            "uri": "s3://demo-bucket/the-sea.mp4"
                        }
                    },
                }
            },
            {"text": "Provide video titles for this clip."},
        ],
    }
]

# Configure the inference parameters.
inf_params = {"max_new_tokens": 300, "top_p": 0.1, "top_k": 20, "temperature": 0.3}

native_request = {
    "schemaVersion": "messages-v1",
    "messages": message_list,
    "system": system_list,
    "inferenceConfig": inf_params,
}

# Invoke the model and extract the response body.
response = client.invoke_model(modelId=LITE_MODEL_ID, body=json.dumps(native_request))
model_response = json.loads(response["body"].read())

# Pretty print the response JSON.
print("[Full Response]")
print(json.dumps(model_response, indent=2))

# Print the text content for easy readability.
content_text = model_response["output"]["message"]["content"][0]["text"]
print("\n[Response Content Text]")
print(content_text)