# Run Llama 2 7B in SageMaker using MK1 Flywheel

---

MK1 is building the world’s most efficient generative AI platform to run foundations models in the cloud. We routinely save customers >50% in cloud costs, and offer the lowest latency for a given throughput. Take it for a spin, and contact us when you are ready to run it in production.

In this demo notebook, we demonstrate how to use the SageMaker Python SDK to deploy the MK1 Flywheel runtime using the Llama 2 fine-tuned model optimized for dialogue use cases.

By using this example, you are agreeing to all terms and conditions described in the Llama2 end-user-license-agreement (EULA) (https://ai.meta.com/llama/license/).

---

## Setup

---

In order to run the following example, the following pre-requisites need to be satisfied:
1. The IAM role requires **AmazonSageMakerFullAccess** permissions and the `AssumeRole` trust relationship (see below).
2. One of the following:
    1. The IAM role has the necessary permissions to automatically subscribe to the corresponding AWS Marketplace listing.
    2. The AWS account already has an active subscription.

#### Trust Relationship for IAM role
```
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "sagemaker.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}
```

---

In [None]:
%pip install --upgrade --quiet sagemaker

In [None]:
import sagemaker
import boto3
import json

try:
    execution_role_arn = sagemaker.get_execution_role()
except ValueError:
    execution_role_arn = None

if execution_role_arn == None:
    execution_role_arn = input("Enter your execution role ARN: ")

model_package_map = {
    "us-west-2": "arn:aws:sagemaker:us-west-2:123488637174:model-package/mk1-flywheel-v060-llama2-7b-chat-s0",
}

region = boto3.Session().region_name
if region not in model_package_map.keys():
    raise Exception(f"Current boto3 session region {region} is not supported.")

model_package_arn = model_package_map[region]

## Deploy model

---

You can now deploy the model using SageMaker. The following instance types are supported:
- ml.g5.xlarge
- ml.g5.2xlarge
- ml.g5.4xlarge
- ml.g5.8xlarge
- ml.g5.16xlarge

**NOTE:** Despite the fact that all of these instances have only one GPU, the CPU count is relevant when running high-throughput workloads to handle all the incoming connections.

---

In [None]:
model = sagemaker.ModelPackage(
    model_package_arn=model_package_arn,
    role=execution_role_arn,
)
model.deploy(initial_instance_count=1, instance_type='ml.g5.xlarge')

## Invoke the endpoint

---

The example below shows the minimal code required to run inference from the endpoint.

### Supported Parameters
The plain model API supports the following inference parameters:

- **text (str):** The initial input text provided to the model. This text serves as the context or prompt based on which the model generates additional content.
- **max_tokens (int):** The maximum number of tokens that the model will generate in response to the input text. 
- **max_input_tokens (int, default 0):** This specifies the maximum number of tokens allowed in the input text. If the input text exceeds this number, it will be truncated.
- **num_samples (int, default 1):** The number of independent completions to generate for the given input text. Each sample is generated separately and may result in different outputs.
- **eos_token_ids (List[int], default [1, 2]):** A list of token IDs that signify the end of a sequence. When the model generates one of these tokens, it considers the output complete and stops generating further tokens.
- **stop (List[str], default []):** A list of strings where, if the model generates them, it will stop further text generation. The stop string is not included in the returned output.
- **temperature (float, default 1.0):** Controls the degree of determinism in the output. A higher temperature leads to more varied output, while a lower temperature makes the model more likely to choose high-probability logits. At a temperature of 0, the model performs greedy sampling.
- **top_k (int, default 50):** This parameter narrows down the choice of next words to the top 'k' most likely options, based on the model's predictions. It helps in focusing the generation on more probable logits.
- **top_p (float, range 0 to 1, default 1.0):** This parameter allows the model to choose from the smallest set of logits whose cumulative probability does not exceed 'p'. This can create more diverse and less predictable text compared to top_k.
- **presence_penalty (float, range -2 to 2, default 0.0):** This parameter adjusts the likelihood of the model introducing new topics or entities during text generation. A positive value encourages the introduction of new concepts by reducing repetition.
- **frequency_penalty (float, range -2 to 2, default 0.0):** This parameter alters the likelihood of the model repeating the same line of thought or specific words. Positive values discourage repetition, encouraging the model to introduce more varied language and ideas.

---

In [None]:
prompt = "What is the difference between a Llama and an Alpaca?"

payload = {
    'text': prompt,
    'max_tokens': 500
}

response = model.sagemaker_session.sagemaker_runtime_client.invoke_endpoint(
            EndpointName=model.endpoint_name,
            Body=bytes(json.dumps(payload), 'utf-8'),
            ContentType="application/json"
        )

response = json.loads(response['Body'].read().decode('utf-8'))

print(prompt)
print(response["responses"][0]["text"])

## Run examples

---

The following examples have been adapted to match the input format used by the Llama2 JumpStart endpoints.

---

In [None]:
from typing import Dict, List


def format_instructions(instructions: List[Dict[str, str]]) -> List[str]:
    """Format instructions for JumpStart Llama2 endpoint.

    The model only supports 'system', 'user' and 'assistant' roles, starting with 'system', then 'user' and
    alternating (u/a/u/a/u...). The last message must be from 'user'.
    """

    T_BSYS, T_ESYS = "<<SYS>>", "<</SYS>>\n"
    T_BINST, T_EINST = "[INST]", "[/INST]\n"
    T_BOS, T_EOS = "<s>", "</s>\n"

    prompt: List[str] = []

    if instructions[0]["role"] == "system":
        prompt.extend([T_BSYS, instructions[0]["content"], T_ESYS])
        instructions = instructions[1:]

    for user, answer in zip(instructions[0::2], instructions[1::2]):
        prompt.extend([T_BOS, T_BINST, (user["content"]).strip(), T_EINST, (answer["content"]).strip(), T_EOS])

    prompt.extend([T_BOS, T_BINST, (instructions[-1]["content"]).strip(), T_EINST])

    return "".join(prompt)


def predict(payload):
    model_payload = {
        'text': format_instructions(payload["inputs"][0]),
    }
    model_payload.update(payload["parameters"])

    response = model.sagemaker_session.sagemaker_runtime_client.invoke_endpoint(
            EndpointName=model.endpoint_name,
            Body=bytes(json.dumps(model_payload), 'utf-8'),
            ContentType="application/json"
        )

    return json.loads(response['Body'].read().decode('utf-8'))


def print_dialog(payload, response):
    dialog = payload["inputs"][0]
    for msg in dialog:
        print(f"{msg['role'].capitalize()}: {msg['content']}\n")
    print(f"> ASSISTANT: {response['responses'][0]['text']}")
    print("\n==================================\n")

### Example 1

In [None]:
%%time

payload = {
    "inputs": [[
        {"role": "user", "content": "What is the recipe of mayonnaise?"},
    ]],
    "parameters": {"max_tokens": 512, "top_p": 0.9, "temperature": 0.6}
}
response = predict(payload)
print_dialog(payload, response)

### Example 2

In [None]:
%%time

payload = {
    "inputs": [[
        {"role": "user", "content": "I am going to Paris, what should I see?"},
        {
            "role": "assistant",
            "content": """\
Paris, the capital of France, is known for its stunning architecture, art museums, historical landmarks, and romantic atmosphere. Here are some of the top attractions to see in Paris:

1. The Eiffel Tower: The iconic Eiffel Tower is one of the most recognizable landmarks in the world and offers breathtaking views of the city.
2. The Louvre Museum: The Louvre is one of the world's largest and most famous museums, housing an impressive collection of art and artifacts, including the Mona Lisa.
3. Notre-Dame Cathedral: This beautiful cathedral is one of the most famous landmarks in Paris and is known for its Gothic architecture and stunning stained glass windows.

These are just a few of the many attractions that Paris has to offer. With so much to see and do, it's no wonder that Paris is one of the most popular tourist destinations in the world.""",
        },
        {"role": "user", "content": "What is so great about #1?"},
    ]],
    "parameters": {"max_tokens": 512, "top_p": 0.9, "temperature": 0.6}
}
response = predict(payload)
print_dialog(payload, response)

### Example 3

In [None]:
%%time

payload = {
    "inputs": [[
        {"role": "system", "content": "Always answer with Haiku"},
        {"role": "user", "content": "I am going to Paris, what should I see?"},
    ]],
    "parameters": {"max_tokens": 512, "top_p": 0.9, "temperature": 0.6}
}
response = predict(payload)
print_dialog(payload, response)

### Example 4

In [None]:
%%time

payload = {
    "inputs": [[
        {
            "role": "system",
            "content": "Always answer with emojis",
        },
        {"role": "user", "content": "How to go from Beijing to NY?"},
    ]],
    "parameters": {"max_tokens": 512, "top_p": 0.9, "temperature": 0.6}
}
response = predict(payload)
print_dialog(payload, response)

## Clean up the endpoint

In [None]:
# Delete the SageMaker endpoint
model.sagemaker_session.delete_endpoint(model.endpoint_name)