# REDIRECT NOTICE

## The Lemonade SDK project has moved to https://github.com/lemonade-sdk/lemonade

The new PyPI package name is `lemonade-sdk`

Please migrate to the new repository and package as soon as possible.

For example:

```
    pip uninstall turnkeyml
    pip install lemonade-sdk[YOUR_EXTRAS]
    e.g., pip install lemonade-sdk[llm-oga-hybrid]
```
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
Thank you for using Lemonade SDK!

# Lemonade Tools Tutorial

This notebook is a tutorial for: how to measure an LLM's performance, memory usage, accuracy, and subjective quality on Ryzen AI hardware using Lemonade (LLM-Aide) tools.

The tutortial follows this flow:
1. Lemonade Overview.
    1. Command syntax.
    2. Installing Lemonade Tools and Ryzen AI SW support.
    3. Choosing your LLM-under-test.
3. Benchmarking.
    1. Benchmark the LLM's performance.
    2. Interpreting time to first token (TTFT), tokens/second (TPS), and memory usage data.
4. Subjective Quality Testing.
    1. Prompt the LLM using its chat template.
    2. How to assess the response as a human judge.
    3. How to automatically assess the response using an LLM judge.
5. Objective Quality Testing.
    1. Overview of the LM Evaluation Harness tool.
    2. Measuring log-probability accuracy with `MMLU`.
    3. Measuring generation accuracy with `GSM8k`.

## Lemonade Overview

Lemonade (LLM-Aide) is a software development kit (SDK) that expedites measurement, validation, and deployment of LLMs. It primarily supports OnnxRuntime-GenAI (OGA)-based LLMs but also provides support for llama.cpp and Hugging Face PyTorch LLMs as performance and accuracy baselines.

This tutorial will focus on measurement and validation tasks, using the `lemonade` command line interface (CLI).

### Command Syntax

The `lemonade` CLI uses a unique command syntax that enables convenient interoperability between models, frameworks, devices, accuracy tests, and deployment options.

Each unit of functionality (e.g., loading a model, running a test, deploying a server, etc.) is called a `Tool`, and a single call to `lemonade` can invoke any number of `Tools`. Each `Tool` will perform its functionality, then pass its state to the next `Tool` in the command.

You can read each command out loud to understand what it is doing. For example, a command like this:

```bash
lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 llm-prompt -p "Hello, my thoughts are"
```

Can be read like this:

> Run `lemonade` on the input `(-i)` checkpoint `microsoft/Phi-3-mini-4k-instruct`. First, load it in the OnnxRuntime GenAI framework (`oga-load`), onto the integrated GPU device (`--device igpu`) in the int4 data type (`--dtype int4`). Then, pass the OGA model to the prompting tool (`llm-prompt`) with the prompt (`-p`) "Hello, my thoughts are" and print the response.

The `lemonade -h` command will show you which options and Tools are available, and `lemonade TOOL -h` will tell you more about that specific Tool.

### Installation

Before running any cell in this notebook, the following setup steps are required:
1. Install Conda (we suggest the [Miniforge](https://github.com/conda-forge/miniforge) flavor of Conda):
    1. Download: https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Windows-x86_64.exe
    2. Double click to install. Make sure to install for `Just Me` (not `All Users`).
    3. Open the `Miniforge Prompt` app that was just installed, and use that to run the following commands.
1. Create and activate a Python 3.10 environment. For example:
    ```bash
    conda create -n hybrid python=3.10
    conda activate hybrid
    ```
1. Install Lemonade + Hybrid:

    ```bash
    pip install turnkeyml[llm-oga-hybrid]
    lemonade-install --ryzenai hybrid
    ```
1. Configure this Jupyter notebook to use the `hybrid` environment as its Python kernel.

Additional backends, such as CPU-only and DirectML-only, are also available: https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/README.md#installing-from-pypi

### Environment Configuration

These commands customize the Lemonade environment for the experiments we will run in this notebook.

In [None]:
import os

# This improves the signal/noise ratio of Lemonade outputs in the notebook
os.environ["TURNKEY_BUILD_MONITOR"] = "False"
# This places all output data in the same directory as this notebook
# It can be customized on a per-experiment basis:
# For example, this would put the data in a new folder called `my-experiment-100`:
#    cache_dir = "./my-experiment-100"
cache_dir = "./tutorial-cache"
os.environ["LEMONADE_CACHE_DIR"] = cache_dir

### Choosing an LLM-Under-Test

Use the following code block to customize which LLM and device will be used for this tutorial. The table below links device names to Hugging Face model collections filled with compatible models. All of these models use the int4 data type.

| Device       | Collection                                                                                                                                      |
| ------------ | ----------------------------------------------------------------------------------------------------------------------------------------------- |
| `hybrid` | [Hybrid Collection](https://huggingface.co/collections/amd/ryzenai-14-llm-hybrid-models-67da31231bba0f733750a99c)                               |
| `npu`    | [NPU Collection](https://huggingface.co/collections/amd/ryzenai-14-llm-npu-models-67da3494ec327bd3aa3c83d7)                                     |
| `cpu`    | [CPU Collection](https://huggingface.co/collections/amd/oga-cpu-llm-collection-6808280dc18d268d57353be8) |
| `igpu`    | [GPU Collection](https://huggingface.co/collections/amd/ryzenai-oga-dml-models-67f940914eee51cbd794b95b)                                                                                                                          |



In [None]:
checkpoint = "amd/Llama-3.2-1B-Instruct-awq-g128-int4-asym-fp16-onnx-hybrid"
device = "hybrid"
DTYPE = "int4"

## Benchmarking

This section shows you how to benchmark your LLM. Our goal is to measure the following 3 properties of the LLM-under-test:
1. Time to first token (TTFT): the amount of time the user has to wait for the LLM to prefill the prompt, before returning the first response token.
2. Tokens per second (TPS): the number of tokens the LLM outputs to the user each second, after the first token.
3. Memory utilization (GB): the amount of RAM required to hold the LLM in memory and calculate the response to the prompt.

This section will leverage the following Lemonade commands:
1. `oga-load`: load an OGA LLM into memory.
2. `oga-bench`: benchmark an OGA LLM.
3. `report`: print the outcome of the experiment to the screen and save it to disk in a CSV file.

### Benchmark Command

Benchmarking an OGA LLM requires the `oga-load` and `oga-bench` commands. Both of these commands are configurable.

#### `oga-load` Configuration
The `oga-load` tool has settings to help you load your target model:
- The `-i` (input) argument from the `lemonade` command is passed into `oga-load`, and determines which LLM to load. We will pass the name of a Hugging Face checkpoint for a pre-optimized OGA LLM.
- `--device`: which device (e.g., `hybrid`, `cpu`) to load the LLM on (each OGA checkpoint is only compatible with one device).
- `--dtype`: the data type of the LLM's weights in memory (each OGA checkpoint only supports one data type).

#### `oga-bench` Configuration
The `oga-bench` tool has settings to customize the benchmarking experiment:
- `input_sequence_lengths`: list of input sizes to the model. Performance data will be collected for each item in the list.
- `output_sequence_length`: number of output tokens to produce during generation.
- `iterations`: how many times to repeat the experiment. Higher values take longer to run, but produce a more accurate result.
- `warmup-iterations`: iterations that are not counted towards the experimental data. Used to warm up the system, such as the cache.

> Note: we're setting warmup to 0 in this tutorial to save demonstration time, but a typical value would be 5-10.

In [None]:
input_sequence_lengths = "256 512 1024 2048"
output_sequence_length = 64
iterations = 5
warmup = 0

!lemonade -i {checkpoint} oga-load \
    --device {device} \
    --dtype {DTYPE} \
    oga-bench \
    --prompts {input_sequence_lengths} \
    --output-tokens {output_sequence_length} \
    --iterations {iterations} \
    --warmup-iterations {warmup}
    

### Interpreting the Results

The benchmark command outputs a lot of data to the terminal and also saves that data to the Lemonade Cache on disk.

The `report` command can help us visualize the data in a table format. We are customizing the `report` tool with these settings:
- `-i`: which Lemonade Cache to report on. We will use the tutorial cache directory we set up at the start of the notebook.
- `--no-save`: tells the `report` tool to not save a CSV file of the data to disk (since we are not using it in this tutorial).
- `--perf`: include performance information in the table printed to the screen.
- `--lean`: include minimal other information in the table printed to the screen, for a cleaner presentation.

> Note: a lot of additional data is saved to the cache, such as system information, that is not printed here.

In [None]:
!lemonade report -i {cache_dir} --no-save --perf --lean

## Subjective Quality Testing

This section shows you how to prompt the LLM-under-test and get the response. It is complementary to objective quality testing (the next section of this tutorial) because it lets us see how the LLM will react to basic real-world scenarios.

Subjective testing helps us quickly identify undesirable behaviors such as rambling responses, responses that erroneously include special tokens, etc.

We will also cover how to use a local LLM judge to automatically assess whether the model's response is reasonable.

### Prompt Command

The `llm-prompt` command sends your prompt to the LLM under test and then prints the response to the screen.

Options:
- Feel free to customize the `prompt` variable to anything you like.
- `--template` applies the model's chat template to the prompt, which improves output quality.
- `--max-new-tokens` limits the amount of output the LLM is allowed to produce.

> Note: The output you see will include the full prompt the LLM sees, including all special template tokens. 

In [None]:
prompt = "What is the capital of France?"

prompt_cmd_output = !lemonade -i {checkpoint} \
                        oga-load --device {device} --dtype {DTYPE} \
                        llm-prompt --template --max-new-tokens 64 -p "{prompt}"

print(prompt_cmd_output.n)

### Subjective Response Validation

One of the easiest ways to validate an LLM is to make sure that it responds to simple questions in a clear and concise way.

- In the previous code block, we prompted: "What is the capital of France?"

- The response should be something like: "The capital of France is Paris."

If you got a clear and concise response like that, consider the response a success ✅. If the response is rambling or nonsensical, consider it a fail ❌.

### Automatic Response Validation

If you are validating LLMs at scale, you may want to automatically assess the response quality. The following code extracts the prompt and response from the previous cell and feeds it into an LLM judge, which assesses the response quality.

#### Accessing the Lemonade Database
The `Stats` class is useful for extracting experimental data from a specific `lemonade` command. In this case, we will use it to programmatically access the `prompt` and `response` from the last command. Later, we will also use `Stats` to store our analysis back to the database.

The `Stats` class requires the cache directory and build name of the command in question, which we will obtain by parsing the output of the last command. From there, we can load up the `Stats` from that command and access the `prompt` and `response` values.

In [None]:
from turnkeyml.common.filesystem import Stats

# Parse the output of the last cell to get the build directory
full_cache_dir, build_name = (
    next(l for l in prompt_cmd_output if "Build dir:" in l)    # find the line
    .split("Build dir:")[1].strip()                            # drop the label
    .rsplit("\\builds\\",1)                                    # split off last segment
)

# Load up the stats dictionary from the prompt command so that we can access them
stats_handle = Stats(cache_dir=full_cache_dir, build_name=build_name)
stats_dict = stats_handle.stats

prompt = stats_dict["prompt"]
response = stats_dict["response"]

# Print the values to make sure we captured them correctly
print("Prompt:\n", prompt)
print("Response:\n", response)



#### Starting Lemonade Server

Lemonade Server is a tool that loads an LLM into a separate process and allows us to interact with it over a high-level OpenAI `Chat Completions` API.

Let's break that down a little:
- The OpenAI API is an industry standard protocol for communicating with server processes.
- The `openai.OpenAI` class is a convenient Python client for formatting requests and parsing the responses.
- The [Chat Completions API](https://platform.openai.com/docs/guides/text?api-mode=chat) allows for back-and-forth messaging with the served LLM.

We will use the server to load up our LLM judge, `Llama-3.1-8B-Instruct-Hybrid`, and interact with it.

The next cell will start Lemonade Server as a subprocess that will run alongside this notebook, until we shut it down. Note that the last cell in the notebook runs `!lemonade-server-dev stop`, which shuts down this process. Make sure to run that cell when you are done with the notebook!

> Note: OpenAI API was originally invented for communication with LLM servers in the datacenter, but it has since been adopted by the community for server processes that run right on the local PC as well.

In [None]:
import subprocess
import time
from lemonade_server.cli import status

# Start the lemonade-server-dev serve command in a non-blocking manner
subprocess.Popen(['lemonade-server-dev', 'serve'])

# Wait until the server process is ready
while not status()[0]:
    time.sleep(5)

#### Seeking Judgement

Now that Lemonade Server is available, we can ask it to judge our LLM-under-test's response.

We will use a `system prompt` to give specific instructions to the LLM judge, ensuring that it returns its response in the form of a JSON object. This approach will make it easy to parse the judgement data and save it back to our database later.

Then, we will send our request for judgement as an OpenAI API chat completions request using the `OpenAI` Python library and parse the response.

> Note: if you already have the `Llama-3.1-8B-Instruct-Hybrid` model downloaded on your system, this step should only take 5-10 seconds. However, if you don't already have the model this step will download it for you (~8GB), which can take a few minutes.

In [None]:
# Provide a system prompt that will help the LLM judge give us an easy-to-parse response
system_prompt = """
<instructions> You are a judge that evaluates prompt/response pairs from an LLM under test. Your job is to determine if the response is reasonable, concise, and appropriate for a small local LLM.
Specifically, check for:

1. No insertion of special tokens or role markers (e.g., "assistant", "<|start_header_id|>", etc.) in the response.
2. The reply should not ramble or continue after giving a direct answer.
3. The LLM must not have a conversation with itself, repeat roles, or include multiple "assistant" tokens or role markers in the response.
4. The response should directly answer the prompt and not include unnecessary information.

If any of these issues are present, mark the response as invalid and explain the reason.

Return your answer as a JSON object: {"valid": bool, "detail": str} </instructions>
"""

# Provide the prompt/response pair to the judge
user_prompt = f"""
LLM's Prompt: {prompt}

LLM's Response: {response}

"""

messages = [
    {"role":"system", "content":system_prompt},
    {"role":"user", "content":user_prompt},
]

# Use the OpenAI API to send the messages to the LLM judge
from openai import OpenAI

base_url = f"http://localhost:8000/api/v0"

client = OpenAI(
    base_url=base_url,
    api_key="lemonade",  # required, but unused
)

completion = client.chat.completions.create(
    model="Llama-3.1-8B-Instruct-Hybrid",
    messages=messages,
    max_completion_tokens=128,
)

judgement = completion.choices[0].message.content

# Print the response (which should be a JSON object)
print(judgement)

In [None]:
import json

# Extract the assistant's response
decoded_json = json.loads(judgement)

# Print the parsed values
print("The local LLM judge says the response is reasonable:", decoded_json["valid"])
print("and offers this explanation:", decoded_json["detail"])

# Save the results to the stats option, so that they become part of the results database
stats_handle.save_stat("llm_judgement",decoded_json["valid"])
stats_handle.save_stat("llm_judgement_reason",decoded_json["detail"])

## Objective Quality Testing

Our final set of experiments for our LLM-under-test is objective accuracy testing.

We will use [LM-Evaluation-Harness](https://github.com/EleutherAI/lm-evaluation-harness) (often called `lm-eval`), an open-source framework for evaluating language models across a wide variety of tasks and benchmarks. Developed by EleutherAI, it has become a standard tool in the AI research community for consistent evaluation of language model capabilities.

`lm-eval` works with OpenAI API-compatible servers, like the Lemonade Server we started in the last section.

### Loading the LLM-Under-Test

In this section, we'll load our LLM-under-test onto the Lemonade Server process we started in the last section.

To accomplish this, we'll use Lemoande Server's `load` endpoint, which we will access using the Python `requests` library. More documentation about Lemonade Server endpoints is available [here](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/server_spec.md).

The `load` endpoint allows any checkpoint to be loaded into the server; we just have to provide Lemonade Server with a `recipe` that lets it know which framework and device to use. Lemonade Recipe documentation is available [here](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/lemonade_api.md).

In [None]:
import requests
import json

# Note: change this to `hf-{device}` if you are using Hugging Face as your framework 
recipe = f"oga-{device}"
payload = {"checkpoint": checkpoint, "recipe": recipe}

response = requests.post(
    "http://localhost:8000/api/v0/load",
    headers={"Content-Type": "application/json"},
    data=json.dumps(payload)
)

# Make sure the correct model loaded
print(response.text)

### Log Probability Testing with MMLU

These tests evaluate a model's ability to assign probabilities to different possible answers. The model predicts which answer is most likely based on conditional probabilities.

In MMLU (Massive Multitask Language Understanding), the model is given a multiple-choice question and must assign probabilities to each answer choice. The model's performance is measured by how often it assigns the highest probability to the correct answer.

Options:
- `model`: `local-completions` means to use a local LLM server, like Lemonade Server.
- `model_args`: this points to our Lemonade Server process and tells `lm-eval` which model we have loaded up, and how to access it.
- `tasks`: these are the accuracy tests that will be run. Right now we just have one MMLU subject, `mmlu_abstract_algebra`.
- `limit`: run only the first N tasks in the test.

> Note: this test takes about 2 minutes to run with a limit of 5. In a real-world testing scenario, we would remove the `limit` argument, which would run all questions in the subject(s). We would also suggest running multiple MMLU subjects besides `mmlu_abstract_algebra` to gather more accuracy data.

This command will print out a table of accuracy results. Since we are running a small amount of questions (for the same of demonstration time) we may not see a very high accuracy score.

In [None]:
model_args = f"model={checkpoint},base_url=http://localhost:8000/api/v0/completions,num_concurrent=1,max_retries=0,tokenized_requests=False"

!lm_eval \
    --model local-completions \
    --model_args {model_args} \
    --tasks mmlu_abstract_algebra \
    --limit 5

### Generation Testing with GSM8k

These tests evaluate a model's ability to generate full responses to prompts. The model generates text that is then evaluated against reference answers or using specific metrics.

In GSM8K (Grade School Math), the model is given a math problem and must generate a step-by-step solution. Performance is measured by whether the final answer is correct.

> Note: this test takes about 2 minutes to run with a limit of 5. In a real-world testing scenario, we would remove the `limit` argument, which would run all questions in the test.


In [None]:
!lm_eval \
    --model local-completions \
    --model_args {model_args} \
    --tasks gsm8k \
    --limit 5

In [None]:
# Stop the Lemonade Server process that we started earlier in the notebook
!lemonade-server-dev stop