# LLMs on RyzenAI with TurnkeyML

This notebook will demonstrate how to bring up an example application that uses a RyzenAI to perform LLM inference. We will use the `turnkeyml.llm` APIs in order to make this as quick as possible. This notebook makes use of both the `RyzenAI NPU`, as well as the `RyzenAI Radeon integrated GPU (iGPU)`.

## Application

Our example application will prompt the user for input and then return the LLM's reponse. This is the same technology stack used to create AMD GAIA, which shows how Retrieval Augmented Generation (RAG), agentic workflows, and other advanced techniques can be layered on top of RyzenAI.

In [None]:
# Define our application: prompt the user and print the LLM's response
# We define this in a way that makes the NPU and iGPU interchangable

def application(model, tokenizer):
    while True:
        # Prompt the user
        user_prompt = input("What is your prompt to the LLM? ")
        print("Prompt:",user_prompt)

        # Exit the application if the user prompts "exit"
        if user_prompt == "exit":
            print("Done!")
            return

        # Tokenize the user's prompt
        input_ids = tokenizer(user_prompt, return_tensors="pt").input_ids

        # Generate the response
        # Limit the response length to 30 tokens so that we have time to
        # try a few prompts
        llm_response = model.generate(input_ids, max_new_tokens=30)

        # Decode the response into text
        decoded_response = tokenizer.decode(llm_response[0])

        # Print the response, then prompt for another input
        print("Response:",decoded_response)


## RyzenAI NPU Model Initialization

### Prequisites for NPU

- `ryzenai-transformers` conda environment is installed and activated.
-  Access to `meta-llama/Llama-2-7b-chat-hf` on Hugging Face.
- Install the TurnkeyML-LLM  in your `ryzenai-transformers` environment, see https://github.com/onnx/turnkeyml/tree/main/src/turnkeyml/llm/README.md#install-ryzenai-npu
- Also `pip install jupyter` in your `ryzenai-transformers` environment.

### Starting Up

- Run `conda activate ryzenai-transformers`
- Run `setup_phx.bat` or `setup_stx.bat` on your PHX (RyzenAI 7000 or RyzenAI 300, respectively)
- Run `jupyter notebook`
- Copy the URL printed from the previous command, and use that as the kernel for this notebook.
  - Example: 
    > Or copy and paste one of these URLs:
    >
    >      http://localhost:8888/tree?token=14796c43ce39ef9a3296b7c7c26335e01f7bdc8b0fd4efce


In [None]:
# Import the turnkey APIs
from turnkeyml.llm import leap

# Load the model on to RyzenAI NPU
# NOTE: this takes a couple of minutes, but after you've done it once
#   you can keep reusing the `model` instance in subsequent notebook cells
npu_model, npu_tokenizer = leap.from_pretrained(
    "meta-llama/Llama-2-7b-chat-hf", recipe="ryzenai-npu"
)

## NPU Application

In [None]:
# Run the application on NPU
# User should prompt "exit" to stop the application
application(npu_model, npu_tokenizer)

## Radeon iGPU Initialization

### Prequisites for iGPU

- `turnkeyml[llm-oga-dml]` is installed into an activated conda environment.
-  Download a copy of `Phi-3-mini`
- See https://github.com/onnx/turnkeyml/tree/main/src/turnkeyml/llm/README.md#install-onnxruntime-genai for details

In [None]:
# Import the turnkey APIs
from turnkeyml.llm import leap

# Load the model on iGPU
igpu_model, igpu_tokenizer = leap.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct", recipe="oga-dml-igpu"
)

## Radeon iGPU Application

In [None]:
# Run the application on iGPU
# User should prompt "exit" to stop the application
application(igpu_model, igpu_tokenizer)