# Python LLM (Large Language Model) Tutorial

This tutorial demonstrates how to use the LLM Python API for running Large Language Models on Hailo hardware.

The LLM API provides both streaming and non-streaming text generation capabilities, context management, and various generation parameters for fine-tuned control over model outputs.

**Best Practice: Structured Prompts**
This tutorial uses **structured prompts** (list of JSON messages) exclusively. Structured prompts provide better control, consistency, and leverage the model's chat template effectively.

**Best Practice: context-manager**
This tutorial does not use context-manager, to share resources between different cells. Make sure to create VDevice and LLM using 'with' statements whenever possible. When not using 'with', use VDevice.release() and LLM.release() to clean up resources.

**Requirements:**

* Run the notebook inside the Python virtual environment: ```source hailo_virtualenv/bin/activate```
* An LLM HEF file (Hailo Executable Format for Large Language Models)
* Optional: LoRA (Low-Rank Adaptation) name for fine-tuned models inside this HEF

**Memory Optimization (Optional):**

* For large models that may exceed device memory, enable client-side tokenization
* Requires libhailort to be compiled with `HAILO_BUILD_CLIENT_TOKENIZER=ON`
* Requires Rust toolchain (cargo, rustup) to be installed on the build machine
* Set `OPTIMIZE_MEMORY_ON_DEVICE = True` in the configuration section below

**Tutorial Structure:**

* Basic LLM initialization and simple generation, Streaming vs non-streaming
* Generation parameters (temperature, top_p, top_k, etc.)
* Context management for multi-turn conversations
* Advanced features: templates, tokenization, stop tokens

When inside the ```virtualenv```, use the command ``jupyter-notebook <tutorial-dir>`` to open a Jupyter server that contains the tutorials (default folder on GitHub: ``hailort/libhailort/bindings/python/platform/hailo_tutorials/notebooks/``).



In [None]:
# LLM Tutorial: Setup and Configuration

from hailo_platform import VDevice
from hailo_platform.genai import LLM

# Configuration - Update these paths for your setup
MODEL_PATH = "/your/hef/path/llm.hef"  # Update this path
LORA_NAME = ""  # Optional: specify LoRA adapter name

print("Model path: {}".format(MODEL_PATH))
print("LoRA adapter: {}".format(LORA_NAME if LORA_NAME else 'None'))
vdevice = VDevice()
print("Initializing LLM... this may take a moment...")
llm = LLM(vdevice, MODEL_PATH, LORA_NAME)
print("LLM initialized successfully!")


## Streaming Generation Example



In [None]:
# Structured prompt (recommended approach)
structured_prompt = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Explain machine learning in 2 sentences."}
]

with llm.generate(structured_prompt, max_generated_tokens=50, seed=31) as generation:
    for token in generation:
        print(token, end="", flush=True)



## Non-Streaming Generation Example



In [None]:
print(llm.generate_all(structured_prompt, max_generated_tokens=50, seed=31))


## Multi-Turn Conversations Example



In [None]:
# Clear context
llm.clear_context()

# Turn 1: Introduction
conversation_1 = [
    {"role": "system", "content": "You are a helpful tutor."},
    {"role": "user", "content": "Hi! I'm learning Python. What's a list?"}
]

print(llm.generate_all(conversation_1, max_generated_tokens=60))
print()

# Turn 2: Follow-up (context maintained automatically)
followup = [
    {"role": "user", "content": "Can you show me an example?"}
]

print(llm.generate_all(followup, max_generated_tokens=60, seed=6))



## Context Management Example
Context is maintained between generate() calls automatically.
Use 'clear_context()' to start fresh conversations



In [None]:
question = [
    {"role": "user", "content": "What is your profession?"}
]

print(llm.generate_all(question, max_generated_tokens=30))
new_question = [
    {"role": "user", "content": "What were we just discussing?"}
]
print(llm.generate_all(new_question, max_generated_tokens=30))
print()

# Clear context and ask again
llm.clear_context()

print(llm.generate_all(new_question, max_generated_tokens=30))


## Generation Parameters Example
seed - used to ensure reproducible results.
more configurable parameters can be found in the API documentation.



In [None]:
test_prompt = [
    {"role": "user", "content": "Tell me about AI."}
]
# Reproducible with seed
llm.clear_context()
response_seed1 = llm.generate_all(test_prompt, seed=42, temperature=0.8, max_generated_tokens=25)
llm.clear_context()
response_seed2 = llm.generate_all(test_prompt, seed=42, temperature=0.8, max_generated_tokens=25)
print("Seed=42 (run 1): {}".format(response_seed1))
print("Seed=42 (run 2): {}".format(response_seed2))


## Raw Prompts vs Structured Prompts Example
Raw prompt is a single string, usually wrapped with special tokens that are different per model.
Here we demonstrates the tokens for QWEN family. Special tokens and prompt structures can be obtained using 'llm.prompt_template()'



In [None]:
llm.clear_context()
raw_prompt = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWhat is machine learning?<|im_end|>\n<|im_start|>assistant\n"
with llm.generate(raw_prompt, max_generated_tokens=30, seed=100) as generation:
    print("".join(generation))
print()

llm.clear_context()

# Structured prompt (recommended)
structured_prompt = [
    {"role": "user", "content": "What is machine learning?"}
]
print(llm.generate_all(structured_prompt, max_generated_tokens=30, seed=100))
print()
print(llm.prompt_template())


## Tokenization Example
The GenAI HEF comes with tokenization information, allowing the encoding of text into tokens.



In [None]:
test_texts = [
    "Hello world",
    "Machine learning with Hailo",
    "The quick brown fox jumps!"
]

for text in test_texts:
    tokens = llm.tokenize(text)
    print("'{}' {} tokens: {}".format(text, len(tokens), tokens))

## Stop Tokens and Recovery Sequence Example
The generation will stop when one of these conditions is met:

**Max Tokens Reached**
* in this case, a custom recovery-sequence will be fed into the LLM
* most models comes with default recovery sequence, which is retrievable and configurable using 'llm.get_generation_recovery_sequence()' and 'llm.set_generation_recovery_sequence()'.
* the recovery-sequence tokens are not counted for 'max_generated_tokens'.

**Logical End of Generation**
* whenever hitting one of the stop_sequences, the model finishes its generation 'gracefully' (-> without any recovery-sequence).
* a stop-token can be any sequence of tokens (str). the generation will stop when the exact string is generated as a sequence.
* most models comes with default stop tokens, which are retrievable and configurable using 'llm.get_stop_tokens()' and 'llm.set_stop_tokens()'.
* all stop tokens are checked after each generated token. Setting too many can affect performance.



In [None]:
# Get current stop tokens
original_stop_tokens = llm.get_stop_tokens()
print("Original stop tokens: {}".format(original_stop_tokens))

# Set custom stop tokens
custom_stop_tokens = [".", "END", "\\n"]
llm.set_stop_tokens(custom_stop_tokens)
print("Custom stop tokens: {}".format(llm.get_stop_tokens()))
print()

# Set empty stop tokens - model will stop generaiton only when 'max_generated_tokens' is reached
llm.set_stop_tokens([])
test_prompt = [
    {"role": "user", "content": "Count to 5: 1, 2, 3, 4, 5. Done."}
]
print(llm.generate_all(test_prompt, max_generated_tokens=10))
print()

# Reset stop tokens
llm.set_stop_tokens(original_stop_tokens)
print("Reset stop tokens: {}".format(llm.get_stop_tokens()))



In [None]:
# Clean up resources (best practice: use context managers when possible)
llm.release()
vdevice.release()
print("Resources released successfully")
print("LLM tutorial completed!")
