# Llama Stack Inference Guide

This document provides instructions on how to use Llama Stack's `chat_completion` function for generating text using the `Llama3.1-8B-Instruct` model. 

Before you begin, please ensure Llama Stack is installed and set up by following the [Getting Started Guide](https://llama-stack.readthedocs.io/en/latest/getting_started/index.html).


### Table of Contents
1. [Quickstart](#quickstart)
2. [Building Effective Prompts](#building-effective-prompts)
3. [Conversation Loop](#conversation-loop)
4. [Conversation History](#conversation-history)
5. [Streaming Responses](#streaming-responses)


## Quickstart

This section walks through each step to set up and make a simple text generation request.



### 0. Configuration
Set up your connection parameters:

In [1]:
HOST = "localhost"  # Replace with your host
PORT = 8321       # Replace with your port
# MODEL_NAME='meta-llama/Llama-3.2-3B-Instruct'
MODEL_NAME="deepseek-ai/DeepSeek-R1-Distill-Qwen-14B"

### 1. Set Up the Client

Begin by importing the necessary components from Llama Stack’s client library:

In [2]:
from llama_stack_client import LlamaStackClient

client = LlamaStackClient(base_url=f'http://{HOST}:{PORT}')

### 2. Create a Chat Completion Request

Use the `chat_completion` function to define the conversation context. Each message you include should have a specific role and content:

In [3]:
response = client.inference.chat_completion(
    messages=[
        {"role": "system", "content": "You are a friendly assistant."},
        {"role": "user", "content": "Write a two-sentence poem about llama."}
    ],
    model_id=MODEL_NAME,
)

print(response.completion_message.content)

Okay, so I need to write a two-sentence poem about llamas. Hmm, where do I start? I know llamas are animals, they're from South America, maybe the Andes region. They look kind of like camels but smaller, with those distinctive faces and long eyelashes. I remember they have soft fur, which makes them nice to touch. Llamas are often used as pack animals, carrying loads for people. They're also known for their gentle nature, though sometimes they can be a bit mischievous.

I should think about what makes llamas unique. Their appearance is striking, with that camel-like shape but smaller size. Their fur comes in various colors, which adds to their charm. Maybe I can mention their eyes or expressions, as they seem to have a certain look that's both curious and calm. Also, their behavior, like how they interact with humans or other animals, could be a good point to include.

For the first sentence, I want to capture their essence. Maybe something about their presence or how they stand out in

## Building Effective Prompts

Effective prompt creation (often called 'prompt engineering') is essential for quality responses. Here are best practices for structuring your prompts to get the most out of the Llama Stack model:

### Sample Prompt

In [4]:
response = client.inference.chat_completion(
    messages=[
        {"role": "system", "content": "You are shakespeare."},
        {"role": "user", "content": "Write a two-sentence poem about llama."}
    ],
    model_id=MODEL_NAME,  # Changed from model to model_id
)
print(response.completion_message.content)

Okay, so I need to write a two-sentence poem about llamas. Hmm, where do I start? I know llamas are animals, they're from South America, right? They have that camel-like appearance but smaller. They're often used for their wool, I think. Also, they have a sort of gentle demeanor, but I've heard they can be a bit spitty too. 

First, I should think about the imagery. Maybe describe their appearance. They have soft fur, maybe in different colors. Their eyes are big and expressive. They graze on grass, so maybe include that. 

For the second sentence, I can talk about their behavior or something unique about them. They're social animals, so maybe mention herds. Also, they have a calm presence, which people appreciate. Maybe something about their gentle nature or how they interact with humans.

Putting it together, the first line could describe their appearance and the environment they're in. The second line could highlight their behavior or what they bring to that environment. I should ma

## Conversation Loop

To create a continuous conversation loop, where users can input multiple messages in a session, use the following structure. This example runs an asynchronous loop, ending when the user types 'exit,' 'quit,' or 'bye.'

In [5]:
import asyncio
from llama_stack_client import LlamaStackClient
from termcolor import cprint

client = LlamaStackClient(base_url=f'http://{HOST}:{PORT}')

async def chat_loop():
    while True:
        user_input = input('User> ')
        if user_input.lower() in ['exit', 'quit', 'bye']:
            cprint('Ending conversation. Goodbye!', 'yellow')
            break

        message = {"role": "user", "content": user_input}
        response = client.inference.chat_completion(
            messages=[message],
            model_id=MODEL_NAME
        )
        cprint(f'> Response: {response.completion_message.content}', 'cyan')

# Run the chat loop in a Jupyter Notebook cell using await
await chat_loop()
# To run it in a python file, use this line instead
# asyncio.run(chat_loop())


User>  quit


[33mEnding conversation. Goodbye![0m


## Conversation History

Maintaining a conversation history allows the model to retain context from previous interactions. Use a list to accumulate messages, enabling continuity throughout the chat session.

In [6]:
async def chat_loop():
    conversation_history = []
    while True:
        user_input = input('User> ')
        if user_input.lower() in ['exit', 'quit', 'bye']:
            cprint('Ending conversation. Goodbye!', 'yellow')
            break

        user_message = {"role": "user", "content": user_input}
        conversation_history.append(user_message)

        response = client.inference.chat_completion(
            messages=conversation_history,
            model_id=MODEL_NAME,
        )
        cprint(f'> Response: {response.completion_message.content}', 'cyan')

        # Append the assistant message with all required fields
        assistant_message = {
            "role": "user",
            "content": response.completion_message.content,
            # Add any additional required fields here if necessary
        }
        conversation_history.append(assistant_message)

# Use `await` in the Jupyter Notebook cell to call the function
await chat_loop()
# To run it in a python file, use this line instead
# asyncio.run(chat_loop())


User>  bye


[33mEnding conversation. Goodbye![0m


## Streaming Responses

Llama Stack offers a `stream` parameter in the `chat_completion` function, which allows partial responses to be returned progressively as they are generated. This can enhance user experience by providing immediate feedback without waiting for the entire response to be processed.

In [7]:
from llama_stack_client.lib.inference.event_logger import EventLogger

async def run_main(stream: bool = True):
    client = LlamaStackClient(base_url=f'http://{HOST}:{PORT}')

    message = {
        "role": "user",
        "content": 'Write me a 3 sentence poem about llama'
    }
    cprint(f'User> {message["content"]}', 'green')

    response = client.inference.chat_completion(
        messages=[message],
        model_id=MODEL_NAME,
        stream=stream,
    )

    if not stream:
        cprint(f'> Response: {response.completion_message.content}', 'cyan')
    else:
        for log in EventLogger().log(response):
            log.print()

# In a Jupyter Notebook cell, use `await` to call the function
await run_main()
# To run it in a python file, use this line instead
# asyncio.run(run_main())


[32mUser> Write me a 3 sentence poem about llama[0m
[36mAssistant> [0m[33mOkay[0m[33m,[0m[33m so[0m[33m I[0m[33m need[0m[33m to[0m[33m write[0m[33m a[0m[33m three[0m[33m-s[0m[33mentence[0m[33m poem[0m[33m about[0m[33m ll[0m[33mamas[0m[33m.[0m[33m Hmm[0m[33m,[0m[33m where[0m[33m do[0m[33m I[0m[33m start[0m[33m?[0m[33m I[0m[33m know[0m[33m ll[0m[33mamas[0m[33m are[0m[33m animals[0m[33m,[0m[33m they[0m[33m're[0m[33m from[0m[33m South[0m[33m America[0m[33m,[0m[33m maybe[0m[33m the[0m[33m And[0m[33mes[0m[33m?[0m[33m They[0m[33m look[0m[33m kind[0m[33m of[0m[33m like[0m[33m cam[0m[33mels[0m[33m but[0m[33m smaller[0m[33m,[0m[33m with[0m[33m those[0m[33m distinctive[0m[33m faces[0m[33m and[0m[33m long[0m[33m eyel[0m[33mashes[0m[33m.[0m[33m I[0m[33m remember[0m[33m they[0m[33m have[0m[33m soft[0m[33m fur[0m[33m,[0m[33m which[0m[33m makes[0m[33m them[0m