## Setting the Environment Variables

The environment variables are completed acording to the requirements of Ollama template 


### Environment variables required for all demos
- `MODEL_NAME`: In this case llama3.2:3b-instruct-fp16
- `REMOTE_BASE_URL`: the URL of the remote Llama Stack server.(In this case is LOCAL)
- `TEMPERATURE` (optional): the temperature to use during inference. Defaults to 0.0.
- `TOP_P` (optional): the top_p parameter to use during inference. Defaults to 0.95.
- `MAX_TOKENS` (optional): the maximum number of tokens that can be generated in the completion. Defaults to 512.
- `STREAM` (optional): set this to True to stream the output of the model/agent and False otherwise. Defaults to False.
- `VDB_PROVIDER`: the vector DB provider to be used. Must be supported by Llama Stack. For this demo, we use Faiss which is the default for Ollama
- `VDB_EMBEDDING`: the embedding model to be used for ingestion and retrieval. For this demo, we use all-MiniLM-L6-v2.
- `VDB_EMBEDDING_DIMENSION` (optional): the dimension of the embedding. Defaults to 384.
- `VECTOR_DB_CHUNK_SIZE` (optional): the chunk size for the vector DB. Defaults to 512.
- `USE_PROMPT_CHAINING`: dictates if the prompt should be formatted as a few separate prompts to isolate each step or in a single turn.

## Necessary Imports

In [1]:
# for accessing the environment variables
import os
from dotenv import load_dotenv
load_dotenv()

# for communication with Llama Stack
from llama_stack_client import LlamaStackClient
from llama_stack_client.types import UserMessage

In [2]:
model_name=os.getenv("MODEL_NAME") 
model_name

'llama3.2:3b-instruct-fp16'

## Setting Up the Server Connection

Establish the connection to your Llama Stack server.

_Note: A Tavily search API key is required for some of our demos and must be provided to the client upon initialization. If you do not have one, you can set one up for free at https://app.tavily.com_

In [3]:
base_url = os.getenv("REMOTE_BASE_URL", "http://localhost:8321")

# Tavily search API key is required for some of our demos and must be provided to the client upon initialization.
# We will cover it in the agentic demos that use the respective tool. Please ignore this parameter for all other demos.
tavily_search_api_key = os.getenv("TAVILY_SEARCH_API_KEY")
if tavily_search_api_key is None:
    provider_data = None
else:
    provider_data = {"tavily_search_api_key": tavily_search_api_key}


client = LlamaStackClient(
    base_url=base_url,
    provider_data=provider_data
)

print(f"Connected to Llama Stack server")

Connected to Llama Stack server


## Initializing the Inference Parameters

Fetch the inference-related parameters from the corresponding environment variables and convert them to the format Llama Stack expects.

In [4]:
temperature = float(os.getenv("TEMPERATURE", 0.0))
if temperature > 0.0:
    top_p = float(os.getenv("TOP_P", 0.95))
    strategy = {"type": "top_p", "temperature": temperature, "top_p": top_p}
else:
    strategy = {"type": "greedy"}

max_tokens = int(os.getenv("MAX_TOKENS", 4096))

# sampling_params will later be used to pass the parameters to Llama Stack Agents/Inference APIs
sampling_params = {
    "strategy": strategy,
    "max_tokens": max_tokens,
}

stream_env = os.getenv("STREAM", "True")
# the Boolean 'stream' parameter will later be passed to Llama Stack Agents/Inference APIs
# any value non equal to 'False' will be considered as 'True'
stream = (stream_env != "False")

print(f"Inference Parameters:\n\tSampling Parameters: {sampling_params}\n\tstream: {stream}")

Inference Parameters:
	Sampling Parameters: {'strategy': {'type': 'greedy'}, 'max_tokens': 512}
	stream: True


Now, let's use the Llama stack inference API to greet our LLM. 

In [5]:
# Prepare the message
message = UserMessage(
    content="Hi, what do you know about Red Hat?",
    role="user",
)

# Call the chat completion and get the response object
response = client.inference.chat_completion(
    model_id=model_name,
    messages=[message],
    sampling_params=sampling_params,
    timeout=600
)



INFO:httpx:HTTP Request: POST http://localhost:8321/v1/inference/chat-completion "HTTP/1.1 200 OK"


In [6]:
# Extract the message content and save it
answer = response.completion_message.content

# Print the content
print("AI Response:", answer)


AI Response: Red Hat is a well-known American multinational software company that specializes in open-source software and cloud computing. Here's an overview:

**History**: Red Hat was founded in 1993 by Bob Young and Marc Ewing in Santa Clara, California. The company started as a small startup focused on developing and selling Linux operating systems.

**Linux and Open-Source**: Red Hat is perhaps best known for its distribution of the Linux operating system, which is an open-source software project that was created by Linus Torvalds in 1991. Red Hat's Enterprise Linux (RHEL) is a popular version of Linux that is widely used in enterprise environments.

**Products and Services**: Red Hat offers a range of products and services, including:

1. **Red Hat Enterprise Linux (RHEL)**: A commercial version of the Linux operating system designed for enterprise use.
2. **Red Hat OpenShift**: An open-source container application platform that allows developers to build, deploy, and manage appli

# Next

Now that we've set up our Tutorial environment, Let's get started building with Llama Stack! The next notebook will teach you how to build a [Simple RAG](./Level1_simple_RAG.ipynb) application.

#### Any Feedback?

If you have any feedback on this or any other notebook in this demo series we'd love to hear it! Please go to https://www.feedback.redhat.com/jfe/form/SV_8pQsoy0U9Ccqsvk and help us improve our demos. 