## LLM Workshop
By: Mohammed Alageel
\
Prerequisites:
- Basic understanding of Python programming concepts.
- A laptop equipped with::
    - Python (version 3.11 is recommended).
    - An IDE like VS Code or PyCharm.
    - Ollama installed.

# 1. What Are LLMs?
Large Language Models (LLMs) are AI models trained on massive amounts of text to understand and generate human-like language.

## Key Concepts
- Based on transformer architecture (e.g., GPT, BERT).
- Trained to predict the next word (or token) in a sequence.
- Can generate, translate, summarize, and answer questions.

## Tokens
LLMs process text by breaking it down into smaller units called **tokens**.\
A token can represent a whole word, a part of a word (sub-word), or even a single character.\
![Tokens](images/tokenizer.png)\
You can explore how text is tokenized using tools like the [OpenAI Tokenizer](https://platform.openai.com/tokenizer)

## Parameters
The complexity and capability of an LLM are often related to its number of **parameters**. These are the internal variables the model learns during training.\
Model Parameter size ranges from 1B (Small) to 100B+ (Large)\
Generally, models with more parameters have greater capacity, but also require more computational resources.\
\
![Parameters](images/model_size.png)\
[Ollama Gemma3 Model](https://ollama.com/library/gemma3)

## Context Length
The **context length** defines the maximum amount of text (measured in tokens) that an LLM can consider at one time when processing input or generating output.\
This limit varies between different models, typically ranging from a few thousand (e.g., 4k) to over a hundred thousand (e.g., 128k) tokens.\
![Context](images/context_length.png)\
[HuggingFace LLama4 Link](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct)

## Popular LLMs
- ChatGPT (OpenAI)
- Claude (Anthropic)
- Gemini (Google)
- LLaMA (Meta)
- Qwen (Alibaba)

## Common Use Cases
- Writing assistance (emails, blog posts)
- Code generation
- Chatbots & virtual assistants
- Language translation

# 2. Basic LLM Usage
How do you use an LLM?\
You give it a prompt (a message or instruction), and it returns a response.

## Examples
Prompt: “Write a birthday message for a 10-year-old.”\
Output: “Happy 10th Birthday! Hope your day is filled with fun and cake!”

Prompt: “What’s the capital of France?”\
Output: “Paris.

## APIs
- Low-level: Transformers
- High-level: OpenAI (Ollama or Cloud)


## Low level: Transformers
Offer fine-grained control (e.g., Hugging Face `transformers`).

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B")

prompt = "Hey, are you conscious? Can you talk to me?"
inputs = tokenizer(prompt, return_tensors="pt")

generate_ids = model.generate(**inputs, max_new_tokens=30)
tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

## High level: OpenAI
Provide simpler interfaces (e.g., OpenAI's API, usable with cloud services or local tools like Ollama).

In [None]:
from openai import OpenAI

client = OpenAI(
    api_key='...',
)

response = client.chat.completions.create(
    model="gpt-4.1",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What’s the capital of France?"},
    ]
)
print(response.choices[0].message.content)

## Ollama
Ollama provides a convenient way to run various open-source LLMs directly on your own machine.\
It exposes an API endpoint compatible with the OpenAI API standard, allowing you to use the same `openai` Python library to interact with local models.\
**Terminal Commands**:
```bash
ollama run qwen2.5:1.5b
```
To run a larger version (e.g., Qwen 7B):

```bash
ollama run qwen2.5:7b
```

In [None]:
from openai import OpenAI

client = OpenAI(
    base_url = 'http://localhost:11434/v1',
    api_key='ollama',
)

response = client.chat.completions.create(
    model="qwen2.5:1.5b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What’s the capital of France?"},
    ]
)
print(response.choices[0].message.content)

## LLM Hyperparameters
When generating text, you can influence the output using several parameters:
- **`max_tokens`**: Sets the maximum number of tokens the model should generate in its response.
- **`temperature`**: Controls the randomness of the output. A lower value (e.g., 0) makes the output more deterministic and focused (greedy decoding), while a higher value increases randomness and creativity.
- **`top_p` (Nucleus Sampling)**: Selects tokens from a cumulative probability distribution. Only the most probable tokens whose probabilities add up to `top_p` are considered. `top_p=1` considers all tokens, while lower values restrict choices.
- **`top_k`**: Selects only the `k` most likely tokens at each step. `top_k=1` is equivalent to greedy decoding.
                
**Warning:** Setting `temperature` very high without constraints like `max_tokens` can sometimes lead to repetitive or nonsensical output loops.

In [None]:
from openai import OpenAI

client = OpenAI(
    base_url = 'http://localhost:11434/v1',
    api_key='ollama',
)

response = client.chat.completions.create(
    model="qwen2.5:1.5b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What’s the capital of France?"},
    ],
    max_tokens=100, # if not used we will enter an infinite loop
    temperature=5,
    top_p=1,
    # top_k is not supported here
)
print(response.choices[0].message.content)

## Hallucination
<!-- LLMs may generate incorrect information confidently, which is referred to as hallucination.\
To reduce its impact, we can ask the LLM to only generate an answer if it knows the answer.\
If it does not know the answer, it can respond with
“I don’t know.” -->
LLMs can sometimes generate text that sounds plausible but is factually incorrect or nonsensical. This phenomenon is known as **hallucination**.

It occurs because models predict likely sequences of words based on patterns in their training data, without true understanding or access to real-time facts.

One way to mitigate this is to instruct the model in the prompt to state when it doesn't know an answer, for example:
```python
"Answer the following question. If you do not know the answer or cannot find it in the provided context, respond with 'I don’t know.'"
```

## Tokenizer

**Tokenization** is the fundamental process of converting raw text into a sequence of tokens (numerical IDs) that the model can understand.

The specific way text is broken down depends on the **tokenizer** used, which is typically paired with a specific LLM.
![tokenizer](images/tokenizer.png)

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B")

prompt = "Hey, are you conscious? Can you talk to me?"
inputs = tokenizer(prompt, return_tensors="pt")

print(inputs)

print(inputs.input_ids[0][0], tokenizer.decode(inputs.input_ids[0][0]))

# 3. Prompt Engineering
Prompt Engineering is the craft of designing effective inputs to get useful outputs from an LLM.

## Techniques
1. Zero-shot: Ask directly.
```python
"Summarize this article. [ARTICLE]"
```
2. Few-shot: Give examples.
```python
"""Classify as Negative or Positive
example:
this drink made me vomit
output:
negative

[INPUT]
output:
"""
```
3. Chain-of-Thought: Ask for reasoning.
```python
"When I was 3 years old, my partner was 3 times my age. Now, I am 20 years old. How old is my partner?"
```
4. Role prompting:
```python
"Act like a legal advisor and explain this contract. [CONTRACT]"
```

## Tips
- provide examples.
- Be specific about the output.
- Use instructions over constraints.
- Experiment and iterate.

In [None]:
from openai import OpenAI

client = OpenAI(
    base_url = 'http://localhost:11434/v1',
    api_key='ollama',
)

customer_order = "Now, I would like a large pizza, with the first half cheese and mozzarella. And the other tomato sauce, ham and pineapple."
prompt = """
Parse a customer's pizza order into valid JSON:
EXAMPLE:
I want a small pizza with cheese, tomato sauce, and pepperoni.
JSON Response:
```
{
"size": "small",
"type": "normal",
"ingredients": [["cheese", "tomato sauce", "peperoni"]]
}
```""" + "\n" + customer_order + "\nJSON Response:"


response = client.chat.completions.create(
    model="qwen2.5:1.5b",
    messages=[
        {"role": "user", "content": prompt},
    ],
    temperature=0,
)
print(response.choices[0].message.content)

## Parsing Structured Output
When you need the LLM's output to be used programmatically (e.g., feeding data into another system), it's crucial to get structured data like JSON.

You can prompt the model to generate JSON. However, the output might not always be perfectly valid.

Libraries like **`Pydantic`** are excellent for:
1.  Defining the expected data structure using Python classes.
2.  Parsing the LLM's JSON output.
3.  Validating that the parsed data conforms to the defined structure.


**JSON Repair Libraries:** Tools like **`json-repair`** can attempt to automatically fix common errors in malformed JSON strings before parsing.


In [None]:
from pydantic import BaseModel

class Order(BaseModel):
    size: str
    type: str
    ingredients: list[list[str]]

json_str = response.choices[0].message.content.lstrip('```json').rstrip('```')
print(json_str)
order = Order.model_validate_json(json_str)
print(order)

# 4. Retrieval-Augmented Generation (RAG)

**RAG** stands for **Retrieval-Augmented Generation**. It's a powerful technique that enhances LLM responses by providing them with relevant information retrieved from an external knowledge source (like a collection of documents or a database) before generation.

This helps produce more accurate, up-to-date, and context-aware answers, especially for domain-specific or recent information not present in the LLM's original training data.


## Text Embeddings
A core component of RAG is **text embedding**. This process converts pieces of text into numerical vectors (lists of numbers).

These vectors are designed to capture the semantic meaning of the text, such that texts with similar meanings have vectors that are close to each other in the vector space.

![embeddings](images/emb.png)

### Usages
- Semantic Search
- Recommendation Systems
- Text Classification
- Text Clustering


In [None]:
from sentence_transformers import SentenceTransformer
# Load model
model = SentenceTransformer("all-MiniLM-L6-v2")
# Convert text to text embeddings
vector = model.encode("Best movie ever!")
vector

## Vector DB
**Vector databases** are specialized databases designed to efficiently store and search through large collections of embedding vectors.

Their key capability is performing **similarity searches**. Given a query vector (representing a question or topic), the database can quickly find the stored vectors (representing documents or data chunks) that are most similar in meaning.

![vector2](images/vector_search.png)

### Typical RAG Indexing and Querying Workflow
1. All Text → Embed (e.g., via SentenceTransformers, OpenAI)
2. Store embeddings in vector DB (like FAISS, ChromaDB, Pinecone, Weaviate)
3. Query with new text → Embed
4. Use Query in vector DB -> get most similar documents


## Knowledge cutoffs
LLMs are trained on data up to a certain point in time (their **knowledge cutoff**). They typically lack information about events or developments occurring after that date.

RAG is an effective solution to this limitation, as it allows the model to access and incorporate current information retrieved from an up-to-date external knowledge source during the generation process.
            

In [None]:
import numpy as np
import faiss
from typing import List, Callable
from sentence_transformers import SentenceTransformer

def make_index(texts: List[str], embedding_model: Callable[[List[str]], np.ndarray]) -> faiss.Index:
    # Generate embeddings for all texts
    embeddings = embedding_model(texts)
    # Get dimensionality from the embeddings
    dimension = embeddings.shape[1]
    # Create a FAISS index - using L2 distance (Euclidean)
    index = faiss.IndexFlatL2(dimension)
    # Add embeddings to the index
    index.add(embeddings.astype(np.float32))
    return index

def make_query(query: str, embedding_model: Callable[[List[str]], np.ndarray],
               index: faiss.Index, k: int = 5) -> np.ndarray:
    # Generate embedding for the query
    query_embedding = embedding_model([query])
    # Search the index
    _, indices = index.search(query_embedding.astype(np.float32), k)
    # Return the indices
    return indices[0]  # Return the first (and only) result's indices

texts = [
    "The quick brown fox jumps over the lazy dog",
    "Machine learning is a subset of artificial intelligence",
    "Python is a popular programming language for data science",
    "Neural networks have revolutionized natural language processing",
    "FAISS is a library for efficient similarity search",
    "Vector databases store and retrieve embeddings efficiently",
    "Climate change is affecting global weather patterns",
    "Renewable energy sources include solar and wind power",
    "Electric vehicles are becoming increasingly popular",
    "Quantum computing uses quantum bits or qubits",
    "Blockchain technology enables secure decentralized transactions",
    "Healthy eating involves consuming a balanced diet",
    "Regular exercise improves physical and mental health",
    "Space exploration has led to many technological advances",
    "The Great Barrier Reef is the world's largest coral reef system",
    "Digital transformation is changing how businesses operate",
    "Cybersecurity protects systems from digital attacks",
    "Artificial intelligence can solve complex problems",
    "Cloud computing delivers computing services over the internet",
    "Data privacy concerns are growing in the digital age"
]

model = SentenceTransformer("all-MiniLM-L6-v2")
faiss_index = make_index(texts, model.encode)
user_query = "What is artificial intelligence?"
results = make_query(user_query, model.encode, faiss_index, k=5)

for i in results:
    print(i, texts[i])

## RAG Code

### How It Works
1. User query → embedding
2. Search vector DB → find relevant documents
3. Combine prompt + documents + query → full LLM prompt
4. LLM generates answer using retrieved context

### Example
1. “What’s in the company’s refund policy?”
2. Vector DB retrieves policy snippet.
3. LLM answers: “You can request a refund within 30 days...”
\
\
![rag](images/rag.png)

### Benefits
- Reduces hallucination
- Up-to-date answers (even beyond model’s training data)
- Domain-specific accuracy

In [None]:
from openai import OpenAI

client = OpenAI(
    base_url = 'http://localhost:11434/v1',
    api_key='ollama',
)

system_prompt = "You are a helpful assistant. You may only answer based on the documents given to you, if you don't know, say i don't know."
docs = [texts[i] for i in results]
docs_text = "\n".join(docs)
full_prompt = f"""{system_prompt}
DOCUMENTS:
{docs_text}"""

response = client.chat.completions.create(
    model="qwen2.5:1.5b",
    messages=[
        {"role": "system", "content": full_prompt},
        {"role": "user", "content": user_query},
    ],
    temperature=0.7
)
print(full_prompt)
print('-----------------')
print(user_query)
print('-----------------')
print(response.choices[0].message.content)

### Chunking
For large documents, embedding the entire text at once can be inefficient and may dilute specific details. A common strategy is **chunking**: splitting the document into smaller, potentially overlapping, segments (chunks).

Each chunk is then embedded and stored individually. During retrieval, the system finds the most relevant chunks to provide focused context to the LLM.

# 5. Finetuning
Fine-tuning means training an existing LLM on your custom dataset to specialize it.

## When to Use
- Need highly specific outputs **(style, tone or format)**
- Want to reduce cost
- **Note:** Fine-tuning can be complex and resource-intensive. It's often considered after exploring prompt engineering and RAG.

## Example
A legal firm fine-tunes a model on case law and terminology.\
The LLM now speaks “legalese” better than a generic one.

## Tools
- OpenAI fine-tuning API (for smaller models)
- LoRA / PEFT (Parameter-Efficient Fine-Tuning)
- Hugging Face Transformers

[Youtube Link for finetuning](https://www.youtube.com/watch?v=S9VHQhC3HPc)

## Model Finetuning Example
Most models are released with 2 types:
- Base Model
- Instruct Finetune\
\
![instruct](images/instruct.png)\
Base models are just text completion\
Instruct models are tuned for question answering

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

# tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B")
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")

prompt = "Hey."
inputs = tokenizer(prompt, return_tensors="pt")

generate_ids = model.generate(**inputs, max_new_tokens=30)
tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

## Quantization
**Quantization** is a technique used to reduce the memory footprint and computational cost of running LLMs.


![quant](images/quantization.png)

This significantly decreases the model size and can speed up inference, often with only a small impact on performance. It's crucial for running larger models on consumer hardware.

In [None]:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model_8bit = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-1.5B",
    quantization_config=quantization_config
)
model_full = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B")
print(f'Q8 Quantization: {model_8bit.get_memory_footprint() / (1024 * 1024 * 1024):.2f}GB')
print(f'Full Precision: {model_full.get_memory_footprint() / (1024 * 1024 * 1024):.2f}GB')



## Special Tokens and Chat Templates",
LLMs and their tokenizers use **special tokens** – unique symbols that don't represent regular words but serve structural or functional purposes.

Examples include:
- `[CLS]`, `<s>`: Mark the beginning of a sequence.
- `[SEP]`, `</s>`: Indicate separation between segments or the end of a sequence.
- `[PAD]`: Used to pad shorter sequences to a uniform length in a batch.
- `[UNK]`: Represents tokens that were not in the tokenizer's vocabulary.

For chat models, specific tokens and formatting rules (**chat templates**) are used to delineate between system messages, user turns, and assistant turns. Applying the correct chat template is crucial for getting instruct/chat models to behave as expected.
            

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B")

chat = [
  {"role": "system", "content": "You are a helpful assistant."}, # Try commenting this
  {"role": "user", "content": "Hello, how are you?"},
  {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
]

tokenizer.apply_chat_template(chat, tokenize=False)

In [None]:
tokenizer.special_tokens_map

# 6. Agents
LLM agents use an LLM as a thinking brain that plans, decides, and interacts with tools or APIs to complete tasks.

![agents](images/agentic-ai-workflow.png)

## Components
- **LLM**: reasoning engine
- **Tools**: external APIs, databases, calculators
- **Memory**: stores past interactions
- **Planner/Executor**: decides what to do next

## Example
An AI assistant that:
1. Receives a task: “Book me a flight to NY.”
2. Calls flight APIs.
3. Picks the cheapest option.
4. Sends a confirmation email.

## Security Warning
Granting LLM agents the ability to execute actions (especially interacting with **file systems**, **databases**, or **external APIs**)

introduces significant **security risks**. The agent could be manipulated (via malicious prompts) into performing harmful actions.

Careful design, sandboxing, and validation of tool inputs/outputs are crucial.

In [None]:
from datetime import datetime
from typing import List, Dict, Tuple
from openai import OpenAI
from pydantic import BaseModel
import os


class Message:
    """Represents a message in the conversation."""
    def __init__(self, role: str, content: str):
        self.role = role
        self.content = content

    def to_dict(self) -> Dict[str, str]:
        """Convert message to dictionary format for OpenAI API."""
        return {
            "role": self.role,
            "content": self.content
        }


class ChatHistory:
    """Manages conversation history."""
    def __init__(self):
        self.messages: List[Message] = []

    def add_message(self, role: str, content: str) -> None:
        """Add a new message to the history."""
        self.messages.append(Message(role, content))

    def get_history_for_prompt(self) -> List[Dict[str, str]]:
        """Get formatted history for use in LLM prompts."""
        return [message.to_dict() for message in self.messages]


def generate_response(client: OpenAI, messages: List[Dict[str, str]]) -> str:
    """Generate a response from the language model."""
    response = client.chat.completions.create(
        model="qwen2.5:7b",
        messages=messages,
        temperature=0.3
    )

    return response.choices[0].message.content


class Agent:
    """Base class for all agents."""
    def __init__(self, name: str, system_prompt: str, client: OpenAI):
        self.name = name
        self.system_prompt = system_prompt
        self.client = client

    def process(self, user_input: str, chat_history: List[Dict[str, str]]) -> str:
        """Process a user request and return a response."""
        # implemented by subclass
        ...


class AgentResponse(BaseModel):
    timestamp: str
    name: str
    input: str
    output: str
    
    def to_dict(self) -> dict[str, str]:
        """convert agent response to dictionary format for OpenAI API."""
        return {
            "role": "assistant",
            "content": f"{self.timestamp} - agent={self.name}: input={self.input}\n\noutput={self.output}"
        }


class FileAgent(Agent):
    """Agent specializing in file operations."""
    def __init__(self, client: OpenAI):
        super().__init__(
            name="FileAgent",
            system_prompt="""You are a file operations agent. Your job is to 
1. include the filename in the first line
2. give the input to write to the file
example:
write xyz to a file
output:
file.txt
xyz""",
            client=client
        )
    
    def process(self, user_input: str, chat_history: List[Dict[str, str]]) -> str:
        """Process a file-related request and return instructions."""
        messages = [
            {"role": "system", "content": self.system_prompt}
        ] + chat_history
        resp = generate_response(self.client, messages)
        filename = resp.split('\n')[0]
        text = "\n".join(resp.split('\n')[1:])
        if os.path.isfile(filename):
            return f"Couldn't write to {filename}, there is a file"
        with open(filename, 'w', encoding='utf-8') as f:
            f.write(text)
        return f"Wrote to {filename}"


class DatabaseAgent(Agent):
    """Agent specializing in querying and retrieving data."""
    def __init__(self, client: OpenAI):
        super().__init__(
            name="DatabaseReadAgent",
            system_prompt="""You are a database query agent. Your job is to:
            1. Interpret the user's data retrieval needs
            2. Formulate appropriate database queries (SQL or other query language)
            3. Explain how to retrieve the requested information
            Be precise in your query syntax and explain the expected results.
            """,
            client=client
        )
    
    def process(self, user_input: str, chat_history: List[Dict[str, str]]) -> str:
        """Process a database query request and return instructions."""
        db = [
                "The quick brown fox jumps over the lazy dog",
                "Machine learning is a subset of artificial intelligence",
                "Python is a popular programming language for data science",
                "Neural networks have revolutionized natural language processing",
                "FAISS is a library for efficient similarity search",
                "Vector databases store and retrieve embeddings efficiently",
                "Climate change is affecting global weather patterns",
                "Renewable energy sources include solar and wind power",
                "Electric vehicles are becoming increasingly popular",
                "Quantum computing uses quantum bits or qubits",
                "Blockchain technology enables secure decentralized transactions",
                "Healthy eating involves consuming a balanced diet",
                "Regular exercise improves physical and mental health",
                "Space exploration has led to many technological advances",
                "The Great Barrier Reef is the world's largest coral reef system",
                "Digital transformation is changing how businesses operate",
                "Cybersecurity protects systems from digital attacks",
                "Artificial intelligence can solve complex problems",
                "Cloud computing delivers computing services over the internet",
                "Data privacy concerns are growing in the digital age"
        ]
        return "\n".join(db)


class FinalAnswerAgent(Agent):
    """Agent for providing final answers to the user."""
    def __init__(self, client: OpenAI):
        super().__init__(
            name="FinalAnswerAgent",
            system_prompt="""You are the final answer agent. Your job is to:
            1. Synthesize information from previous agent interactions
            2. Provide a clear, helpful response to the user's query
            3. Be concise but thorough in your explanations
            If you don't have enough information, say so clearly.
            """,
            client=client
        )
    
    def process(self, user_input: str, chat_history: List[Dict[str, str]]) -> str:
        """Process the request and provide a final answer to the user."""
        messages = [
            {"role": "system", "content": self.system_prompt}
        ] + chat_history
        return generate_response(self.client, messages)


class RouterAgent(Agent):
    """Meta-agent that routes requests to specialized agents."""
    def __init__(self, client: OpenAI):
        self.agents_memory = []
        self.chat_history = ChatHistory()
        super().__init__(
            name="RouterAgent",
            system_prompt="""You are a teacher agent that determines which specialized agent should handle a user's request.
Respond only with the name of the agent that should handle this request from the following options:
Please only answer with information from the database, if you don't know say you don't know
- FileAgent: For writing files
- DatabaseAgent: For querying and retrieving data
- FinalAnswerAgent: For final answer to user""",
            client=client
        )

        # Initialize agents
        self.file_agent = FileAgent(client)
        self.database_agent = DatabaseAgent(client)
        self.final_answer_agent = FinalAnswerAgent(client)
        
        # Mapping of agent names to agent instances
        self.agents = {
            "fileagent": self.file_agent,
            "databaseagent": self.database_agent,
            "finalansweragent": self.final_answer_agent
        }
    
    def process(self, user_input, chat_history):
        messages = [
            {"role": "system", "content": self.system_prompt}
        ] + chat_history
        return generate_response(self.client, messages)

    def route_request(self, user_input: str, chat_history: List[Dict[str, str]]) -> Tuple[str, str]:
        """Route the user request to the appropriate agent."""
        # Ask the teacher agent which specialized agent should handle this
        routing_decision = self.process(user_input, chat_history).lower()

        # Parse the routing decision to get the agent name
        selected_agent = None
        for agent_name, agent in self.agents.items():
            if agent_name in routing_decision:
                selected_agent = agent
                break

        # Default to general query if no specific agent is identified
        if not selected_agent:
            selected_agent = self.final_answer_agent

        # Get response from the selected agent
        response = selected_agent.process(user_input, chat_history)
        
        return agent_name, response

    def process_input(self, user_input: str) -> str:
        """Process user input through the agent system."""
        # Add user message to chat history
        self.chat_history.add_message("user", user_input)
        agent = None
        max_tries = 3
        while max_tries >= 0 and (agent is None or agent.lower() != 'finalansweragent'):
            max_tries -= 1          
            # Route the request to the appropriate agent
            agent, response = self.route_request(
                user_input,
                self.chat_history.get_history_for_prompt() + [m.to_dict() for m in self.agents_memory]
            )
            if agent.lower() != 'finalansweragent':
                agent_response = AgentResponse(
                    timestamp=str(datetime.now()),
                    name=agent,
                    input=user_input,
                    output=response
                )
                self.agents_memory.append(agent_response)

        # Add agent response to chat history
        self.chat_history.add_message("assistant", response)

        return response


agent_system = RouterAgent(
    client=OpenAI(  
        base_url = 'http://localhost:11434/v1',
        api_key='ollama', # required, but unused
    )
)

# Example interactions
# query = "Tell me what information you have about artificial intelligence."
query = "Create a file summarizing what you know about climate and energy."
print(f"User: {query}")
response = agent_system.process_input(query)
print(f"System: {response}")
print("="*50)
print("Agents memories:")
for a in agent_system.agents_memory:
    print(a)

# Thanks for listening