## LLM Workshop
By: Mohammed Alageel

Prerequisites:
- Basic understanding of Python programming concepts.
- A laptop

# 1. What Are LLMs?
Large Language Models (LLMs) are AI models trained on massive amounts of text to understand and generate human-like language.

## Key Concepts
- Based on transformer architecture (e.g., GPT, BERT).
- Trained to predict the next word (or token) in a sequence.
- Can generate, translate, summarize, and answer questions.

## Tokens
LLMs process text by breaking it down into smaller units called **tokens**.

A token can represent a whole word, a part of a word (sub-word), or even a single character.

![Tokens](https://github.com/mo-100/LLM-workshop/blob/main/images/tokenizer.png?raw=1)

You can explore how text is tokenized using tools like the [OpenAI Tokenizer](https://platform.openai.com/tokenizer)

## Parameters
The complexity and capability of an LLM are often related to its number of **parameters**. These are the internal variables the model learns during training.

Model Parameter size ranges from 1B (Small) to 100B+ (Large)

Generally, models with more parameters have greater capacity, but also require more computational resources.

![Parameters](https://github.com/mo-100/LLM-workshop/blob/main/images/model_size.png?raw=1)

[Ollama Gemma3 Model](https://ollama.com/library/gemma3)

## Context Length
The **context length** defines the maximum amount of text (measured in tokens) that an LLM can consider at one time when processing input or generating output.


This limit varies between different models, typically ranging from a few thousand (e.g., 4k) to over a hundred thousand (e.g., 128k) tokens.


![Context](https://github.com/mo-100/LLM-workshop/blob/main/images/context_length.png?raw=1)


[HuggingFace LLama4 Link](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct)

## Popular LLMs
- ChatGPT (OpenAI)
- Claude (Anthropic)
- Gemini (Google)
- LLaMA (Meta)
- Qwen (Alibaba)

## Common Use Cases
- Writing assistance (emails, blog posts)
- Code generation
- Chatbots & virtual assistants
- Language translation

# 2. Basic LLM Usage
How do you use an LLM?

You give it a prompt (a message or instruction), and it returns a response.

## Examples
Prompt: “Write a birthday message for a 10-year-old.”\
Output: “Happy 10th Birthday! Hope your day is filled with fun and cake!”

Prompt: “What’s the capital of France?”\
Output: “Paris.

## APIs
- Low-level: Transformers
- High-level: OpenAI (Ollama or Cloud)


to install package requirements for this notebook
```bash
pip install -r requirements.txt
````

## Low level: Transformers
Offer fine-grained control (e.g., Hugging Face `transformers`).

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B")

In [None]:
prompt = "Hey, are you conscious? Can you talk to me?"
inputs = tokenizer(prompt, return_tensors="pt")

generate_ids = model.generate(**inputs, max_new_tokens=30)
tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

## High level: OpenAI
Provide simpler interfaces (e.g., OpenAI's API, usable with cloud services or local tools like Ollama).

Get a free api key from [Google AI Studio](https://aistudio.google.com)

In [None]:
from google.colab import userdata
GEMINI_API_KEY = userdata.get('GEMINI_API_KEY')
OPENAI_BASE_URL = 'https://generativelanguage.googleapis.com/v1beta/openai/'

In [None]:
from openai import OpenAI

client = OpenAI(
    api_key=GEMINI_API_KEY,
    base_url=OPENAI_BASE_URL
)

In [None]:
response = client.chat.completions.create(
    model="gemini-2.5-flash-preview-04-17",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What’s the capital of France?"},
    ]
)
print(response.choices[0].message.content)

## Ollama
Ollama provides a convenient way to run various open-source LLMs directly on your own machine.\
It exposes an API endpoint compatible with the OpenAI API standard, allowing you to use the same `openai` Python library to interact with local models.\
**Terminal Commands**:
```bash
ollama run qwen2.5:1.5b
```
To run a larger version (e.g., Qwen 7B):

```bash
ollama run qwen2.5:7b
```

In [None]:
from openai import OpenAI

client = OpenAI(
    base_url = 'http://localhost:11434/v1',
    api_key= 'ollama',
)

response = client.chat.completions.create(
    model="qwen2.5:1.5b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What’s the capital of France?"},
    ]
)
print(response.choices[0].message.content)

## LLM Hyperparameters
When generating text, you can influence the output using several parameters:
- **`max_tokens`**: Sets the maximum number of tokens the model should generate in its response.
- **`temperature`**: Controls the randomness of the output. A lower value (e.g., 0) makes the output more deterministic and focused (greedy decoding), while a higher value increases randomness and creativity.
- **`top_p` (Nucleus Sampling)**: Selects tokens from a cumulative probability distribution. Only the most probable tokens whose probabilities add up to `top_p` are considered. `top_p=1` considers all tokens, while lower values restrict choices.
- **`top_k`**: Selects only the `k` most likely tokens at each step. `top_k=1` is equivalent to greedy decoding.
                
**Warning:** Setting `temperature` very high without constraints like `max_tokens` can sometimes lead to repetitive or nonsensical output loops.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B")

In [None]:
prompt = "What’s the capital of France?"
inputs = tokenizer(prompt, return_tensors="pt")

generate_ids = model.generate(
                        **inputs,
                        max_new_tokens=30, # if not used we will enter an infinite loop (when temp > 5)
                        do_sample=True,
                        temperature=5.0,
                        # top_k=1,
                        top_p=1,
                        )
tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

## Hallucination
<!-- LLMs may generate incorrect information confidently, which is referred to as hallucination.\
To reduce its impact, we can ask the LLM to only generate an answer if it knows the answer.\
If it does not know the answer, it can respond with
“I don’t know.” -->
LLMs can sometimes generate text that sounds plausible but is factually incorrect or nonsensical. This phenomenon is known as **hallucination**.

It occurs because models predict likely sequences of words based on patterns in their training data, without true understanding or access to real-time facts.

One way to mitigate this is to instruct the model in the prompt to state when it doesn't know an answer, for example:
```python
"Answer the following question. If you do not know the answer or cannot find it in the provided context, respond with 'I don’t know.'"
```

## Tokenizer

**Tokenization** is the fundamental process of converting raw text into a sequence of tokens (numerical IDs) that the model can understand.

The specific way text is broken down depends on the **tokenizer** used, which is typically paired with a specific LLM.
![tokenizer](https://github.com/mo-100/LLM-workshop/blob/main/images/tokenizer.png?raw=1)

In [None]:
prompt = "Hey, are you conscious? Can you talk to me?"
inputs = tokenizer(prompt, return_tensors="pt")

print(inputs.input_ids)

print(inputs.input_ids[0][0], tokenizer.decode(inputs.input_ids[0][0]))

# 3. Prompt Engineering
Prompt Engineering is the craft of designing effective inputs to get useful outputs from an LLM.

## Techniques
1. Zero-shot: Ask directly.
```python
"Summarize this article. [ARTICLE]"
```
2. Few-shot: Give examples.
```python
"""Classify as Negative or Positive
example:
this drink made me vomit
output:
negative

[INPUT]
output:
"""
```
3. Chain-of-Thought: Ask for reasoning.
```python
"When I was 3 years old, my partner was 3 times my age. Now, I am 20 years old. How old is my partner?"
```
4. Role prompting:
```python
"Act like a legal advisor and explain this contract. [CONTRACT]"
```

## Tips
- provide examples.
- Be specific about the output.
- Use instructions over constraints.
- Experiment and iterate.

In [None]:
prompt = """Answer without explaining
When I was 3 years old, my partner was 3 times my age.
Now, I am 20 years old.
How old is my partner?
Age="""

# prompt = """Think step by step
# When I was 3 years old, my partner was 3 times my age.
# Now, I am 20 years old.
# How old is my partner?"""

response = client.chat.completions.create(
    model="gemini-2.5-flash-preview-04-17",
    messages=[
        {"role": "user", "content": prompt},
    ],
    temperature=0,
)
print(response.choices[0].message.content)

In [None]:
from openai import OpenAI

client = OpenAI(
    api_key=GEMINI_API_KEY,
    base_url=OPENAI_BASE_URL
)

customer_order = "Now, I would like a large pizza, with the first half cheese and mozzarella. And the other tomato sauce, ham and pineapple."
prompt = """Parse a customer's pizza order into valid JSON:
EXAMPLE:
I want a small pizza with cheese, tomato sauce, and pepperoni.
JSON Response:
```json
{
"size": "small",
"type": "normal",
"ingredients": [["cheese", "tomato sauce", "peperoni"]]
}
```""" + "\n" + customer_order + "\nJSON Response:"


response = client.chat.completions.create(
    model="gemini-2.5-flash-preview-04-17",
    messages=[
        {"role": "user", "content": prompt},
    ],
    temperature=0,
)
print(response.choices[0].message.content)

## Parsing Structured Output
When you need the LLM's output to be used programmatically (e.g., feeding data into another system), it's crucial to get structured data like JSON.

You can prompt the model to generate JSON. However, the output might not always be perfectly valid.

Libraries like **`Pydantic`** are excellent for:
1.  Defining the expected data structure using Python classes.
2.  Parsing the LLM's JSON output.
3.  Validating that the parsed data conforms to the defined structure.


**JSON Repair Libraries:** Tools like **`json-repair`** can attempt to automatically fix common errors in malformed JSON strings before parsing.


In [None]:
from pydantic import BaseModel

class Order(BaseModel):
    size: str
    type: str
    ingredients: list[list[str]]

json_str = response.choices[0].message.content.lstrip('```json').rstrip('```')
print(json_str)
order = Order.model_validate_json(json_str)
print(order)

# 4. Retrieval-Augmented Generation (RAG)

**RAG** stands for **Retrieval-Augmented Generation**. It's a powerful technique that enhances LLM responses by providing them with relevant information retrieved from an external knowledge source (like a collection of documents or a database) before generation.

This helps produce more accurate, up-to-date, and context-aware answers, especially for domain-specific or recent information not present in the LLM's original training data.


## Text Embeddings
A core component of RAG is **text embedding**. This process converts pieces of text into numerical vectors (lists of numbers).

These vectors are designed to capture the semantic meaning of the text, such that texts with similar meanings have vectors that are close to each other in the vector space.

![embeddings](https://github.com/mo-100/LLM-workshop/blob/main/images/emb.png?raw=1)

### Usages
- Semantic Search
- Recommendation Systems
- Text Classification
- Text Clustering


In [None]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
vector = model.encode("Best movie ever!")
vector[:20] # print the first 20 of the 384 dimensions

### Measuring Similarity
How is **\"similarity\"** measured between vectors? Common metrics include:
- **Cosine Similarity:** Measures the cosine of the angle between two vectors. It focuses on the orientation, not the magnitude. Values range from -1 (opposite) to 1 (identical), with 0 indicating orthogonality (no similarity). It's very common for text embeddings.
- **Euclidean Distance (L2 Distance):** The straight-line distance between the endpoints of two vectors in the vector space. Lower values mean higher similarity. Vectors must typically be normalized for this to be effective for semantic similarity.
- **Dot Product:** Can be used, especially if vectors are normalized (then it becomes equivalent to Cosine Similarity).

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

str1 = "i told the teacher i forgot to do the assignment"
emb1 = model.encode(str1)
str2 = "the dog ate my homework"
emb2 = model.encode(str2)
str3 = "my cat is missing"
emb3 = model.encode(str3)

print(f"""cos_similarity("{str1}", "{str2}") = {cosine_similarity([emb1], [emb2])[0][0]:.2f}""")
print(f"""cos_similarity("{str1}", "{str3}") = {cosine_similarity([emb1], [emb3])[0][0]:.2f}""")
print(f"""cos_similarity("{str2}", "{str3}") = {cosine_similarity([emb2], [emb3])[0][0]:.2f}""")

## Vector Database
**Vector databases** are specialized databases designed to efficiently store and search through large collections of embedding vectors.

Their key capability is performing **similarity searches**. Given a query vector (representing a question or topic), the database can quickly find the stored vectors (representing documents or data chunks) that are most similar in meaning.

![vector2](https://github.com/mo-100/LLM-workshop/blob/main/images/vector_search.png?raw=1)


### Typical RAG Indexing and Querying Workflow
1. All Text → Embed (e.g., via SentenceTransformers, OpenAI)
2. Store embeddings in vector DB (like FAISS, ChromaDB, Pinecone, Weaviate)
3. Query with new text → Embed
4. Use Query in vector DB -> get most similar documents


In [None]:
!pip install faiss-cpu

In [None]:
import numpy as np
import faiss
from typing import List, Callable
from sentence_transformers import SentenceTransformer

def make_index(texts: List[str], embedding_function: Callable[[List[str]], np.ndarray]) -> faiss.Index:
    embeddings = embedding_function(texts)
    dimension = embeddings.shape[1]
    index = faiss.IndexFlatL2(dimension)
    index.add(embeddings.astype(np.float32))
    return index

def make_query(query: str, embedding_function: Callable[[List[str]], np.ndarray],
               index: faiss.Index, k: int = 5) -> np.ndarray:
    query_embedding = embedding_function([query])
    _, indices = index.search(query_embedding.astype(np.float32), k)
    return indices[0]

texts = [
    "The quick brown fox jumps over the lazy dog",
    "Machine learning is a subset of artificial intelligence",
    "Python is a popular programming language for data science",
    "Neural networks have revolutionized natural language processing",
    "FAISS is a library for efficient similarity search",
    "Vector databases store and retrieve embeddings efficiently",
    "Climate change is affecting global weather patterns",
    "Renewable energy sources include solar and wind power",
    "Electric vehicles are becoming increasingly popular",
    "Quantum computing uses quantum bits or qubits",
    "Blockchain technology enables secure decentralized transactions",
    "Healthy eating involves consuming a balanced diet",
    "Regular exercise improves physical and mental health",
    "Space exploration has led to many technological advances",
    "The Great Barrier Reef is the world's largest coral reef system",
    "Digital transformation is changing how businesses operate",
    "Cybersecurity protects systems from digital attacks",
    "Artificial intelligence can solve complex problems",
    "Cloud computing delivers computing services over the internet",
    "Data privacy concerns are growing in the digital age"
]

model = SentenceTransformer("all-MiniLM-L6-v2")
faiss_index = make_index(texts, model.encode)
user_query = "What is artificial intelligence?"
results = make_query(user_query, model.encode, faiss_index, k=5)

for i in results:
    print(i, texts[i])

## RAG Code

### How It Works
1. User query → embedding
2. Search vector DB → find relevant documents
3. Combine prompt + documents + query → full LLM prompt
4. LLM generates answer using retrieved context

### Example
1. “What’s in the company’s refund policy?”
2. Vector DB retrieves policy snippet.
3. LLM answers: “You can request a refund within 30 days...”
\
\
![rag](https://github.com/mo-100/LLM-workshop/blob/main/images/rag.png?raw=1)

### Benefits
- Reduces hallucination
- Up-to-date answers (even beyond model’s training data)
- Domain-specific accuracy

In [None]:
from openai import OpenAI

client = OpenAI(
    api_key=GEMINI_API_KEY,
    base_url=OPENAI_BASE_URL
)

user_query = "What is artificial intelligence?"
results = make_query(user_query, model.encode, faiss_index, k=5)
system_prompt = "You are a helpful assistant. You may only answer based on the documents given to you, if you don't know, say i don't know."
docs = [texts[i] for i in results]
docs_text = "\n\n".join(docs)
full_prompt = f"""{system_prompt}
DOCUMENTS:
{docs_text}"""

response = client.chat.completions.create(
    model="gemini-2.5-flash-preview-04-17",
    messages=[
        {"role": "system", "content": full_prompt},
        {"role": "user", "content": user_query},
    ],
    temperature=0.7
)
print("Full Prompt:")
print(full_prompt)
print('-----------------')
print("User Query:")
print(user_query)
print('-----------------')
print("Response:")
print(response.choices[0].message.content)

## Knowledge cutoffs
LLMs are trained on data up to a certain point in time (their **knowledge cutoff**). They typically lack information about events or developments occurring after that date.

RAG is an effective solution to this limitation, as it allows the model to access and incorporate current information retrieved from an up-to-date external knowledge source during the generation process.

## Chunking
For large documents, embedding the entire text at once can be inefficient and may dilute specific details. A common strategy is **chunking**: splitting the document into smaller, potentially overlapping, segments (chunks).

Each chunk is then embedded and stored individually. During retrieval, the system finds the most relevant chunks to provide focused context to the LLM.

# 5. Finetuning
Fine-tuning means training an existing LLM on your custom dataset to specialize it.

## When to Use
- Need highly specific outputs **(style, tone or format)**
- Want to reduce cost
- **Note:** Fine-tuning can be complex and resource-intensive. It's often considered after exploring prompt engineering and RAG.

## Example
A customer support team might fine-tune an LLM on transcripts of successful support interactions.

The resulting model could then generate more empathetic and contextually appropriate responses tailored to their company's products and policies.

## Tools
- OpenAI fine-tuning API (for smaller models)
- LoRA / PEFT (Parameter-Efficient Fine-Tuning)
- Hugging Face Transformers

Learn more about finetuning [here](https://www.youtube.com/watch?v=S9VHQhC3HPc)

## Model Finetuning Example
Most models are released with 2 types:
- Base Model
- Instruct Finetune\
\
![instruct](https://github.com/mo-100/LLM-workshop/blob/main/images/instruct.png?raw=1)\
Base models are just text completion\
Instruct models are tuned for question answering

## Quantization
**Quantization** is a technique used to reduce the memory footprint and computational cost of running LLMs.


![quant](https://github.com/mo-100/LLM-workshop/blob/main/images/quantization.png?raw=1)

This significantly decreases the model size and can speed up inference, often with only a small impact on performance. It's crucial for running larger models on consumer hardware.

In [None]:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model_8bit = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-1.5B",
    quantization_config=quantization_config
)
model_full = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B")
print(f'Q8 Quantization: {model_8bit.get_memory_footprint() / (1024 * 1024 * 1024):.2f}GB')
# Q8 Quantization: 1.66GB

print(f'Full Precision: {model_full.get_memory_footprint() / (1024 * 1024 * 1024):.2f}GB')
# Full Precision: 5.75GB


## Special Tokens and Chat Templates
LLMs and their tokenizers use **special tokens** – unique symbols that don't represent regular words but serve structural or functional purposes.

Examples include:
- `[CLS]`, `<s>`: Mark the beginning of a sequence.
- `[SEP]`, `</s>`: Indicate separation between segments or the end of a sequence.
- `[PAD]`: Used to pad shorter sequences to a uniform length in a batch.
- `[UNK]`: Represents tokens that were not in the tokenizer's vocabulary.

For chat models, specific tokens and formatting rules (**chat templates**) are used to delineate between system messages, user turns, and assistant turns. Applying the correct chat template is crucial for getting instruct/chat models to behave as expected.
            

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B")

chat = [
  {"role": "system", "content": "You are a helpful assistant."}, # Try commenting this
  {"role": "user", "content": "Hello, how are you?"},
  {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
]

tokenizer.apply_chat_template(chat, tokenize=False)

In [None]:
tokenizer.special_tokens_map

# 6. Tool Calling

LLMs can be enhanced by giving them access to external tools to fetch information or perform actions on behalf of the user. This capability is known as **tool calling**, **function calling**.

## Usages
- Calling APIs.
- Querying a database.
- Run any code.

In [None]:
def get_weather(city_name: str):
    if city_name.lower() in {'riyadh'}:
        temp = 40
    else:
        temp = 20
    return f"Temperatue={temp} Celsius"

def call_function(name, args):
    if name == "get_weather":
        return get_weather(**args)

tools = [
  {
    "type": "function",
    "function": {
      "name": "get_weather",
      "description": "Get current temperature for the city in celsius.",
      "parameters": {
        "type": "object",
        "properties": {
            "city_name": {
                "type": "string",
                "description": "The name of the city"
            }
        },
        "required": ["city_name"],
      },
    }
  }
]

In [None]:
import json

query = {"role": "user", "content": "What's the weather like in London today?"}
messages = [query]

response = client.chat.completions.create(
  model="gemini-2.5-flash-preview-04-17",
  messages=messages,
  tools=tools,
  tool_choice="auto",
  temperature=0
)

tool_call = response.choices[0].message.tool_calls[0].function
args = json.loads(tool_call.arguments)
print(tool_call)
result = call_function(tool_call.name, args)
print(result)
messages.append({"role": "assistant", "content": f"{tool_call.name} called with input {json.loads(tool_call.arguments)} output: {str(result)}"})

In [None]:
response = client.chat.completions.create(
    model="gemini-2.5-flash-preview-04-17",
    messages=messages,
    tools=tools,
    tool_choice="auto",
    temperature=0
)
print(response.choices[0].message.content)

## Security Warning
Granting LLMs the ability to execute actions (especially interacting with **file systems**, **databases**, or **external APIs**)

introduces significant **security risks**. The LLM could be manipulated (via malicious prompts) into performing harmful actions.

Careful design, sandboxing, and validation of tool inputs/outputs are crucial.

# 7. Agents
LLM **Agents** are systems that use a Large Language Model as their central **reasoning engine** or "brain" to understand tasks, decide on a plan, execute steps, and iterate until a goal is achieved.

![agents](https://github.com/mo-100/LLM-workshop/blob/main/images/agentic-ai-workflow.png?raw=1)

## Components
- **LLM**: reasoning engine
- **Tools**: external APIs, databases, calculators
- **Memory**: stores past interactions
- **Router/Orchestrator**: the code logic that manages the agent's loop

## Example
An AI assistant that:
1. Receives a task: “Book me a flight to NY.”
2. Calls flight APIs.
3. Picks the cheapest option.
4. Sends a confirmation email.


Learn more about agents [Here](https://www.anthropic.com/engineering/building-effective-agents)

Code modified from [This repo](https://github.com/daveebbelaar/ai-cookbook/tree/main)

In [None]:
from pydantic import BaseModel, Field
import logging

# Set up logging configuration
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
)
logger = logging.getLogger(__name__)

model_name = "gemini-2.5-flash-preview-04-17"

# --------------------------------------------------------------
# Step 1: Define the data models
# --------------------------------------------------------------


class SubTask(BaseModel):
    """Blog section task defined by orchestrator"""

    section_type: str = Field(description="Type of blog section to write")
    description: str = Field(description="What this section should cover")
    style_guide: str = Field(description="Writing style for this section")
    target_length: int = Field(description="Target word count for this section")


class OrchestratorPlan(BaseModel):
    """Orchestrator's blog structure and tasks"""

    topic_analysis: str = Field(description="Analysis of the blog topic")
    target_audience: str = Field(description="Intended audience for the blog")
    sections: list[SubTask] = Field(description="List of sections to write")


class SectionContent(BaseModel):
    """Content written by a worker"""

    content: str = Field(description="Written content for the section")
    key_points: list[str] = Field(description="Main points covered")


class SuggestedEdits(BaseModel):
    """Suggested edits for a section"""

    section_name: str = Field(description="Name of the section")
    suggested_edit: str = Field(description="Suggested edit")


class ReviewFeedback(BaseModel):
    """Final review and suggestions"""

    cohesion_score: float = Field(description="How well sections flow together (0-1)")
    suggested_edits: list[SuggestedEdits] = Field(
        description="Suggested edits by section"
    )
    final_version: str = Field(description="Complete, polished blog post")


# --------------------------------------------------------------
# Step 2: Define prompts
# --------------------------------------------------------------

ORCHESTRATOR_PROMPT = """
Analyze this blog topic and break it down into logical sections.

Topic: {topic}
Target Length: {target_length} words
Style: {style}

Return your response in this format:

# Analysis
Analyze the topic and explain how it should be structured.
Consider the narrative flow and how sections will work together.

# Target Audience
Define the target audience and their interests/needs.

# Sections
## Section 1
- Type: section_type
- Description: what this section should cover
- Style: writing style guidelines

[Additional sections as needed...]
"""

WORKER_PROMPT = """
Write a blog section based on:
Topic: {topic}
Section Type: {section_type}
Section Goal: {description}
Style Guide: {style_guide}

Return your response in this format:

# Content
[Your section content here, following the style guide]

# Key Points
- Main point 1
- Main point 2
[Additional points as needed...]
"""

REVIEWER_PROMPT = """
Review this blog post for cohesion and flow:

Topic: {topic}
Target Audience: {audience}

Sections:
{sections}

Provide a cohesion score between 0.0 and 1.0, suggested edits for each section if needed, and a final polished version of the complete post.

The cohesion score should reflect how well the sections flow together, with 1.0 being perfect cohesion.
For suggested edits, focus on improving transitions and maintaining consistent tone across sections.
The final version should incorporate your suggested improvements into a polished, cohesive blog post.
"""

# --------------------------------------------------------------
# Step 3: Implement orchestrator
# --------------------------------------------------------------


class BlogOrchestrator:
    def __init__(self):
        self.sections_content = {}

    def get_plan(self, topic: str, target_length: int, style: str) -> OrchestratorPlan:
        """Get orchestrator's blog structure plan"""
        completion = client.beta.chat.completions.parse(
            model=model_name,
            messages=[
                {
                    "role": "user",
                    "content": ORCHESTRATOR_PROMPT.format(
                        topic=topic, target_length=target_length, style=style
                    ),
                }
            ],
            response_format=OrchestratorPlan,
        )
        return completion.choices[0].message.parsed

    def write_section(self, topic: str, section: SubTask) -> SectionContent:
        """Worker: Write a specific blog section with context from previous sections.

        Args:
            topic: The main blog topic
            section: SubTask containing section details

        Returns:
            SectionContent: The written content and key points
        """
        # Create context from previously written sections
        previous_sections = "\n\n".join(
            [
                f"=== {section_type} ===\n{content.content}"
                for section_type, content in self.sections_content.items()
            ]
        )

        completion = client.beta.chat.completions.parse(
            model=model_name,
            messages=[
                {
                    "role": "user",
                    "content": WORKER_PROMPT.format(
                        topic=topic,
                        section_type=section.section_type,
                        description=section.description,
                        style_guide=section.style_guide,
                        target_length=section.target_length,
                        previous_sections=previous_sections
                        if previous_sections
                        else "This is the first section.",
                    ),
                }
            ],
            response_format=SectionContent,
        )
        return completion.choices[0].message.parsed

    def review_post(self, topic: str, plan: OrchestratorPlan) -> ReviewFeedback:
        """Reviewer: Analyze and improve overall cohesion"""
        sections_text = "\n\n".join(
            [
                f"=== {section_type} ===\n{content.content}"
                for section_type, content in self.sections_content.items()
            ]
        )

        completion = client.beta.chat.completions.parse(
            model=model_name,
            messages=[
                {
                    "role": "user",
                    "content": REVIEWER_PROMPT.format(
                        topic=topic,
                        audience=plan.target_audience,
                        sections=sections_text,
                    ),
                }
            ],
            response_format=ReviewFeedback,
        )
        return completion.choices[0].message.parsed

    def write_blog(
        self, topic: str, target_length: int = 1000, style: str = "informative"
    ) -> dict:
        """Process the entire blog writing task"""
        logger.info(f"Starting blog writing process for: {topic}")

        # Get blog structure plan
        plan = self.get_plan(topic, target_length, style)
        logger.info(f"Blog structure planned: {len(plan.sections)} sections")
        logger.info(f"Blog structure planned: {plan.model_dump_json(indent=2)}")

        # Write each section
        for section in plan.sections:
            logger.info(f"Writing section: {section.section_type}")
            content = self.write_section(topic, section)
            self.sections_content[section.section_type] = content

        # Review and polish
        logger.info("Reviewing full blog post")
        review = self.review_post(topic, plan)

        return {"structure": plan, "sections": self.sections_content, "review": review}


# --------------------------------------------------------------
# Step 4: Example usage
# --------------------------------------------------------------

orchestrator = BlogOrchestrator()

# Example: Technical blog post
topic = "The impact of AI on software development"
result = orchestrator.write_blog(
    topic=topic, target_length=200, style="technical but accessible"
)

print("\nFinal Blog Post:")
print(result["review"].final_version)

print("\nCohesion Score:", result["review"].cohesion_score)
if result["review"].suggested_edits:
    for edit in result["review"].suggested_edits:
        print(f"Section: {edit.section_name}")
        print(f"Suggested Edit: {edit.suggested_edit}")

# Thanks for listening