<div style="background: linear-gradient(135deg, #034694 0%, #1E8449 50%, #D4AC0D 100%); color: white; padding: 20px; border-radius: 10px; box-shadow: 0 4px 8px rgba(0,0,0,0.2);">
    <h1 style="color: #FFF; text-shadow: 1px 1px 3px rgba(0,0,0,0.5);">🔍 | Step 1: Understand Zava Scenario</h1>
    <p style="font-size: 16px; line-height: 1.6;">
        By now you have setup your infrastructure and have a valid Azure AI Foundry project with required models deployed. You've created an Azure AI Search resource with a zava-products index containing 50 products. And you can connected the local development environment to your provisioned backend by using Azure CLI login to establish credentials, and updating the .env file to configure local variables. <b> It's time to look at the data and begin our model customization journey! </b>
    </p>
</div>

## 0. Add More Models

This is the start of "Act 2" where we look at the Fine-Tuning options available in Azure AI Foundry. For his section, let's deploy a few additional models that we will be using over the next couple of demos, to get an intuitive sense for how to <em>pick</em> the best starting model or our needs - then <em>customize</em> it to provide the required optimization - and <em>evaluate</em> it to see if it has improved over the base model we started with.

For now, just add these models using the Azure AI Foundry Portal UI
1. Visit the Azure AI Foundry portal - and open your project page
1. Select the Models + Endpoints tab - in the left sidebar
1. Deploy the required models - till you have this specific set:
    - Reasoning Models → o3, o3-mini, o4-mini
    - Chat Models →  gpt-4o, gpt-4.1, gpt-4.1-nano
    - Embedding Models →  text-embedding-ada-002

Your `.env` variables should already be set to the required Azure OpenAI endpoint - we are ready to go!


In [1]:
# ........ Setup an Azure OpenAI client and test out different models
import os
import time
from openai import AzureOpenAI
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Initialize Azure OpenAI client
client = AzureOpenAI(
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    api_version=os.getenv("AZURE_OPENAI_API_VERSION", "2024-05-01-preview"),
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT")
)

### 1. Define Test Prompt

In [2]:
# ........ Define a test prompt that we'll use for all models
test_prompt = """
You are a home improvement assistant for Zava, a fictional hardware store.
Please give me a brief recommendation for a paint color for my living room.
Include one key feature of the paint and a price range.
"""

# List of models to test
models_to_test = [
    "gpt-4o",
    "gpt-4.1",
    "gpt-4.1-nano",
    "o3",
    #"o3-mini",
    #"o4-mini"
]

# Function to call a model and measure performance
def test_model(model_name, prompt):
    print(f"..... Testing model: {model_name}")
    start_time = time.time()
    
    try:
        params = {
            "model": model_name,
            "messages": [
                {"role": "system", "content": "You are Cora, a polite, factual and helpful assistant for Zava, a DIY hardware store"},
                {"role": "user", "content": prompt}
            ]
        }
        
        # Add the appropriate token limit parameter based on model type
        if model_name.startswith("o"):
            params["max_completion_tokens"] = 300
        else:
            params["max_tokens"] = 300
        
        response = client.chat.completions.create(**params)
        
        end_time = time.time()
        latency = end_time - start_time
        
        # Extract response and token usage
        content = response.choices[0].message.content
        prompt_tokens = response.usage.prompt_tokens
        completion_tokens = response.usage.completion_tokens
        total_tokens = response.usage.total_tokens
        
        return {
            "model": model_name,
            "latency": latency,
            "prompt_tokens": prompt_tokens,
            "completion_tokens": completion_tokens,
            "total_tokens": total_tokens,
            "response": content
        }
    
    except Exception as e:
        print(f"❌ Error with model {model_name}: {str(e)}")
        return {
            "model": model_name,
            "error": str(e)
        }

### 2. Run Model Tests

In [3]:
# Test each model with the same prompt
results = []
for model in models_to_test:
    try:
        result = test_model(model, test_prompt)
        results.append(result)
    except Exception as e:
        print(f"Exception outside test_model for {model}: {str(e)}")
        results.append({"model": model, "error": str(e)})

# Store the detailed output for later, but don't display all of it 
detailed_outputs = {}
for result in results:
    if "error" not in result:
        detailed_outputs[result["model"]] = {
            "response": result["response"],
            "latency": result["latency"],
            "prompt_tokens": result["prompt_tokens"],
            "completion_tokens": result["completion_tokens"],
            "total_tokens": result["total_tokens"]
        }
    else:
        detailed_outputs[result["model"]] = {
            "response": f"ERROR: {result.get('error', 'Unknown error')}",
            "latency": None,
            "prompt_tokens": None,
            "completion_tokens": None,
            "total_tokens": None
        }


..... Testing model: gpt-4o
..... Testing model: gpt-4.1
..... Testing model: gpt-4.1-nano
..... Testing model: o3


### 3. Visualize Results

You may see something like this (taken from a previous run) - note how the same prompt has different latency and token usage metrics for different models. In this instance, gpt-4.1 has the lowest total token usage (but the highest latency) - while the o3 reasoning model has the highest token usage (likely due to the reasoning tokens used). Now how the _gpt-4.1-nano_ has the lowest latency (with a slightly higher total token cost) bhile the gpt-4o model is in the middle.

While these results are not conclusive, they offer us some intuition into two metrics (token usage and latency) that are key optimization targets for our assistant. **Note** that these results are _not_ grounded in Zava data (and therefore not accurate) - orchestrating a RAG-based solution would incur added token costs (to capture context in prompt) and latency (to retrieve and augment relevant results)

| MODEL PERF METRICS | | | |
|:---|:---|:---|:---|
| Model | Latency (s) | Prompt Tokens |Completion Tokens | Total Tokens
gpt-4o	     | 1.36	    | 74	           |  93	           |    167
gpt-4.1	     | 2.95	    | 74	            | 62	            |    136
gpt-4.1-nano | 1.10	    | 74	            | 76	             |   150
o3	         | 1.81	    | 73	            | 144	              |  217


In [4]:
# ........ Now display the two clean tables
from IPython.display import display, HTML
import pandas as pd

# First table: Model Responses with left-aligned text
response_data = []
for model, data in detailed_outputs.items():
    # Truncate long responses for cleaner display
    response = data["response"]
    if len(response) > 300:
        response = response[:297] + "..."
    response_data.append({"Model": model, "Response": response})

response_df = pd.DataFrame(response_data)
print("\n\n🤖 MODEL RESPONSES")
print("="*100)

# Custom HTML styling for left-aligned responses
html = response_df.to_html(index=False)
html = html.replace('<td>ERROR', '<td style="color:red">ERROR')
html = html.replace('<td>', '<td style="text-align: left;">')
display(HTML(html))

# Second table: Performance Metrics
metrics_data = []
for model, data in detailed_outputs.items():
    metrics_data.append({
        "Model": model,
        "Latency (s)": f"{data['latency']:.2f}" if data['latency'] is not None else "N/A",
        "Prompt Tokens": data['prompt_tokens'] if data['prompt_tokens'] is not None else "N/A",
        "Completion Tokens": data['completion_tokens'] if data['completion_tokens'] is not None else "N/A",
        "Total Tokens": data['total_tokens'] if data['total_tokens'] is not None else "N/A"
    })

metrics_df = pd.DataFrame(metrics_data)
print("\n\n📊 MODEL PERFORMANCE METRICS")
print("="*100)
display(HTML(metrics_df.to_html(index=False)))



🤖 MODEL RESPONSES


Model,Response
gpt-4o,"Certainly! A soft, warm gray paint like ""Cloud Drift"" is a fantastic choice for your living room—it creates a cozy and modern feel while complementing a variety of furniture styles. Look for a satin finish, which is durable and easy to clean. At Zava, similar paint options typically range between..."
gpt-4.1,"I recommend ""Calm Gray"" for your living room—a versatile, light gray shade that brightens the space and pairs well with most décor styles. This paint features excellent washability, making it ideal for high-traffic areas. At Zava, our premium interior paint ranges from $28 to $42 per gallon."
gpt-4.1-nano,"Certainly! I recommend considering ""Soft Sage"" for your living room. It’s a calming, versatile green hue that creates a welcoming atmosphere. A key feature is its low VOC formula, making it environmentally friendly and safe for indoor use. The price range for a 2.5-gallon can typically falls betw..."
o3,"For a versatile, modern look, consider Zava’s Soft Dove Grey interior paint. \n• Key feature: low-VOC, washable finish—ideal for high-traffic areas like living rooms. \n• Price range: about $28–35 per gallon, depending on sheen (matte, eggshell, or satin)."




📊 MODEL PERFORMANCE METRICS


Model,Latency (s),Prompt Tokens,Completion Tokens,Total Tokens
gpt-4o,1.79,74,84,158
gpt-4.1,3.54,74,65,139
gpt-4.1-nano,0.75,74,83,157
o3,2.15,73,150,223


In [None]:
# Let's debug why o3-mini is failing
# Try invoking the o3-mini model directly to debug
# TODO: WHY DOES O3-MINI FAIL TO GENERATE OUTPUT??
'''
try:
    params = {
        "model": "o3-mini",
        "messages": [
            {"role": "system", "content": "You are Cora, a polite, factual and helpful assistant for Zava, a DIY hardware store"},
            {"role": "user", "content": test_prompt}
        ],
        "max_completion_tokens": 300
    }
    response = client.chat.completions.create(**params)
    print("o3-mini response:", response)
    print("Token usage:", response.usage)
except Exception as e:
    print("❌ Error invoking o3-mini:", str(e))

'''

<div style="height: 6px; margin: 30px 0; background: linear-gradient(90deg, #034694 0%, #1E8449 50%, #D4AC0D 100%); border-radius: 3px; box-shadow: 0 2px 4px rgba(0,0,0,0.1);"></div>

## 1. Understand The Requirements

Our goal is to make the Cora chatbot **polite, factual, and helpful** to Zava shoppers. But what does this actually mean?

1. **Polite & Helpful** - This is about changing the _tone_ and _style_ of responses from Cora to follow a desired template.
1. **Factual** - This is about ensuring that responses are _grounded_ in Zava product data, typically using a RAG-based approach.

**Desired Tone & Style**


<div style="height: 6px; margin: 30px 0; background: linear-gradient(90deg, #034694 0%, #1E8449 50%, #D4AC0D 100%); border-radius: 3px; box-shadow: 0 2px 4px rgba(0,0,0,0.1);"></div>

## 2. Explore The Data

<div style="height: 6px; margin: 30px 0; background: linear-gradient(90deg, #034694 0%, #1E8449 50%, #D4AC0D 100%); border-radius: 3px; box-shadow: 0 2px 4px rgba(0,0,0,0.1);"></div>

## 3. Try Prompt Engineering

<div style="height: 6px; margin: 30px 0; background: linear-gradient(90deg, #034694 0%, #1E8449 50%, #D4AC0D 100%); border-radius: 3px; box-shadow: 0 2px 4px rgba(0,0,0,0.1);"></div>

## 4. Try Retrieval Augmented Generation

<div style="height: 6px; margin: 30px 0; background: linear-gradient(90deg, #034694 0%, #1E8449 50%, #D4AC0D 100%); border-radius: 3px; box-shadow: 0 2px 4px rgba(0,0,0,0.1);"></div>

## 5. Time To Try Fine-Tuning!

<div style="display: flex; align-items: center; justify-content: center; height: 60px; margin: 30px 0; background: linear-gradient(90deg, #ff6ec4 0%, #7873f5 100%); border-radius: 8px; box-shadow: 0 2px 8px rgba(0,0,0,0.12); font-size: 1.5em; font-weight: bold; color: #fff;">
    Next: Be More Helpful With SFT
</div>