# Test Finetuned Qwen2.5-Coder-14B LoRA Weights

This notebook tests the finetuned LoRA weights from `lilyzhng/Qwen2.5-Coder-14B-r32-20260208-164425`.

Run this on Google Colab with a **T4 GPU** (free tier) or better.

## 1. Installation

Install Unsloth and dependencies.

In [1]:
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth  # Do this in local & cloud setups
else:
    import torch; v = re.match(r'[\d]{1,}\.[\d]{1,}', str(torch.__version__)).group(0)
    xformers = 'xformers==' + {'2.10':'0.0.34','2.9':'0.0.33.post1','2.8':'0.0.32.post2'}.get(v, "0.0.34")
    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth_zoo bitsandbytes accelerate {xformers} peft trl triton unsloth
!pip install transformers==4.56.2
!pip install --no-deps trl==0.22.2
!pip install wandb playwright nest_asyncio
!playwright install chromium

## 2. Load the Finetuned Model

Load the LoRA weights from HuggingFace Hub.

In [2]:
import os
import torch

# Skip Unsloth's statistics check to avoid timeout issues
os.environ["UNSLOTH_DISABLE_STATISTICS"] = "1"

from unsloth import FastLanguageModel
from transformers import AutoTokenizer

# Load the finetuned LoRA model from HuggingFace
model, _ = FastLanguageModel.from_pretrained(
    model_name = "lilyzhng/Qwen2.5-Coder-14B-r32-20260208-164425",
    max_seq_length = 4096,   # Context length
    load_in_4bit = True,     # 4bit uses much less memory
    load_in_8bit = False,
    token = os.environ.get("HF_TOKEN"),  # Set HF_TOKEN env var or replace with your token
)

# Load tokenizer from base model (no chat template - trained on base model)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-14B")

# Enable native 2x faster inference
FastLanguageModel.for_inference(model)

print("Model loaded successfully!")
print("Note: Using base model (no chat template) - prompts are passed directly")

ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.
ðŸ¦¥ Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2026.1.4: Fast Qwen2 patching. Transformers: 4.56.2.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.93G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/166 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/632 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/613 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/551M [00:00<?, ?B/s]

Unsloth 2026.1.4 patched 48 layers with 48 QKV layers, 48 O layers and 48 MLP layers.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Model loaded successfully!
Note: Using base model (no chat template) - prompts are passed directly


## 3. Test Inference

### 3.1 Simple Code Generation Test

In [5]:
from transformers import TextStreamer

def generate_response(prompt, max_new_tokens=512, temperature=0.7, top_p=0.9):
    """Generate a response for the given prompt.
    
    Uses raw text prompt (no chat template) since the model was trained
    on the base Qwen2.5-Coder-14B without chat formatting.
    """
    # Use raw prompt directly (no chat template for base model)
    streamer = TextStreamer(tokenizer, skip_prompt=True)
    
    _ = model.generate(
        **tokenizer(prompt, return_tensors="pt").to("cuda"),
        max_new_tokens=max_new_tokens,
        temperature=temperature,
        top_p=top_p,
        top_k=20,
        streamer=streamer,
    )
    print()  # Add newline after generation

In [6]:
# Test 1: Simple Python function
print("=" * 50)
print("Test 1: Write a Python function")
print("=" * 50)

generate_response(
    "Write a Python function that checks if a string is a valid palindrome, "
    "ignoring spaces and punctuation."
)

Test 1: Write a Python function
 Certainly! Below is a Python function that checks if a string is a valid palindrome, ignoring spaces and punctuation:

```python
import re

def is_valid_palindrome(s):
    """
    Check if a string is a valid palindrome, ignoring spaces and punctuation.
    
    Args:
    s (str): The input string to check.
    
    Returns:
    bool: True if the string is a palindrome, False otherwise.
    """
    # Remove all non-alphanumeric characters and convert to lowercase
    cleaned_s = re.sub(r'[^A-Za-z0-9]', '', s).lower()
    
    # Check if the cleaned string is equal to its reverse
    return cleaned_s == cleaned_s[::-1]

# Example usage:
print(is_valid_palindrome("A man, a plan, a canal: Panama"))  # True
print(is_valid_palindrome("race a car"))  # False
print(is_valid_palindrome("No lemon, no melon"))  # True
```

This function uses the `re` module to remove all non-alphanumeric characters from the input string and then converts it to lowercase. It then 

In [None]:
# Test 2: Algorithm implementation
print("=" * 50)
print("Test 2: Implement binary search")
print("=" * 50)

generate_response(
    "Implement binary search in Python. Include both iterative and recursive versions."
)

In [None]:
# Test 3: Code explanation
print("=" * 50)
print("Test 3: Explain code")
print("=" * 50)

generate_response(
    """Explain what this code does:

```python
def mystery(n):
    if n <= 1:
        return n
    return mystery(n-1) + mystery(n-2)
```
"""
)

In [13]:
# Test 4: Debug code
print("=" * 50)
print("Test 4: Debug code")
print("=" * 50)

generate_response(
    """Find and fix the bug in this code:

```python
def merge_sorted_lists(list1, list2):
    result = []
    i = j = 0
    while i < len(list1) and j < len(list2):
        if list1[i] < list2[j]:
            result.append(list1[i])
            i += 1
        else:
            result.append(list2[j])
            j += 1
    return result
```
"""
)

Test 4: Debug code
The provided code is a function to merge two sorted lists into a single sorted list. However, it does not handle the case where one of the lists is exhausted before the other. Here is the corrected code:

```python
def merge_sorted_lists(list1, list2):
    result = []
    i = j = 0
    while i < len(list1) and j < len(list2):
        if list1[i] < list2[j]:
            result.append(list1[i])
            i += 1
        else:
            result.append(list2[j])
            j += 1
    # Append remaining elements from list1
    while i < len(list1):
        result.append(list1[i])
        i += 1
    # Append remaining elements from list2
    while j < len(list2):
        result.append(list2[j])
        j += 1
    return result
```

This version of the function will correctly merge the two lists, even if one of them is longer than the other.<|file_sep|><|fim_prefix|>/README.md
# Qwen

Qwen is a large language model created by Alibaba Cloud. It is designed to be a helpful

### 3.2 UI/HTML Generation Test (if trained on UIGEN)

In [8]:
# Test 5: HTML/CSS generation (using training prompt format)
print("=" * 50)
print("Test 5: Generate HTML/CSS (with training prompt format)")
print("=" * 50)

# IMPORTANT: Use the same prompt format as training data!
# The model was trained on "# Task: ... # Requirements: ..." format
ui_prompt = """# Task: Generate HTML/CSS code using Tailwind CSS
# Requirements: Create a responsive navigation bar with HTML and CSS that has a logo on the left, menu items in the center, and a login button on the right.

"""

generate_response(ui_prompt, max_new_tokens=1024)

Test 5: Generate HTML/CSS
 The navigation bar should have a transparent background and the menu items should change color on hover. The logo should be clickable and redirect to the homepage. The login button should be styled differently from the menu items and should have a hover effect that changes its background color. The navigation bar should be sticky and remain at the top of the page as the user scrolls down. The menu items should be hidden on smaller screens and replaced with a hamburger menu icon that toggles the menu when clicked. The hamburger menu should have a slide-in animation when opened and a slide-out animation when closed. The navigation bar should have a smooth transition effect when the user hovers over the menu items and the login button. The menu items should have a subtle shadow effect when hovered over. The navigation bar should have a smooth transition effect when the user scrolls down and the navigation bar becomes sticky. The hamburger menu should have a smoo

In [7]:
# Test 6: React component (using training prompt format)
print("=" * 50)
print("Test 6: Generate React component (with training prompt format)")
print("=" * 50)

# IMPORTANT: Use the same prompt format as training data!
react_prompt = """# Task: Generate HTML/CSS code using Tailwind CSS
# Requirements: Create a React component for a todo list with add, delete, and mark complete functionality. Use hooks and include basic styling.

"""

generate_response(react_prompt, max_new_tokens=1024)

Test 6: Generate React component
 ```jsx
import React, { useState } from 'react';
import './TodoList.css';

const TodoList = () => {
  const [todos, setTodos] = useState([]);
  const [newTodo, setNewTodo] = useState('');

  const addTodo = () => {
    if (newTodo.trim() !== '') {
      setTodos([...todos, { text: newTodo, completed: false }]);
      setNewTodo('');
    }
  };

  const deleteTodo = (index) => {
    const updatedTodos = [...todos];
    updatedTodos.splice(index, 1);
    setTodos(updatedTodos);
  };

  const toggleComplete = (index) => {
    const updatedTodos = [...todos];
    updatedTodos[index].completed = !updatedTodos[index].completed;
    setTodos(updatedTodos);
  };

  return (
    <div className="todo-list-container">
      <h1>Todo List</h1>
      <div className="input-container">
        <input
          type="text"
          placeholder="Add a new todo"
          value={newTodo}
          onChange={(e) => setNewTodo(e.target.value)}
        />
        <button o

In [None]:
# Test 7: UIGEN-style prompt (exactly matching training format)
# This is the key test - the model was fine-tuned on this exact format!
print("=" * 50)
print("Test 7: UIGEN-style HTML Generation (CORRECT FORMAT)")
print("=" * 50)
print("\nThis prompt format matches exactly what the model was trained on.")
print("If this works but Test 5/6 don't, it confirms the prompt format matters.\n")

# This is EXACTLY the format from the training data
uigen_prompt = """# Task: Generate HTML/CSS code using Tailwind CSS
# Requirements: Make a productivity timer app with minimalistic design, circular countdowns, and calming pastel backgrounds.

"""

print(f"Prompt:\n{uigen_prompt}")
print("-" * 50)
generate_response(uigen_prompt, max_new_tokens=2048)

## 6. Memory Usage

In [15]:
# Check GPU memory usage
gpu_stats = torch.cuda.get_device_properties(0)
reserved_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)

print(f"GPU: {gpu_stats.name}")
print(f"Max GPU Memory: {max_memory} GB")
print(f"Reserved Memory: {reserved_memory} GB")
print(f"Memory Usage: {round(reserved_memory / max_memory * 100, 1)}%")

GPU: NVIDIA A100-SXM4-80GB
Max GPU Memory: 79.318 GB
Reserved Memory: 23.148 GB
Memory Usage: 29.2%


## 7. UIGEN Evaluation

Evaluate the model on UI generation tasks using the UIGEN test dataset.
This saves generated HTML and ground truth HTML for side-by-side comparison.

In [26]:
import json
import re
import base64
import time
from pathlib import Path
from tqdm import tqdm
import wandb

# Configuration
MODEL_NAME = "Qwen2.5-Coder-14B-r32-20260208-164425"  # Used for output folder name
TEST_DATA_URL = "https://huggingface.co/datasets/lilyzhng/uigen-test/resolve/main/uigen_test.jsonl"
LIMIT = 10  # Number of samples to evaluate (set to None for all)
MAX_NEW_TOKENS = 4096  # Max tokens for HTML generation
WANDB_PROJECT = "uiux-eval"  # W&B project name

# Login to W&B (will prompt for API key if not logged in)
wandb.login()

# Create output directory
OUTPUT_DIR = Path(f"eval_results/{MODEL_NAME}")
SCREENSHOTS_DIR = OUTPUT_DIR / "screenshots"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
SCREENSHOTS_DIR.mkdir(parents=True, exist_ok=True)
print(f"Output directory: {OUTPUT_DIR}")



Output directory: eval_results/Qwen2.5-Coder-14B-r32-20260208-164425


In [27]:
# Load test data
# Option 1: Upload the file manually in Colab
# from google.colab import files
# uploaded = files.upload()  # Upload uigen_test.jsonl

# Option 2: Download from URL (update URL to your actual data location)
# !wget -q -O uigen_test.jsonl {TEST_DATA_URL}

# Option 3: Use sample data inline for testing
SAMPLE_TEST_DATA = [
    {
        "id": 1,
        "question": "Create a modern login form with email and password fields, a 'Remember me' checkbox, and a gradient submit button. Use Tailwind CSS.",
        "answer": '''<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <script src="https://cdn.tailwindcss.com"></script>
    <title>Login Form</title>
</head>
<body class="min-h-screen bg-gray-100 flex items-center justify-center">
    <div class="bg-white p-8 rounded-lg shadow-md w-full max-w-md">
        <h2 class="text-2xl font-bold mb-6 text-center text-gray-800">Login</h2>
        <form>
            <div class="mb-4">
                <label class="block text-gray-700 text-sm font-bold mb-2" for="email">Email</label>
                <input class="shadow appearance-none border rounded w-full py-2 px-3 text-gray-700 leading-tight focus:outline-none focus:shadow-outline" id="email" type="email" placeholder="Email">
            </div>
            <div class="mb-6">
                <label class="block text-gray-700 text-sm font-bold mb-2" for="password">Password</label>
                <input class="shadow appearance-none border rounded w-full py-2 px-3 text-gray-700 mb-3 leading-tight focus:outline-none focus:shadow-outline" id="password" type="password" placeholder="Password">
            </div>
            <div class="mb-6 flex items-center">
                <input type="checkbox" id="remember" class="mr-2">
                <label for="remember" class="text-sm text-gray-600">Remember me</label>
            </div>
            <button class="w-full bg-gradient-to-r from-blue-500 to-purple-600 hover:from-blue-600 hover:to-purple-700 text-white font-bold py-2 px-4 rounded focus:outline-none focus:shadow-outline" type="submit">
                Sign In
            </button>
        </form>
    </div>
</body>
</html>'''
    },
    {
        "id": 2,
        "question": "Design a pricing card component with three tiers (Basic, Pro, Enterprise), each showing price, features list, and a CTA button. Make it responsive with Tailwind CSS.",
        "answer": '''<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <script src="https://cdn.tailwindcss.com"></script>
    <title>Pricing Cards</title>
</head>
<body class="min-h-screen bg-gray-50 py-12">
    <div class="max-w-6xl mx-auto px-4">
        <h1 class="text-4xl font-bold text-center mb-12">Pricing Plans</h1>
        <div class="grid md:grid-cols-3 gap-8">
            <div class="bg-white rounded-lg shadow-lg p-8 hover:shadow-xl transition">
                <h3 class="text-xl font-semibold text-gray-800">Basic</h3>
                <p class="text-4xl font-bold my-4">$9<span class="text-lg text-gray-500">/mo</span></p>
                <ul class="space-y-3 mb-8">
                    <li class="flex items-center"><span class="text-green-500 mr-2">âœ“</span>5 Projects</li>
                    <li class="flex items-center"><span class="text-green-500 mr-2">âœ“</span>10GB Storage</li>
                    <li class="flex items-center"><span class="text-green-500 mr-2">âœ“</span>Email Support</li>
                </ul>
                <button class="w-full bg-gray-800 text-white py-2 rounded-lg hover:bg-gray-700">Get Started</button>
            </div>
            <div class="bg-blue-600 text-white rounded-lg shadow-lg p-8 transform scale-105">
                <h3 class="text-xl font-semibold">Pro</h3>
                <p class="text-4xl font-bold my-4">$29<span class="text-lg opacity-75">/mo</span></p>
                <ul class="space-y-3 mb-8">
                    <li class="flex items-center"><span class="mr-2">âœ“</span>Unlimited Projects</li>
                    <li class="flex items-center"><span class="mr-2">âœ“</span>100GB Storage</li>
                    <li class="flex items-center"><span class="mr-2">âœ“</span>Priority Support</li>
                </ul>
                <button class="w-full bg-white text-blue-600 py-2 rounded-lg hover:bg-gray-100">Get Started</button>
            </div>
            <div class="bg-white rounded-lg shadow-lg p-8 hover:shadow-xl transition">
                <h3 class="text-xl font-semibold text-gray-800">Enterprise</h3>
                <p class="text-4xl font-bold my-4">$99<span class="text-lg text-gray-500">/mo</span></p>
                <ul class="space-y-3 mb-8">
                    <li class="flex items-center"><span class="text-green-500 mr-2">âœ“</span>Everything in Pro</li>
                    <li class="flex items-center"><span class="text-green-500 mr-2">âœ“</span>Unlimited Storage</li>
                    <li class="flex items-center"><span class="text-green-500 mr-2">âœ“</span>24/7 Support</li>
                </ul>
                <button class="w-full bg-gray-800 text-white py-2 rounded-lg hover:bg-gray-700">Contact Us</button>
            </div>
        </div>
    </div>
</body>
</html>'''
    },
]

def load_test_data(path: str) -> list[dict]:
    """Load test samples from JSONL file."""
    samples = []
    with open(path, "r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if line:
                samples.append(json.loads(line))
    return samples

# Try to load from file, fall back to sample data
try:
    samples = load_test_data("uigen_test.jsonl")
    print(f"Loaded {len(samples)} test samples from file")
except FileNotFoundError:
    print("uigen_test.jsonl not found, using sample test data")
    samples = SAMPLE_TEST_DATA
    print(f"Using {len(samples)} sample test cases")

if LIMIT and len(samples) > LIMIT:
    samples = samples[:LIMIT]
    print(f"Limited to first {LIMIT} samples")

uigen_test.jsonl not found, using sample test data
Using 2 sample test cases


In [28]:
# Helper functions for HTML processing and W&B integration
import asyncio
import nest_asyncio
from playwright.async_api import async_playwright

# Apply nest_asyncio to allow async in Jupyter/Colab
nest_asyncio.apply()

# IMPORTANT: This prompt template MUST match the training data format exactly!
# The base model was trained on text in this format, so we need to use it for inference
PROMPT_TEMPLATE = "# Task: Generate HTML/CSS code using Tailwind CSS\n# Requirements: {requirements}\n\n"

HTML_TEMPLATE = """\
<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <script src="https://cdn.tailwindcss.com"></script>
  <title>{title}</title>
</head>
<body>
{content}
</body>
</html>
"""

# W&B HTML card template for side-by-side comparison
WANDB_CARD_TEMPLATE = """\
<div style="font-family: system-ui, -apple-system, sans-serif; width: 100%; box-sizing: border-box;">
  <div style="background: #f8f9fa; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
    <div style="font-size: 14px; color: #666; margin-bottom: 4px;">ID: {sample_id}</div>
    <div style="font-size: 16px; font-weight: 600; color: #1a1a1a;">{prompt}</div>
  </div>
  <div style="display: flex; gap: 16px; margin-bottom: 16px; width: 100%;">
    <div style="flex: 1; min-width: 0;">
      <div style="font-size: 14px; font-weight: 600; color: #e74c3c; margin-bottom: 8px; text-transform: uppercase; letter-spacing: 0.5px;">Generation</div>
      <img src="data:image/png;base64,{gen_b64}" style="width: 100%; height: auto; border: 1px solid #ddd; border-radius: 6px; display: block;" />
    </div>
    <div style="flex: 1; min-width: 0;">
      <div style="font-size: 14px; font-weight: 600; color: #27ae60; margin-bottom: 8px; text-transform: uppercase; letter-spacing: 0.5px;">Ground Truth</div>
      <img src="data:image/png;base64,{gt_b64}" style="width: 100%; height: auto; border: 1px solid #ddd; border-radius: 6px; display: block;" />
    </div>
  </div>
</div>"""

def extract_code(response_text: str) -> str:
    """Extract HTML/CSS code from the model response."""
    pattern = r"```(?:html|css|tsx|jsx|vue)?\s*\n(.*?)```"
    matches = re.findall(pattern, response_text, re.DOTALL)
    
    if matches:
        return "\n".join(matches)
    
    stripped = response_text.strip()
    if stripped.startswith("<") or stripped.startswith("<!"):
        return stripped
    
    return stripped

def wrap_in_html(code: str, title: str = "UI Output") -> str:
    """Wrap extracted code in a full HTML page with Tailwind CDN."""
    if "<!DOCTYPE" in code.upper() or "<html" in code.lower():
        if "tailwindcss" not in code:
            code = code.replace(
                "<head>",
                '<head>\n  <script src="https://cdn.tailwindcss.com"></script>',
                1,
            )
        return code
    return HTML_TEMPLATE.format(title=title, content=code)

def generate_html(requirements: str, max_new_tokens: int = 4096, use_template: bool = True) -> str:
    """Generate HTML response for the given requirements.
    
    Args:
        requirements: The UI requirements to generate code for
        max_new_tokens: Maximum tokens to generate
        use_template: If True, wrap requirements in the training prompt template.
                      This is CRITICAL for base model inference - the model was trained
                      on a specific prompt format and expects it during inference.
    
    Returns:
        The generated HTML code (raw model output)
    """
    # CRITICAL: Use the same prompt format as training data!
    # The base model was trained on "# Task: ... # Requirements: ..." format
    if use_template:
        prompt = PROMPT_TEMPLATE.format(requirements=requirements)
    else:
        prompt = requirements
    
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        temperature=0.7,
        top_p=0.9,
        top_k=20,
        do_sample=True,
    )
    
    response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
    return response

async def render_html_to_screenshot_async(html_path: str, screenshot_path: str, browser,
                                           viewport_width: int = 1280, viewport_height: int = 800) -> bool:
    """Render an HTML file to a PNG screenshot using async Playwright."""
    try:
        page = await browser.new_page(viewport={"width": viewport_width, "height": viewport_height})
        await page.goto(f"file://{os.path.abspath(html_path)}", wait_until="networkidle")
        await page.wait_for_timeout(1000)  # Wait for Tailwind CDN
        await page.screenshot(path=screenshot_path, full_page=True)
        await page.close()
        return True
    except Exception as e:
        print(f"  WARNING: Screenshot failed: {e}")
        return False

def render_html_to_screenshot(html_path: str, screenshot_path: str, browser,
                               viewport_width: int = 1280, viewport_height: int = 800) -> bool:
    """Sync wrapper for async screenshot function."""
    return asyncio.get_event_loop().run_until_complete(
        render_html_to_screenshot_async(html_path, screenshot_path, browser, viewport_width, viewport_height)
    )

def image_to_base64(image_path: str) -> str:
    """Convert a local image file to a base64 string."""
    with open(image_path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

def build_wandb_card(sample_id: int, prompt: str, gen_screenshot_path: str, gt_screenshot_path: str) -> wandb.Html:
    """Build a rich HTML card for W&B with side-by-side screenshots."""
    gen_b64 = image_to_base64(gen_screenshot_path) if os.path.exists(gen_screenshot_path) else ""
    gt_b64 = image_to_base64(gt_screenshot_path) if os.path.exists(gt_screenshot_path) else ""
    
    html = WANDB_CARD_TEMPLATE.format(
        sample_id=sample_id,
        prompt=prompt[:300],
        gen_b64=gen_b64,
        gt_b64=gt_b64,
    )
    return wandb.Html(html)

print("Helper functions defined.")

Helper functions defined.


In [29]:
# Main evaluation function using async Playwright
async def run_evaluation():
    # Initialize W&B run
    run_name = f"{MODEL_NAME}-{time.strftime('%m%d-%H%M')}"
    run = wandb.init(
        project=WANDB_PROJECT,
        name=run_name,
        config={
            "model": MODEL_NAME,
            "num_samples": len(samples),
            "max_new_tokens": MAX_NEW_TOKENS,
        },
    )
    print(f"W&B run: {run.url}")

    # Launch headless browser for screenshot capture
    print("Launching headless browser for screenshot capture...")
    pw = await async_playwright().start()
    browser = await pw.chromium.launch(headless=True)

    # Run evaluation on UIGEN test samples
    results = []

    for i, sample in enumerate(tqdm(samples, desc="Evaluating")):
        sample_id = sample["id"]
        question = sample["question"]
        reference_answer = sample["answer"]
        
        print(f"\n[{i+1}/{len(samples)}] ID={sample_id}")
        print(f"  Prompt: {question[:80]}...")
        
        # Generate model response
        try:
            raw_response = generate_html(question, max_new_tokens=MAX_NEW_TOKENS)
        except Exception as e:
            print(f"  ERROR: {e}")
            raw_response = f"ERROR: {e}"
        
        # Extract and wrap HTML
        extracted_code = extract_code(raw_response)
        model_html = wrap_in_html(extracted_code, title=f"Generation - {sample_id}")
        reference_html = wrap_in_html(reference_answer, title=f"GT - {sample_id}")
        
        # Define file paths
        raw_path = OUTPUT_DIR / f"{sample_id}_raw.txt"
        gen_path = OUTPUT_DIR / f"{sample_id}_generation.html"
        gt_path = OUTPUT_DIR / f"{sample_id}_gt.html"
        gen_screenshot_path = SCREENSHOTS_DIR / f"{sample_id}_generation.png"
        gt_screenshot_path = SCREENSHOTS_DIR / f"{sample_id}_gt.png"
        
        # Save HTML files
        with open(raw_path, "w", encoding="utf-8") as f:
            f.write(raw_response)
        with open(gen_path, "w", encoding="utf-8") as f:
            f.write(model_html)
        with open(gt_path, "w", encoding="utf-8") as f:
            f.write(reference_html)
        
        print(f"  Saved HTML files")
        
        # Capture screenshots
        print("  Capturing screenshots...")
        await render_html_to_screenshot_async(str(gen_path), str(gen_screenshot_path), browser)
        await render_html_to_screenshot_async(str(gt_path), str(gt_screenshot_path), browser)
        print(f"  Screenshots saved")
        
        # Build W&B card and log
        card = build_wandb_card(sample_id, question, str(gen_screenshot_path), str(gt_screenshot_path))
        wandb.log({
            f"samples/{sample_id}": card,
            "sample_idx": i,
        })
        
        results.append({
            "id": sample_id,
            "question": question,
            "generation_path": str(gen_path),
            "gt_path": str(gt_path),
            "gen_screenshot": str(gen_screenshot_path),
            "gt_screenshot": str(gt_screenshot_path),
        })

    # Cleanup browser
    await browser.close()
    await pw.stop()

    print(f"\n{'='*50}")
    print(f"Evaluation complete! Results saved to: {OUTPUT_DIR}")
    print(f"Total samples: {len(results)}")
    print(f"W&B run: {run.url}")
    
    return results, run

# Run the async evaluation
results, run = asyncio.get_event_loop().run_until_complete(run_evaluation())

W&B run: https://wandb.ai/alchemxz/uiux-eval/runs/laj552bh
Launching headless browser for screenshot capture...


TargetClosedError: BrowserType.launch: Target page, context or browser has been closed
Browser logs:

<launching> /root/.cache/ms-playwright/chromium_headless_shell-1208/chrome-headless-shell-linux64/chrome-headless-shell --disable-field-trial-config --disable-background-networking --disable-background-timer-throttling --disable-backgrounding-occluded-windows --disable-back-forward-cache --disable-breakpad --disable-client-side-phishing-detection --disable-component-extensions-with-background-pages --disable-component-update --no-default-browser-check --disable-default-apps --disable-dev-shm-usage --disable-extensions --disable-features=AvoidUnnecessaryBeforeUnloadCheckSync,BoundaryEventDispatchTracksNodeRemoval,DestroyProfileOnBrowserClose,DialMediaRouteProvider,GlobalMediaControls,HttpsUpgrades,LensOverlay,MediaRouter,PaintHolding,ThirdPartyStoragePartitioning,Translate,AutoDeElevate,RenderDocument,OptimizationHints --enable-features=CDPScreenshotNewSurface --allow-pre-commit-input --disable-hang-monitor --disable-ipc-flooding-protection --disable-popup-blocking --disable-prompt-on-repost --disable-renderer-backgrounding --force-color-profile=srgb --metrics-recording-only --no-first-run --password-store=basic --use-mock-keychain --no-service-autorun --export-tagged-pdf --disable-search-engine-choice-screen --unsafely-disable-devtools-self-xss-warnings --edge-skip-compat-layer-relaunch --enable-automation --disable-infobars --disable-search-engine-choice-screen --disable-sync --enable-unsafe-swiftshader --headless --hide-scrollbars --mute-audio --blink-settings=primaryHoverType=2,availableHoverTypes=2,primaryPointerType=4,availablePointerTypes=4 --no-sandbox --user-data-dir=/tmp/playwright_chromiumdev_profile-qtnvsJ --remote-debugging-pipe --no-startup-window
<launched> pid=12081
[pid=12081][err] /root/.cache/ms-playwright/chromium_headless_shell-1208/chrome-headless-shell-linux64/chrome-headless-shell: error while loading shared libraries: libatk-1.0.so.0: cannot open shared object file: No such file or directory
Call log:
  - <launching> /root/.cache/ms-playwright/chromium_headless_shell-1208/chrome-headless-shell-linux64/chrome-headless-shell --disable-field-trial-config --disable-background-networking --disable-background-timer-throttling --disable-backgrounding-occluded-windows --disable-back-forward-cache --disable-breakpad --disable-client-side-phishing-detection --disable-component-extensions-with-background-pages --disable-component-update --no-default-browser-check --disable-default-apps --disable-dev-shm-usage --disable-extensions --disable-features=AvoidUnnecessaryBeforeUnloadCheckSync,BoundaryEventDispatchTracksNodeRemoval,DestroyProfileOnBrowserClose,DialMediaRouteProvider,GlobalMediaControls,HttpsUpgrades,LensOverlay,MediaRouter,PaintHolding,ThirdPartyStoragePartitioning,Translate,AutoDeElevate,RenderDocument,OptimizationHints --enable-features=CDPScreenshotNewSurface --allow-pre-commit-input --disable-hang-monitor --disable-ipc-flooding-protection --disable-popup-blocking --disable-prompt-on-repost --disable-renderer-backgrounding --force-color-profile=srgb --metrics-recording-only --no-first-run --password-store=basic --use-mock-keychain --no-service-autorun --export-tagged-pdf --disable-search-engine-choice-screen --unsafely-disable-devtools-self-xss-warnings --edge-skip-compat-layer-relaunch --enable-automation --disable-infobars --disable-search-engine-choice-screen --disable-sync --enable-unsafe-swiftshader --headless --hide-scrollbars --mute-audio --blink-settings=primaryHoverType=2,availableHoverTypes=2,primaryPointerType=4,availablePointerTypes=4 --no-sandbox --user-data-dir=/tmp/playwright_chromiumdev_profile-qtnvsJ --remote-debugging-pipe --no-startup-window
  - <launched> pid=12081
  - [pid=12081][err] /root/.cache/ms-playwright/chromium_headless_shell-1208/chrome-headless-shell-linux64/chrome-headless-shell: error while loading shared libraries: libatk-1.0.so.0: cannot open shared object file: No such file or directory
  - [pid=12081] <gracefully close start>
  - [pid=12081] <kill>
  - [pid=12081] <will force kill>
  - [pid=12081] exception while trying to kill process: Error: kill ESRCH
  - [pid=12081] <process did exit: exitCode=127, signal=null>
  - [pid=12081] starting temporary directories cleanup
  - [pid=12081] finished temporary directories cleanup
  - [pid=12081] <gracefully close end>


In [None]:
# Save results summary as JSON
results_path = OUTPUT_DIR / "results.json"
with open(results_path, "w", encoding="utf-8") as f:
    json.dump(results, f, indent=2)
print(f"Results summary saved to: {results_path}")

# Log summary to W&B
wandb.summary["num_samples"] = len(results)
wandb.summary["model"] = MODEL_NAME

# Finish W&B run
wandb.finish()
print(f"\nW&B run finished!")

# List all generated files
print(f"\nGenerated files in {OUTPUT_DIR}:")
for f in sorted(OUTPUT_DIR.iterdir()):
    if f.is_file():
        print(f"  {f.name}")
print(f"\nScreenshots in {SCREENSHOTS_DIR}:")
for f in sorted(SCREENSHOTS_DIR.iterdir()):
    print(f"  {f.name}")

### 7.1 View Side-by-Side Comparison

Display generated HTML and ground truth HTML in iframes for visual comparison.

In [None]:
from IPython.display import HTML, display
import html

def show_comparison(sample_id: int):
    """Display side-by-side comparison of generated vs ground truth HTML."""
    gen_path = OUTPUT_DIR / f"{sample_id}_generation.html"
    gt_path = OUTPUT_DIR / f"{sample_id}_gt.html"
    
    if not gen_path.exists() or not gt_path.exists():
        print(f"Files not found for sample {sample_id}")
        return
    
    gen_html = gen_path.read_text(encoding="utf-8")
    gt_html = gt_path.read_text(encoding="utf-8")
    
    # Find the question for this sample
    question = next((s["question"] for s in samples if s["id"] == sample_id), "Unknown")
    
    comparison_html = f"""
    <div style="margin-bottom: 20px; padding: 10px; background: #f5f5f5; border-radius: 8px;">
        <h3 style="margin: 0 0 10px 0;">Sample ID: {sample_id}</h3>
        <p style="margin: 0; font-size: 14px; color: #666;"><strong>Prompt:</strong> {html.escape(question[:200])}...</p>
    </div>
    <div style="display: flex; gap: 20px;">
        <div style="flex: 1;">
            <h4 style="text-align: center; color: #e74c3c;">Generation (Model Output)</h4>
            <iframe srcdoc="{html.escape(gen_html)}" style="width: 100%; height: 600px; border: 2px solid #e74c3c; border-radius: 8px;"></iframe>
        </div>
        <div style="flex: 1;">
            <h4 style="text-align: center; color: #27ae60;">Ground Truth</h4>
            <iframe srcdoc="{html.escape(gt_html)}" style="width: 100%; height: 600px; border: 2px solid #27ae60; border-radius: 8px;"></iframe>
        </div>
    </div>
    """
    display(HTML(comparison_html))

# Show first sample comparison
if results:
    first_id = results[0]["id"]
    show_comparison(first_id)

In [None]:
# View more comparisons - change the sample_id to view different samples
# Available sample IDs:
print("Available sample IDs:")
for r in results:
    print(f"  {r['id']}")

# Uncomment and modify to view a specific sample:
# show_comparison(737)  # Replace with desired sample ID

In [None]:
# Download all results as a zip file (for local review)
import shutil

zip_path = f"{MODEL_NAME}_eval_results"
shutil.make_archive(zip_path, 'zip', OUTPUT_DIR)
print(f"Results zipped to: {zip_path}.zip")

# In Colab, you can download using:
from google.colab import files
files.download(f"{zip_path}.zip")