diff --git a/examples/gpt-5/prompt-optimization-cookbook/llm_as_judge.txt b/examples/gpt-5/prompt-optimization-cookbook/llm_as_judge.txt new file mode 100644 index 0000000000..7b5f3d04a3 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/llm_as_judge.txt @@ -0,0 +1,93 @@ +# SYSTEM PROMPT + +You are an expert judge responsible for evaluating the quality of outputs produced by language models, specifically focusing on how well they follow provided task instructions and the overall code quality (if the output is code). Your evaluation must be fair, thorough, and well-reasoned. + +First, carefully read and understand: +- The task instructions provided. +- The output (text or code) produced by the model. + +**Your tasks:** +1. **Analyze Task Adherence:** + - Step-by-step, explain how the output matches or fails to meet each part of the instructions. + - Highlight all instances where instructions are fully, partially, or not followed. + - Consider any ambiguities and how reasonable the model's choices are. + +2. **Evaluate Code Quality (if applicable):** + - Step-by-step, assess the clarity, correctness, efficiency, readability, structure, maintainability, and best practices of the code. + - Identify any bugs, inefficiencies, or stylistic issues, explaining your reasoning for each point. + - If the output is not code, skip this step and say so. + +**Reasoning Process:** +- Always **reason first**—do not state your final assessment until after you have fully documented your reasoning about task adherence and code quality. +- Structure your findings in two sections: "Reasoning" (step-by-step analysis), followed by "Final Judgement." + +**Output Format:** +Respond ONLY in the following JSON structure (replace bracketed areas with your content): + +{ + "reasoning": { + "task_adherence": "[Step-by-step analysis of how well the output follows all instructions, including any missed or ambiguous points.]", + "code_quality": "[Step-by-step code quality assessment, or short note if not applicable.]" + }, + "final_judgement": { + "adherence_score": [integer 1-5, where 5=perfectly follows instructions, 1=ignores or subverts instructions], + "code_quality_score": [integer 1-5, where 5=exceptional code quality, 1=severe issues or missing code; use null if not code], + "comments": "[Short summary of main issues, overall impression, or suggestions for improvement.]" + } +} + +**Scoring Guidelines:** +- 5 = Exceptional; all instructions/code quality criteria met to a high standard. +- 4 = Good; minor issues. +- 3 = Average; some issues or minor omissions. +- 2 = Major issues or omissions. +- 1 = Severe failure to follow task or produce usable code. + +**EXAMPLES:** + +**Example 1:** +Input Instructions: "Write a function that returns the sum of two numbers." +Model Output: +def add(a, b): +  return a + b + +JSON Output: +{ + "reasoning": { + "task_adherence": "The output defines a function named 'add' with two arguments and returns their sum as instructed.", + "code_quality": "The code is concise, correct, and follows Python conventions. No issues." + }, + "final_judgement": { + "adherence_score": 5, + "code_quality_score": 5, + "comments": "Task followed perfectly; code is clean and correct." + } +} + +**Example 2:** +Input Instructions: "Write a function that checks if a string is a palindrome, ignoring case and spaces." +Model Output: +def is_palindrome(s): +  return s == s[::-1] + +JSON Output: +{ + "reasoning": { + "task_adherence": "The output defines a function, but it does not ignore case and spaces, as required.", + "code_quality": "The code is correct for a basic palindrome check, but it does not implement the extra requirements." + }, + "final_judgement": { + "adherence_score": 2, + "code_quality_score": 4, + "comments": "Major task requirement (ignoring case/spaces) omitted; otherwise, basic code is clean." + } +} + +**Important reminders:** +- Always provide reasoning before your ratings and summary. +- Never start with a conclusion. +- Use the JSON schema strictly. +- Use step-by-step analysis, and detailed explanations, and adjust your scores according to the scoring guidelines. + +**Reminder:** +Evaluate how well the output follows instructions first, provide detailed reasoning, then give your overall numeric ratings for task adherence and code quality. Output in the specified JSON format only Do not be nice on scoring, be fair. \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/prompt-optimization-cookbook.ipynb b/examples/gpt-5/prompt-optimization-cookbook/prompt-optimization-cookbook.ipynb new file mode 100644 index 0000000000..91905adc47 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/prompt-optimization-cookbook.ipynb @@ -0,0 +1,943 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "4c84498c", + "metadata": {}, + "source": [ + "# GPT-5 Prompt Migration and Improvement using the new prompt optimizer" + ] + }, + { + "cell_type": "markdown", + "id": "a3942231", + "metadata": {}, + "source": [ + "The GPT-5 Family of models are the smartest models we’ve released to date, representing a step change in the models’ capabilities across the board. GPT-5 is particularly specialized in agentic task performance, coding, and steerability, making it a great fit for everyone from curious users to advanced researchers. \n", + "\n", + "GPT-5 will benefit from all the traditional prompting best practices, and to help you construct the best prompt we are introducing a [Prompting Guide for GPT-5](#) explaining how to make the most of its state-of-the-art capabilities. Alongside that, we are introducing a [GPT-5 Specific Prompt Optimizer](#https://platform.openai.com/chat/edit?optimize=true) in our Playground to help users get started on **improving existing prompts** and **migrating prompts** for GPT-5 and other OpenAI models.\n", + "\n", + "In this cookbook we will go through how you can get spun up quickly to solve your task with GPT-5. We will share results of measurable improvements on common tasks and walk you through how you can use the Prompt Optimizer to do the same.\n" + ] + }, + { + "cell_type": "markdown", + "id": "f066a2db", + "metadata": {}, + "source": [ + "## Migrating and Optimizing Prompts\n", + "\n", + "Crafting effective prompts is a critical skill when working with LLMs. The goal of the Prompt Optimizer is to give your prompt the best practices and formatting most effective for our models. The Optimizer also removes common prompting failure modes such as: \n", + "\n", + "• Contradictions in the prompt instructions \n", + "•\tMissing or unclear format specifications \n", + "•\tInconsistencies between the prompt and few-shot examples \n", + "\n", + "Along with tuning the prompt for the target model, the Optimizer is cognizant of the specific task you are trying to accomplish and can apply crucial practices to boost performance in Agentic Workflows, Coding and Multi-Modality. Let's walk through some before-and-afters to see where prompt optimization shines. \n", + "\n", + "> [!NOTE]\n", + "> Remember that prompting is not a one-size-fits-all experience, so we recommend running thorough experiments and iterating to find the best solution for your problem." + ] + }, + { + "cell_type": "markdown", + "id": "8fcbc964", + "metadata": {}, + "source": [ + "> [!IMPORTANT]\n", + "> Ensure you have set up your OpenAI API Key set as `OPENAI_API_KEY` and have access to GPT-5\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5a0d077c", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "\n", + "required = ('OPENAI_API_KEY',)\n", + "missing = [k for k in required if not os.getenv(k)]\n", + "print('OPENAI_API_KEY is set!' if not missing else 'Missing environment variable: ' + ', '.join(missing) + '. Please set them before running the workflow.')\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f664a575", + "metadata": {}, + "outputs": [], + "source": [ + "## Let's install our required packages\n", + "%pip install -r requirements.txt --quiet" + ] + }, + { + "cell_type": "markdown", + "id": "fad827dc", + "metadata": {}, + "source": [ + "----------------" + ] + }, + { + "cell_type": "markdown", + "id": "b750b040", + "metadata": {}, + "source": [ + "\n", + "### Coding and Analytics: Streaming Top‑K Frequent Words \n", + "\n", + "We start with a task in a field that model has seen significant improvements: Coding and Analytics. We will ask the model to generate a Python script that computes the exact Top‑K most frequent tokens from a large text stream using a specific tokenization spec. Tasks like these are highly sensitive to poor prompting as they can push the model toward the wrong algorithms and approaches (approximate sketches vs multi‑pass/disk‑backed exact solutions), dramatically affecting accuracy and runtime.\n", + "\n", + "For this task, we will evaluate:\n", + "1. Compilation/Execution success over 30 runs\n", + "2. Average runtime (successful runs)\n", + "3. Average peak memory (successful runs)\n", + "4. Exactness: output matches ground‑truth Top‑K with tie‑break: by count desc, then token asc\n", + "\n", + "Note: Evaluated on an M4 Max MacBook Pro; adjust constraints if needed.\n" + ] + }, + { + "cell_type": "markdown", + "id": "750300af", + "metadata": {}, + "source": [ + "### Our Baseline Prompt\n", + "For our example, let's look at a typical starting prompt with some minor **contradictions in the prompt**, and **ambiguous or underspecified instructions**. Contradictions in instructions often reduce performance and increase latency, especially in reasoning models like GPT-5, and ambiguous instructions can cause unwanted behaviors. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "377cc6f4", + "metadata": {}, + "outputs": [], + "source": [ + "baseline_prompt = \"\"\"\n", + "Write Python to solve the task on a MacBook Pro (M4 Max). Keep it fast and lightweight.\n", + "\n", + "- Prefer the standard library; use external packages if they make things simpler.\n", + "- Stream input in one pass to keep memory low; reread or cache if that makes the solution clearer.\n", + "- Aim for exact results; approximate methods are fine when they don't change the outcome in practice.\n", + "- Avoid global state; expose a convenient global like top_k so it's easy to check.\n", + "- Keep comments minimal; add brief explanations where helpful.\n", + "- Sort results in a natural, human-friendly way; follow strict tie rules when applicable.\n", + "\n", + "Output only a single self-contained Python script inside one Python code block, with all imports, ready to run.\n", + "\"\"\"\n" + ] + }, + { + "cell_type": "markdown", + "id": "66ae7a26", + "metadata": {}, + "source": [] + }, + { + "cell_type": "markdown", + "id": "01b0e8b3", + "metadata": {}, + "source": [ + "This baseline prompt is something that you could expect from asking ChatGPT to write you a prompt, or talking to a friend who is knowledgeable about coding but not particularly invested in your specific use case. Our baseline prompt is intentionally shorter and friendlier, but it hides mixed signals that can push the model into inconsistent solution families.\n", + "\n", + "First, we say to prefer the standard library, then immediately allow external packages “if they make things simpler.” That soft permission can nudge the model toward non‑portable dependencies or heavier imports that change performance and even execution success across environments.\n", + "\n", + "Next, we encourage single‑pass streaming to keep memory low, but we also say it’s fine to reread or cache “if that makes the solution clearer.” That ambiguity opens the door to multi‑pass designs or in‑memory caches that defeat the original streaming constraint and can alter runtime and memory profiles.\n", + "\n", + "We also ask for exact results while permitting approximate methods “when they don’t change the outcome in practice.” This is a judgment call the model can’t reliably verify. It may introduce sketches or heuristics that subtly shift counts near the Top‑K boundary, producing results that look right but fail strict evaluation.\n", + "\n", + "We advise avoiding global state, yet suggest exposing a convenient global like `top_k`. That mixes interface contracts: is the function supposed to return data, or should callers read globals? Models may implement both, causing side effects that complicate evaluation and reproducibility.\n", + "\n", + "Documentation guidance is similarly split: “keep comments minimal” but “add brief explanations.” Depending on how the model interprets this, you can get under‑explained code or prose interleaved with logic, which sometimes leaks outside the required output format.\n", + "\n", + "Finally, we ask for “natural, human‑friendly” sorting while also mentioning strict tie rules. These aren’t always the same. The model might pick convenience ordering (e.g., `Counter.most_common`) and drift from the evaluator’s canonical `(-count, token)` sort, especially on ties—leading to subtle correctness misses.\n", + "\n", + "**Why this matters**: the softened constraints make the prompt feel easy to satisfy, but they create forks in the road. The model may pick different branches across runs—stdlib vs external deps, one‑pass vs reread/cache, exact vs approximate—yielding variability in correctness, latency, and memory.\n", + "\n", + "**Our evaluator remains strict**: fixed tokenization `[a-z0-9]+` on lowercased text and deterministic ordering by `(-count, token)`. Any divergence here will penalize exactness even if the rest of the solution looks reasonable.\n" + ] + }, + { + "cell_type": "markdown", + "id": "9377fe68", + "metadata": {}, + "source": [ + "### Let's see how it performs: Generating 30 code scripts with the baseline prompt \n", + "\n", + "Using the OpenAI Responses API we'll invoke the model 30 times with our baseline prompt and save each response as a Python file in the `results_topk_baseline`. This may take some time. \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1b3a3b39", + "metadata": {}, + "outputs": [], + "source": [ + "from scripts.gen_baseline import generate_baseline_topk\n", + "\n", + "MODEL = \"gpt-5\"\n", + "N_RUNS = 30\n", + "CONCURRENCY = 10\n", + "OUTPUT_DIR = \"results_topk_baseline\"\n", + "\n", + "USER_PROMPT = \"\"\"\n", + "Task:\n", + "Given globals text (str) and k (int), produce the Top-K most frequent tokens.\n", + "\n", + "Tokenization:\n", + "- Case-insensitive tokenization using an ASCII regex; produce lowercase tokens. Whole-string lowercasing is not required.\n", + "- Tokens are ASCII [a-z0-9]+ sequences; treat all other characters as separators.\n", + "\n", + "Output:\n", + "- Define top_k as a list of (token, count) tuples.\n", + "- Sort by count desc, then token asc.\n", + "- Length = min(k, number of unique tokens).\n", + "\n", + "Notes:\n", + "- Run as-is with the provided globals; no file or network I/O.\n", + "\"\"\"\n", + "\n", + "generate_baseline_topk(\n", + " model=MODEL,\n", + " n_runs=N_RUNS,\n", + " concurrency=CONCURRENCY,\n", + " output_dir=OUTPUT_DIR,\n", + " dev_prompt=baseline_prompt,\n", + " user_prompt=USER_PROMPT,\n", + ")\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "53f063b6", + "metadata": {}, + "source": [ + "### Evaluate Generated Scripts - Baseline Prompt\n", + "\n", + "We then benchmark every script in ``results_topk_baseline`` On larger datasets this evaluation is intentionally heavy and can take several minutes." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "391da952", + "metadata": {}, + "outputs": [], + "source": [ + "from scripts.topk_eval import evaluate_folder\n", + "\n", + "evaluate_folder(\n", + " folder_path=\"results_topk_baseline\",\n", + " k=500,\n", + " scale_tokens=5_000_000,\n", + " csv_path=\"run_results_topk_baseline.csv\",\n", + ")\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "92a02c33", + "metadata": {}, + "source": [ + "### Optimizing our Prompt " + ] + }, + { + "cell_type": "markdown", + "id": "56da7b3f", + "metadata": {}, + "source": [ + "Now let's use the prompt optimization tool in the console to improve our prompt and then review the results. We can start by going to the [OpenAI Optimize Playground](#https://platform.openai.com/chat/edit?optimize=true), and pasting our existing prompt in the Developer Message section.\n", + "\n", + "From there press the **Optimize** button. This will open the optimization panel. At this stage, you can either provide specific edits you'd like to see reflected in the prompt or simply press **Optimize** to have it refined according to best practices for the target model and task. To start let's do just this.\n", + "\n", + "![optimize_image](../../../images/image_optimize_1.png)\n", + "\n", + "\n", + "\n", + "Once it's completed you'll see the result of the prompt optimization. In our example below you'll see many changes were made to the prompt. It will also give you snippets of what it changed and why the change was made. You can interact with these by opening the comments up or using the inline reviewer mode.\n", + "\n", + "We'll add an additional change we'd like which include:\n", + "\n", + "- Enforcing the single-pass streaming\n", + "\n", + "This is easy using the iterative process of the Prompt Optimizer.\n", + "\n", + "![optimize_image](../../../images/image_optimize_2.png)\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "a983e50d", + "metadata": {}, + "source": [ + "Once we are happy with the optimized version of our prompt, we can save it as a [Prompt Object](#https://platform.openai.com/docs/guides/prompt-engineering#reusable-prompts) using a button on the top right of the optimizer. We can use this object within our API Calls which can help with future iteration, version management, and reusability across different applications. \n", + "\n", + "![optimize_image](../../../images/image_optimize_3.png)\n" + ] + }, + { + "cell_type": "markdown", + "id": "f5bc98ab", + "metadata": {}, + "source": [ + "### Let's see how it performs: Evaluating our improved prompt \n", + "\n", + "For visibility we will provide our new optimized prompt here, but you can also pass the ``prompt_id`` and ``version``. Let's start by writing out our optimized prompt. " + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "8bf4c55d", + "metadata": {}, + "outputs": [], + "source": [ + "optimized_prompt = \"\"\"\n", + "# Objective\n", + "Generate a single, self-contained Python script that exactly solves the specified task on a MacBook Pro (M4 Max).\n", + "\n", + "# Hard requirements\n", + "- Use only Python stdlib. No approximate algorithms.\n", + "- Tokenization: ASCII [a-z0-9]+ on the original text; match case-insensitively and lowercase tokens individually. Do NOT call text.lower() on the full string.\n", + "- Exact Top‑K semantics: sort by count desc, then token asc. No reliance on Counter.most_common tie behavior.\n", + "- Define `top_k` as a list of (token, count) tuples with length = min(k, number of unique tokens).\n", + "- When globals `text` (str) and `k` (int) exist, do not reassign them; set `top_k` from those globals. If you include a `__main__` demo, guard it to run only when globals are absent.\n", + "- No file I/O, stdin, or network access, except optionally printing `top_k` as the last line.\n", + "\n", + "# Performance & memory constraints\n", + "- Do NOT materialize the entire token stream or any large intermediate list.\n", + "- Do NOT sort all unique (token, count) items unless k >= 0.3 * number_of_unique_tokens.\n", + "- When k < number_of_unique_tokens, compute Top‑K using a bounded min‑heap of size k over counts.items(), maintaining the correct tie-break (count desc, then token asc).\n", + "- Target peak additional memory beyond the counts dict to O(k). Avoid creating `items = sorted(counts.items(), ...)` for large unique sets.\n", + "\n", + "# Guidance\n", + "- Build counts via a generator over re.finditer with re.ASCII | re.IGNORECASE; lowercase each matched token before counting.\n", + "- Prefer heapq.nsmallest(k, cnt.items(), key=lambda kv: (-kv[1], kv[0])) for exact selection without full sort; avoid heapq.nlargest.\n", + "- Do NOT wrap tokens in custom comparator classes (e.g., reverse-lex __lt__) or rely on tuple tricks for heap ordering.\n", + "- Keep comments minimal; include a brief complexity note (time and space).\n", + "\n", + "# Output format\n", + "- Output only one Python code block; no text outside the block.\n", + "\n", + "# Examples \n", + "```python\n", + "import re, heapq\n", + "from collections import Counter\n", + "from typing import List, Tuple, Iterable\n", + "\n", + "_TOKEN = re.compile(r\"[a-z0-9]+\", flags=re.ASCII | re.IGNORECASE)\n", + "\n", + "def _tokens(s: str) -> Iterable[str]:\n", + " # Case-insensitive match; lowercase per token to avoid copying the whole string\n", + " for m in _TOKEN.finditer(s):\n", + " yield m.group(0).lower()\n", + "\n", + "def top_k_tokens(text: str, k: int) -> List[Tuple[str, int]]:\n", + " if k <= 0:\n", + " return []\n", + " cnt = Counter(_tokens(text))\n", + " u = len(cnt)\n", + " key = lambda kv: (-kv[1], kv[0])\n", + " if k >= u:\n", + " return sorted(cnt.items(), key=key)\n", + " # Exact selection with bounded memory\n", + " return heapq.nsmallest(k, cnt.items(), key=key)\n", + "\n", + "# Compute from provided globals when available; demo only if missing and running as main\n", + "try:\n", + " text; k # type: ignore[name-defined]\n", + "except NameError:\n", + " if __name__ == \"__main__\":\n", + " demo_text = \"A a b b b c1 C1 c1 -- d! d? e\"\n", + " demo_k = 3\n", + " top_k = top_k_tokens(demo_text, demo_k)\n", + " print(top_k)\n", + "else:\n", + " top_k = top_k_tokens(text, k) # type: ignore[name-defined]\n", + "# Complexity: counting O(N tokens), selection O(U log k) via heapq.nsmallest; extra space O(U + k)\n", + "```\n", + "\"\"\"\n" + ] + }, + { + "cell_type": "markdown", + "id": "95c97164", + "metadata": {}, + "source": [ + "### Generating 30 code scripts with the Optimized prompt" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e003656a", + "metadata": {}, + "outputs": [], + "source": [ + "from scripts.gen_optimized import generate_optimized_topk\n", + "\n", + "MODEL = \"gpt-5\"\n", + "N_RUNS = 30\n", + "CONCURRENCY = 10\n", + "OUTPUT_DIR = \"results_topk_optimized\"\n", + "\n", + "USER_PROMPT = \"\"\"\n", + "Task:\n", + "Given globals text (str) and k (int), produce the Top-K most frequent tokens.\n", + "\n", + "Tokenization:\n", + "- Case-insensitive tokenization using an ASCII regex; produce lowercase tokens. Whole-string lowercasing is not required.\n", + "- Tokens are ASCII [a-z0-9]+ sequences; treat all other characters as separators.\n", + "\n", + "Output:\n", + "- Define top_k as a list of (token, count) tuples.\n", + "- Sort by count desc, then token asc.\n", + "- Length = min(k, number of unique tokens).\n", + "\n", + "Notes:\n", + "- Run as-is with the provided globals; no file or network I/O.\n", + "\"\"\"\n", + "\n", + "generate_optimized_topk(\n", + " model=MODEL,\n", + " n_runs=N_RUNS,\n", + " concurrency=CONCURRENCY,\n", + " output_dir=OUTPUT_DIR,\n", + " dev_prompt=optimized_prompt,\n", + " user_prompt=USER_PROMPT,\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "b2fe4c92", + "metadata": {}, + "source": [ + "### Evaluate Generated Scripts - Optimized Prompt\n", + "\n", + "We run the same evaluation as above, but now with our optimized prompt to see if there were any improvements" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "eea51c83", + "metadata": {}, + "outputs": [], + "source": [ + "from scripts.topk_eval import evaluate_folder\n", + "\n", + "evaluate_folder(\n", + " folder_path=\"results_topk_optimized\",\n", + " k=500,\n", + " scale_tokens=5_000_000,\n", + " csv_path=\"run_results_topk_optimized.csv\",\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "bf35e47b", + "metadata": {}, + "source": [ + "### Adding LLM-as-a-Judge Grading \n", + "\n", + "Along with more quantitative evaluations we can measure the models performance on more qualitative metrics like code quality, and task adherence. We have created a sample prompt for this called ``llm_as_judge.txt``. " + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "cb68a647", + "metadata": {}, + "outputs": [], + "source": [ + "from scripts.llm_judge import judge_folder" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "40cdec99", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "# Run LLM-as-judge for baseline results\n", + "judge_folder(\n", + " results_dir=\"results_topk_baseline\",\n", + " out_dir=None, # auto-map to results_llm_as_judge_baseline\n", + " model=\"gpt-5\",\n", + " system_prompt_path=\"llm_as_judge.txt\",\n", + " task_text=None, # use default task description\n", + " concurrency=6,\n", + ")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "626f4797", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "# Run LLM-as-judge for optimized results\n", + "judge_folder(\n", + " results_dir=\"results_topk_optimized\",\n", + " out_dir=None, # auto-map to results_llm_as_judge_optimized\n", + " model=\"gpt-5\",\n", + " system_prompt_path=\"llm_as_judge.txt\",\n", + " task_text=None,\n", + " concurrency=6,\n", + ")\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "50361139", + "metadata": {}, + "source": [ + "### Summarizing the results \n", + "\n", + "We can now demonstrate from both a quantitative standpoint, along with a qualitative standpoint from our LLM as Judge results. " + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "a6dd05b0", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAABcsAAAMQCAYAAAD4vT0AAAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjUsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvWftoOwAAAAlwSFlzAAAPYQAAD2EBqD+naQAAzJtJREFUeJzs3Qd8VFX6//EngUDo0kEU6QhSlSIqiKhgwV2Rn7IKKihS1GUFKSogSBEUEBWkCYoKKCpYWFQUXayAYEVpUmWlSS+hhDD/1/f4v7OTkEBIJpmZzOftawyZcufMTO489z7nnOfE+Hw+nwEAAAAAAAAAEMViQ90AAAAAAAAAAABCjWQ5AAAAAAAAACDqkSwHAAAAAAAAAEQ9kuUAAAAAAAAAgKhHshwAAAAAAAAAEPVIlgMAAAAAAAAAoh7JcgAAAAAAAABA1CNZDgAAAAAAAACIeiTLAQSVz+cLdRMAAAAAADkc554AsgLJcoSthx9+2KpXr24vvfSSRZP//ve/7nWf7vL6669bOJowYYJNmzbN//u4ceNce7PaI488csb37M4777S5c+e6f+s9DqUFCxbYHXfcka77Hj9+3K677jr78ccfs7xdABAM0Rq/ly5dekrsufDCC+3iiy+2f/zjH/bZZ59lyfMqvumSkbjZrFmzNBMNo0eP9sfPnCwwJnvHYDpeSGnJkiVWr149u+mmm2z37t2pHq9ddNFF7j19/PHHbc+ePf7H7t2715o3b25btmzJ1tcGAOHku+++s3/+8592+eWXW+3ate3qq6+2AQMG2Pr16zO8vS5dulikS+1cVscOt912m3388cfZ2paMnr/rNbRo0SJL2gSEQu6QPCtwBgcPHrSFCxdatWrVbPbs2dapUyeLiYmxaNK9e3d3YpWa888/38LRc889Zw8++KD/91tvvdWaNm2a5c97//33u0REYNJ+5cqVNn78eP91BQsWtGLFirm/p1KlSlmo6AT7iSeesBdffDFd98+TJ4/17t3b+vXrZ++9957Fx8dneRsBIKOI3+YSpUqaihLR+/fvdx0HilWTJ0+2K6+80sJBbGys7dixw77//nu75JJLTrn9gw8+sJwuvTH522+/tW7dulnFihXdZ1m0aFF/x3vg8dqxY8ds48aNLtmwbt06mzVrlrte9+/YsaM99thj9uqrr0bdPgEAU6ZMsWeeecauuOIK911YsmRJ27x5sxsE1qZNGxsxYoTdeOONZ7XNt956K8OJ9nCj98M7dz158qQ7dvj3v/9tPXr0cIPR1MEAIPuQLEdYUmCQ/v3729133+1G8zRp0sSiSfny5d0IpkhWpkwZd8mO90oXj5LiSjKn9v7ptlCaOHGi1alTx59ISY9rrrnGnn32WXcwqcQTAIQr4rdZlSpVTok/DRo0cAlVJUrDJVletmxZl8z/8MMPT0mWazaTEunq9MjJ0hOTly1bZl27dnWfqxLlhQsXPu3xWuPGjS0uLs4lg3777TerWrWqu16j1/V8n3zyibVs2TILXxUAhJf//Oc/NmbMGDeqPHBgVaNGjezmm292M9I0Mlkxx/vOjDapnbvquOGHH35wgw9IlgPZizIsCEtz5sxxJ9eXXnqpXXDBBfbGG2/4b7vnnnvslltuOeUxGrH1t7/9zf/78uXLrUOHDla3bl0XiDUyN3BKrKbY1qxZ0/VIK/joPhoFlJSU5Hq+W7du7U6gFLQ0alkn/IEWLVrk2qH7tGrVyiUIrr32WjeayLNv3z43wuyyyy5zU800lWrx4sVBe590sKHtbtiwwX+dnr9GjRpuFJRHr1E99bVq1XJBV/fR6wz0+eefu9ep16sef7X7wIEDp52Opeu81+vdrh5x79+pPU4j1fS+1a9f373veh71nAe2X++j3l9NdVab9f6+++67mX6/UpZh0UHZvffe6w5AlJDWZ6n3QKPCdFCn59ffj0bIr1q1Ktm2zvT3lRrd/vbbb7u/rUCvvPKKK7Wiz1Ij8QcPHmyHDh1Kdh+15eWXX3ZlWQAgXBG/U6fZTRqVvHXr1rN6Dr1ujXy+6qqrXDzUa33ggQdOW07syy+/dPdVh8WZarkq9miKd8r7KVarXeecc84pjznTMUVmYuvXX3/tkspK3ivprATKtm3bTvvZz5w508V2bT+QZmPpeCjw8emJyYH0t6gp/tr+9OnTT0mUp6VIkSLuZ+AIciVC9Pem2QUAEE10flipUiUXv1JS5+KQIUMsV65c/lk+aZXECiz1oX+/88479scffyS7r86hhg4d6s6pdBzQtm1bF/c9ileKG4pFik+KYyo7pplBgc+TneeIaVEMKVSoULJYorZpMMKgQYNcqZYbbrjBvSaNRtcxkI5nvPPn1157Ldn20nucFEjHLXqPdNzk5QZ07v7oo4+619ewYUMbNWqUe/6UTnfer8EDKlWnMmWeF154wX2WgcdCmq2o+6kDPyvzBEBKJMsRdjQKZ8WKFa6XWfTz008/tV27drnfdUL966+/umlbHn1xf/HFF/b3v//dPwpI011VskIjcjW6R8nju+66y44ePZosYGiU0PDhw90XfuXKlV2wVBmPdu3a2dSpU12w1Qntv/71Lzty5Ih7nAKKTu41Kktf2u3bt3cBK/CETAFXgUxt79mzpztI0Cjrzp07p+uEWwHnxIkTp1wCT0iVVM2fP797bvnll19s0qRJLiGh4CU6KRs4cKBLXug2tVUHIrrOo6CvUVPFixd375fKfigwqd3ppYMJ+b//+z//v1PS+9qrVy8XmJ9//nl3wKRaoaqHGvi5/Pnnn+6gSZ+XAvp5553nDjSyYpqdeutnzJjhDjw0/U/PoRNj/VvviaYL6nPVe+JJ799XSkpI6DNU0sOjJI0OMPS5aIqd3hOd4OvvLmVCQwcJgZ0gABBOiN9pU0enTv69WVDpeQ4lsBWHlEBWDFKMUCe5bvfifkp6/3QfnUQOGzbsjOU+dJLtlWIJPP746KOPUp0On55jiozGVp3s6vhFn41u1+eq7ejzVLmUtD57nfTnzZvXxc5A2p7aqe2lNyanrIV73333uRN3vffq8DjT8Zr+RlevXu3+DtVhpNHoKWO5jtVSJvYBIKdSoljfe/quTSsmqWNWHbSKiemlWK6ZWipfonNPJXQVHxRH5s2b52KNvou9JL2S2KKErWKRkuCa7aM4pnil7QV2HGfnOaLHiyWJiYkuiayEso6tbr/99mT302vRcyu5rE5ldTQoL6Dzax1rKT4r3jz55JPuPp70HCcF0jm5Xo8+Hw3aUoexYp6OVTTQTufnI0eOdMcQKUu3nem8X5+X3u/ARL33b72PHh0jqpO8dOnS2Z4nQJTzAWFmxIgRvkaNGvmOHTvmft+6davvwgsv9E2cONH9fvjwYV+9evV848eP9z/mrbfecvfZvn27+71du3a+1q1b+06cOOG/z4YNG3w1atTwzZgxw/0+Z84cX7Vq1Xzvvvtusufv1auXb/r06cmuW7BggbvvDz/84H6/4447fH/72998J0+e9N/n3//+t7vP888/736fPXu2+/3HH3/030f3b9++ve+WW25J8/Vv2bLFPS6ti157oPnz57vr33zzTd+NN97ou/nmm/3v3YEDB3x16tTxPf7448keo/vqMWvXrnW/t2nTxj0u8PVouy1btvT9+eef7jXp/ikFvt7Ufg983L59+3y1atXyDRw4MNk2li1b5u7jfS7eY7755hv/ff744w933bRp03zp0a9fP99VV111yvXeZ6732Luffl+3bp3/PnqvUj6/nlfX7d+/P91/X6n517/+5f5uAun9aNWqlS8pKcl/3Xvvved79dVXT3l8w4YNfU8//XS63gMAyG7RHr+XLFnijx+JiYnucuTIERdj1Dbd9vnnn6f7OfSe3HnnnS5OBho6dKiLp54OHTq4y08//eSrX7++r0+fPsliypni5NVXX+226Vm6dKmvdu3avoMHD/q3fTbHFBmJrWrv5Zdf7rvnnnuSbXvz5s2+iy66yPfUU0+d8bPX6/E+123btrm/q3nz5p1VTPaOwRSb9V5Wr17dHQsdOnTorI7XtB+sWbPmlMfoPdTtM2fOTLNdAJCT/Pzzz8nO9dIycuRIdz+dM3rfr/rOP905XsrfP/vsM/e4Tz75xH+d4ouOLcaNG+f77bff3O2TJ09Otl3FFF2/aNGikJwjes+X2mXQoEHJYrp3X8W5wOdQvEr5usaOHevi+Z49e9J9nOSdi+sxyi3cdNNN/sfLf/7zn2THM97xXePGjf2fRXrP+3UO7N0nISHBxXvlJbzjDmnevLn/+CwYeQIgvRhZjrCiXtT333/f9fSqx1EjzgoUKOCm47755puuJ1MjqXV7YO/l/Pnz3egh9TiqV/Snn35yPc3qrfR6aLUopkaeaYRWIE3RDaR6ahrtpV5w9dpqSrna5I0M00U9zao3Gdg7rt7b3Ln/twyARn6pp1t1MANHhatXXb3rgaVHUqORYZoenPKiaWMpR4Vp+pF6ybds2eJ6jDXVV9ROvY+arhY4Ot2bvqb3QrdrMUy9p4GvR9tV72+JEiUsGFT/VO9dyunOquNarly5U0ZMB9Zs8+qeJyQkWLBpqrT+Ljze69XUOY83DV1/j2f79xVIn496vwNp5JlGmGmKmkYWalSmRgSq1z2lc88997RT7wEgVIjf/6NRWHqsLooliqfa5oABA6xZs2bpfg69JxpVpvdQ3/16/ZpSrRFcKUtyaZq0RkHrfdPxgBbvTC+1L7AUiz4TjfhKOZI6PccUGY2tioMaLZbyGEEj8TV9O+UxQsrPXrPaNBXfGzmoUeX6+9N07bOJyR6NUtTxiUbkaSaERrGl53hNZYfGjh3rSu5oartmUgTSdHqNzCOWA4gWXmxRuZXT0ejowPtnhGYE6Xm8uCSKh/pu1ne1F0tSzpzS73r+pUuXhuQcUXRMEHjOr9JfarOOZfr27ZvsvnrewHXBNCJbz5lafNZMNr0v6TlOCqTR4xrVrtHxWqTao8fpPVaZG4+O7wLXY0nveb+ONb755ptkn51GjOt91ONVYs8rAxOKPAGiGwt8Iqyo/pSm2npBIrU6nPoi1nRtfbFrqqsClwKbphl5wUon5ZoW7NU9C6SpuoH05R5IyUrVB9XPfPnyuSm0SlKKgpCmKumEViVLAinABtb21P104pfWolG6zatpmRoFEtUwTQ+tIK7EdoUKFdwJWmAbRFPGUrNz5053Qq7XlfL1BJuXXEgt+a7rDh48mOw6vfce76Q/MwdPaUlrWnXKvwvP2f59BVINvcDX5SUptL1Zs2a56WoqC6DPXlP6dFsgPTZlLXMACAfE7/9RG7zHatu6r9oRmKBP73PovfKmequNShJrendKSr5qvRG9n+p41bTx9FKsUXkVJeF1AqrEuaZzp5SeY4qMxlZv22kdI6hT/3TbUcezEt9Kkqt+qn7qdZ1tTPbob1XvowYfeFP0VW81sLZ+WsdrSu7r8V49d02HD0QsBxBN9B0p6tA8HXVgqpNTsS6j35GKJXp8Wh3G3vmoEtOB1GGuhHDg+Wh2niOK4k3Kc38NJlDbVNKlU6dO/mMGvU+pxdDUyqeJyq2l5zgpkJL/iqtKsKsD2XtP9R7qPU5ZUifwPU3veb9ipcq76BhGgwhUg12vWQl+Jcw1eEDbVW3yUOQJEN1IliOsqHdTva+qQxlIX37qWVWvsL5U9SWqL84PP/zQ/VTw0UgxL3joy1sju1ILGGmdGIkCs3pRVZ9SI6tU40xfwKrJpWS06CRbvZ5eDVaPgqMXqLzRQ0pea6R3atIazXS2FMhUO02rh69du9bV8dRrEG8hKrVBbUktWOlAQO9XyoVHFKTUS63ecy8YKsng9fofPnz4rNrpJRb0vul9TZkU0OceCTLz95XyIMyjXndddNtXX33lDrD69OnjRhN69dm8gzDvgAYAwgnx+3/UaX2mzu70PIdGb6kOp2YaaaExLx48/fTT/lFiHh0DKOGtE2odB+j9S2+HuxbOUptVp1wjxxX/U47iSu8xRUZ5nRUpPxvvGCFwVFtq9HejgQMaea/arhqp/tRTT532MWnFZG+2gTdLT/FYI9/UgaDOBK/u/Onob1l/g4H1+QNj+ZleDwDkFIq9+u5ULFZt7NQS2YrhGnntjQgPPPcMdKbRw4qtiuc69ghM5qrDVdd556OKK14SX7wa4Zn5bs7MMczpeIlixZO0Oti9+PzKK6+ckkgXnT+m5zgpkLalxUs1a02z3PS6RO+R3qvAvIAEHkel97xfI82Vi1CiXHkHzZbXsY6OMdT5r2MdHY+caf0VICtQhgVhQ1+cGnmm4NK4ceNkF40Y0omLvszVM6ovZpWq0MKUOrnTtG6vl1dfuFoEYsOGDe5E0btUrVrVjfAJnF6Vkh6jL3pN/1FPqxfMtbCEd0Kt51avZ8oFSD777DM33cmjBTY1EkwHCIHt0IGAFtQIDC6Zod7e7du3u9emlbe1gIa3wIUS3UoM6D0LbIN6qDVSTb24Cqgaqab3MpBes0aPaaSY17Ou5/GkPFGX0037Vlt04qkFLQMpGaDpVXpPI0Fm/r50oBK4iJw89NBD/pXhdYB3/fXXuwVm9LcUOEpPB3j6HAMP7AAgHBC/z156nkNlT9Tuf/7zn/5EuU5OvSnLui0w2azYrniiKckq+RL4mtJbikUlclS6JLURcOk5psgoJevVeZLyGEEjDTWdOz3HCCpnpkS0kuSa8h44VT69MTk1Gsmvhbg1JVyLsSqpciZKwithf8EFFyS7XqPtNMiBjm8A0USd5vpOVKxISXFNi1ars9Yb8OWde3ojokXfvT///PNpzz2VfNX9vNjvnUNpMWh1KCv2ipLFgfS72qGBShmVmWOY0/Fec8p4kvJ1i5LYgc+twXDPPfecOz5Kz3FSIMVklY7Tuam24cV4DXrQ8cXChQv991V8DCwzk97zfh1TaNaWjsuUmPc+Hx07asai7p/WItxAVmNkOcKGpszqizet6UM333yzvfXWW672qU4cNZVbo6f0RZ9yqpNWXlaiV6tDa8qsgp/uq+k8SkSe7mRNgU5TZnXyp4t6Wr0p5d4q0T169HAjvfRTdTL1pa8gIl7Pp07aNG1XU6a6detmZcuWdSe4aquS2meq2/b777+7E8TUqLdWbVW9Lz2HTt7UA6vE6yeffOKmX2sUn3p+ddChtqk3WYkLHXTod7VTo8m819O9e3f3vul9Vi+wDmaUxNBoNbVVo9dVB1Wj27zVt1P2XKtXW9O4tYK1F7QDT+T1mehx2p4Cn4Ku2qKArRFhkSKjf186GNBoSp1EKzHuHQzoAFEn9zog0Ym+pn7r8/Q+H9GsAT0usD4cAIQD4vfZS89z1KlTx91X9bLbtm3rEq1at0QlbLwRdimniWvkmmKK3sNp06ZZ165d050sV3x+7733XEmw1KT3mCIj9Legz14JDe+z10m/4qGOefQ+nYkS0JdddpmboaVSZmeSWkw+3cg+/f3pter4SCP+0zpe0zGUOjy8UXyBvIEGKpkDANFC5y86P9XMKCVFFdNKlSrlzgVff/11d51mpnlxRN/7Kmml2UJKEut3jW5WQj2wDIrOPfWdqw55Df7SKGQ9Ts+l82KNYFZc00CyoUOH+s85NbhMxwUq26XnVqxRTMvseVZGj2G8hHNgLNFxlc71J06c6GJGWqPKRaPF9XwDBw505W4Us9Q5oTU0NFNN55XeMcOZjpNSUs1yDYjQsYWOK5QsV3vUKa/yexrIpc9GiXmvzN3ZnPdr1qGeQ5+rN4pen4VyGeq4V1wHQoFkOcLG3LlzXc+rkrOpUU+vvux1wq1go2Cq++pkSl/agfQFri9zBT6dEOtLWgFGNbECF4RISSdLOklUINc0MW/UtU5oNQVJvZuaHqZEsHqI9YWvtihIKDgpae0lkPWFr5NajfzWiCSdjOl+Cp733HPPGd8PBUZdUnP11Ve7adA6qdR7oAS26LmV0FbiWydqOknWgYJ6hlUTW9fpYEPvl4K5d3KoAKbAqfdLo9KKFSvmRv4pqeElIZTMVXsU+DRiSwccugTSCb/eP71XgQu4ebQ9TdPW+6naZwqkGnGoNqZV/y0cZfTvS++zDkx0wOHVI9cCYBoBoQMCfUYawabPR9O+AxMy6vXX5xgpI/ABRA/i99lLz3PoZFExXa9do/AVP3WdF6uVeA1cUMuj6zSVWSep+playZSUdPKqz0SzBE53YpqeY4rMdCDoM9DoP70+ndQrcaFtp6wvmxYlSjSdWx0yZ5JaTD4dHVMpFuvz0HvkrRETeLympL/eB/3N6u845cABPV6dIMwSAxBt1OmpRLbKe+i8UslVfber41KJcsWhQCNHjnTnmkrKKh6og1vHEzqWCIwbSpQrZuiYQeep6nTWebLivBLASiQrYe11QOu5lIBX+TjdV0l7jbbWMcHZLI6dmowew4jib7t27fy/67GKFWqbNwv5dDSwTfFT55SaDa7EtWKb4rZmq6X3OCklvT+Kw+q41+AIDYDQ69N7rE4HlW7T89x2223JZu6l97xfxyzqcNc5rrfYuo51dJ1+ZrR8DZBZMT4q4QNnTYFA05wDe3i1WrTqTisIKZkNpEYHffpbUQ98eulrWgmPO+64w18vDgBw9ojfOZtGcmskmjoKsiomZ5RG9Sn5rySRZu4BAAAgPDGyHMgATfHVyGlN89XIIk1D1qgiLWDB1Fqcjkbfq1SB6s95IxzORHVkNY1Po9ABABlH/M6ZlBzXlHN9vhr1npUxOaM02k8zMOiQAQAACG+MLAcyQPXSNLVLdb60CKOmFWm0kKZPa7oRcDpK1GgUm06cz0T161QS58knn8zUojMAAOJ3TqX6t6odrjJ0Z1sq52xickap3ICmrnv1dwEAABC+SJYDAAAAABDBtNie6jyrDr/qCGvxQi1Gq7WGxFtE8ZdffnHrE6m0n+ohAwCA5DK3ggEAAAAAAAgpLQK4efNmmzJlir399ttu0XglxLXIoRZU1gKL5cuXdwsb6r5aoE//BgAAyVGzHAAAAACACLV//34rV66cde3a1apVq+auu//+++3vf/+7W8R28eLFFhcXZ0OGDLHcuXO70eZeYl1ljAAAwP8wshwAAAAAgAhVpEgRGzNmjD9Rrjr506dPtzJlyliVKlVs+fLl1qhRI5co91x66aW2adMm27VrVwhbDgBA+In4keU//PCDqey6esoBAIgEiYmJFhMTY/Xr17doRgwHAESacI/hAwcOtDfffNPy5MljEydOtPz589v27dv9iXRPqVKl3M9t27ad9QLHxG8AQE6O4RGfLFeQZo3SyKDPSX+YOqjSHyeA4GM/iwzErb8QwyMD3ytA1mM/ixzhHrfuvvtua9eunc2cOdPVJp81a5YdPXrUJc8D5c2b1/3UYqAZjd/Hjx8PWrtTtg+IFMHcD7Ia+xki1fFs3s8iPlnu9WbXrl071E3BGSQkJLhV2DUVUCMcAAQf+1lkWLFiRaibEBaI4ZGB7xUg67GfRY5wj+H6G5Lhw4fbTz/9ZDNmzHCLfaZMNHhJ8oz8vSl+K1nuPVdmqYMoX7589tq2hbbj+N6gbBPIaqXzFLU7y17jFtEN9040YT9DJCod5P1s3bp16RqUEPHJcgAAAAAAopVqlGsRz1atWvnrksfGxrpk9s6dO13tcv0M5P1eunTpDD2nkg3B7th5a+cX9uvhzUHdJpBVLipwgUviKQEdSdjPEM37WUw6Z++xwCcAAAAAABFKi3T26tXLJcw9KuuzcuVKq1y5sjVs2NC+++47S0pK8t++ZMkSq1ixohUvXjxErQYAIDyRLAcAAAAAIEJp8c5mzZrZsGHDbNmyZbZ27Vp75JFH7MCBA9axY0dr27atHTp0yPr37++moM+dO9emT59uXbt2DXXTAQAIOyTLAQAAAACIYM8884w1adLEevbsabfeeqvt27fPLfJ57rnnutHjU6dOtY0bN1qbNm1s/Pjx1rdvX/dvAACQHDXLAQAAAACIYIUKFbLBgwe7S2rq1Kljs2fPzvZ2AQAQaUKSLD9x4oS98MIL9u6777oe75o1a1qfPn2sXr16oWgOAAAAAAAAAFi8xVnRmAIWY+lbEBJZo7ivoB09ejRd942Li7NcuXJFbrJ84sSJ9tZbb9nIkSPt/PPPtxdffNE6d+5sH3zwgZUqVSoUTQIAAAAAAAAQpZQab5OroV0Vd5HFxeYmVR5icTG5XQmx9DrnnHOsTJkyFhMTE3nJ8oULF1rr1q3tiiuucL9r8RElz3/88Udr2bJlKJoEAAAAAAAAIEopUX5j/MV2ToliFhOfi5HlIZYnNs4q5Ct9xvv5fD5LSEiwnTt3ut/Lli0beclyLTDyn//8xzp06OBegGqn5cmTxy688MJQNAcAAAAAAABAlMpnedyIciXKcxXJG+rmwMxyxea2+Pj4dN03X7587qcS5qpakpmSLCFJlvfv39/+9a9/2dVXX+0aHxsba+PGjbPy5ctnaHteDwLC25EjR5L9BBB87GeRQXErs1PDAAAAAADBcU5M/r9Kr8QHp+41sl/+/Pndz8TExMhLlq9bt86t1q1FPkuXLu1KsPTu3dtmzJhhNWrUOOvt6U1YtWpVlrQVwbdp06ZQNwHI8djPwp9mVAEAAAAAQk8lVzScidIrkStYA9KyPVm+bds2e/jhh2369OnWoEEDd13t2rVdAl2jyydMmHDW29SKp1WqVMmC1iKYNNJVCbwKFSr4p0cACC72s8igmAcAAAAACH9xMbksV0xoRpwn+ZIs0ZcUkueOVtmeLP/pp5/cSHAlyAPVrVvXvvjiiwz3HHhD7RH+lMDj8wKyFvtZeKMECwAAAABERqK8Sr5yljs2NMnyEyeTbN2RPzKUMP/ig8/sw9nv2+/rNrpz0HIVz7erb77OWra9MUvamlNke7K8TJky7ueaNWusTp06/uvXrl3rRkIi59KOqQQeSSIAACIH8RsAAADRSiPKlSjvuXaSrT+yNVufu3K+c21stW6uDWebLP/0vQX28qiJ1qlPd6tR7yK3btZPS763l0ZNsv2799mtXdpnWbsjXbYny5Ugv+SSS6xfv342aNAglzx/9913bfHixfb6669bqLHoWtbRiXbNmjVD3Ywci79dANGO78GsQfzOevztAgAAhDclyn89vNkixYK3/m0t/t7Krv57K/915Sqcb3t27rb5r79DsjyckuWxsbE2ceJEe/bZZ+3RRx+1/fv3W7Vq1VwNc5ViCTWdqGz47y47ejwx1E0B0i0+T5xVOq9EqJsBACFFDEckIoYDAAAg2GJjY2zNzyvt0IGDVrBwIf/1bTrdZi3+3tL9u3vru6z5Tddau653+m/vnuK6db+usZnjX7bfVqy2vPnirfFVl9vdPe9z/9aAjw9ef88WvD3Pdm3/00qVK2P/d+/tdsV1V7nH7t65y14Z+6L9+M1yi80VaxfWrWl39+xiZcuXc7fv37PPXhw53n5d/rMdO3rUKl5Yxe54oKNddMlflUjW/LLKBj3bx1atWmW5c+e2Sy+91OWSzz333JyVLJciRYq4UeW6hCOdZCcc5UQbAIBIQwwHAAAAEO3+ftetNvbREdblug5Wq0Edq3FxbavdsK5VrlnNChQqmK5t7Phjuw3u2s8atbjcnpz+rCUcOmzjHh/tEtwPPtHb3nv1bXtrygy7p093l+D+/utl9vzjo+yc4sWsSq3qNqhLX6tUo6oNeXGUGzw9b+Zce/Tuf9mY2ZOseKkSNmXEOEs8nmhDXnzacufJY3OnvW5P9XrCpnw00+LyxNmQHv3t9nb/sKeeesoOHDhgjz/+uD322GNuwHWOS5YDAAAAAAAAAIKvyTVNrXjpEjb/9XddrXIlsuXcC8rZ/Y/3sgvrXXTGbSyc+4EVLFLYHni8l+XK/dcCp90HPuRGrGtU+fxZ79gNt9/sFg2VG/7xdzt+7LidOHHCvl6wyBIOHrZ/De2b7LG/Lv/JFr7zoRu5vuO/26x8lQpWqlxZyxuf1zr17mZNr7/KJdaPHE6wA/v2W6lSpaxcuXJ2/vnnuyolu3fvtqxGshwAAAAAAAAAcpBqtWu4y8mTJ23T2g32w9fL7MPZ79vwHgNt/LsvnfHxm9dtsko1qviT3VKrYV13ObB3v+3dtceq1b4w2WNuvvtW91Ojz1UC5u7mbZPdfvz4cftj4xb371vva2/PD3zalnz6lUve1730Epcsz5M3j7u0vbudDR061J5//nlXguXKK6+066+/3rIayXIAAHIA9bCPHDnSvvzySzt27Jg1bNjQLaZduXJld/uAAQPsrbfeSvYY9dB/9tln7t86gBo/fry7z8GDB93jNc1NPfge1YobPny4/fLLL1asWDHr2LGj3XXXXdn8SgEAAAAAadm940+b+/Jsu6VTOyteuqQbqV3pwiru0qh5E+t5Wzdb+f2KVB+blHTS/2/VCU9LrtPcJr6TPjv3gvOs39jBp9wWny/e/Wzc4nKb0nCW/bh4uf289Af798y59taLM23E9LF2fuUK1vFfXeyBu7va559/bosXL3aJ86lTp9q7775refLksawSm2VbBgAA2eaBBx6wzZs325QpU+ztt9+2+Ph4l8w+cuSIu33NmjXWrVs3++qrr/wX3c8zYcIEmzVrljsAeeONN1zyvHPnzq7nX/bu3WudOnWy8uXL25w5c9zzjR492v0bAAAAABAe4vLksU/f+ci++PA/p9yW///XKy9SvKjljotz5U48qkm+f/de/+/nVSpvG1evs6SkJP91Sz/72i0CqprixUoWt3W/rk22/dF9h9n0Zybb+VUusD+37bACBQtY2fPPdZeSZUrZzHEv2arvf7HE48fd/Xb8sc0ub3mlK9Ey/r2XLTYmxr77apn9sWmLvTB8rBUvXtxuv/12N7pcifL169fb6tWrLSuRLAcAIMLt37/fjRIfNmyY1alTx40mv//++23nzp3222+/uXpy69ats1q1alnJkiX9F40OFyXEX3rpJevRo4c1b97cLrzwQhs7dqxt377dPv74Y3efN9980+Li4mzIkCFu+23btnXJeCXnAQAAAADhoXDRIvb3u2+1Nya8YrNemG4b16x39cGXf7HURvUearUa1LWa9Wu5Ei3ffPyFrf5ppW3ZsNkmDBmbrOTKdbfdZAf3H7QpT46z/2783Y1Gf+25qVa7UT1XJuXmjre5muhffPCpbd+y1f172aLF1vDKJtbs+qutYJFCLnm+dsVqV3pl/KDRrhRM+aoVXEJfifbJw5+ztStW2c6t223RvE/s6JGjVq1ODSt8ThH74qPP3GxnJcg3btxo77zzjhUpUsQqVaqUpe8fZVgAAIhwOmAYM2aM//c9e/a4FcLLlCljVapUsd9//90SEhLSPKhQz/zhw4etSZMm/usKFy5sNWvWtGXLllnr1q1t+fLl1qhRo2RT8VQ3bvLkybZr1y4rUaJEFr9KAAAAAAiNyvnOjajnvP3+u61s+XJuMc2P3pxnx48esxJlS9nlLZtZm07/cPe548GONnn4ARvS/VErUKiA3dShrR0+eMi/DY0cH/jCcHvtuWnW544HrGDhQnZZyyvtjgc6utuvb/c3t6DnGxNfdfXL9Xw9Rz5qF11Sx90+5MXR9uqzL9qwB/u7mcsqAzNwwgg7r2J5d3uvkY/Z9DGTbWTPwZZwKMHKVTjPegzr6xL5Mnj8SHvzhVfttttuc6Pb69WrZy+//LIVLPjX6PisQrIcAIAcZODAgW4UuGq4TZw40fLnz29r1/41Ne61116zL774wtWsa9asmfXs2dMKFSrkRpBL2bJlk21LK497t+lntWrVTrldtm3bRrIcAAAAQI6T5EuyEyeTbGy1biF5fj232pARzVtf4y5pKV6qhD323JBk1/3tzuQLclavU9OGTfvfwKxAMTExbkFPb1HPlEqXK2N9Rg1M8/mVjFfCPC016l5kM2bMsOxGshwAgBzk7rvvtnbt2tnMmTNdXXHVIVeyXAlyJbcnTZrkRpo//fTTrkTLK6+84q9rnnKRlLx587oSL3L06NFUbxctKJpRKhGjUe+ZpQO1fPnyZXo7QKhoP9T+gOjkfQ97PxG+tJ8q5gAAokOiL8nWHfnDcsX8rzxJdlKiXG1A9iFZDgBADqKyKzJ8+HD76aefXE+8/n3HHXdY0aJF3W0aIa6a5ZrOtmLFCrcYqFe73Pu3lwT3EtC63lvsM/B20ej1jEpMTLRVq1ZZZqmdKhsDRCrVYSRRik2bNoW6CUiHlJ3HAICcTclqEtbRg2Q5AAARTjXKFy9ebK1atfLXFNdIciXOtcin/u0lyj1Vq1b1l1fxyq/ovuXL/1U/zvu9evXq7t+qf67fA3m/ly5dOsNt16KhXoI/Mxjlh0hXsWJFRpZHMXWUKFFeoUIFZsmEOS2YDQAAci6S5QAARDgtsNmrVy+bOnWqNW3a1D9ie+XKldaiRQvr27evS2xr0U+PRpSLEtXnn3++WyRl6dKl/mT5gQMH3OM7dOjgfm/YsKG98cYbbmGVXLn+moK4ZMkSl+ArXrx4ppLcmRmZDuQUJEjh/R3wnRje6JwFACBniw11AwAAQOaorIoW7Bw2bJgtW7bM1Sh/5JFHXMK7Y8eObsS5Rp6PHz/e1Sv//PPP7bHHHrPWrVtb5cqV3XRyJcVHjx5tn376qa1evdot/qnR5C1btnTP0bZtWzt06JD179/fjaqbO3euS7537do11C8fAAAAAICgYGQ5AAA5wDPPPGNjxoxxSe6DBw9agwYN3CKf5557rrs8++yzNmXKFHvxxRetUKFCdtNNN9lDDz3kf3yPHj3sxIkTNmDAALeYp0aST5s2zZVJEY0e18h11T9v06aNq3muEev6NwAAAAAAOQHJcgAAcgAlwAcPHuwuqbn++uvdJS0qrdKnTx93SUudOnVs9uzZQWkvAAAAAADhhjIsAAAAAAAAAICoR7IcAAAAAAAAABD1KMMCAAAAAAAAAKmIi8lluWJyheS5k3xJluhLCslzRyuS5QAAAAAAAACQSqK8Wr5zLTY2NCnUkydP2NojW9OdMB/Upa8dOZxgT88cn+rtE4c+a5/P/9ROJCba4MlPWa0Gdc+4zV+W/2SDu/azCfOmW6lzy6SrHat//NV8Pp/VqF/Ldm7dbvff1DHdzxdqJMsBAAAAAAAAIAWNKFei/Njn99jJ/Wuy9blji1S3vFe+5NqQ3mT51Te3sucHjrI/Nm6xchXPT3bb8WPHbfHCL61dtzuteetrrGCRQlnUcrMB9z5sDwzq5ZLlxUuXtBcXzMrS5wsmkuUAAAAAAAAAkAYlyn27f8re58zAYxq3uMLyPzXBvvjwM7v9/ruT3fbtom/s2JGjLlFetEQxyy65cuXK1ufLLBb4BAAAAAAAAIAIlzc+r13Rqrl99dGiU25b9O+FdvEVjSzx+HH7v0uuc+VVJCkpyebNnGs9brnXbm9yk/u54O35aT7HoQMHXTmXLte1t3aNbrR7rmnnflciXrRteeGJZ2z8oNGuDMvZPJ/ud1ujG2zZl0usdevWVqtWLbvuuuts4cKFlh1IlgMAAAAAAABADtDi7y1txx/bbM3PK/3X7d21x35e+r1dffNfiexAr4590eZMnWW33tfBnpk9yVrdepO9PHqS/XvWO6luf/zgMbZxzXrrPWqgjXt3mnXs1dXVQf/knQ/d7Sq5Ip0e7madenfP0POdTDppLz83xfr372///ve/rVq1atavXz87fPiwZTXKsAAAAAAAAABADlDloupWvkoF+/LD/1j1OjXddV988JmdU6yo1b+8ge3e8af/vgmHDtuCt/5td/fqYk2vv8pdd2P5crZz6w575+XZduPtN5+y/bqNL7aaF9e2C6pWdL9r0c8PZ79vv6/b6H73Sq7kL5jfChQqYIcPHszQ8935wD3WpEkT9+/777/fFixYYGvXrrX69etbVmJkOQAAAAAAAADkEC3+3sq++eQLSzrx18Kgn89faFe2vsbVDw/0x6b/2okTJ+zCehclu/6ii2vb/j373CWlVre2th3/3WavPDPFRvYcZA/8rZOt+3WNGw1+JmfzfOdXLO//d8GCBd3PxMREy2okywEAAAAAAAAgh2h2fQtLOJRgPy35zjasXmdb1m92CfSUfD5fqo8/6fsr8Z0rd/KiJCdPnrQRDw2yl0ZNdLdddu2V9uhzT1j1un+NYD+Ts3m+uDx50v34YKIMCwAAAAAAAADkEIWLFrEGzRrb1x9/YeeUKOrKppQ9/9xT7ndexfMtd+7ctvrHX61i9cr+61f98KudU7yoFSz814huz6Y16+2Hr5fZk9OftWq1L3TXnUg8Ydu3bLXS5cqesV1n+3yhQLIcAAAAAAAAAHIQLeb5XP+RVqBQIWvXrUOq98lfsIBd2/YGmz3pVStUpJBVvqi6/bh4uasrfscDHS0mJibZ/c8pUcyVcln8yRdWpNg5dnD/AZs77Q3bt3tvshIp8fnz2X83bbGD+w5k6vmiIlm+dOlSu+uuu1K97bzzzrNPP/00u5sEAAAAAAAAAKmKLVLdTobgOTOj7qUXW3y+fC6h3bjFFWner2OvrlbonMI24/mXbN+efVa2/Ll2b9/77dpbrj/lvsVKFrcHn+htsye/Zh+9Nc+NBr+kaWNr3b6NLf98if9+N3W4xd575W37Y+Pvdk+f7hl+vqhIlmvF0q+++irZdT/++KP985//dCubAgAAAAAAAECoJfmS7OTJE5b3ypdC8vx6brUhI2JjY23SB6+dcn2pc8vY29995P89V+5cdluXDu6SmloN6ia7f9Prr3KXlJQE97Treqe7eDLyfPGxeZINsF6zZo3lyGR5njx5rGTJkv7fExISbMSIEdamTRtr27ZtdjcHAAAAAAAAAE6R6EuytUe2Wq6YXCF5fiXK1QZEUc3ySZMm2ZEjR6xfv36hbgoAAAAAABFn37599swzz9iiRYvs0KFDVr16dXv44YetQYMG7vZOnTrZN998k+wxjRo1stdeO3XEIQAgOSWrSVhHj5Amy/fs2WPTp093Qfycc87J8HZ8Pp8boZ5ZKiKfL1++TG8HCBV1PGl/QHT/DQT+RHjSfhoOC5cAAICcoVevXvbnn3+6hHnx4sVdEvzee++1d955xypVquSmrg8ePNiuueYa/2Pi4uJC2mYAAMJRSJPls2bNskJakbVdu0xtR6utrlq1KtPtUaK8Zs2amd4OECobN26MiCSpDsxz5w75xJYcS99lO3bsCHUzcqQTJ04kW+E7s2XJAAAAMmvz5s329ddfu/PrSy65xF03cOBA+/LLL23evHnWoUMH2717t9WtWzdZSVQAAHCqkGar3n33Xbv55pstPj4+04m3KlWqZLo9jPJDpKtYsWLYjyzXfpYnb7zlimV/Q+RJOumz48eOZno/W7duXdDaBAAAolvRokVtypQpVrt27WTH3LocOHDAjSrXv3WuAAAAwjRZvnr1atuyZYvddNNNmd6WAn/+/PmD0i4gkkVSGaHJ35ywbfvDO7EPBCpbJMa6XpY7KPsZnbMAACBYChcubFdeeWWy6xYsWOBGnD/22GO2du1aN6N7yJAhbgS6zp2vu+46u//++5npBgBAuCTLly9f7mqpXXjhhaFqAoAQUqJ8895QtwI4G3TuAACA8Pf999/bo48+ai1btrTmzZu7hPmxY8esTp06bqFPlTB9+umnbevWre5nKNcNE9YOQySLlHXD2M8QyU6ePJmu/SwpKcndV/ulfmZ07bCQJctXrlzpVugGAAAAAACZt3DhQuvdu7ddfPHFNnr0aHedRpT369fPihQp4n6vVq2aK2Xas2dP69u3r5UoUSJk64YJa4chkkXKumHsZ4hkx44dSzX5ndr9tM7Yhg0b0rxPemZUhSxZrpW6zznnnFA9PQAAAAAAOcaMGTNs+PDhrsTKU0895U8I5M6d258o91StWtX93L59e4aS5cFaN0woT4dIFgnrhgn7GSJZ3rx5072fKeaVL1/ePSaja4eFLFn+4osvhuqpAQAAAADIMWbNmmVDhw61O++80/r3758sMabrzjvvPBsxYoT/uhUrVriEd4UKFTL0fKwbBvyF0ibRIS4ml+WKyRWS507yJVmiL+msHvN4lz628rsVqd52U4e2dnfP+077+F+W/2SDu/azCfOmW6lzy1j31ndZ85uutXZd77RQiI2NTdf9cuXK5e6r/TI+Pj7DnUYhS5YDAAAAAIDMl4F48skn7dprr7WuXbvarl27/LcpWdCqVSt3u2qWX3HFFS5Rrlrl9957rxUsWDCkbQeASEiUV4s/12JzhSaFejLphK09uvWsE+aXXdvMOvXudsr18flOTSKnVL1uTXtxwSwrXDT5rKRoQbIcAAAAAIAItWDBAldD/JNPPnGXQG3atLGRI0e60XSvvfaaS5qXLFnSOnbsaF26dAlZmwEgUmhEuRLlh0eMsJO//56tzx1bvrwVePRR14azTZbnyZvHipYolqHnjYuLy/BjcwKS5QAAAAAARKhu3bq5y+m0b9/eXQAAGaNEeVI6a16Hu0MHDtprz02zH75eZvv37LMChQtawyub2D29u1nefPGnlGGJNiTLAQAAAAAAACAKjB88xvbs3G29Rw20c4qfY6t/XGkThoy18ytfYK3vaGPRjmQ5AAAAAAAAAOQQX374H1vy6VfJrruwfi0bMG6Y1W18sdW8uLZdULWiu16jxz+c/b79vm5jiFobXkiWAwAAAAAAAEAO0eDKS61Dj3tPqWMurW5tbcs/X2KL5n1i27b8YVvW/247t263chXOC1Frw0tsqBsAAAAyb/fu3danTx+79NJLrX79+m7RrvXr1/tvX7VqlXXo0MHq1atnLVq0sFdffTXZ40+ePGnPP/+8NW3a1N3nvvvusy1btiS7z5m2AQAAAAAIvXz581nZ889NdileqoQ77xvx0CB7adREy5U7t1127ZX26HNPWPW6NUPd5LBBshwAgBzggQcesM2bN9uUKVPs7bfftvj4eOvYsaMdOXLE9u7da506dbLy5cvbnDlz3H1Hjx7t/u2ZMGGCzZo1y4YOHWpvvPGGO4jq3LmzHT9+3N2enm0AAAAAAMLXpjXr3cKevZ7qbx163GPNbmhhZc4717Zv2Wo+X6hbFx4owwIAQITbv3+/lStXzrp27WrVqlVz191///3297//3X777TdbvHixxcXF2ZAhQyx37txWuXJlf2K9bdu2LiH+0ksvWe/eva158+bu8WPHjnWjzD/++GNr3bq1vfnmm6fdBgAAAAAgvJ1TopjlypXLFn/yhRUpdo4d3H/A5k57w/bt3muJiYmhbl5YIFkOAECEK1KkiI0ZM8b/+549e2z69OlWpkwZq1Klio0bN84aNWrkktwelWuZPHmy7dq1y7Zu3WqHDx+2Jk2a+G8vXLiw1axZ05YtW+aS5cuXLz/tNkqUKJGNrxgAAAAAsk9s+fI54jmLlSxuDz7R22ZPfs0+emuenVO8qF3StLG1bt/G1TEHyXIAAHKUgQMHulHgefLksYkTJ1r+/Plt+/bt/hHnnlKlSrmf27Ztc7dL2bJlT7mPd9uZtpHRZLnP57OEhATLrJiYGMuXL1+mtwOEikomaX9A9H7+gT8RvrSfKuYAAKJDki/JTiadsAKPPhqS59dzqw1nY8iUUae9ven1V7lLSh17dXU/azWoa29/95H/+on/jq61qkiWAwCQg9x9993Wrl07mzlzpqsrrjrkR48edcnzQHnz5nU/jx075k/OpHYflXiRM20jozTVTwuHZpYS5RoJD0SqjRs3kiiFbdq0KdRNQDqkjIcAgJwr0Zdka49utVwxuULy/EqUqw3IPiTLAQDIQVR2RYYPH24//fSTzZgxwy326S3U6fES3Bp5rttF9/H+7d3HG619pm1klOqge23ODEb5IdJVrFiRkeVRTB0lSpRXqFCBWTJhbt26daFuAgAgmylZTcI6epAsBwAgwqlGuRbxbNWqlb+meGxsrEtC79y509Uu189A3u+lS5e2EydO+K8rH1AXT79Xr17d/ftM28hMkjszyXYgpyBBCu/vgO/E8EbnLAAAOVtsqBsAAAAyRwts9urVyyXMA8ubrFy50ipXrmwNGza07777zpKS/jcaYsmSJW4ka/Hixe3CCy+0ggUL2tKlS/23HzhwwD1ej5UzbQMAAAAAgEhHshwAgAinhTebNWtmw4YNs2XLltnatWvtkUcecQnvjh07Wtu2be3QoUPWv39/N3187ty5Nn36dOvatau/9mqHDh1s9OjR9umnn9rq1autZ8+ebjR5y5Yt3X3OtA0AAAAAACIdZVgAAMgBnnnmGRszZoxLch88eNAaNGjgFvk899xz3e1Tp051dczbtGljJUuWtL59+7p/e3r06OHKsQwYMMAt5qmR5NOmTXM1xUWjx8+0DQAAAACIRD73318/KbgVmYK1/g/JcgAAcoBChQrZ4MGD3SU1derUsdmzZ6f5+Fy5clmfPn3cJS1n2gYAAAAARKKDvqOW5DtplsSC65HKW4vLW8croyjDAgAAAAAAACBqHbajtvfkIUs6nBjqpiCDVIZUg8B0yQxGlgMAAAAAAACIWhpP/l7icuu6/xwrmieXxcTnshgKsoRUUuwJOxp7NF3lVw4fPuyS5WXLlrWYmMx9biTLAQAAAAAAAES15Sc32HnHvrdmO2pavtg8pMpDLC4mt/nyJKTrvkqQn3POOVakSJFMPy/JcgAAAAAAAAAW7aPL30labh8l/WznxORnZHmIVcl3rk2s2CNd942Li8t0+RUPyXIAAAAAAAAAMLMjdtyO+I6HuhlRr2hMYYuPj8/252WBTwAAAAAAAABA1CNZDgAAAAAAAACIeiTLAQAAAAAAAABRj2Q5AAAAAAAAACDqhSxZ/u6779oNN9xgtWvXthtvvNE+/PDDUDUFAAAAAAAAABDlQpIsf++996x///7Wvn17mz9/vrVu3dp69eplP/zwQyiaAwAAAAAAAACIctmeLPf5fPbcc8/ZXXfd5ZLl5cuXt+7du9tll11m3377bXY3BwAAAAAAAAAAy53dT7hx40b7448/7Kabbkp2/bRp07K7KQAAAAAAAAAAhGZkuZLlkpCQYPfee681adLEbr31Vvvss8+yuykAAAAAAAAAAIRmZPmhQ4fcz379+tmDDz5ovXv3tgULFtj9999vL7/8skueZ6S0i5LvmRUTE2P58uXL9HaAUDly5IjbH8IZ+xkiXTD2Mz1e+wIAAAAAAIjiZHlcXJz7qVHlbdq0cf+uUaOGrVy5MsPJ8sTERFu1alWm26YEXs2aNTO9HSBUNHNDibxwxn6GSBes/SxPnjxBaQ8AAAAAAIjQZHnp0qXdz2rVqiW7vkqVKrZo0aIMJ+D1+MxilB8iXcWKFSNiZDkQ7fvZunXrgtYeAAAAAAAQocnyiy66yAoUKGA//fSTNWjQwH/92rVrrXz58hlOvuXPnz+IrQQiE+VNgMjYz+g0AgAAAAAg/GR7sjw+Pt46d+5sL7zwghtlXqdOHZs/f759/fXXNn369OxuDgAAAAAAAAAA2Z8sFy3mqZF5Y8eOtR07dljlypVt3Lhx1rhx41A0BwAAAAAAAAAQ5UKSLJdOnTq5CwAAAAAAAAAAoRYb6gYAAAAAAAAAABBqJMsBAAAAAAAAAFGPZDkAAAAAAAAAIOqRLAcAAAAAAAAARD2S5QAAAAAAAACAqEeyHAAAAAAAAAAQ9UiWAwAAAAAQwfbt22ePP/64NWvWzC6++GK7/fbbbfny5f7bFy9ebLfccovVrVvXrrvuOps/f35I2wsAQLgiWQ4AAAAAQATr1auX/fDDD/bMM8/YnDlzrEaNGnbvvffahg0bbP369da1a1dr2rSpzZ0712699Vbr27evS6ADAIDkcqf4HQAAAAAARIjNmzfb119/bbNmzbJLLrnEXTdw4ED78ssvbd68ebZ7926rXr269ezZ091WuXJlW7lypU2dOtWaNGkS4tYDABBeGFkOAAAAAECEKlq0qE2ZMsVq167tvy4mJsZdDhw44MqxpEyKX3rppfbdd9+Zz+cLQYsBAAhfJMsBAIiCWqWdOnVyo8oCL3feeaf/9mPHjtkTTzzhTqbr169vDz/8sO3ZsyfZc1DvFACA8FO4cGG78sorLU+ePP7rFixY4Eacq/TK9u3brUyZMskeU6pUKTty5Ijt3bs3BC0GACB8UYYFAIAcUqv0zz//dLVKixcvbq+99pqrVfrOO+9YpUqVbM2aNTZ48GC75ppr/I+Ji4vz/1u3Kbk+btw4d7I9aNAg69Gjh82YMcPd7tU7VdJ91KhRtmjRIlfvtFixYkzhBgAgjHz//ff26KOPWsuWLa158+Z29OjRZIl08X4/fvx4hp5DI9ITEhKC0l6NgM+XL19QtgVkN3U6RcIMDfYzRLIjQdrPtA3tC2dCshwAgBxeq7RDhw6uXqlGhJcsWfKUx+/YscPeffddmzRpkjVo0MBdp6S7Ro9rsTCNNH/llVeodwoAQJhbuHCh9e7d280yGz16tLsub968pyTFvd8zmjxLTEy0VatWBaHFf7WhZs2aQdkWkN02btzoEnnhjv0MkWxjEPezlJ3HqSFZDgBADq9VqlHl+nfFihVTfbxqlnr1Sz26b+nSpW3ZsmUuWa5R54Gj0r37Dx8+PN099AAAIOtoNpjisjq7n3rqKX9CoGzZsrZz585k99Xv+fPnt0KFCmXouTQ7rUqVKkFpN8cQiGQ6Zo6UkeVAtO9n69atS9f9SJYDAJBDapUG8mqVPvbYY7Z27Vp3MjxkyBA3Al0nxzqRvv/++92JtEaWK+GukWcp65mqzqmcqd6pyrGEcho3U0sR6SJlGjeyhjdaKhJGJ0a7cO0g1uyyoUOHuvVI+vfvn6yNmjX27bffJrv/kiVL3Ojz2NiMLWOm7et4Aoh2HH8CkbOfpTd+kywHACCH1ypVwlwLeNapU8fVHNe06aefftq2bt3qfio5k9p0NCXP9TjJinqnwZzGzdRSRLpImcaNrLVp06ZQNwFBmsKd3d8fTz75pF177bVufZFdu3b5b4uPj3cJ9DZt2riyLPr5+eef20cffeRKqQEAgORIlgMAkMNrlWpEeb9+/axIkSLu92rVqrnp06o/rkU6dSKdWsJbiXKvFz8r6p0Gcxp3OI7yA3LiNG5kDXWUKFFeoUIFRimGufRO4c5Omk2mzudPPvnEXQIpOT5y5EibMGGCW6Bba5Ccd9557t+sOQIAwKlIlgMAkMNrlebOndufKPdUrVo1WXmVffv2ueR34Gg51TNV3fKsqncqTOMG/kKCFN7fAd+J4S0cO2e7devmLqfTrFkzdwEAAKeXsQJlAAAgLGuVtm/f3p555plkSW9Nv1ZZlkArVqxwo7o1ivGSSy6xkydP+hf69KZ0q5Z5w4YNs6zeKQAAAAAA4YSzWwAAIlxqtUr//PNPdzl48KC1atXK3nvvPXv99ddty5Yt9sEHH7ha5ffee68VLFjQjR6/8cYbbcCAAbZ06VL7+eefrVevXtaoUSOrV6+eP+Gu61XaZf369fbSSy+5eqedO3cO9csHAAAAACAoKMMCAECES0+tUk0bf+2111xSvWTJktaxY0fr0qWL/34ala7bHnzwQfe7pmoreR5YtoV6pwAAAACAnIxkOQAAES49tUpVnkWXtKhG7rBhw9wlLdQ7BQAAAADkZJRhAQAAAAAAAABEPZLlAAAAAAAAAICoR7IcAAAAAAAAABD1SJYDAAAAAAAAAKIeyXIAAAAAAAAAQNTLHYon3bFjhzVr1uyU60eMGGG33HJLKJoEAAAAAAAAAIhiIUmWr1692vLmzWsLFy60mJgY//WFChUKRXMAAAAAAAAAAFEuJMnytWvXWoUKFaxUqVKheHoAAAAAAAAAAEJfs3zNmjVWuXLlUDw1AAAAAAAAAADhkSzXyPI9e/ZY+/bt7bLLLrPbb7/dvvjii1A0BQAAAAAAAACA7C/DcuLECduwYYNVqVLFHnnkEStYsKDNnz/funTpYi+//LI1adLkrLfp8/ksISEh021T/fR8+fJlejtAqBw5csTtD+GM/QyRLhj7mR4fuGYHAAAAAACIwmR57ty5benSpZYrVy6Lj49319WqVct+++03mzZtWoaS5YmJibZq1apMt00JvJo1a2Z6O0CobNy40SXywhn7GSJdsPazPHnyBKU9AAAAAAAgghf4LFCgwCnXVa1a1b766qsMbS8uLs6NVM8sRvkh0lWsWDEiRpYD0b6frVu3LmjtAQAAAAAAEZos1wjydu3a2cSJE61x48b+63/55ZcMJ7yVfMufP38QWwlEJsqbAJGxn9FpBAAAAABA+Mn2BT4rV65slSpVsiFDhtjy5ctt/fr1NmLECPvxxx+te/fu2d0cAAAAAAAAAACyf2R5bGysTZo0ycaMGWMPPfSQHThwwNUv1uKe1apVy+7mAAAAAAAAAAAQmprlJUqUcKPJAQAAAAAAAACIyjIsAAAAAAAAAACEG5LlAAAAAAAAAICoR7IcAAAAAAAAABD1SJYDAAAAAAAAAKIeyXIAAAAAAAAAQNQjWQ4AAAAAAAAAiHokywEAAAAAAAAAUY9kOQAAAAAAAAAg6pEsBwAAAAAAAABEPZLlAAAAAAAAAICoR7IcAAAAAAAAABD1SJYDAAAAAAAAAKIeyXIAAAAAAAAAQNQjWQ4AQA6wb98+e/zxx61Zs2Z28cUX2+23327Lly/337548WK75ZZbrG7dunbdddfZ/Pnzkz3+2LFj9sQTT1iTJk2sfv369vDDD9uePXuS3edM2wAAAAAAIJKRLAcAIAfo1auX/fDDD/bMM8/YnDlzrEaNGnbvvffahg0bbP369da1a1dr2rSpzZ0712699Vbr27evS357Bg8ebF999ZWNGzfOXnnlFfe4Hj16+G9PzzYAAAAAAIhkuUPdAAAAkDmbN2+2r7/+2mbNmmWXXHKJu27gwIH25Zdf2rx582z37t1WvXp169mzp7utcuXKtnLlSps6daobSb5jxw579913bdKkSdagQQN3HyXdNXpcCXiNNFcC/XTbAAAAAAAg0jGyHACACFe0aFGbMmWK1a5d239dTEyMuxw4cMCVY0mZ0L700kvtu+++M5/P535613kqVqxopUuXtmXLlrnfz7QNAAAAAAAiHclyAAAiXOHChe3KK6+0PHny+K9bsGCBG3Gusinbt2+3MmXKJHtMqVKl7MiRI7Z37143slwJ97x5855yHz1WzrQNAAAAAAAiHWVYAADIYb7//nt79NFHrWXLlta8eXM7evRoskS6eL8fP37cJbxT3i5KnmvhTznTNjJKo9ITEhIsszSKPl++fJneDhAq2g+ZpRHdn3/gT4Qv7aeKOQAAIGciWQ4AQA6ycOFC6927t1188cU2evRof9I7ZULb+10J5vj4+FQT3kqUewnoM20joxITE23VqlWWWWpDzZo1M70dIFQ2btxIohS2adOmUDcB6ZBaBzMAAMgZSJYDAJBDzJgxw4YPH+4W5nzqqaf8J/Nly5a1nTt3Jruvfs+fP78VKlTIlVfZt2+fS34HJgB0H9UtT882MiouLs6qVKlimcUoP0Q6rRPAyPLopY4SJcorVKjALJkwt27dulA3AQAAZCGS5QAA5ACzZs2yoUOH2p133mn9+/dPljxu0KCBffvtt8nuv2TJEjf6PDY21i655BI7efKkW6zTW8RTo1xVy7xhw4bp2kZGqZ1KuAPRjgQpvL8DvhPDG52zAADkbCzwCQBAhFNi+8knn7Rrr73Wunbtart27bI///zTXQ4ePOgS6D///LMry7J+/Xp76aWX7KOPPrLOnTu7x2v0+I033mgDBgywpUuXuvv26tXLGjVqZPXq1XP3OdM2AABAeJg8ebKL24EU46tXr57s0qJFi5C1EQCAcMXIcgAAItyCBQtc7e9PPvnEXQK1adPGRo4caRMmTLBRo0bZK6+8Yuedd577tzeKXDQqXQn3Bx980P3erFkzd2LtqVq16hm3AQAAQmvmzJn27LPPuhlhgdasWWPdunWzDh06+K/LlStXCFoIAEB4I1kOAECE08mvLqej5LcuadG0/2HDhrlLRrcBAABCQ6XTBg0a5GaIqfZ9IK2HoFrrXbp0sZIlS4asjQAARALKsAAAAAAAEMF+/fVXt2j2+++/b3Xr1k122++//24JCQlWqVKlkLUPAIBIwchyAAAAAAAimOqPp1WDfO3ate7na6+9Zl988YVbmFszxXr27GmFChXK0PNptLoS8MFaNJVFjhGpjhw54vaHcMd+hkh2JEj7mbaRnoW6SZYDAAAAAJBDKVmuBHmpUqVs0qRJbqT5008/bb/99ptbh0S3nS2tlbJq1aqgtE8JvJo1awZlW0B227hxo0vkhTv2M0SyjUHcz/LkyRPeyXK92FtuucUGDhzofgIAAAAAgODp3r273XHHHVa0aFH3e7Vq1Vzt8ttuu81WrFhxStmW9FDJlypVqgSlfekZ5QeEq4oVK0bMyHIg2vezdevWpet+IUuWqye6d+/eQZu6BQAAAAAAktPIcS9R7qlatar7uX379gwly5V40+LgQLSjtAkQOftZejuNQrbA57hx46xgwYKhenoAAAAAAHK8vn37WseOHZNdpxHlEqzR4QAA5BQhSZYvW7bMZs+ebSNHjgzF0wMAAAAAEBVatWplixcvtvHjx7t65Z9//rk99thj1rp1a6tcuXKomwcAQFjJ9jIsBw4ccD3bAwYMsLJlywZlm8FaiZvVgRHpImElbvYzRLpg7GfpXYUbAAAgs66++mp79tlnbcqUKfbiiy9aoUKF7KabbrKHHnoo1E0DACDsZHuyfPDgwVa/fn0XnIMlWCtxszowIl0krMTNfoZIF6z9LD2rcAMAAJyt1GZwX3/99e4CAADCKFn+7rvv2vLly23evHlB3W6wVuJmlB8iXSSsxM1+hkgXjP0svatwAwAAAACAHJosnzNnju3evduaN2+e7PpBgwbZBx98YFOnTs3QdlmJG/gL5U2AyNjP6DQCAAAAACDKk+WjR4+2o0ePJruuZcuW1qNHD/vb3/6WnU0BAAAAAAAAACA0yfLSpUunen3x4sXTvA0AAAAAAAAAgKwWm+XPAAAAAAAAAABAmMvWkeWpWbNmTaibAAAAAAAAAACIcowsBwAAAAAAAABEPZLlAAAAAAAAAICoR7IcAAAAAAAAABD1SJYDAAAAAAAAAKIeyXIAAAAAAAAAQNQjWQ4AAAAAAAAAiHokywEAAAAAAAAAUY9kOQAAAAAAAAAg6pEsBwAAAAAAAABEPZLlAAAAAAAAAICoR7IcAAAAAAAAABD1SJYDAAAAAAAAAKIeyXIAAAAAAAAAQNQjWQ4AAAAAAAAAiHokywEAAAAAAAAAUY9kOQAAAAAAAAAg6pEsBwAgh5k8ebLdeeedya4bMGCAVa9ePdmlRYsW/ttPnjxpzz//vDVt2tTq1atn9913n23ZsiXZNlatWmUdOnRwt+uxr776ara9JgAAAAAAshrJcgAAcpCZM2fas88+e8r1a9assW7dutlXX33lv7z99tv+2ydMmGCzZs2yoUOH2htvvOGS5507d7bjx4+72/fu3WudOnWy8uXL25w5c+yBBx6w0aNHu38DAAAAAJAT5A51AwAAQObt2LHDBg0aZEuXLrUKFSoku83n89m6deusS5cuVrJkyVMeq4T4Sy+9ZL1797bmzZu768aOHetGmX/88cfWunVre/PNNy0uLs6GDBliuXPntsqVK9vmzZttypQp1rZt22x7nQAAAAAAZBVGlgMAkAP8+uuvLpn9/vvvW926dZPd9vvvv1tCQoJVqlQp1ceuXr3aDh8+bE2aNPFfV7hwYatZs6YtW7bM/b58+XJr1KiRS5R7Lr30Utu0aZPt2rUry14XAAAAAADZhZHlAADkAKohHliDPNDatWvdz9dee82++OILi42NtWbNmlnPnj2tUKFCtn37dnd72bJlkz2uVKlS/tv0s1q1aqfcLtu2bbMSJUpkyesCAAAAACC7kCwHACCHU7JcCXIltydNmuRGmj/99NP222+/2SuvvGJHjhxx98uTJ0+yx+XNm9f279/v/n306NFUb5djx45luG0qEaNR75kVExNj+fLly/R2gFDRfqj9AdHJ+x72fiJ8aT9VzAEAADkTyXIAAHK47t272x133GFFixZ1v2uEuGqX33bbbbZixQqLj4/31y73/u0lwb0EtK73FvsMvF3y58+f4bYlJibaqlWrLLPUTpWNASLVxo0bSZTClbZC+EvZeQwAAHIOkuUAAORwGlXuJco9VatW9ZdX8cqv7Ny508qXL++/j36vXr26+3eZMmXc74G830uXLp3htqnOepUqVSyzGOWHSFexYkVGlkcxdZQoUa4FmpklE960YDYAAMi5SJYDAJDD9e3b1yW2p0+f7r9OI8pFierzzz/fChYsaEuXLvUnyw8cOGArV660Dh06uN8bNmxob7zxhiUlJVmuXLncdUuWLHEJvuLFi2cqyZ2ZkelATkGCFN7fAd+J4Y3OWQAAcrbYUDcAAABkrVatWtnixYtt/Pjxrl75559/bo899pi1bt3aKleu7KaTKyk+evRo+/TTT2316tVu8U+NJm/ZsqXbRtu2be3QoUPWv39/N6pu7ty5LvnetWvXUL88AAAAAACCgpHlAADkcFdffbU9++yzNmXKFHvxxRetUKFCdtNNN9lDDz3kv0+PHj3sxIkTNmDAALeYp0aST5s2zZVJEY0enzp1qg0fPtzatGnjap5rxLr+DQAAAABAThCSZPnu3btt5MiR9uWXX7rFwXRC3q9fPze6DQAAZI5ibErXX3+9u6RFpVX69OnjLmmpU6eOzZ49O2jtBAAAAADAor0MywMPPGCbN292I9zefvtti4+Pt44dO7qFbQAAAAAAAAAAyPHJ8v3791u5cuVs2LBhboSaRpPff//9buGx3377LbubAwAAAAAAAABA9pdhKVKkiI0ZM8b/+549e9wCYVpErEqVKtndHAAAAAAAAAAAQrvA58CBA+3NN9+0PHny2MSJEy1//vwZ2o7P57OEhIRMtycmJsby5cuX6e0AoaJSRtofwhn7GSJdMPYzPV77AgAAAAAACB8hTZbffffd1q5dO5s5c6arYz5r1iy76KKLzno7iYmJtmrVqky3Rwm8mjVrZno7QKhs3Lgx7Gv/s58h0gVrP1NHMQAAAAAACB8hTZZ7ZVeGDx9uP/30k82YMcNGjBhx1tuJi4sLSgkXRvkh0lWsWDEiRpYD0b6frVu3LmjtAQAAAAAAEZosV43yxYsXW6tWrSx37r+ePjY21iW7tchnRpNvGS3hAuQklDcBImM/o9MIAAAAAIDwE5vdT7hr1y7r1auXS5gHllFZuXKlVa5cObubAwAAAAAAAABA9ifLq1WrZs2aNbNhw4bZsmXLbO3atfbII4/YgQMHrGPHjtndHAAAAAAAcozJkyfbnXfemew6rfHVoUMHq1evnrVo0cJeffXVkLUPAIBwlu3JcnnmmWesSZMm1rNnT7v11ltt3759bpHPc889NxTNAQAAAAAg4um8+tlnn0123d69e61Tp05Wvnx5mzNnjj3wwAM2evRo928AABAGC3wWKlTIBg8e7C4AAAAAACDjduzYYYMGDbKlS5dahQoVkt325ptvWlxcnA0ZMsStG6byp5s3b7YpU6ZY27ZtQ9ZmAADCUUhGlgMAAAAAgOD49ddfXUL8/ffft7p16ya7bfny5daoUSOXKPdceumltmnTJremGAAACPHIcgAAAAAAEByqQ65LarZv3+7WDgtUqlQp93Pbtm1WokSJs34+n89nCQkJFgwxMTGWL1++oGwLyG5Hjhxx+0O4Yz9DJDsSpP1M29C+cCYkywEAAAAAyKGOHj1qefLkSXZd3rx53c9jx45laJuJiYlu0dBgUAKvZs2aQdkWkN02btzoEnnhjv0MkWxjEPezlPEwNSTLAQAAAADIoeLj4+348ePJrvOS5Pnz58/QNlXypUqVKkFpX3pG+QHhqmLFihEzshyI9v1s3bp16bofyXIAAAAAAHKoMmXK2M6dO5Nd5/1eunTpDCfeMppoB3ISSpsAkbOfpbfTiAU+AQAAAADIoRo2bGjfffedJSUl+a9bsmSJG6lXvHjxkLYNAIBwQ7IcAAAAAIAcqm3btnbo0CHr37+/m4I+d+5cmz59unXt2jXUTQMAIOyQLAcAAAAAIIfS6PGpU6e6BdLatGlj48ePt759+7p/AwCA5KhZDgAAAABADjFy5MhTrqtTp47Nnj07JO0BACCSMLIcAAAAAAAAABD1SJYDAAAAAAAAAKIeyXIAAAAAAAAAQNQjWQ4AAAAAAAAAiHokywEAAAAAAAAAUY9kOQAAAAAAAAAg6pEsBwAAAAAAAABEPZLlAAAAAAAAAICoR7IcAAAAAAAAABD1SJYDAAAAAAAAAKIeyXIAAAAAAAAAQNQjWQ4AQA4zefJku/POO5Ndt2rVKuvQoYPVq1fPWrRoYa+++mqy20+ePGnPP/+8NW3a1N3nvvvusy1btpzVNgAAAAAAiGQkywEAyEFmzpxpzz77bLLr9u7da506dbLy5cvbnDlz7IEHHrDRo0e7f3smTJhgs2bNsqFDh9obb7zhkuedO3e248ePp3sbAAAAAABEstyhbgAAAMi8HTt22KBBg2zp0qVWoUKFZLe9+eabFhcXZ0OGDLHcuXNb5cqVbfPmzTZlyhRr27atS4i/9NJL1rt3b2vevLl7zNixY90o848//that259xm0AAAAAABDpGFkOAEAO8Ouvv7pk9vvvv29169ZNdtvy5cutUaNGLsntufTSS23Tpk22a9cuW716tR0+fNiaNGniv71w4cJWs2ZNW7ZsWbq2AQAAAABApGNkOQAAOYBqiOuSmu3bt1u1atWSXVeqVCn3c9u2be52KVu27Cn38W470zZKlCiRoXb7fD5LSEiwzIqJibF8+fJlejtAqBw5csTtD4jezz/wJ8KX9lPFHAAAkDORLAcAIIc7evSo5cmTJ9l1efPmdT+PHTvmT86kdp/9+/enaxsZlZiY6BYOzSwlyjUSHohUGzduJFEKN1sH4S9lPAQAADkHyXIAAHK4+Ph4/0KdHi/BnT9/fne76D7ev737eKO1z7SNjFLpmCpVqlhmMcoPka5ixYqMLI9i6ihRolxrTjBLJrytW7cu1E0AAAA5LVm+b98+e+aZZ2zRokV26NAhq169uj388MPWoEGDUDQHAIAcrUyZMrZz585k13m/ly5d2k6cOOG/rnz58snuoxidnm1kJsmdmWQ7kFOQIIX3d8B3YnijcxYAgJwtJAt89urVy3744QeXMJ8zZ47VqFHD7r33XtuwYUMomgMAQI7WsGFD++677ywpKcl/3ZIlS9xI1uLFi9uFF15oBQsWtKVLl/pvP3DggK1cudI9Nj3bAAAAAAAg0mV7snzz5s329ddf2+DBg91Icp1kDxw40C0SNm/evOxuDgAAOV7btm3dTK7+/fu76eNz58616dOnW9euXf21Vzt06GCjR4+2Tz/91FavXm09e/Z0o8lbtmyZrm0AAAAAABDpsr0MS9GiRW3KlClWu3btZFPZdNEoNgAAEFwa+T116lQbPny4tWnTxkqWLGl9+/Z1//b06NHDlWMZMGCAW8xTI8mnTZvmaoqndxsAAAAAAESybE+WFy5c2K688spk1y1YsMCNOH/ssccytE0thpSQkJDptilhT71IRPriUOG+OBj7GSJdMPYzPT4ra56OHDnylOvq1Kljs2fPTvMxuXLlsj59+rhLWs60DQAAAAAAIllIFvgM9P3339ujjz7qpnk3b948Q9tITEy0VatWZbotSuDVrFkz09sBQmXjxo0ukRfO2M8Q6YK1n6n0CQAAAAAACB8hTZYvXLjQevfubRdffLGrk5pRmiJepUqVTLeHlc0R6bQGQCSMLAeifT9TzW8AAAAAABBeQpYsnzFjhqt7et1119lTTz2VqRF2Sr7lz58/qO0DIhHlTYDI2M/oNAIAAAAAIPzEhuJJZ82aZUOHDrX27dvbM888w1R0AAAAAAAAAEB0jSxXrdcnn3zSrr32Wuvatavt2rXLf1t8fLwVKlQou5sEAAAAAAAAAIhy2Z4sX7BggVuQ85NPPnGXQG3atLGRI0dmd5MAAAAAAAAAAFEu25Pl3bp1cxcAAAAAAAAAAKK6ZjkAAAAAAAAAAOGEZDkAAAAAAAAAIOqRLAcAAAAAAAAARD2S5QAAAAAAAACAqEeyHAAAAAAAAAAQ9UiWAwAAAAAAAACiHslyAAAAAAAAAEDUI1kOAAAAAAAAAIh6JMsBAAAAAAAAAFGPZDkAAAAAAAAAIOqRLAcAAAAAAAAARD2S5QAAAAAAAACAqJc71A0AAAAAAABZZ8eOHdasWbNTrh8xYoTdcsstIWkTAADhiGQ5AAAAAAA52OrVqy1v3ry2cOFCi4mJ8V9fqFChkLYLAIBwQ7IcAAAAAIAcbO3atVahQgUrVapUqJsCAEBYo2Y5AAAAAAA52Jo1a6xy5cqhbgYAAGGPZDkAAAAAADl8ZPmePXusffv2dtlll9ntt99uX3zxRaibBQBA2KEMCwAAAAAAOdSJEydsw4YNVqVKFXvkkUesYMGCNn/+fOvSpYu9/PLL1qRJk7Peps/ns4SEhKC0TzXU8+XLF5RtAdntyJEjbn8Id+xniGRHgrSfaRuB63akhWQ5AAAAAAA5VO7cuW3p0qWWK1cui4+Pd9fVqlXLfvvtN5s2bVqGkuWJiYm2atWqoLRPCbyaNWsGZVtAdtu4caNL5IU79jNEso1B3M/y5MlzxvuQLAcAAAAAIAcrUKDAKddVrVrVvvrqqwxtLy4uzo1UD4b0jPIDwlXFihUjZmQ5EO372bp169J1P5LlAAAAAADkUBpB3q5dO5s4caI1btzYf/0vv/yS4YS3Em/58+cPYiuByERpEyBy9rP0dhqxwCcAAAAAhEEtWUb+IStUrlzZKlWqZEOGDLHly5fb+vXrbcSIEfbjjz9a9+7dQ908AADCCiPLAQAAAKTLSZ/PYknoBh21ZLNeNP/txsbG2qRJk2zMmDH20EMP2YEDB9zfmxb3rFatWqibBwBAWCFZDgAAACBdlGyc/M0J27Y//OuzAp6yRWKs62XRfepbokQJN5ocAACcXnQfMQAAEEV27NhhzZo1O+V6nTzfcssttmrVKhs+fLirYVqsWDHr2LGj3XXXXf77nTx50saPH29vvfWWHTx40Bo2bGiPP/64nX/++dn8SgCEkhLlm/eGuhXA2aBzBwAApA/JcgAAosTq1astb968tnDhwmR1cQsVKmR79+61Tp06WYsWLeyJJ55wdUz1s0CBAta2bVt3vwkTJtisWbNs5MiRVqZMGRs1apR17tzZ5s2bZ3ny5AnhKwMAAAAAIPNIlgMAECXWrl1rFSpUsFKlSp1y2yuvvGJxcXFu8a/cuXO7xcA2b95sU6ZMccny48eP20svvWS9e/e25s2bu8eMHTvWmjZtah9//LG1bt06BK8IAAAAAIDgibUQmzx5st15552hbgYAADnemjVrXBI8NcuXL7dGjRq5RLnn0ksvtU2bNtmuXbvcqPTDhw9bkyZN/LcXLlzYLRC2bNmybGk/AAAAAAA5dmT5zJkz7dlnn7UGDRqEshkAAETNyPKiRYta+/btbePGjXbBBRdY9+7dXR3z7du3W7Vq1ZLd3xuBvm3bNne7lC1b9pT7eLdlhM/ns4SEBMsslZXJly9fprcDhMqRI0fc/hDO2M8Q6YKxn+nxgaXMAABAzpI7VAuMDRo0yJYuXeqmgwMAgKx14sQJ27Bhg1WpUsUeeeQRK1iwoM2fP9+6dOliL7/8sh09evSUuuOqby7Hjh1zCQZJ7T779+/PcLsSExPdwqKZpQSeRrkDkUodWN5+Fq7YzxDpgrWfsU4HAAA5V0iS5b/++quri/r+++/bCy+8YH/88UcomgEAQNRQeRV1UufKlcvi4+PddbVq1bLffvvNpk2b5q5TXfJASpJL/vz5/Y/Rfbx/e/fJzEhTHQ8ogZ9ZjPJDpKtYsWJEjCwHon0/W7duXdDaAwAAwk9IkuUtWrRwFwAAkH0KFChwynVVq1a1r776ysqUKWM7d+5Mdpv3e+nSpd3IdO+68uXLJ7tP9erVM5V8UzIeiHaUNwEiYz+j0wgAgJwtpDXLg4V6p8BfqHcKZL1IrXeqEeTt2rWziRMnWuPGjf3X//LLL25kd40aNeyNN96wpKQkN/pclixZ4kbhFS9e3AoVKuRKt2h0upcsP3DggK1cudI6dOiQra8FAAAAAICskCOS5dQ7Bf5CvVMg60VqvdPKlStbpUqVbMiQIfbEE0+4hT7ffPNN+/HHH23OnDkuIT516lTr37+/de7c2X7++WebPn26u6/XXiXFR48ebcWKFbNy5crZqFGj3Ij0li1bZutrAQAAAAAgK+SIZDn1ToG/UO8UyHqRWu80NjbWJk2aZGPGjLGHHnrIjQpXx5UW96xWrZq7j5Llw4cPtzZt2ljJkiWtb9++7t+eHj16uHIsAwYMcAuCNmzY0NU7VxwGAAAAACDS5YhkOfVOgb9Q3gTIepFc77REiRI2YsSING+vU6eOzZ49O83bVZ6lT58+7gIAAAAAQE4TG+oGAAAAAAAAAAAQaiTLAQAAAAAAAABRL+RlWEaOHBnqJgAAAAAAAAAAohwjywEAAAAAAAAAUY9kOQAAAAAAAAAg6pEsBwAAAAAAAABEPZLlAAAAAAAAAICoR7IcAAAAAAAAABD1SJYDAAAAAAAAAKIeyXIAAAAAAAAAQNQjWQ4AAAAAAAAAiHokywEAAAAAAAAAUY9kOQAAAAAAAAAg6pEsBwAAAAAAAABEPZLlAAAAAAAAAICoR7IcAAAAAAAAABD1SJYDAAAAAAAAAKIeyXIAAAAAAAAAQNQjWQ4AAAAAAAAAiHokywEAAAAAAAAAUY9kOQAAAAAAAAAg6pEsBwAAAAAAAABEPZLlAAAAAAAAAICoR7IcAAAAAAAAABD1SJYDAAAAAAAAAKIeyXIAAAAAAAAAQNQjWQ4AAAAAAAAAiHokywEAAAAAAAAAUY9kOQAAAAAAAAAg6pEsBwAAAAAAAABEvZAky0+ePGnPP/+8NW3a1OrVq2f33XefbdmyJRRNAQAA6UT8BgAgMhHDAQAI42T5hAkTbNasWTZ06FB74403XODu3LmzHT9+PBTNAQAA6UD8BgAgMhHDAQAI02S5gvFLL71kPXr0sObNm9uFF15oY8eOte3bt9vHH3+c3c0BAADpQPwGACAyEcMBAAjjZPnq1avt8OHD1qRJE/91hQsXtpo1a9qyZcuyuzkAACAdiN8AAEQmYjgAAGGcLFfvtZQtWzbZ9aVKlfLfBgAAwgvxGwCAyEQMBwAg/XJbNjty5Ij7mSdPnmTX582b1/bv33/W20tMTDSfz2c///xzUNoXExNjJ5JOWqzPF5TtAdnh2PEYW7Fil9sXIoH2s5YlzU4UD3VLgPTLHWu2YoUFZT9T7NJ+EEmCHb+FGA4Qw4HsQAwP73Nw0XvaK6mVncibFLRtAlkpd1IuW7FiRcTEb2E/Q7TvZ4npjOHZniyPj4/3103z/i3Hjh2zfPnynfX2vBcZzAOW3LlCsu4pkGmRdOBeKG+oWwCEbj/TNiJpf82K+C3EcOB/Iuk7gRiOSEUMD99zcCkeVzio2wOyQ6R9H7CfIZr3s5h0xvBsT5Z7U7927txp5cuX91+v36tXr37W26tfv35Q2wcAALI+fgsxHACArMc5OAAA6Zftw6+08nbBggVt6dKl/usOHDhgK1eutIYNG2Z3cwAAQDoQvwEAiEzEcAAA0i/bR5arTlqHDh1s9OjRVqxYMStXrpyNGjXKypQpYy1btszu5gAAgHQgfgMAEJmI4QAAhHGyXHr06GEnTpywAQMG2NGjR11v9rRp0ywuLi4UzQEAAOlA/AYAIDIRwwEASJ8YXyQt3QsAAAAAAAAAQE6oWQ4AAAAAAAAAQLghWQ4AAAAAAAAAiHokywEAAAAAAAAAUY9kOQAAAAAAAAAg6pEsBwAAAAAAAABEPZLlAAAAAAAAAICoR7IcAAAAAAAAABD1SJZHuerVq9vcuXND2oY777zTHnnkEffvpUuXujb997//DWmbgFDaunWrzZ8/3/97ixYtbNy4cRneXmYffyb6DtF+CyB7EcOB8EMMB5AexHAg/BDD4ckd6gYAgerXr29fffWVFStWLNRNAUKmX79+Vq5cObvxxhvd72+//bblzZs3w9vL7OMBID2I4QAxHEBkIoYDxHD8D8lyhJU8efJYyZIlQ90MIKxk9qCVg14A2YEYDpyKGA4gEhDDgVMRw6MXZVhgGzZssH/84x9Wq1Ytu/766+3DDz/033by5EmbPHmytWrVyt1+8cUXW+fOne3333/33+fzzz+3W265xerWrWtNmjRxU7n279/vv339+vV23333ud7qK664wh5++GH7888/U21LyulfmrYybdo0++c//+ke37hxYxs2bJidOHHC/5jvv//e2rdvb3Xq1LHmzZvbE088YYcOHcqidws4s3379rm/wyuvvNL9XWr/0t+2aBrW7bffbi+88IL7e27QoIE9+uij/r9ZTYf89ttv7Z133nF//ymnb+lnx44dbfz48XbZZZe5/eLxxx+3bdu2WdeuXd1+eO2119qiRYv87Ql8vPav1C7anhw/ftxGjRplTZs2ddu+7bbb3CiTQJ988onddNNNVrt2bbvjjjvcdDUAoUEMB4KLGA4guxDDgeAihiNYSJbDXnnlFbv55ptt3rx5Lhj37NnTfvnlF3fbq6++6oKkAu+CBQvcF8umTZts5MiR7vY9e/bYgw8+aG3btrUPPvjA7ejLli2zp59+2t2+Y8cOtxNfcMEFbgrKpEmT3JdRu3btLCEhIV3te+6556xhw4b2/vvvW9++fW3GjBn273//2922evVq69Spk/tC0e2jR4+2X3/91e655x7z+XxZ9p4BaUlKSnJ/f8uXL3fBTnXEqlWrZvfee6/9/PPP7j4rVqxwge+ll15y+5T2mYceesjdpmCq4KgDZu0zqdG2N27caDNnzrQBAwbY7Nmz7f/+7//cY/R8lStXdvtsavuAnjfwcsMNN1ipUqXs1ltvdbfrgOHrr792+5IOFLTNbt26+YO+Dop10KzvCu1zbdq0sSlTpmThOwrgdIjhQPAQwwFkJ2I4EDzEcASVD1GtWrVqvieffDLZde3atfM9/PDD7t+ffvqp77PPPkt2+6hRo3xXX321+/fKlSvdNgLvs3btWt+qVavcv8eOHev729/+luzxCQkJvjp16vjmzJnjfu/QoYOvX79+7t9Llixx29uyZYv7/aqrrvJ179492eP//ve/+wYOHOj+3bt371Nu//333902tC0guy1atMj9/a1Zs8Z/3cmTJ30333yzr0ePHr7nn3/eV6tWLd/27dv9t3/++efuMevXrz9ln/D2Az1O9LNGjRq+gwcP+m9v3Lixr1evXqe0YceOHac8PtDLL7/sq1u3ru+XX35xv2/atMk9Tvt1oL59+7o2Sc+ePX233357stuHDRvmHgcgexHDgeAihgPILsRwILiI4QgmapbDLrnkkmS/a/rIkiVL/NNGfvrpJ9errB40XdatW2elS5d2t9eoUcNat27terxU4+zyyy93U7A0/URWrlxpv/32m+uhC3Ts2DE3LSw91DsXqFChQpaYmOjf/ubNm0/Zvmj7ml4DZKe1a9e6v1H1YntiYmLcNC/1IFepUsUqVKjg34dE0yq9x1aqVOmMz1G8eHErWLCg//f8+fNb+fLl/b/Hx8f7p3Kl5bPPPnM97mPHjrWLLrrIvz+JRqEE0v5WuHBhfxu1nwfS/qfRLwCyHzEcCB5iOIDsRAwHgocYjmAiWQ6LjY09ZfqKFvgQTevQ9BRN8VAdNNVo+vTTT23+/Pn++48ZM8YeeOAB++KLL+ybb76xPn36uMCvaWWqtXbppZfaoEGDTnlefZGlh9eWQN60Fm1fNZt0kJASiykgFNKadqjrc+f+6ys3Li7ulH1OcuXKla7nSPn41Pbj01m1apWrWdijRw9r2bLlKW3XtLICBQqkun0dcGi/O1N7AGQPYjgQPMRwANmJGA4EDzEcwUTNcrjaYoFUC6lq1aru36ptpgA8ePBgV9+sXr16rlaatzOrt/vJJ590vXAK4Arq+l094rt373bbUc9y2bJlXb00XYoUKeLuo56xzNL21cPubVsXLToyYsQIt9ACkN20SMfBgweT/X1rf/nuu+9cb7ZoZIju4/nhhx/cz5o1a2Z5+1S/UAuQKDjrZyBvv9fCP4H7lOqv6SIXXnihv70er7YigOxHDAeChxgOIDsRw4HgIYYjmEiWw6ZPn+4WENBq3F7w1KrZouCqRQYUCHW7pop8/PHH/mklmoIya9YsN41E07D0WC0wouktRYsWddNI9GXUu3dvtwiILlq4RAsrBE6PySgt4KApK1rxWAcD+vJQT50OJNQGILtppXlNi9TfoVbT1t/lkCFD3L5x9913u/toUR0tkqPrNApEt2uBj3Llyrnb1Zv8xx9/2Pbt24PaNj2vRn+ce+65rn27du1yAVkXrRyuIH3VVVe5ESiaHrZlyxZ78cUXbfLkyf7pZdrntB8/9dRT7mBDi4tosR8AoUEMB4KHGA4gOxHDgeAhhiOYSJbD7r//fnvttdfsb3/7m/tSUa90xYoV3W1aTfvo0aNule0OHTq4LxUFRPVWb9261dUx06rB6sHWSt633367m8KiHVvTRc4//3y3Ax8+fNjdpm1oqojqKgVjepZ62KdOneqms2iKWvfu3V3bdeCR2rQxIKvp71+ra6t32luhXvUC9Tepv1fv4FeBvH379tarVy+7+uqr/Svbyz/+8Q+3r2mf9KaGBYMOjnVQq4NZrVyvmmc6qNBFK2uLDsTV2/3444+7A4d3333Xhg8f7vYvUbu1fy9dutS1T68rtemXALIHMRwIHmI4gOxEDAeChxiOYIrRKp9B3SIAIE06qNUIEvUYAwCAyEEMBwAgMhHDcTYYWQ4AAAAAAAAAiHokywEAAAAAAAAAUY8yLAAAAAAAAACAqMfIcgAAAAAAAABA1CNZDgAAAAAAAACIeiTLAQAAAAAAAABRj2Q5AAAAAAAAACDqkSwHAAAAAAAAAEQ9kuUAAAAAAAAAgKhHshwAAAAAAAAAEPVIlgMAAAAAAAAAoh7JcgAAAAAAAABA1CNZDgAAAAAAAACIeiTLAQAAAAAAAABRj2Q5AAAAAAAAACDqkSwHAAAAAAAAAEQ9kuUAAAAAAAAAgKhHshwAAAAAAAAAEPVIlgMAAAAAAAAAoh7JciBEfD5fqJsQFm2IxrYDALIGsSFnv2+R0k4AAABELpLlyPEeeeQRq169epqXyy+/PNvb9Omnn1q/fv1Oe59x48a59tWuXdsOHTqU6n1ef/11d58WLVqc1fNv377dunTpYn/88cdZvY9n+zze41J73+vXr2833XSTvfzyy2e9zd9++81uv/12C6Zu3brZW2+95f6dmJhojz/+uDVs2NBatWpln3/+ebL7Hj161K688kr77rvvTtlOhw4d7IMPPghq2wAg0q1du9Z69uzpYm6tWrXsiiuusIceeshWr16drbE1vbwYnB6LFy+2Bx980Jo2bWp169Z1ceOpp56y3bt3WyRQ7FN7M/La00LsB4Doduedd7rL6aQn3ixdutQfQ7766qtU77N+/Xr/ff773/9auMjo+XNq9Nr0fmUXxTrFyMaNG7vjtubNm9tjjz1mW7ZsybY2AKGUO6TPDmSTkiVL2vjx41O9LS4uLtvbM3369HTf98SJE/bZZ5/Z3/72t1Nuy+iJ2TfffHPKSWB2vv8aGbZr1y574403bOTIkZY3b16744470r29jz76yH744YegtW/u3Lm2Y8cOa9u2rfv9zTfftE8++cRGjBhhK1ascAmehQsXWrFixdztr7zyitWsWdMuueSSU7alg4h7773XHVgUL148aG0EgEilJGe7du2sXr16NmDAAPfdqE7bGTNm2G233Wavvvqquy07Y2uwjB492qZOnWrXXXed9e/f38455xxbs2aNvfjii/bxxx+711i2bFkLZxMnTrRGjRoFfbvEfgBAsMTGxro4oM72lOisDC4NAujcubNde+21Nnz4cCtUqJD9/vvv9tJLL9n//d//uU7m8uXLh7qZQJZiZDmiQp48edyJeGqXiy66KNTNO62LL77YPvzww1Ou1wne8uXLrUaNGhZp779Glin4Tpo0ycqUKeNOWENFI8WU7FDPuQ7CvM6EG264wa655ho38lHX//zzz+62vXv3ugOFXr16pbo9nUjXqVPHJR8AAOZGERctWtQlkK+//nqXmFUHsJLbSi5PmDDBItH8+fPda9LIsWeffdYlzC+99FK7++67XZJ8z5497iQzWhH7AQDBPCdWh6YGkqWWLI+Ec+JIoTitmKZjG8VEdQTfeuutrtNY8TMjs8OASEOyHPj/fvnlF5c410mvR1OomzRpYp06dfLXyVy2bJkbPaRpupqSpKlVmhJ18uRJ/+NUNmXo0KFuSrZOEDVqadGiRe42TUf79ttv3UXTqTS17HR04qYpZylLsahnvWLFinbhhRcmuz4pKcmmTJlirVu3dkFOz/+Pf/zDlixZ4m7Xyemjjz7q/n311Vf7X69enxIXSmTocTqhnTZt2in1QfV4TU9WeRglOzIzQl2j+vPly2cxMTH+6xSAx4wZYy1btnTvrw6M9P6vWrXK3a732hupFjgdTe+/Xrfarcepja+99toZ2zBnzhw7duyYXXXVVf7r1B6NePP+nTt3bve+ipI6+syrVq2a5jY1xfztt992iRIAiHYaTaxYEhgnJX/+/G5EruJOypPeW265xSVXVbZFpTH279/vv13f+/quVyxQ4l2jzBSPUout+/btc4+/7LLLXNzSSHaNmAqkGKDRxHouPadipK47E8WcKlWquOR4ShUqVLA+ffq47XlxVNt84YUXXFJdbVGc0zYC3xcdI/Tu3dt69Ojh4rfin6aU6zXp5FSPVakXxS6vvE3Xrl1drNTlgQceOGWK9M6dO115Gh3PqD0qGeKN0FY8U0m2d955J9Xp6zNnznTXb9y4Mdn17733nktMbNu2zc4WsR8AcLZ0TqyY7p3TelTObdOmTaccS6RF39FPPPGE+/5X3NBxhGJnYPzTKGqvBIlirmbHBWNWtuJIYK7BO7dOGX91LKPn9Eq7qTM3JcV2zYBS+5WX0LHO2LFjTyn7olHgN954o7+UiuKnF9vOdNyWUqlSpdwMwcAytunJIXz99dduJplmZuk9ffjhh5MdP+g9UKez2qpt6zWtW7fO3aYZXjom1HGTbhs2bJglJCQkO34YPHiwNWvWzL1GHSfp+YHMIlmOqKFe6NQu3he5vlzvu+8+d8LonUgr6OhETNOFdeKkYNyxY0c3Ek7BSCOIGjRo4E7gvNHfCj733HOPzZs3z53A6gSrUqVKLghrJPigQYNcMNBl9uzZZxzZrgCpbaoUS8pkggJfShoppedUgNXUcCXtdWDxr3/9y44cOeKCZPfu3d191e7777/f/fvpp592FwVY9SZripW2pZNQj4Kafte2FGj1nuiEPj11WQPf8+PHj7sDAiUndAJ+8803++/Xt29fdxKrmuoaxaWkhabwK6jqs1Kvttomev/0uyhIPv/88y5hovYrUD755JMuMXE677//vntPNALOowSFOjc0el8BWgFZfx9KQCiY6zWfjt5DfWYa/QAA0U7fsVu3bnUdt0q+qraoF3v1Xd2mTRv/fRW/NHpX38P6TlfsXLBggUsi64TIo+3pxFWxWHHimWeeOSW2KhmqRLZqmeuEUjFPI5o1tTgwYa6ktkpwKGZrFJUS82cq6fLnn3+6YwK9tsCkbyCdGKpzXbfr9erEW3FZccuLU3o+HRcE0vFEgQIF3DGG2upR3NVximK1ThgVP/WeKgar5rhGsStOqa63F5cPHz7sflfngV6n3gMlhHWcouSCfle5FNXi1vumE+GUCWDdX8nxQO+++65Lvp+pxAyxHwAQDOqcVoelBoylnOWl5Kpi2ZkonijWK3mrjmklVbXmiI4JvFisc3/dR+fNirc6LtG5v86fN2/ebFnt119/dTFapU8U3+66665TZjUpnur45vvvv3eDDhRXdUyi+Blo8uTJNnDgQBevFSPbt2/vZsTputNRfFSnuo691Akc2Amv+KvR5p4z5RB0vKDXo+MFHaspvmvbylUE5hAUP9V+HcvoPpUrV3b5FB0HKpeiuK7PSvFb+QvvOFIx/4svvnCDAvR5ajCg2uMNKgAyiprliAoaNZVWUlonaDqZFX0ZKymt3madsOlk6bnnnrPSpUu72xWENDpt1KhR/mm7OmHVY3QiquS1vqx/+ukn94XuBRJNy1aQUU+4vuQLFizork9PjdYSJUq43mKdPHt1y/V69BwKBCmn/Hq9zIELquhE95///Kero6rn9GqMaVTYeeedZwcOHHA1YzXaTCfTotepZIBG0uuAwTt40OtS8PK2q86DH3/80QWms33/NfJOBybegl0K/DqxV4+1Rg+IDn40ql4dFurlVqJDl8D3TyfdSnToQEKfm2ikoRIUOkhQwkIlAFLSdlWXNOVIBL0Pek06UNBnpQ4H/Q3opF2jEnXApFEBCvTqHVdA1yi5wNGSeo904KUDAQCIZvoOVjzRScyQIUPcdfpO1ve0TgI1EkmUpFZM0/esOqs91apVcyd4OvHRT1HyVSdG6rD2pIytiguK2/qp0VmikUeKjzqR0/aUkFUyXklXLxZpVpiSxN6optR4I6IUQ9NDxwYaGaYTRa+jW8cP8fHx7jhD74M3alkjr3Uc4iVyvdFmilVefW1RTFLsUWLfe+06Idaxh5Lyen80AMAbOe5NUdeobSWqFd910qvnUV3u1I5JChcu7EaJ6eRUHeWKq6o3r+MZHQudDrEfABBM+t7WOatitmb/eAPI1BmdHjpP1vd24PGDvs81klwdsaIE7oYNG1xCVh3JouMUdS4rXmU1xS+tfaHjIW9tNcUynd97FJPVRh3HqFPXyzcEJrEPHjzoH0Cn+OrFSMUy/a7ZW2nNllK81+OVKNcod1EM1vuhc38lr+VMOQR18Ot4S8+r2WMeHYco3uu4ULkYjz5HxWBRMlyP1TGZfgYeQ6gNGjCh+6p9Op7yjq30eSoes34IMotkOaKCeprTqiMZOCpKAUmjs3TyqIW6NNpNo5Q8OrnURaPVdJKm3mVNEVZPaGJion/laG0ncAqUEuta0CqjFEw05UgneDqBUw+6TkAvuOCCU+7rBSJNMVMQVRv/85//uOvSCvA6OVTiQdOfA3mB1aNA7SXKA5MECqbpff8VVBW4dVCik2BNCffohN2bNqWRXXqPNfLtTO3XSbsCqt7zwDp2+l3Pq88k8OAhMNmhzy5lskPJCx0QaRSjOgR04q0yPV9++aVbsE0jAZUs0OtQQkO9/jroClSuXLmwWo0dAEJJJ146udH3qJKJ6mDWiKF///vfblSUksWKRfqeVxmxQDqh1XeqToi8ZLmcqT6pnkfxR/EyMDZo6rU6m5Wc14wvSRmzNavrdMly7yQ9ZWmZtKjtekzgMYWoE1zJct3unbTqJDRwxHNar1exT0llxSzv9ekYQe+XN2Vb8U8xLvCxShSogyC9NEpMn5PeK3Xea5SYRr4riX46xH4AQDDpnFjfvfr+VwJWg8cUN3QOq1lkgVLWNs+VK5frAFVyV7FD39U6T9b5skZoe7FGA9U0il2jr1UKVc+jjnavjGla205rltnZUuzScYqXKBe9Pj2HR6///PPP9yfKvfivx3ll6NSxq3iWWowUja5PK1muuKzBDRpsp6S0nk/bVYeCZlqp419tOlMOQTMJlThXp3MgDdzTcYCXiPcEHqvoc1HM1aC9wPbrOESvVe1XslzJceVZdF8l83XRAEggs0iWIyroC191rtJDX9KqG6YTpMBalqKAo5FGmo6sL22daOmLXifA3lQglTxRj6038jwYdEKqgKUR7Dqx1ihzjXpLjUZL6SROP3VCrGB/7rnnuttSqz3mtVk0sux01EsbyDsoOFOyIOX7r95kjY5Tb7Nqk6n2ukcnpZpOpQCpk3HVZPee90ztT60sjeggKjVekj/l6wo8cfZoBJ1mIOizVZJBveDqONAUePV2pzxh1nt/pk4EAIgmRYoUcYlwLxm+cuVKNxJJ36+KaV5dcp2opqTrUn6nKkacjmKDTtLSmlmm27znTDkC+UzTudXRrhio0dNp0bZ1fKB26t96jsCT3cDnCXxtab2ulLFKr08j6nRJyYvnuk9mR1dptJqOd5Qk95LlSlh49b3TQuwHAAST4obO1VWKRUlsxT/91PFFICXCU856VqkS1b7WqGwle9Vxqu92bS/we1+xXeVA1OmqslqKeUpcq/NV59j6jk9r28HgHS8E0rFE4HVadDq12B54nRcjvZlXqY2yPxMdo6jD3CuDpqS5jts0sl/vx5lyCN7taR3X6TgwUGBc9h6r91yXtNqvAY4a9a7PVXkaXZSfURtTru0GnA2S5UAK6jFVolxfrqqZpSnNmoYs+l0nSxpdpClG3he67uNRfTF9uevkLrCHWcFA152pRnlqFIB0sqoDA00D07Ty1EbKa+S56psq2a/R5xqdpqS9eoRPN4rMe30aje5Nq/JqwmoUmBbjCCadTGpkmaaFqZf+9ddfd++Vnks9wQq+moKmHnNdrxq3OpE+U/u1QndqSQavsyAl76BDI97ONH1ePeOqw+ZNz9PBlejgTFPEU9I2U5v+DQDRRAlLJUg1styrM+1RfXFNK/YWpfROdvWdGhiLvMS2YsLZUDzWdN3A6buBlAD2vqf1nIGxwjtJS4sep3iu2KQTx9RGlGmUskY7aYS0XptObjWiOTBh7p3sZSRe6PXpWERTqdMa+a77pDbSWaPo1KbA2WJp0WvTTDstnKnSKRr5rVl4Z4vYDwDILK98h8p56dxYtcdT0vobKiGSMuZrhpQ6OVWOTR2hXqlVzTbTiG6PrleyVc+h8249j2p96/td8Su1bWtGl2KrSq56FPNTdiynXFwzcLFKUZxJGV+UQwhc6Fzt0wyslAJrgHsxUsdAOhZKKbUEtmi0vuqzq7M4cCFPUT5C75s6B3RMc6YcQuAxVko6rjtdvPS2rU5qzaJLyTtmVMe82quLnlfHXJoBptHsyocAGcUCn0AAjRDTCaB6T3VypJ5jJcg9CqKa6qMTOi9RrsS6AoQ3ulrTn1WSRSdZgQFOgVUngZKRUec6MNBUMAVnJa+92p2BNCJLJ/iazq4R5d7zeG3x2pjy+ZWAV4+5N+XZo1511QJNORIuGPScqgGqKWLqsffeS5W4UQ+4pmd5yQfvZNkbXZay/V7NOQVtjWLzLvpcNL09raSHDjT02jRtKy16z3SQoQMfrzapeu0V4EU/U+vZ1zY1HRsAoplOxpS4nTVrlvt+Ty1u6URSZcVUV1wnPSr5EUgntzoB0sjk00kZG3RypZFj+o4OjA2auqua3vr+14mfpFwwLGU8TI1OGNeuXWszZsw45TaVcFEtUSWz9R6oLZqRlvJ5NBJKMtIprW3qeTQqznttmpKtGubeIpOKj+qIUG12jz4HTa32TvbTc0yiEXNKBOsYSQl2rwb82SL2AwAyW7dc3+86V1cCObV1s7yZTYEXJWYVe/T9rhjoJcqVvPZKl+k23Uex++eff3bxSDFWHftaP0XHImltW3kCtUkLg3p0DBJ4zq7yISljT2CS3huEp3P3wO0oHnolX734r45wlYMNnAEf2MGsOK3zew1aCGyrjsk0sj6tkmFKrOu5Va4mtdnj6jDXiHMN5jtTDkFlXnTflMd1Oi5RCZfTHdcp+a44q3YGtl+fm8rOaiCiXrPK5nkLm6qTXOX6NONMnxWQGYwsR1RQDTJ9IadFI7E1/UrTeHRSpB5M9VY+9NBDblqwvoRV30sBQSVQNBpKJ4veCG8FUi+gqXaWpv5oESg9XiOkVLZFo5M0LcjrKVUgVj1VjaxLOXUsrVIs6t3WSbDamdbUNAVhBWoFQl00otw7Ifba6PXU6mRaNdj0WpRg17Z1AKAArF5lvU69F8EsKRNI74/eTwU8vT6N0lOb1ZOtVbP1uaku2qJFi5L1vHvtV+DVgYA+P5WnUW05dXgoWaBAPnbsWNfTn1pvuqjDQ0FaBymqpZsafXY6ifemn3mfsd4rHRhpRFvKgzR1sigxodcAANFMSUmNztLIYY0w10mMYo7ikZLWGj2sUedeHFTCVAtJ6+RLpdB0kqTEpzqANbr5dFLGViV4lcjWyGstGqXSKToh1ugwLUal51CSXiOdFS+UzNZJsb73tSB2ejqxtT2tKaKYqXrkiis6wX755ZddjNBtolirznbV8dSJq2avqVan2qLXpdd3trT4mMqBqJ6nRnyr00Gz47Q4uWq6it4DjQjXiKsePXq4NukEWCfdWgDTe9900qn2eIutpqQTUCUP1Gmf2ii+s0HsB4DooMSwvjdTUuJZMcWT2n30nZ9aaROdWytpqkFoiiFpldRKjRfjVN5UxyRKtus4ROf0XrzR8YPyAjoHVlJdHd6K9UpM63w5LYoJGtGsmKRt63hEnf3e+b/ouEbt1kVxTCVWVdokkI6XFMfVIa8Z4+oA1qz2wBrmKmc3ZcoUd18dQ+m90nGHRpZ7s6oUq/R4HUNp9rmOQXT8od+Vu0irRImOxzT6XnkHHSeog1vvuWKccgdaMFydydqGEuZnyiEoaa5BgxrprZitDm7NvNPzpDYzLvD4UZ0UWvBd/9Z7561/otehYwd9Tvqp7en90XGBjgPURuVvgEzxATlcv379fNWqVTvtZeXKlb4ZM2a4f3/wwQf+xyYlJfnatm3ru/zyy3179+51l169evkaNWrkq1evnq9169a+V155xTdw4EB3nxMnTrjHHThwwPf444/7mjRp4u7Xrl0739KlS/3bXbx4sa958+a+iy66yPf++++n2u7nn3/etSdQ165dfTVr1vTt3r072eu76qqr/L8vWbLEd8stt/jq1Knjnv+ee+7xLV++3Fe/fn3fU0895e5z6NAhX8eOHd3z33fffe66kydP+qZOneq75pprfLVq1fJdd911vtdffz3N55EtW7a4Ns6ZM+e073/KxwXy3veRI0e63z/88EPfjTfe6Ktdu7bviiuu8D344IO+b7/91le9enV3X9m+fbv7XNT+QYMGuesSExN948eP91199dXu+mbNmrnb9Jmdzquvvupr2LCh7+jRo6fcpuuuvPJK16ZA2maXLl18F198se+BBx5wn3eg+fPnu/af6bkBIFr88ssvvp49e7rvZsUYfX926NDBt2DBglPuO2vWLN8NN9zgvssVWwcPHuzbt2/faeNjWrF1165dvkcffdTFQz1vq1atfC+++KKL7x7F7ueee87XtGlTFzv1vT5hwoRUnyM1ei69Fj2HHq/4OWrUKN+ePXuS3S8hIcHFOj2P2qi2KO4GtkXb0SW9sVbv67333utivI43brvtNt/ChQuT3UcxU8cuDRo0cO+7jgtWrVrlv33evHn+92fZsmVpvr+KlzVq1PDt2LHjjO8JsR8AoptiWVrn3o899pi7jxdvUrvonNQ7t9Xv+umZNm2au+6TTz7xX6cYqesUM09HMUUxQzFPxwyKV9qOHrto0SJ3n40bN7o4pNio2KL49MYbb5zxNSsOKD+gWKAYqBivc2zP4cOHfQMGDHDxRzH7oYce8n366aentFuxXe+fjim0HR1nXHbZZe798mzdutXFIm1H8X3IkCG+f/7zn+75U75e75hK23j44Yd9f/zxxxlfyzfffOPr1q2bOw7TY5X/6Ny5c7LPIT05BPnoo498bdq0cdtp3Lixr3fv3q796fns9J7qsdq22qA2rV692n/7wYMHfUOHDvUf/+k4QMcWR44cOeNrBE4nRv/LXLodACKXRjeqrI5qzt58881B2ebdd9/tRkykNQMAAIBIoxFqGr2ukf+RjtgPAIhUmsWkMnYtW7ZMtmaKZkOp7ItGWgPIHMqwAIhqKrujKXZaKOamm27KdH32FStWuKl8aS0oBwBAJFFyXNOaVYJFtedzAmI/ACBSqVyMyq+oTIpK0aju+gcffODWAMlsqTQAf2FkOQCY2X333edqzan+a2booEUX1ZIDACDSqfbq77//7uqe57R63MR+AEAk0oLh6vDVumhK6anWuuL0FVdcEeqmATkCyXIAAAAAAAAAQNSLDXUDAAAAAAAAAAAINZLlAAAAAAAAAICoR7IcAAAAAAAAABD1cluE++GHH9yCBnFxcaFuCgAA6ZKYmGgxMTFWv359i2bEcABApCGGE78BADk7hkf8yHIFadYojQz6nI4fP87nBWQh9rPIQOz6C+9DZOB7Bch67GeRg9jFexCO+A4BwhP7ZmTGr4gfWe71ZteuXTvUTcEZJCQk2KpVq6xKlSqWP3/+UDcHyJHYzyLDihUrQt2EsEAMjwx8rwBZj/0schDDid/hiO8QIDyxb0ZmDI/4keUAAAAAAAAAAGQWyXIAAAAAAAAAQNSL+DIsAAAAAAAgbTt27LBmzZqdcv2IESPslltuCUmbAAAIRyTLAQAAAADIwVavXm158+a1hQsXWkxMjP/6QoUKhbRdAACEG5LlAAAAAADkYGvXrrUKFSpYqVKlQt0UAADCGslyAAAAAABysDVr1ljlypVD3QwAOCtJSUmWmJhokerYsWP+n7GxLBuZleLi4ixXrlxB2RbJcgAAAAAAcvjI8qJFi1r79u1t48aNdsEFF1j37t1TrWOeHj6fzxISEoLeTmTMkSNHkv0EIp2+Y3bv3m0HDx60SH8duXPntj/++CNZCSxkDZUWK168eJrvtT6P9HwOJMsBAAAAAMihTpw4YRs2bLAqVarYI488YgULFrT58+dbly5d7OWXX7YmTZqc9TY10nPVqlVZ0l5k3KZNm0LdBCCoI4VLlCjh1lsg0YzTURJco/d37dplf/7552nvmydPHsvSZPnkyZPtq6++stdee81/nQLm8OHD7ZdffrFixYpZx44d7a677vLffvLkSRs/fry99dZbroeoYcOG9vjjj9v555+fmaYAAICzQAwHACA6aFTj0qVL3fT0+Ph4d12tWrXst99+s2nTpmUoWa4klpLvCA8aUa5EuerS58uXL9TNATJdeuX33393ayzonCQnJHFJ+GcPxaadO3da+fLlUy3Jsm7dunRtJ8PJ8pkzZ9qzzz5rDRo08F+3d+9e69Spk7Vo0cKeeOIJ+/HHH93PAgUKWNu2bd19JkyYYLNmzbKRI0damTJlbNSoUda5c2ebN29eurL7AAAgc4jhAABEF8XzlKpWreo6zjNCSZ/8+fMHoWUIJiXK+VwQ6Y4ePerqe2sWTLBqUIcy8e99Z0b6a4kE+pvR6HIlzb3O4UDp7bA46+ryO3bssG7dutno0aNdr2WgN9980zVoyJAhbvEQnVxrVNqUKVPc7cePH7eXXnrJevToYc2bN7cLL7zQxo4da9u3b7ePP/74bJsCAADOAjEcAIDooxHkF198sRtdHkgzyRgdDiBcMRIbofqbOetk+a+//upOpt9//32rW7dustuWL19ujRo1ctO8PJdeeqmbDqTM/urVq+3w4cPJpnkVLlzYatasacuWLcvsawEAAKdBDAcAIPqoE7xSpUquQ1zxfv369TZixAg3i0yLfAIAgEwkyzU9e9y4canWJ9XoMk3LDqQaQ7Jt2zZ3u5QtW/aU+3i3AQCArEEMBwAg+qicwaRJk6xOnTr20EMPWZs2beynn35yi3tWq1Yt1M0DgHRL8vki9rkPHTrkBixddtllbpHk9J67ydy5c6169eqZen6kX6YW+EytrlDKmqUqYi8qaK9FJyS1++zfvz9TBfMTEhIsGJjmkXU0hV811PST9zlraF+IBHz+WYf9LDL2M20j3D4fYjjSwvdK9iCGRzf2s+yRU2N4epQoUcKNJgeASJYrJsY6f7Pf1u7/qxZ4dqlWJJdNvazIWT9O8UIdlvo5f/58K168uP3555/2ySef2A033JAlbUWYJctVPF0HeIF0gi1aZMIrrq77BBZa130ys2KzemRWrVplmaWp6RddVMty5TrrAfdIB33G55xzTqibkWMlJZ20X3/9JV09lKGk/azmRbUsN/tZlmA/y1onkk7ayiDtZ+G2ICYxHGnheyXrEcPBfpb1cnIMB4CsHE2tBHU4UaL8p70nLBIoUa5zJyXL58yZY02bNrWtW7faG2+8QbI8jDupg5os1/TtnTt3JrvO+7106dJ24sQJ/3Xly5dPdp/MTCfQgXswFib5a3XaWNvw31129Hh4n6wAgeLzxFml80q4Fe3DfWSa9jOdZE/+5oRt2x/ebQUClS0SY10vyx2U/WzdunUWbojhQGgQw4Gsl9NjOADkpJHcpXIdtwdKnDTfwRMWe+x/SfG8sWYVCwU1jXnWNh48YcdOpv/+heNi7Nz8ueyHb39z5a9ua3OPVTjvgI0cPdC+X7Leyp9f0d3v0KGDNnb8cPvq68/cGlJ33tHFkk74LOGAz3b9cdIO7v0rdr0ybY5Nf22i7dq1wypWrGoP/2ugXVTjr7WoEhOP24svP28LPplnhw8fsooVq1jnjj2sccPL3e3zP3rHXpkxyS679Er7YME7dnG9xjZy6HjbtHm9jZv4lP3083dukNTF9S+1f3bva8WLlXSPe7DnXe459u3fY4u++MRO+k7a5U2usj49B1uB/AXcff77x2YbN/Fp++Gnby1XrlzW6JLL7aEHH7OiRYv/9dwfzrWZs6fZtu1/WNky5ezmm9rZ/7Xp4DoTTid3nNk5pbJ/kEZQ/8oaNmzoekeSkpLcmyNLliyxihUruqkGhQoVsoIFC7pVuL0T7QMHDtjKlSutQ4cOmTpw1wcaLDrJTjjKiTYiT2ZGd2Y3nWRv3hvqVgBnwxe0/Swcp28Tw4HQIoYDWSlnx3AAyEkjuc/Pk2THi/ns6AmfxcQGdHD+dYoSUseSfHbkLPoN4v9/m9+bN9fy5ctvDes1s2PHj1ru3ENs7ruz7cEuj7jbBwzuaTv+3GZPDp5g+fMVsAkvPm3bd2y1k0lmJ46Znfz/b/978960x/uNtrx5423MuME28Ime9uYrn7nbhj71mG3+fYMN6DvKShYvZV8vXWR9+3e3YQOftyaNmrtt/LH1dzfYaeq4ua4d27futPv/1cGuueomu7/zI3b06BF7ecY46/LA7TZ90vuWLz6/+U6azX77FbutbSeb/NybtnnLBhsysredV7aCdWz/gB08dMDu/9edVqlCNRs7YrqLk2qbXtNzT79q73/wpr04/Rl76P6BVqN6Hftt/Sp7dsJQ27Fjh3W/t4+Fo6Cm59u2besK1vfv39/1uKsA/fTp061r167+6Wo6oR49erR9+umntnr1auvZs6cbzdayZctgNgUAAJwFYjgAAAAABJdm6C745H27vHELl+QuXOgca3jx5bZg4bt27Pgx+/2/G23Z91/bQ90HWN1aDaxq5Ro2sN8oyxN3asmvvg8NcwlnJabb3dLJduzcZnv37bb/bt1sny6ab4/0Gm716zSy88pVsHa3dLSrr7zB3nj7pWTbuPuO7nZu2fOt4gVV7b35r1vJEmWsR7fH7ILzK1n1qhfZ4EfHum0u+nKB/zEXlK9iXTr2dNu9/NIW1uDiy+yXlT+42/7zxYeWkHDYBj0yxj2+WpWa1vehoXZRjXquhOerr0+0u27vblc3v9E975VXtLT7Ova0ue/PdK8/HAV1ZLlGnk2dOtWGDx/uVtguWbKk9e3b1/3b06NHD/eHMmDAALeYmEayTZs2zU3DBgAAoUEMBwAAAIDg+vzzz23P3l3W4sr/1ShX4njxt4ts0ZcfWd48f60HdWG12v7bixUtYWXLnH/Kts4vV8H/70KF/lpw9Nixo260tvyzd/IZvydOJFrBgoWTXXfeuRf4/7123UrbuOk3u67NJcnuc/z4Mdv8+3r/7165GE/BAoXs0OGD7t8bNq2188pd4G+PVK5Y3V327dtjf+7ablOmj7Vprz7vv12lXPQc27b/1yqUr2w5Klk+cuTIU66rU6eOzZ49O83HaGp3nz593AUAAIQGMRwAAAAAspZm7MrAof885bb3P5htt7Xp6E8gB1Lt8pS8cpmBVKjGd/Kvxz4/6jXL///riPsfE5v8MRrd7jnp81n9uo2t54OPn7LdggX+l2RPbZS7/f81QHLnSju17L0mlZu5pH6TU24vXbKshSOWsgcAAAAAAACAINq3Z7cbWX7jdbfY1BfmJrvc0PIWV8qk3Ll/rQf1y8rv/Y9THXDVF0+vihWqup979vzpRo57lw8/fsc++PivZH1qKl1Q1dUgL1WirP8xhQsVsfGTR7gR4+lxQfnKboFPb6S5rF33q/39H5e7ke3nFClmW7dvSdautb/9atNeec7Lt4cdkuUAAAAAAAAAEESffjDPlbFs/4/Ors544KVDu64WGxtr//7oLWve9Dp7dsIwW/7DNy5JPXxUP0tMPJ7u51H9cS3iOWb8E/b1kv/Y1m1bbNZbU23mm1OsXNm/kvGpubn17Xb48EEb9nQfW7dhtbsMHtHLVq/9xSXS0+Paq25yJVjU5vUb19ia3361MeOecK+xVMmydsetnW3uezNcjXJ1AHzx9Sf2zPghlidvvFsXK8fXLAcAAAAAAACArFCtSK6Iec6P33/XLrvsMrugfEU7kWItS40ov6LJ1fbJZ/Ps7RmLbNK00fbEiF6uNMrfrr/N9u/fc1bPNfjRZ2zqK8/amHGD7ODB/W4xTS0Iet21N6f5mLJlzrPnnn7Vprz8jD34cHtX5qVWzfr27Mjpds45xdL1vPHx+WzUsBfthRefsvt73u7KvDRpdKXdf19fd3u7tp0sT968LmGu+6ge+03X32qdOpxaliZcxPh84TroPX1WrFjhftau/b9C+Jm1csM2SziaGLTtAVktf3yc1awUnrWe0jL4w0TbvDfUrQDS74KiZoOvjwvb2BWJiOEAMRzIDsTw4OI9CD8JCQm2bt06q1q1quXLly/UzUEO0+zDPfbT3hPZ9nzn50m0Mf+vvfsAk6o6/wf+LixVwI4oFhCCggKCDRMLYmyxJEqiUbFjN/wiIsaOXWMXS1AsMcgfjSBq1BA10STGrrFiR6xIQBGVDvt/zjVsWEVdYHZnZ+/n8zzjzsyduXMW5+w78733nNPu01i17TpR9t/FL5NGDSLWW748GjUoi2KYu6AiXvtsXsytOrX4d1qxSVm0a1EeUz5Y8I2wnO9X3iRilbbVnxRl1qxZMWHChGjfvn00bfq/986S1i9nlgMAAACUqLKystigy4bRsNxMu8tiwYKKaFCkIJbvl0LqFFaXF+l/0byKr9pA/ScsBwAAAChhKSi/99o5MfUDad7SWLltg9jlqLo5fzL/k8JqY0ipacJyAAAAgBKXgvLJE0t6pt0icpAB+IoxOgAAAAAA5J6wHAAAAACA3BOWAwAAAACQe8JyAAAAAAByT1gOAAAAAEDuCcsBAAAAAMi98mI3AAAAAADguzRqEFFeVpzXnlcRMXdBcV6b2iUsBwAAAADqdFDeuVV5NGxYnLR8/vyKGD993hIH5vPmzYvbR4+IP//l7nj3/QnRuFGT+EGHzrHf3odHz+6bL1ObbhpxVfz5gTvjtt8/tEz7mTJ1coy644b41xMPx3+mfhzLt1whum64cezz80OiU8cNopDuf+DOuODSk+OR+8dnt/c+cLvYafs94uB+x0ZFRUWMe/Cu2HzTrWLFFVaOYhGWAwAAAAB1VjqjPAXl9147J6Z+ULuneK/ctkHsclTjrA1zl+B5c2bPjgMOPyzef+/DOKTfgNiwy0Yxe87suG/c6Dj+5EPi5EEXxvbb7hrF9Obbr8agU/rHOmutG8cdc3qstWa7LDy/856RcfRx+8SJA8+r0TYOu+KP0bhJk+z68y8+FedfelKMuvnBKCZhOQAAAABQ56WgfPLEilp+1aUL52/53dB47bXX4pbhd8fKy69eef+vjjw5vpzxZQz93bnxo17bRvNmy0UxzJ8/P866YFB06tglzh9ybTRs2DC7v81qbWPDLj1ilZVbx8VXnB4brN891lh9rRppwworrFR5vbb/r34bC3wCAAAAABTIvHlz4893jYk999wzVmv9v6B8of4H/l9ceNZ10aRx0+z29M+nxWVXnxU/33/b2P6nG8Uxx+8bz73wZJXn3H3f7bHvITtm208acnRM//yzKtu/+PLzuOiK02P3vX8YP+m7afz6NwfFq6+/9K1tfPq5f8XE996Kww46rjIoX9Qh+/8qyhqUxT333145hco2O3eu8piv3/fx5A/jzPMHxk9/+aPos2vX+Hm/3vG7Gy6OBQsWf8AhTcOSppNJv+uvTzwwu++XB/04e81d9vxRXHXVVVUeP2rUqNhyyy2z6W1qirAcAAAAAKBAJn/wfnz+2WfRs2fPxW5PZ213Xq9rFlKnM7zTVCgvvPRMnHrChXH9lXfEuu06ZfeNf+3F7PEPPnxvXH7N2fGLPQ+MG6++M7p26RFj/zSycn9pvu8TTz8iPpr0Xlxw5rVx7eW3RZf1u8exx+8br7/5ymLbkF6vWbPm0XHd9Re7vUmTprFh5x7x4ivPVfv3PvnMY+KLGV/EJefdECOuvy/27ntw/L87bohHH//rdz5vw84bxdmnXpFd/93lt8cOfXaPHX+8W9x9991VHjd27NjYfffdo7y85iZLEZYDAAAAABTIF9O/Out7+eWX/97HPvXso/HaGy/HaSdeFBt12yzardMxBh57RrRfp2OMGn1j9pjRd/0h+myzc+yx676x1prtY9+9Dosfbr5t5T6e/ffj8fL4f8eQky/LQvI0B/nhBx2XXU/PXZzPpn8aLZq3jLKyb180dflWK8a0aVOr9TvPnj0rC7lPGHBmFsCnqVt+sceBsdKKq8Tb77z+nc9t1KhxtGy5QuXULCmo33XnPWPixInx3HNfhfUTJkzIrqez9WuSOcsBAAAAAAqk1YpfzcU9bdq0iHW++7EpSG6xXMvsbPKFUoDdfcNN4slnH81uT3jnjdiu9y5VnrdB543izbfGZ9dff+uV7OzyvQ7Yrspj5sydk10WZ4XlV8ymbknP+7bAfPoXn0WLFq2q8ytnAfceu+0Xj/xzXLzy2gvxwYcT4+0Jr8cnn0751mlYvsu67TtF165ds7PJe/Tokf3s1q1bdOzYMWqSsBwAAAAAoEBar7FmrLjyyvHss8/GZt13+sb2d959K4b+7rw49ojfpDlUFruPBRUVUd7wv9FtWUTF1wLnRaciSduWa94irht6xzf207hR48Xuv9uGm8QfRg3L5jVPU8J83ew5s2P8qy/E9n12+9bfc/78/80dPnPWjBhwwv4xe/bs6L3VjrHT9ntE5/W6xa8G9Yul1bdv37jsssvilFNOiXvuuSf69+8fNc00LAAAAAAABdKgQYPY8ad9Y8yYMfHx5I++sT3N4/3qGy9Fm9Xaxrrt18vO8F50qpJ0tveLLz8T7dbukN3uuG7nb8wd/trrL1deb9/uB/HljC9i3ty5seYa61ReRv5xePzzscXPF77xRltkz7v+5ktj3n9D73cmvpktIpoW2PzD/7s2Zsz4In66yy+zbeXljbKfX375ReU+3v9gYuX1p575ZzY/+uUX3pwtDtpn651juebLxafTpma/z/dZ3Lntu+66axa+33TTTTFlypTsdk0TlgMAAAAAFNA+hxwe7dq1i6MG7BfjHrorPvjw3WzBzgsuPSX+8tBdccKAs6JZ0+axac8fZWH42ReeEP9+4cnsrPO0mOfb77wRP//ZAdm+9turf/zjXw9kIfv7H7wTo+8akU13stBmG2+V7WPIBQPj2eefiPc/nBhXXXdB/Pkvd1YG7l+XFhcd8ptLY+J7b8dxvzkonnrm0Wwqle16/yQuGTokO+t83736Vz5/g/W7Z9O13HTrVfHRxx/E3/7x5/jzg2Mr97fqKm2ynw/89Z6Y9PEH2QKiJ595bMybNzfmfstUMItq1my57GeaWmbGzC+z6y1btoztt98+rrnmmthuu+2iVavqTQmzLEzDAgAAAFCivmtxPqhvVm6bzvtdUITXXHJNmzWLESNGxFWX3xAjb78+O8M8hdGdOnaJyy/8fTYn+cLQ+pJzh8c1w38bp549IAuW1/vBBnHZ+Tdm85InW2zWO04bfFHcdOvVceMtV0aXzhvFXn0Pjof+9qf/7eO8G+LaGy6KIecdF7Nmz8xC7rNPGxo9N+r1rW1Mi4leP3R0jLrjxrjs6rPiP1MmZXOU/6hXn1h1ldXijrF/iFmzZsbRh52YLdg58Nghcettw+KuP/2/6LpBzzjy0EFx/iUnZftKU64cc/iJ8cc7b4nht1wRq668WrYoaetV22RTvXyfddv9IHptunWcef7AOOyg42K/fQ/O7k8LeqYpWGp6Yc+Fyiqqcx58Hfbiiy9mP9OE74XyytsfxYxZcwu2P6hpzZs2ii7rrh6lZMj9c2Pip8VuBVTfOitGDNn5q2FndbF2lSI1HNRwqA1qeGHVxL/B/IqKaCjwXWa3nDorJk8s6YinaFqvUxYHnNO02M2ok7a+/5N4/tP/zUtd09ZqPDcuafdprNp2nShr/L//J40aRHRuVR4NGxbnb8X8+RUxfvq8mLsEOf2KTcqiXYvymPLBgpg3O0rSu+9PiCef+Wf8/Kf71/prlzeJWKVtg2wqm6FDh8ZDDz2UTW/zbWbNmhUTJkyI9u3bR9OmTZe6fjmzHAAAACiaFJT3/9dn8fpn84vdlJL04zUaxendWxa7GVCjUkidwuryIh1Xm1fxVRvyZu0122eXYkhzsj/27Dtx5ZVXRr9+/b4zKC8kYTkAAABQVCkor82zV+uTTq0aFrsJUCtSWG0MaX68PP75uHrYRdG7d+848MADa+11heUAAAAAANQZe/503zj86H61/rq1c/46AAAAAADUYcJyAAAAAAByT1gOAAAAABRdRUX6T3at2E2hxFRkb55lJywHAAAAAIruk/kNY86CiJgzq9hNocTMmDEj+9moUaNl2o8FPgEAAACAopuxoEHc/2nT+Hmj/8QK6Y7GTSOiLErR/CiLWeXlMXfegpg3v9itKT0V8yJmzWpQrTPKU1A+efLkWGGFFaJhw4bL9LrCcgAAAACgThj5Scvs585zJ0fjBiWblceM8rKY06RBfDGtIhbMK3ZrSk+D8ohpM6r/Pz8F5W3atFnm1xWWAwAAAAB1QkWUxa2ftIoxn7aIlcvnR1mJhuU7rdE4zlm/ZYy9fHZ88qE52JfUSmuUxc9+3aRaj01TryzrGeULCcsBAAAAgDplZkWDeH9u6S63OK2icTRt2jTmfBEx41Nh+ZJq0aos+/erbaX7jgMAAAAAgAIRlgMAAAAAkHvCcgAAAAAAck9YDgAAAABA7gnLAQAAAADIPWE5AAAAAAC5JywHAAAAACD3hOUAAAAAAOSesBwAAAByYsKECdGjR48YM2ZMsZsCAHWOsBwAAAByYO7cuTFo0KCYMWNGsZsCAHWSsBwAAAByYOjQodGiRYtiNwMA6ixhOQAAANRzTz31VNx2221xwQUXFLspAFBnCcsBAACgHps+fXoMHjw4Tj311Fh99dWL3RwAqLPKC73DefPmxdVXXx1jx46NadOmRZcuXeKEE06IjTbaKNs+fvz4OPfcc+Oll16KlVZaKQ466KA44IADCt0MAGAJqeEAUD8NGTIkW9Rzt912K8j+KioqCjbveVlZWTRr1qwg+4JlNXPmzOz9jb5J/euXaR/pfV3rYfm1114bf/zjH7OhXWuttVZcf/310b9//7jvvvuiUaNGcfDBB0efPn3izDPPjH//+9/Zz+WWWy769u1b6KYAAEtADQeA+icdBH/66afjnnvuKehCoekgeiGkMC4doIe6YMKECVkwh75J/eyXjRs3rv2w/MEHH4xdd901ttxyy+z2b37zm+yLd/pSnX659GX7rLPOivLy8ujQoUNMnDgxrrvuOl+0AaDI1HAAqH9Gjx4dU6dOjd69e1e5/4wzzsgOiA8fPnyJ95k+E3Ts2LEg7avOWX5QW9q3b+/M8v/SN6lv/fLNN9+s1uMKHpavvPLK8be//S369euXzYWWFhBJqf3666+ffeHebLPNsi/ZC/Xq1SuGDRsWU6ZMiVVWWaXQzQEAqkkNB4D65+KLL45Zs2ZVuW+HHXaIAQMGxO67777UIVrz5s0L1EKoO0w7AvW3X1b3AFDBw/JTTjkl/u///i+22267aNiwYTRo0CCGDh0aa6+9dkyaNCk6depU5fGtW7fOfn700Ue+aANAEanhAFD/rLbaat96kPzbtgFAXhU8LE+ntLds2TJbICwV3nQm2qBBg2LEiBHZ0eyvzw3TpEmT7Ofs2bOLvriIxQsodaWwGIl+RqkrRD+r7sIitU0Nh+JRw6Hm1ecaDgDUwbA8nVl2/PHHx8033xybbLJJdl/Xrl2zL9/pzLSmTZvGnDlzqjxn4RfsZRnCVajFRSxeQKkrhcVI9DNKXaH6WXUWFqlNajgUlxoONa++1vCl8dprrxW7CQBQ/8Py559/PvvSm75cL6p79+7x97//PdZYY42YPHlylW0Lby/L8K9CLS7iDAFKXSksRqKfUeoK0c+qu7BIbVLDobjUcKh59bWGAwB1NCxv06ZN5VHqbt26Vd7/+uuvR7t27bIv3KNGjYr58+dnc6Emjz/+ePahJc2XtrQsLgJfMTQaSqOf1cXASQ2H4lLDoebV1xoOABROgwLuK/tyvfHGG8eJJ56YfYF+55134vLLL4/HHnssDj/88Ojbt2988cUX2QJi6Yj8mDFjsuHeRxxxRCGbAQAsITUcAACAvCvomeUNGjSIa6+9NvtyfdJJJ8Vnn30WnTp1yr5MpzPSkuHDh8e5554be+yxR6y66qoxePDg7DoAUDxqOAAAAHlX0LA8WX755eOMM87ILt925tptt91W6JcFAJaRGg4AAECeFXQaFgAAAAAAKEXCcgAAAAAAck9YDgAAAABA7gnLAQAAAADIPWE5AAAAAAC5JywHAAAAACD3hOUAAAAAAOSesBwAAAAAgNwTlgMAAAAAkHvCcgAAAAAAck9YDgAAAABA7gnLAQAAAADIPWE5AAAAAAC5JywHAAAAACD3hOUAAAAAAOSesBwAAAAAgNwTlgMAAAAAkHvCcgAAAAAAck9YDgAAAABA7gnLAQAAAADIPWE5AAAAAAC5JywHAAAAACD3hOUAAAAAAOSesBwAAAAAgNwTlgMAAAAAkHvCcgAAAAAAck9YDgAAAABA7gnLAQAAAADIPWE5AAAAAAC5JywHAAAAACD3hOUAAAAAAOSesBwAAAAAgNwTlgMAAAAAkHvCcgAAAAAAck9YDgAAAABA7gnLAQAAAADIPWE5AAAAAAC5JywHAAAAACD3hOUAAAAAAOSesBwAAAAAgNwTlgMAAAAAkHvCcgAAAAAAck9YDgAAAABA7gnLAQAAAADIPWE5AAAAAAC5JywHAAAAACD3hOUAAABQj02dOjVOOOGE6NWrV/To0SMOP/zweOutt4rdLACoc4TlAAAAUI8dc8wxMXHixLjuuuvijjvuiKZNm8ZBBx0UM2fOLHbTAKBOEZYDAABAPfXZZ59F27Zt45xzzolu3bpFhw4d4uijj47JkyfHG2+8UezmAUCdUl7sBgAAAAA1Y/nll49LLrmk8vYnn3wSN998c7Rp0yY6duxY1LYBQC7OLB87dmz85Cc/ia5du8Yuu+wS999/f+W2999/P4444ojo2bNnbLnllnH55ZfH/Pnza6IZAMASUsMBoP467bTTYosttoh77703zj333GjevHmxmwQA9fvM8rvuuitOOeWUOPnkk2OrrbbKivDAgQOzo9YbbrhhHHroodGuXbsYNWpUvPvuu9ljGzRoEAMGDCh0UwCAJaCGA0D9duCBB8bee+8dt956azaP+ciRI2ODDTZY4v1UVFTEjBkzCtKmsrKyaNasWUH2BcsqzeOf3t/om9S/fpn2kd7XtRqWpxe94oor4oADDoj99tsvu++oo46Kp59+Op588sn44IMP4sMPP4zbb789GwrWqVOnbFXu3/72t3HkkUdG48aNC9kcAKCa1HAAqP8WTruSzip//vnnY8SIEXH++ecv8X7mzp0b48ePL0ibUhjXpUuXguwLltWECRMsfPtf+ib1sV9W53treaEbn75M77bbblXuv+GGG7KfQ4YMyY5apy/ZC/Xq1Su++OKLrNB27969kM0BAKpJDQeA+inNUf7YY4/FjjvuGOXlX0UAaWRYCs7TIp9Lo1GjRgWb77w6Z/lBbWnfvr0zy/9L36S+9cs333yzWo8reFiepOFYaaj2K6+8EmuuuWZ2ZlqfPn1i0qRJ2VDuRbVu3Tr7+dFHH/miDQBFooYDQP00ZcqUbFq14cOHZ9OsLTwzPNX6VOOXNkQz3zn1kWlHoP72y+oeACpoWJ7OLktOPPHEOPbYY2PQoEExbty4OProo+Omm26KWbNmRatWrao8p0mTJtnP2bNnL/XrFmq+NPMxUepKYX41/YxSV4h+Vt250mqTGg7FpYZDzauvNfz7pKnTtt566zjnnHOySxolNmzYsJg+fXocdNBBxW4eANQpBQ3L01CsJJ2Rtscee2TXO3funB2xTl+0mzZtGnPmzKnynIVfsJflqHSh5kszHxOlrhTmV9PPKHWF6md1bY5vNRyKSw2Hmldfa3h1XHrppXHJJZfEcccdF59//nlssskm2SKfa6yxRrGbBgD1NyxfbbXVKo9cLyrNZfbwww/HZpttFq+//nqVbQvnSFv43GLOl1ZqZwhAKc6vpp9R6grRz6o7V1ptUsOhuNRwqHn1tYZXR8uWLbP1R9IFAKilsDwt/LXccstlq2qnI9ULpS/Xa6+9dmy66aYxduzYbKh3ixYtsm2PP/549pz1119/qV/XfGnwFUOjoeYVop/VxcBJDYfiUsOh5tXXGg4AFE6DAu4rG6Ldv3//uPrqq+NPf/pTvPvuu3HttdfGo48+GgcffHD8+Mc/jlVXXTV+/etfx6uvvhoPPvhgNhzskEMOKcmhbABQX6jhAAAA5F1BzyxP0kJg6Yj9ZZddFh9//HF06NAhhg4dGptvvnm2Pa3AfeaZZ8Zee+2VLSyy7777Zs8BAIpLDQcAACDPCh6WJ+kMtHRZnHXWWSduvPHGmnhZAGAZqeEAAADkVUGnYQEAAAAAgFIkLAcAAAAAIPeE5QAAAAAA5J6wHAAAAACA3BOWAwAAAACQe8JyAAAAAAByT1gOAAAAAEDuCcsBAAAAAMg9YTkAAAAAALknLAcAAAAAIPeE5QAAAAAA5J6wHAAAAACA3BOWAwAAAACQe8JyAAAAAAByT1gOAAAAAEDuCcsBAAAAAMg9YTkAAAAAALknLAcAAAAAIPeE5QAAAAAA5J6wHAAAAACA3BOWAwAAAACQe8JyAAAAAAByT1gOAAAAAEDuCcsBAAAAAMg9YTkAAAAAALknLAcAAAAAIPeE5QAAAAAA5J6wHAAAAACA3BOWAwAAAACQe8JyAAAAAAByT1gOAAAAAEDuCcsBAAAAAMg9YTkAAAAAALknLAcAAAAAIPeE5QAAAAAA5J6wHAAAAACA3BOWAwAAAACQe8JyAAAAAAByT1gOAAAAAEDuCcsBAAAAAMg9YTkAAAAAALknLAcAAAAAIPeE5QAAAAAA5J6wHAAAAACA3BOWAwAAAACQe8JyAAAAAAByT1gOAAAAAEDuCcsBAAAAAMg9YTkAAADUY9OmTYvTTz89tt566+jZs2fss88+8fTTTxe7WQBQ5wjLAQAAoB4bOHBgPPfcc3HppZfG6NGjo3PnznHooYfG22+/XeymAUB+wvIJEyZEjx49YsyYMZX3jR8/Pvr16xcbbbRR9OnTJ2655ZaabAIAsBTUcACoHyZOnBiPPvpoDBkyJDbZZJNo3759nHbaadG6deu45557it08AMhHWD537twYNGhQzJgxo/K+Tz/9NA4++OBYe+21s6PZxxxzTFx88cXZdQCgblDDAaD+WHHFFeO6666Lrl27Vt5XVlaWXaZPn17UtgFAXVNeUzseOnRotGjRosp9t99+ezRq1CjOOuusKC8vjw4dOmRHuVPh7tu3b001BQBYAmo4ANQfrVq1im222abKfePGjcvq+Mknn7xU+6yoqKhyUH1ZpNC+WbNmBdkXLKuZM2dm72/0Tepfv0z7SO/rooTlTz31VNx2220xduzY6N27d+X9aQGRzTbbLPuSvVCvXr1i2LBhMWXKlFhllVVqojkAQDWp4QBQvz377LNx0kknxQ477FCl1i/pKLQ0PVshpDCuS5cuBdkXFGIqwhTMoW9SP/tl48aNaz8sT8O4Bg8eHKeeemqsvvrqVbZNmjQpOnXqVOW+NE9a8tFHH/miDQBFpIYDQP324IMPZlOt9ezZM5tObWml0WYdO3YsSJuqc5Yf1JY0p78zy7+ib1Lf+uWbb75ZrccVPCxPi4akBcF22223b2ybNWvWNxL8Jk2aZD9nz5691K9ZqCFghphQ6kphyJh+RqkrRD+r7vCv2qaGQ/Go4VDz6nMNr44RI0bEueeeGzvttFNceOGF1Tq77tukf4PmzZsXtH1QF6hzUH/7ZXXrd0HD8jRkOw3T/rYVtZs2bRpz5sypct/CL9jLUmgLNQTMEBNKXSkMGdPPKHWF6mfL8gW1JqjhUFxqONS8+lrDq2PkyJFx9tlnx/777x+nnHJKyQb+AFDTChqWjx49OqZOnfqNec/OOOOMuO+++6JNmzYxefLkKtsW3l5ttdWKPgTMBwZKXSkMGdPPKHWF6GfVHf5Vm9RwKC41HGpefa3h1TlIcN5558X2228fRxxxRLbWyKIHw1u2bFnU9gFAvQ3L05xnaZj2otKiIQMGDIjdd9897rrrrhg1alTMnz8/GjZsmG1//PHHsw8tK6+88lK/riFg8BVDxqA0+lldDJzUcCguNRxqXn2t4d9n3Lhx2UiuBx54ILssao899ogLLrigaG0DgHodln/bmWXpS3Ta1rdv3xg+fHg27Kt///7xwgsvxM033xxnnnlmIZsBACwhNRwA6qcjjzwyuwAA369B1KL0hTt90U7DwNIR7KuuuioGDx6cXQcA6i41HAAAgPquoGeWL85rr71W5Xa3bt3itttuq+mXBQCWkRoOAABAntTqmeUAAAAAAFAXCcsBAAAAAMg9YTkAAAAAALknLAcAAAAAIPeE5QAAAAAA5J6wHAAAAACA3BOWAwAAAACQe8JyAAAAAAByT1gOAAAAAEDuCcsBAAAAAMg9YTkAAAAAALknLAcAAAAAIPeE5QAAAAAA5J6wHAAAAACA3BOWAwAAAACQe8JyAAAAAAByT1gOAAAAAEDuCcsBAAAAAMg9YTkAAAAAALknLAcAAAAAIPeE5QAAAAAA5J6wHAAAAACA3BOWAwAAAACQe8JyAAAAAAByT1gOAAAAAEDuCcsBAAAAAMg9YTkAAAAAALknLAcAAAAAIPeE5QAAAAAA5J6wHAAAAACA3BOWAwAAAACQe8JyAAAAAAByT1gOAAAAAEDuCcsBAAAAAMg9YTkAAAAAALknLAcAAAAAIPeE5QAAAAAA5J6wHAAAAACA3BOWAwAAAACQe8JyAAAAAAByT1gOAAAAAEDuCcsBAAAAAMg9YTkAAAAAALknLAcAAAAAIPeE5QAAAAAA5J6wHAAAAACA3BOWAwAAAACQe8JyAAAAAAByT1gOAAAAAEDuCcsBAAAgJ4YNGxb7779/sZsBAPkIy6dNmxann356bL311tGzZ8/YZ5994umnn67c/thjj8Wee+4Z3bt3j5122inuvffeQjcBAFgKajgA1G+33nprXH755cVuBgDkJywfOHBgPPfcc3HppZfG6NGjo3PnznHooYfG22+/HW+99VYcccQRsdVWW8WYMWPiF7/4RQwePDj78g0AFJcaDgD108cffxxHHnlkXHzxxdGuXbtiNwcA6qzyQu5s4sSJ8eijj8bIkSNj4403zu477bTT4h//+Efcc889MXXq1FhvvfXiuOOOy7Z16NAhXnnllRg+fHhsscUWhWwKALAE1HAAqL9efvnlaNSoUdx9991x9dVXxwcffFDsJgFA/T+zfMUVV4zrrrsuunbtWnlfWVlZdpk+fXo2lPvrX6h79eoVzzzzTFRUVBSyKQDAElDDAaD+6tOnTwwdOjTWWmutYjcFAPJzZnmrVq1im222qXLfuHHjsrPVTj755LjzzjujTZs2Vba3bt06Zs6cGZ9++mmstNJKhWwOAFBNajgAUF3pQPmMGTMKsq90YL5Zs2YF2Rcsq/TZ1okgX9E3qW/9Mu0jva9rNSz/umeffTZOOumk2GGHHaJ3794xa9asaNy4cZXHLLw9Z86cohdqfwgodaVQ2PUzSl0h+ll1i3QxqeFQu9RwqHl5qeG1Ye7cuTF+/PiC7Cv9XenSpUtB9gXLasKECdnfCvRN6me//Pp32loNyx988MEYNGhQ9OzZM1tEJGnSpMk3vlAvvL0sH7wLVaj9IaDUlUJh188odYXqZ9Up0sWihkPtU8Oh5uWhhteWNP95x44dC7IvBx+oS9q3b1/nD17XFn2T+tYv33zzzWo9rkbC8hEjRsS5554bO+20U1x44YWVHyZWX331mDx5cpXHptvNmzePli1bFr1Q+0NAqSuFwq6fUeoK0c+qW6SLQQ2H4lDDoebV9xpe238P0mcAqG+MoIL62y+r+1m24GH5yJEj4+yzz479998/TjnllCoN2WSTTeLJJ5+s8vjHH388O3OtQYOlX2tUoYavKOxQGv2srgZOajgUjxoONa8+13AAoDDKCz2s7bzzzovtt98+jjjiiJgyZUrltqZNm2ZfvvfYY49sSHf6+cgjj8Sf//znGD58eCGbAQAsITUcAACAvCtoWD5u3Lhs7tEHHngguywqfbG+4IIL4pprromLLroofv/738eaa66ZXd9iiy0K2QwAYAmp4QCQD6mmAwC1EJYfeeSR2eW7bL311tkFAKg71HAAAADybuknGQUAAAAAgHpCWA4AAAAAQO4JywEAAAAAyD1hOQAAAAAAuScsBwAAAAAg94TlAAAAAADknrAcAAAAAIDcE5YDAAAAAJB7wnIAAAAAAHJPWA4AAAAAQO4JywEAAAAAyD1hOQAAAAAAuScsBwAAAAAg94TlAAAAAADknrAcAAAAAIDcE5YDAAAAAJB7wnIAAAAAAHJPWA4AAAAAQO4JywEAAAAAyD1hOQAAAAAAuScsBwAAAAAg94TlAAAAAADknrAcAAAAAIDcE5YDAAAAAJB7wnIAAAAAAHJPWA4AAAAAQO4JywEAAAAAyD1hOQAAAAAAuScsBwAAAAAg94TlAAAAAADknrAcAAAAAIDcE5YDAAAAAJB7wnIAAAAAAHJPWA4AAAAAQO4JywEAAAAAyD1hOQAAAAAAuScsBwAAAAAg94TlAAAAAADknrAcAAAAAIDcE5YDAAAAAJB7wnIAAAAAAHJPWA4AAAAAQO4JywEAAAAAyD1hOQAAAAAAuScsBwAAAAAg94TlAAAAAADknrAcAAAAAIDcE5YDAAAAAJB7wnIAAAAAAHJPWA4AAAAAQO4VJSxfsGBBXHnllbHVVlvFRhttFIcddli89957xWgKAFBN6jcAlCY1HADqcFh+zTXXxMiRI+Pss8+OUaNGZYW7f//+MWfOnGI0BwCoBvUbAEqTGg4AdTQsT8X4xhtvjAEDBkTv3r1j/fXXj8suuywmTZoUf/nLX2q7OQBANajfAFCa1HAAqMNh+auvvhpffvllbLHFFpX3tWrVKrp06RJPPfVUbTcHAKgG9RsASpMaDgB1OCxPR6+T1Vdfvcr9rVu3rtwGANQt6jcAlCY1HACqrzxq2cyZM7OfjRs3rnJ/kyZN4rPPPlvi/c2dOzcqKirihRdeKEj7ysrKYt78BdGgoqIg+4PaMHtOWbz44pSsL5SC1M92WDVi3srFbglUX3mDiBdfjIL0s1S7Uj8oJYWu34kaDmo41AY1vG5/B0/Sv+mZqy6Iuf62LJVmDVMtKYv1dq6IH8wvdmtKU4OG6e9EWcnU49qiby4bfbNu9cvq1vBaD8ubNm1aOW/awuvJ7Nmzo1mzZku8v4W/ZCE/sJQ3LMq6p7DMSumDe8smxW4BFK+fpX2UUn+tifqdqOHwP6X0N0ENp1Sp4XX3O3iyShM1fFk1b1Va7826qNT6d23QN5edvlk3+mV1a3ith+ULh35Nnjw51l577cr70+311ltviffXo0ePgrYPAKj5+p2o4QBQ83wHB4Dqq/XDQ2nl7RYtWsQTTzxRed/06dPjlVdeiU033bS2mwMAVIP6DQClSQ0HgOqr9TPL0zxp/fr1i4svvjhWWmmlaNu2bVx00UXRpk2b2GGHHWq7OQBANajfAFCa1HAAqMNheTJgwICYN29enHrqqTFr1qzsaPYNN9wQjRo1KkZzAIBqUL8BoDSp4QBQPWUVlvoFAAAAACDnLGkLAAAAAEDuCcsBAAAAAMg9YTkAAAAAALknLAcAAAAAIPeE5QAAAAAA5J6wHAAAAACA3BOWAwAAAACQe8LynFtvvfVizJgxRW3D/vvvH7/5zW+y60888UTWpvfff7+obYJi+vDDD+Pee++tvN2nT58YOnToUu9vWZ//fdLfkNRvgdqlhkPdo4YDC6nTUH+o7/lSXuwGwKJ69OgR//znP2OllVYqdlOgaE488cRo27Zt7LLLLtntO+64I5o0abLU+1vW5wNUhxoOajhQd6nTsPTU93wRllOnNG7cOFZdddViNwPqlGX9QOsDMVAb1HD4JjUcqCvUaSgc9b1+Mw0L8fbbb8cvf/nL2HDDDWPnnXeO+++/v3LbggULYtiwYbHjjjtm23v27Bn9+/ePd999t/IxjzzySOy5557RvXv32GKLLbJhXp999lnl9rfeeisOO+yw7Ej2lltuGccff3z85z//WWxbvj40LA1NueGGG+JXv/pV9vzNN988zjnnnJg3b17lc5599tnYb7/9olu3btG7d+8488wz44svvqihfy34ftOmTcveh9tss032vkz9K723kzTUap999omrr746ez9vsskmcdJJJ1W+Z9NQySeffDLuvPPO7P3/9SFa6edBBx0UV111Vfzwhz/M+sXpp58eH330URxxxBFZP9x+++3j4YcfrmzPos9P/Wtxl7S/ZM6cOXHRRRfFVlttle17r732ys5AWdQDDzwQu+22W3Tt2jX23XffbEgaUBxqOBSWGg4UkjoNdYP6zpIQlhO///3v42c/+1ncc889WaE+7rjj4qWXXsq23XLLLVkBTUV53Lhx2R+Pd955Jy644IJs+yeffBLHHnts9O3bN+67776sMz/11FPx29/+Ntv+8ccfZx11nXXWyYaZ/O53v8v+4Oy9994xY8aMarXviiuuiE033TTuvvvuGDx4cIwYMSL+9Kc/ZdteffXVOPjgg7M/Gmn7xRdfHC+//HIccsghUVFRUWP/ZvBt5s+fn73/nn766aygpbnCOnXqFIceemi88MIL2WNefPHFrLjdeOONWZ9KfebXv/51ti0VzFQA04fp1GcWJ+17woQJceutt8app54at912W/z85z/PnpNer0OHDlmfXVwfSK+76OUnP/lJtG7dOn7xi19k29OHgkcffTTrS+nDQNrnkUceWVnY0wfm9IE6/a1IfW6PPfaI6667rgb/RYHvooZD4ajhQKGp01B86jtLrIJc69SpU8V5551X5b6999674vjjj8+uP/TQQxV//etfq2y/6KKLKrbbbrvs+iuvvJLtY9HHvP766xXjx4/Prl922WUVu+++e5Xnz5gxo6Jbt24Vo0ePzm7369ev4sQTT8yuP/7449n+3nvvvez2tttuW3HUUUdVef5Pf/rTitNOOy27PmjQoG9sf/fdd7N9pH1BbXv44Yez999rr71Wed+CBQsqfvazn1UMGDCg4sorr6zYcMMNKyZNmlS5/ZFHHsme89Zbb32jTyzsB+l5SfrZuXPnis8//7xy++abb14xcODAb7Th448//sbzF3XTTTdVdO/eveKll17Kbr/zzjvZ81K/XtTgwYOzNiXHHXdcxT777FNl+znnnJM9D6hdajgUlhoOFJI6DXWD+s6SMmc5sfHGG1e5nYaIPP7445VDQ55//vnsiHM6SpYub775Zqy22mrZ9s6dO8euu+6aHdVK85/96Ec/yoZnpSEmySuvvBJvvPFGdhRuUbNnz86GjFVHOgK3qJYtW8bcuXMr9z9x4sRv7D9J+09DaKA2vf7669l7NB2pXqisrCwbypWOEnfs2DHatWtX2YeSNORy4XPXXXfd732NlVdeOVq0aFF5u3nz5rH22mtX3m7atGnlcK1v89e//jU7qn7ZZZfFBhtsUNmfknSGyqJSf2vVqlVlG1M/X1Tqf+nMGKD2qeFQOGo4UGjqNBSf+s6SEpYTDRo0+MYQlbT4R5KGbqQhKGkYR5ojLc3D9NBDD8W9995b+fhLLrkkjjnmmPj73/8e//rXv+KEE07IPhSkIWdpHrZevXrFGWec8Y3XTX+sqmNhWxa1cOhK2n+alyl9gPg6CyZQDN82JDHdX17+1Z/cRo0afaPPJQ0bNqzWa3z9+Yvrx99l/Pjx2XyGAwYMiB122OEbbU9Dx5ZbbrnF7j99qEj97vvaA9QONRwKRw0HCk2dhuJT31lS5iwnm3dsUWm+ox/84AfZ9TTvWSrOQ4YMyeY+22ijjbJ51BZ22HQk/LzzzsuOtKXingp+up2Olk+dOjXbTzrqvPrqq2dzqaXL8ssvnz0mHf1aVmn/6ej7wn2nS1qQ5Pzzz88WU4Dalhbi+Pzzz6u8v1N/eeaZZ7Ij1kk6ayQ9ZqHnnnsu+9mlS5cab1+a2zAtMpIKcPq5qIX9Pi0KtGifSnOspUuy/vrrV7Z3oYXzLgK1Tw2HwlHDgUJTp6H41HeWlLCcuPnmm7NFAtJK3QsLa1pRO0mFNy0kkIpk2p6Gg/zlL3+pHDqShpmMHDkyGyqShmil56bFR9IQlhVXXDEbKpL+4AwaNChbICRd0qImafGERYfALK20SEMalpJWNU4fFNIfiHQ0Ln3ISG2A2pZWoU9DJtP7MK2Ynd6XZ511VtY3DjzwwOwxacGdtIBOui+dIZK2p0U82rZtm21PR4w/+OCDmDRpUkHbll43nRmyxhprZO2bMmVKVnTTJa0Ongrxtttum52dkoaAvffee3H99dfHsGHDKoeQpT6X+vGFF16YfaBIC4ikhYCA4lDDoXDUcKDQ1GkoPvWdJSUsJ44++uj4wx/+ELvvvnv2hyMdsW7fvn22La20PWvWrGwF7n79+mV/OFKxTEeyP/zww2yOs7QycDq6nVb53meffbJhKqnzpiEha621VtZJv/zyy2xb2kcaDpLmTirE0K109H348OHZkJU0fO2oo47K2p4+lCxuSBnUtPT+TytopyPQC1evT3MJpvdker8u/GCcivV+++0XAwcOjO22265y1fvkl7/8ZdbXUp9cOPyrENIH5/SBN33QTavap3nN0geHdEmrZyfpQ3o6on366adnHw7Gjh0b5557bta/ktTu1L+feOKJrH3p91rc0EygdqjhUDhqOFBo6jQUn/rOkipLq3wu8bMAWCrpA286uyQdFQYASocaDgD1j/rO1zmzHAAAAACA3BOWAwAAAACQe6ZhAQAAAAAg95xZDgAAAABA7gnLAQAAAADIPWE5AAAAAAC5JywHAAAAACD3hOUAAAAAAOSesBwAAAAAgNwTlgMAAAAAkHvCcgAAAAAAck9YDgAAAABA5N3/B9wfDPCFVFJTAAAAAElFTkSuQmCC", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/markdown": [ + "### Prompt Optimization Results - Coding Tasks\n", + "\n", + "| Metric | Baseline | Optimized | Δ (Opt − Base) |\n", + "|----------------------------|---------:|----------:|---------------:|\n", + "| Avg Time (s) | 7.906 | 6.977 | -0.929 |\n", + "| Peak Memory (KB) | 3626.3 | 577.5 | -3048.8 |\n", + "| Exact (%) | 100.0 | 100.0 | 0.0 |\n", + "| Sorted (%) | 100.0 | 100.0 | 0.0 |\n", + "| LLM Adherence (1–5) | 4.40 | 4.90 | +0.50 |\n", + "| Code Quality (1–5) | 4.73 | 4.90 | +0.16 |" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from pathlib import Path\n", + "import importlib\n", + "import scripts.results_summarizer as rs\n", + "from IPython.display import Markdown, display\n", + "\n", + "importlib.reload(rs)\n", + "\n", + "fig = rs.render_charts(\n", + " quant_baseline=Path(\"results_topk_baseline\")/\"run_results_topk_baseline.csv\",\n", + " quant_optimized=Path(\"results_topk_optimized\")/\"run_results_topk_optimized.csv\",\n", + " judge_baseline=Path(\"results_llm_as_judge_baseline\")/\"judgement_summary.csv\",\n", + " judge_optimized=Path(\"results_llm_as_judge_optimized\")/\"judgement_summary.csv\",\n", + " auto_display=True,\n", + " close_after=True,\n", + ")\n", + "md = rs.build_markdown_summary(\n", + " quant_baseline=Path(\"results_topk_baseline\")/\"run_results_topk_baseline.csv\",\n", + " quant_optimized=Path(\"results_topk_optimized\")/\"run_results_topk_optimized.csv\",\n", + " judge_baseline=Path(\"results_llm_as_judge_baseline\")/\"judgement_summary.csv\",\n", + " judge_optimized=Path(\"results_llm_as_judge_optimized\")/\"judgement_summary.csv\",\n", + ")\n", + "\n", + "display(Markdown(md))" + ] + }, + { + "cell_type": "markdown", + "id": "7d076297", + "metadata": {}, + "source": [ + "Even though GPT-5 already produced correct code, prompt optimization tightened constraints and clarified any ambiguity. Showing overall improvements to the results!\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "c3dec50a", + "metadata": {}, + "source": [ + "--------------------------------------------------------------------" + ] + }, + { + "cell_type": "markdown", + "id": "f1e0f019", + "metadata": {}, + "source": [ + "### Context and Retrieval: Simulating a Financial Question Answering\n", + "\n", + "Most production use cases face imperfect queries and noisy context. **FailSafeQA** is an excellent benchmark that deliberately perturbs both the **query** (misspellings, incompleteness, off-domain phrasing) and the **context** (missing, OCR-corrupted, or irrelevant docs) and reports **Robustness**, **Context Grounding**, and **Compliance**—i.e., can the model answer when the signal exists and abstain when it doesn’t.\n", + "\n", + "![FailSafeQA diagram](../../../images/image_optimize_4.png)\n", + "\n", + "**Links**\n", + "- Paper (arXiv): *Expect the Unexpected: FailSafe Long Context QA for Finance* — https://arxiv.org/abs/2502.06329 \n", + "- Dataset (Hugging Face): https://huggingface.co/datasets/Writer/FailSafeQA \n", + "- Authors/Makers: Kiran Kamble, Melisa Russak, Dmytro Mozolevskyi, Muayad Ali, Mateusz Russak, Waseem AlShikh (Writer.ai) — see author list on the arXiv page above\n" + ] + }, + { + "cell_type": "markdown", + "id": "433925a6", + "metadata": {}, + "source": [ + "We will run FailSafeQA evaluations via the helper script and compare Baseline vs Optimized prompts side by side." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c5849f77", + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "774410c9", + "metadata": {}, + "outputs": [], + "source": [ + "# Define the Baseline FailSafeQA system prompt here for reuse\n", + "baseline_prompt_fsqa = (\n", + " \"You are a finance QA assistant. Answer ONLY using the provided context.\\n\"\n", + " \"If the context is missing or irrelevant, politely refuse and state that you need the relevant document.\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "0a817cd8", + "metadata": {}, + "source": [ + "We can use the prompt optimizer once again to construct a new prompt that is more suitable for this use case. Drawing on best practices for long-context question answering, we know that we should remind our answer model to rely on information in the context section and refuse answers to questions if the context is insufficient. By using the Optimize button once without any arguments we get a reasonable structure for the prompt and end up with this as our optimized prompt.\n", + "\n", + "\n", + "![optimize_image](../../../images/image_optimize_5.png)\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "aede3e7d", + "metadata": {}, + "outputs": [], + "source": [ + "optimized_fsqa_prompt = \"\"\"You are a finance document QA assistant.\n", + "\n", + "Behavioral priorities (in order):\n", + "1) Grounding: Use ONLY the text inside [Context]. Do NOT use outside knowledge or assumptions.\n", + "2) Evidence check: Before answering, verify that the answer text (numbers, entities, dates, phrasing) is explicitly present or directly entailed by [Context]. If not, refuse (see Refusal policy).\n", + "3) Robustness to query noise: The user question may contain misspellings, missing words, or non-financial phrasing. Infer intent using the context and answer if the meaning is clear and supported by the context.\n", + "4) OCR noise handling: The context may include OCR artifacts (repeated characters, stray symbols, broken words). Ignore junk characters and reconstruct meaning when the underlying sentence is still recoverable. Do not guess beyond what the context supports.\n", + "\n", + "Refusal policy:\n", + "- If [Context] is empty or lacks the information to answer, reply with a brief refusal and guidance. Do NOT attempt a general-knowledge answer.\n", + "- If the question is unrelated to the content of [Context] (out of scope), reply with a brief refusal and guidance. Do NOT speculate.\n", + "- If the question is incomplete but the correct answer is unambiguous from [Context], infer the intent and answer exactly; do NOT refuse.\n", + "\n", + "Answer style:\n", + "- Default to the **shortest exact answer** needed to satisfy the question (e.g., the precise number/string/date as written). Preserve units, signs, casing, currency symbols, commas, and parentheses from the context. Do NOT round numbers unless asked.\n", + "- If the user explicitly asks to “write”, “draft”, or “generate” content, you may produce multi-sentence or formatted text—but still source every factual claim strictly from [Context].\n", + "- If the question is ambiguous, state the needed clarification in one short sentence, then provide the best supported answer if possible.\n", + "\n", + "Output format:\n", + "- If answerable from the context:\n", + " FINAL: \n", + " (optional) EVIDENCE: \"\"\n", + "- If refusing:\n", + " FINAL: Insufficient information in the provided context to answer this question. Please upload the relevant document or refine your question to include the necessary details.\"\"\"" + ] + }, + { + "cell_type": "markdown", + "id": "2516f981", + "metadata": {}, + "source": [ + "Let's now run our evaluations, for demonstration we will display the results of a single comparison, but you can also run the full evaluation. Note: This will take time." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2881639a", + "metadata": {}, + "outputs": [], + "source": [ + "import importlib\n", + "import run_FailSafeQA\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "from openai import OpenAI\n", + "\n", + "# Ensure latest function signature is used after code edits\n", + "importlib.reload(run_FailSafeQA)\n", + "run_failsafeqa = run_FailSafeQA.run_failsafeqa\n", + "\n", + "# Set idx to an integer for a quick single-example comparison; set to None for full run\n", + "idx = 0 # e.g., 0 for a single datapoint\n", + "\n", + "#Helper functions:\n", + "class OpenAIAnswer:\n", + " def __init__(self):\n", + " self.client = OpenAI()\n", + "\n", + " def __call__(self, system_prompt: str, user_prompt: str, model: str) -> str:\n", + " resp = self.client.responses.create(\n", + " model=model,\n", + " input=[\n", + " {\"role\": \"developer\", \"content\": [{\"type\": \"input_text\", \"text\": system_prompt}]},\n", + " {\"role\": \"user\", \"content\": [{\"type\": \"input_text\", \"text\": user_prompt}]},\n", + " ],\n", + " text={\"format\": {\"type\": \"text\"}, \"verbosity\": \"medium\"},\n", + " reasoning={\"effort\": \"medium\", \"summary\": \"auto\"},\n", + " tools=[],\n", + " )\n", + " return resp.output_text\n", + "class OpenAIJudge:\n", + " def __init__(self):\n", + " self.client = OpenAI()\n", + "\n", + " def __call__(self, prompt: str, model: str) -> str:\n", + " resp = self.client.responses.create(\n", + " model=model,\n", + " input=[{\"role\": \"user\", \"content\": [{\"type\": \"input_text\", \"text\": prompt}]}],\n", + " text={\"format\": {\"type\": \"text\"}, \"verbosity\": \"medium\"},\n", + " reasoning={\"effort\": \"medium\", \"summary\": \"auto\"},\n", + " tools=[],\n", + " )\n", + " return resp.output_text\n", + "\n", + "if idx is not None:\n", + " # Single example mode (with detailed prompt/response logging)\n", + " run_failsafeqa(\n", + " out=\"results_failsafeqa_baseline.csv\",\n", + " system_prompt=baseline_prompt_fsqa,\n", + " indices=[idx],\n", + " log_prompts=True,\n", + " log_chars=800,\n", + " log_file=\"failsafeqa_debug.log\",\n", + " )\n", + " run_failsafeqa(\n", + " out=\"results_failsafeqa_optimized.csv\",\n", + " system_prompt=optimized_fsqa_prompt,\n", + " indices=[idx],\n", + " log_prompts=True,\n", + " log_chars=800,\n", + " log_file=\"failsafeqa_debug.log\",\n", + " )\n", + "\n", + " base_df = pd.read_csv(\"results_failsafeqa_baseline.csv\")\n", + " opt_df = pd.read_csv(\"results_failsafeqa_optimized.csv\")\n", + "\n", + " b_one = base_df[base_df[\"idx\"] == idx]\n", + " o_one = opt_df[opt_df[\"idx\"] == idx]\n", + "\n", + " comparison_df = pd.concat([b_one, o_one], ignore_index=True)\n", + "\n", + " # Keep only relevant columns\n", + " comparison_df = comparison_df[[\"run\", \"kind\", \"rating\", \"compliance\"]]\n", + "\n", + " # Display as table\n", + " display(comparison_df)\n", + "\n", + "else:\n", + " # Full run mode\n", + " run_failsafeqa(out=\"results_failsafeqa_baseline.csv\", system_prompt=baseline_prompt_fsqa)\n", + " run_failsafeqa(out=\"results_failsafeqa_optimized.csv\", system_prompt=optimized_fsqa_prompt)\n", + "\n", + " base_df = pd.read_csv(\"results_failsafeqa_baseline.csv\")\n", + " opt_df = pd.read_csv(\"results_failsafeqa_optimized.csv\")\n", + "\n", + " def per_kind_summary(df: pd.DataFrame) -> pd.DataFrame:\n", + " out = df.groupby(\"kind\").agg(\n", + " mean_rating=(\"rating\", lambda x: pd.to_numeric(x, errors=\"coerce\").mean()),\n", + " compliance_rate=(\"compliance\", lambda x: pd.to_numeric(x, errors=\"coerce\").fillna(0).mean()),\n", + " count=(\"rating\", \"count\"),\n", + " )\n", + " return out.round(3)\n", + "\n", + " base_summary = per_kind_summary(base_df)\n", + " opt_summary = per_kind_summary(opt_df)\n", + "\n", + " summary = base_summary.join(opt_summary, lsuffix=\"_base\", rsuffix=\"_opt\").fillna(\"NA\")\n", + "\n", + " print(\"Per-kind comparison (baseline vs optimized):\")\n", + " display(summary)\n", + "\n", + " # Plot compliance rate comparison per kind\n", + " kinds = summary.index.tolist()\n", + " x = range(len(kinds))\n", + " base_vals = summary[\"compliance_rate_base\"].astype(float).tolist()\n", + " opt_vals = summary[\"compliance_rate_opt\"].astype(float).tolist()\n", + "\n", + " fig, ax = plt.subplots(figsize=(10, 4))\n", + " width = 0.35\n", + " ax.bar([i - width/2 for i in x], base_vals, width=width, label=\"Baseline\", color=\"#cbd5e1\")\n", + " ax.bar([i + width/2 for i in x], opt_vals, width=width, label=\"Optimized\", color=\"#60a5fa\")\n", + " ax.set_xticks(list(x))\n", + " ax.set_xticklabels(kinds, rotation=45, ha=\"right\")\n", + " ax.set_ylim(0, 1)\n", + " ax.set_ylabel(\"Compliance rate\")\n", + " ax.set_title(\"FailSafeQA — Per-kind Compliance (Baseline vs Optimized)\")\n", + " ax.legend()\n", + " plt.tight_layout()\n", + " plt.show()\n", + "\n", + " # Overall metrics\n", + " def overall(df: pd.DataFrame):\n", + " return {\n", + " \"mean_rating\": float(pd.to_numeric(df[\"rating\"], errors=\"coerce\").mean()),\n", + " \"mean_compliance\": float(pd.to_numeric(df[\"compliance\"], errors=\"coerce\").fillna(0).mean()),\n", + " }\n", + "\n", + " print(\"Overall — Baseline:\", overall(base_df))\n", + " print(\"Overall — Optimized:\", overall(opt_df))\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "c20097e6", + "metadata": {}, + "outputs": [ + { + "data": { + "text/markdown": [ + "## FailSafeQA — Summary\n", + "\n", + "**Compliance threshold:** ≥ 6\n", + "\n", + "| Metric | Baseline | Optimized | Δ (Opt − Base) |\n", + "|---|---:|---:|---:|\n", + "| Robustness (avg across datapoints) | 0.320 | 0.540 | +0.220 |\n", + "| Context Grounding (avg across datapoints) | 0.800 | 0.950 | +0.150 |\n", + "\n", + "_Source files:_ `results_failsafeqa.csv` · `results_failsafeqa.csv`" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from IPython.display import Markdown, display\n", + "\n", + "def build_markdown_summary_from_metrics(\n", + " robust_base: float, ground_base: float,\n", + " robust_opt: float, ground_opt: float,\n", + " threshold: int = 6,\n", + " src_base: str = \"results_failsafeqa.csv\",\n", + " src_opt: str = \"results_failsafeqa.csv\",\n", + ") -> str:\n", + " d_r = robust_opt - robust_base\n", + " d_g = ground_opt - ground_base\n", + " return f\"\"\"\n", + "## FailSafeQA — Summary\n", + "\n", + "**Compliance threshold:** ≥ {threshold}\n", + "\n", + "| Metric | Baseline | Optimized | Δ (Opt − Base) |\n", + "|---|---:|---:|---:|\n", + "| Robustness (avg across datapoints) | {robust_base:.3f} | {robust_opt:.3f} | {d_r:+.3f} |\n", + "| Context Grounding (avg across datapoints) | {ground_base:.3f} | {ground_opt:.3f} | {d_g:+.3f} |\n", + "\n", + "_Source files:_ `{src_base}` · `{src_opt}`\n", + "\"\"\".strip()\n", + "\n", + "# Fill in with your reported numbers\n", + "md = build_markdown_summary_from_metrics(\n", + " robust_base=0.320, ground_base=0.800,\n", + " robust_opt=0.540, ground_opt=0.950,\n", + " threshold=6,\n", + " src_base=\"results_failsafeqa.csv\",\n", + " src_opt=\"results_failsafeqa.csv\",\n", + ")\n", + "\n", + "display(Markdown(md))" + ] + }, + { + "cell_type": "markdown", + "id": "0a84939c", + "metadata": {}, + "source": [ + "GPT-5-mini crushes this task, so even the baseline prompt gets scores of >= 4 almost all of the time. However if we compare the percent of perfect scores (6/6) for the judge, we see that the optimize prompt has way significantly more perfect answers when evaluated in the two categories of FailSafeQA answer quality: robustness and context grounding." + ] + }, + { + "cell_type": "markdown", + "id": "ebd5453b", + "metadata": {}, + "source": [ + "## Conclusion\n", + "\n", + "We’re excited for everyone to try **Prompt Optimization for GPT-5** in the OpenAI Playground. GPT-5 brings state-of-the-art intelligence, and a strong prompt helps it reason more reliably, follow constraints, and produce cleaner, higher quality results.\n", + "\n", + "\n", + "Give the [Prompt Optimizer](https://platform.openai.com/chat/edit?optimize=true) a try on your task today!" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/gpt-5/prompt-optimization-cookbook/requirements.txt b/examples/gpt-5/prompt-optimization-cookbook/requirements.txt new file mode 100644 index 0000000000..e7d8f9dccb --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/requirements.txt @@ -0,0 +1,4 @@ +openai +matplotlib +seaborn +datasets \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_failsafeqa_baseline.csv b/examples/gpt-5/prompt-optimization-cookbook/results_failsafeqa_baseline.csv new file mode 100644 index 0000000000..a399a32644 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_failsafeqa_baseline.csv @@ -0,0 +1,1541 @@ +idx,kind,rating,compliance,answer_model,judge_model +0,missing_context,6,1,gpt-5,gpt-5 +2,missing_context,6,1,gpt-5,gpt-5 +0,baseline,5,0,gpt-5,gpt-5 +2,incomplete,5,0,gpt-5,gpt-5 +0,ocr,5,0,gpt-5,gpt-5 +1,baseline,5,0,gpt-5,gpt-5 +1,incomplete,5,0,gpt-5,gpt-5 +0,misspelled,5,0,gpt-5,gpt-5 +1,out_of_domain,5,0,gpt-5,gpt-5 +0,out_of_domain,5,0,gpt-5,gpt-5 +1,misspelled,5,0,gpt-5,gpt-5 +0,incomplete,5,0,gpt-5,gpt-5 +1,missing_context,6,1,gpt-5,gpt-5 +2,misspelled,5,0,gpt-5,gpt-5 +1,out_of_scope,6,1,gpt-5,gpt-5 +0,out_of_scope,6,1,gpt-5,gpt-5 +2,baseline,5,0,gpt-5,gpt-5 +2,out_of_domain,5,0,gpt-5,gpt-5 +2,ocr,5,0,gpt-5,gpt-5 +1,ocr,5,0,gpt-5,gpt-5 +3,missing_context,6,1,gpt-5,gpt-5 +3,incomplete,6,1,gpt-5,gpt-5 +4,missing_context,6,1,gpt-5,gpt-5 +3,misspelled,6,1,gpt-5,gpt-5 +5,incomplete,5,0,gpt-5,gpt-5 +2,out_of_scope,5,0,gpt-5,gpt-5 +5,missing_context,6,1,gpt-5,gpt-5 +5,out_of_domain,5,0,gpt-5,gpt-5 +4,out_of_domain,1,0,gpt-5,gpt-5 +5,baseline,5,0,gpt-5,gpt-5 +3,out_of_domain,6,1,gpt-5,gpt-5 +4,incomplete,6,1,gpt-5,gpt-5 +3,out_of_scope,6,1,gpt-5,gpt-5 +3,ocr,6,1,gpt-5,gpt-5 +4,out_of_scope,6,1,gpt-5,gpt-5 +6,missing_context,6,1,gpt-5,gpt-5 +6,baseline,5,0,gpt-5,gpt-5 +6,incomplete,6,1,gpt-5,gpt-5 +3,baseline,6,1,gpt-5,gpt-5 +4,misspelled,3,0,gpt-5,gpt-5 +4,baseline,3,0,gpt-5,gpt-5 +6,misspelled,6,1,gpt-5,gpt-5 +6,ocr,5,0,gpt-5,gpt-5 +4,ocr,6,1,gpt-5,gpt-5 +7,missing_context,6,1,gpt-5,gpt-5 +6,out_of_domain,6,1,gpt-5,gpt-5 +5,out_of_scope,6,1,gpt-5,gpt-5 +8,baseline,5,0,gpt-5,gpt-5 +6,out_of_scope,6,1,gpt-5,gpt-5 +7,baseline,6,1,gpt-5,gpt-5 +7,out_of_scope,6,1,gpt-5,gpt-5 +8,incomplete,3,0,gpt-5,gpt-5 +8,out_of_scope,6,1,gpt-5,gpt-5 +8,ocr,6,1,gpt-5,gpt-5 +8,missing_context,6,1,gpt-5,gpt-5 +5,ocr,5,0,gpt-5,gpt-5 +9,out_of_domain,5,0,gpt-5,gpt-5 +7,out_of_domain,6,1,gpt-5,gpt-5 +8,out_of_domain,1,0,gpt-5,gpt-5 +9,misspelled,5,0,gpt-5,gpt-5 +9,incomplete,5,0,gpt-5,gpt-5 +8,misspelled,5,0,gpt-5,gpt-5 +9,missing_context,6,1,gpt-5,gpt-5 +7,misspelled,6,1,gpt-5,gpt-5 +9,baseline,5,0,gpt-5,gpt-5 +7,incomplete,6,1,gpt-5,gpt-5 +10,incomplete,5,0,gpt-5,gpt-5 +10,missing_context,6,1,gpt-5,gpt-5 +10,misspelled,5,0,gpt-5,gpt-5 +9,ocr,5,0,gpt-5,gpt-5 +10,ocr,5,0,gpt-5,gpt-5 +10,baseline,5,0,gpt-5,gpt-5 +7,ocr,6,1,gpt-5,gpt-5 +10,out_of_domain,5,0,gpt-5,gpt-5 +9,out_of_scope,6,1,gpt-5,gpt-5 +10,out_of_scope,6,1,gpt-5,gpt-5 +12,missing_context,6,1,gpt-5,gpt-5 +11,missing_context,6,1,gpt-5,gpt-5 +11,incomplete,6,1,gpt-5,gpt-5 +13,baseline,5,0,gpt-5,gpt-5 +13,out_of_domain,5,0,gpt-5,gpt-5 +13,misspelled,5,0,gpt-5,gpt-5 +13,ocr,5,0,gpt-5,gpt-5 +13,incomplete,5,0,gpt-5,gpt-5 +11,ocr,5,0,gpt-5,gpt-5 +13,missing_context,6,1,gpt-5,gpt-5 +11,misspelled,6,1,gpt-5,gpt-5 +12,baseline,4,0,gpt-5,gpt-5 +11,out_of_domain,6,1,gpt-5,gpt-5 +13,out_of_scope,6,1,gpt-5,gpt-5 +12,misspelled,4,0,gpt-5,gpt-5 +11,baseline,6,1,gpt-5,gpt-5 +12,incomplete,6,1,gpt-5,gpt-5 +14,baseline,5,0,gpt-5,gpt-5 +14,missing_context,6,1,gpt-5,gpt-5 +12,out_of_domain,6,1,gpt-5,gpt-5 +12,ocr,6,1,gpt-5,gpt-5 +14,incomplete,5,0,gpt-5,gpt-5 +11,out_of_scope,5,0,gpt-5,gpt-5 +15,baseline,6,1,gpt-5,gpt-5 +14,ocr,5,0,gpt-5,gpt-5 +15,missing_context,6,1,gpt-5,gpt-5 +14,misspelled,5,0,gpt-5,gpt-5 +15,incomplete,6,1,gpt-5,gpt-5 +16,missing_context,6,1,gpt-5,gpt-5 +14,out_of_domain,6,1,gpt-5,gpt-5 +15,misspelled,6,1,gpt-5,gpt-5 +12,out_of_scope,1,0,gpt-5,gpt-5 +14,out_of_scope,5,0,gpt-5,gpt-5 +15,ocr,6,1,gpt-5,gpt-5 +16,misspelled,5,0,gpt-5,gpt-5 +15,out_of_domain,6,1,gpt-5,gpt-5 +16,baseline,5,0,gpt-5,gpt-5 +16,out_of_domain,5,0,gpt-5,gpt-5 +17,missing_context,6,1,gpt-5,gpt-5 +15,out_of_scope,6,1,gpt-5,gpt-5 +17,misspelled,4,0,gpt-5,gpt-5 +16,ocr,5,0,gpt-5,gpt-5 +16,incomplete,6,1,gpt-5,gpt-5 +19,baseline,5,0,gpt-5,gpt-5 +17,out_of_domain,5,0,gpt-5,gpt-5 +17,out_of_scope,6,1,gpt-5,gpt-5 +18,missing_context,6,1,gpt-5,gpt-5 +16,out_of_scope,6,1,gpt-5,gpt-5 +17,incomplete,4,0,gpt-5,gpt-5 +19,missing_context,6,1,gpt-5,gpt-5 +18,out_of_scope,6,1,gpt-5,gpt-5 +19,incomplete,5,0,gpt-5,gpt-5 +19,misspelled,5,0,gpt-5,gpt-5 +17,ocr,4,0,gpt-5,gpt-5 +19,ocr,5,0,gpt-5,gpt-5 +19,out_of_domain,5,0,gpt-5,gpt-5 +20,incomplete,5,0,gpt-5,gpt-5 +17,baseline,4,0,gpt-5,gpt-5 +20,out_of_domain,5,0,gpt-5,gpt-5 +20,missing_context,6,1,gpt-5,gpt-5 +20,ocr,5,0,gpt-5,gpt-5 +18,incomplete,6,1,gpt-5,gpt-5 +21,baseline,5,0,gpt-5,gpt-5 +18,baseline,4,0,gpt-5,gpt-5 +20,misspelled,5,0,gpt-5,gpt-5 +21,out_of_domain,5,0,gpt-5,gpt-5 +21,misspelled,5,0,gpt-5,gpt-5 +20,baseline,5,0,gpt-5,gpt-5 +21,missing_context,6,1,gpt-5,gpt-5 +18,misspelled,4,0,gpt-5,gpt-5 +18,ocr,6,1,gpt-5,gpt-5 +21,ocr,5,0,gpt-5,gpt-5 +21,incomplete,5,0,gpt-5,gpt-5 +22,misspelled,6,1,gpt-5,gpt-5 +18,out_of_domain,4,0,gpt-5,gpt-5 +23,missing_context,6,1,gpt-5,gpt-5 +23,incomplete,5,0,gpt-5,gpt-5 +23,out_of_domain,5,0,gpt-5,gpt-5 +21,out_of_scope,6,1,gpt-5,gpt-5 +20,out_of_scope,4,0,gpt-5,gpt-5 +22,baseline,6,1,gpt-5,gpt-5 +23,ocr,5,0,gpt-5,gpt-5 +23,baseline,5,0,gpt-5,gpt-5 +22,incomplete,6,1,gpt-5,gpt-5 +23,misspelled,5,0,gpt-5,gpt-5 +22,out_of_domain,6,1,gpt-5,gpt-5 +22,out_of_scope,6,1,gpt-5,gpt-5 +24,missing_context,6,1,gpt-5,gpt-5 +22,ocr,6,1,gpt-5,gpt-5 +25,misspelled,5,0,gpt-5,gpt-5 +24,misspelled,6,1,gpt-5,gpt-5 +25,missing_context,6,1,gpt-5,gpt-5 +24,out_of_scope,6,1,gpt-5,gpt-5 +25,out_of_scope,6,1,gpt-5,gpt-5 +25,baseline,5,0,gpt-5,gpt-5 +24,out_of_domain,6,1,gpt-5,gpt-5 +22,missing_context,6,1,gpt-5,gpt-5 +25,out_of_domain,5,0,gpt-5,gpt-5 +23,out_of_scope,6,1,gpt-5,gpt-5 +24,baseline,6,1,gpt-5,gpt-5 +25,incomplete,5,0,gpt-5,gpt-5 +24,incomplete,5,0,gpt-5,gpt-5 +24,ocr,6,1,gpt-5,gpt-5 +26,missing_context,6,1,gpt-5,gpt-5 +25,ocr,5,0,gpt-5,gpt-5 +27,baseline,5,0,gpt-5,gpt-5 +26,incomplete,6,1,gpt-5,gpt-5 +19,out_of_scope,6,1,gpt-5,gpt-5 +26,out_of_domain,6,1,gpt-5,gpt-5 +27,ocr,5,0,gpt-5,gpt-5 +26,baseline,6,1,gpt-5,gpt-5 +27,missing_context,6,1,gpt-5,gpt-5 +27,misspelled,5,0,gpt-5,gpt-5 +26,misspelled,6,1,gpt-5,gpt-5 +27,incomplete,5,0,gpt-5,gpt-5 +28,missing_context,6,1,gpt-5,gpt-5 +27,out_of_domain,5,0,gpt-5,gpt-5 +28,out_of_domain,6,1,gpt-5,gpt-5 +28,ocr,6,1,gpt-5,gpt-5 +26,out_of_scope,6,1,gpt-5,gpt-5 +28,misspelled,6,1,gpt-5,gpt-5 +29,missing_context,6,1,gpt-5,gpt-5 +27,out_of_scope,5,0,gpt-5,gpt-5 +30,missing_context,6,1,gpt-5,gpt-5 +28,baseline,6,1,gpt-5,gpt-5 +30,baseline,6,1,gpt-5,gpt-5 +28,incomplete,6,1,gpt-5,gpt-5 +26,ocr,6,1,gpt-5,gpt-5 +28,out_of_scope,6,1,gpt-5,gpt-5 +30,incomplete,6,1,gpt-5,gpt-5 +30,out_of_scope,6,1,gpt-5,gpt-5 +30,out_of_domain,6,1,gpt-5,gpt-5 +29,baseline,6,1,gpt-5,gpt-5 +31,missing_context,6,1,gpt-5,gpt-5 +29,out_of_scope,4,0,gpt-5,gpt-5 +30,misspelled,6,1,gpt-5,gpt-5 +32,incomplete,5,0,gpt-5,gpt-5 +29,ocr,6,1,gpt-5,gpt-5 +29,incomplete,6,1,gpt-5,gpt-5 +29,out_of_domain,6,1,gpt-5,gpt-5 +32,misspelled,5,0,gpt-5,gpt-5 +29,misspelled,6,1,gpt-5,gpt-5 +32,baseline,5,0,gpt-5,gpt-5 +32,out_of_domain,5,0,gpt-5,gpt-5 +32,missing_context,6,1,gpt-5,gpt-5 +32,ocr,5,0,gpt-5,gpt-5 +31,out_of_scope,6,1,gpt-5,gpt-5 +31,baseline,6,1,gpt-5,gpt-5 +33,missing_context,6,1,gpt-5,gpt-5 +32,out_of_scope,5,0,gpt-5,gpt-5 +34,baseline,6,1,gpt-5,gpt-5 +31,out_of_domain,6,1,gpt-5,gpt-5 +34,misspelled,6,1,gpt-5,gpt-5 +34,out_of_domain,5,0,gpt-5,gpt-5 +34,incomplete,6,1,gpt-5,gpt-5 +31,misspelled,6,1,gpt-5,gpt-5 +34,ocr,5,0,gpt-5,gpt-5 +34,missing_context,6,1,gpt-5,gpt-5 +33,baseline,6,1,gpt-5,gpt-5 +31,incomplete,6,1,gpt-5,gpt-5 +31,ocr,6,1,gpt-5,gpt-5 +33,incomplete,6,1,gpt-5,gpt-5 +35,missing_context,6,1,gpt-5,gpt-5 +35,ocr,5,0,gpt-5,gpt-5 +35,misspelled,5,0,gpt-5,gpt-5 +35,baseline,5,0,gpt-5,gpt-5 +30,ocr,5,0,gpt-5,gpt-5 +33,misspelled,6,1,gpt-5,gpt-5 +33,out_of_domain,6,1,gpt-5,gpt-5 +34,out_of_scope,5,0,gpt-5,gpt-5 +36,missing_context,6,1,gpt-5,gpt-5 +33,ocr,6,1,gpt-5,gpt-5 +35,out_of_domain,5,0,gpt-5,gpt-5 +37,baseline,5,0,gpt-5,gpt-5 +35,out_of_scope,5,0,gpt-5,gpt-5 +37,incomplete,5,0,gpt-5,gpt-5 +36,out_of_domain,5,0,gpt-5,gpt-5 +35,incomplete,6,1,gpt-5,gpt-5 +37,out_of_domain,5,0,gpt-5,gpt-5 +37,missing_context,6,1,gpt-5,gpt-5 +37,misspelled,5,0,gpt-5,gpt-5 +36,misspelled,6,1,gpt-5,gpt-5 +36,out_of_scope,4,0,gpt-5,gpt-5 +38,baseline,5,0,gpt-5,gpt-5 +36,incomplete,6,1,gpt-5,gpt-5 +38,missing_context,6,1,gpt-5,gpt-5 +39,baseline,5,0,gpt-5,gpt-5 +36,baseline,6,1,gpt-5,gpt-5 +37,out_of_scope,5,0,gpt-5,gpt-5 +37,ocr,5,0,gpt-5,gpt-5 +38,out_of_domain,5,0,gpt-5,gpt-5 +38,incomplete,5,0,gpt-5,gpt-5 +39,incomplete,5,0,gpt-5,gpt-5 +39,missing_context,6,1,gpt-5,gpt-5 +38,ocr,5,0,gpt-5,gpt-5 +38,out_of_scope,6,1,gpt-5,gpt-5 +39,ocr,5,0,gpt-5,gpt-5 +38,misspelled,5,0,gpt-5,gpt-5 +39,misspelled,5,0,gpt-5,gpt-5 +40,misspelled,6,1,gpt-5,gpt-5 +40,incomplete,6,1,gpt-5,gpt-5 +40,baseline,6,1,gpt-5,gpt-5 +39,out_of_domain,5,0,gpt-5,gpt-5 +39,out_of_scope,5,0,gpt-5,gpt-5 +41,baseline,5,0,gpt-5,gpt-5 +40,out_of_domain,6,1,gpt-5,gpt-5 +41,missing_context,6,1,gpt-5,gpt-5 +40,missing_context,6,1,gpt-5,gpt-5 +41,incomplete,5,0,gpt-5,gpt-5 +41,misspelled,5,0,gpt-5,gpt-5 +42,baseline,6,1,gpt-5,gpt-5 +42,misspelled,5,0,gpt-5,gpt-5 +41,out_of_domain,5,0,gpt-5,gpt-5 +40,ocr,6,1,gpt-5,gpt-5 +42,out_of_domain,5,0,gpt-5,gpt-5 +41,ocr,5,0,gpt-5,gpt-5 +42,out_of_scope,6,1,gpt-5,gpt-5 +36,ocr,6,1,gpt-5,gpt-5 +42,missing_context,6,1,gpt-5,gpt-5 +42,incomplete,6,1,gpt-5,gpt-5 +43,incomplete,5,0,gpt-5,gpt-5 +43,misspelled,5,0,gpt-5,gpt-5 +40,out_of_scope,5,0,gpt-5,gpt-5 +33,out_of_scope,6,1,gpt-5,gpt-5 +41,out_of_scope,6,1,gpt-5,gpt-5 +43,missing_context,6,1,gpt-5,gpt-5 +43,out_of_domain,5,0,gpt-5,gpt-5 +43,baseline,5,0,gpt-5,gpt-5 +43,out_of_scope,6,1,gpt-5,gpt-5 +44,ocr,4,0,gpt-5,gpt-5 +43,ocr,5,0,gpt-5,gpt-5 +44,missing_context,6,1,gpt-5,gpt-5 +44,incomplete,5,0,gpt-5,gpt-5 +44,out_of_domain,4,0,gpt-5,gpt-5 +44,baseline,5,0,gpt-5,gpt-5 +45,missing_context,6,1,gpt-5,gpt-5 +42,ocr,3,0,gpt-5,gpt-5 +46,missing_context,6,1,gpt-5,gpt-5 +44,out_of_scope,6,1,gpt-5,gpt-5 +44,misspelled,5,0,gpt-5,gpt-5 +46,misspelled,4,0,gpt-5,gpt-5 +46,baseline,4,0,gpt-5,gpt-5 +46,out_of_scope,6,1,gpt-5,gpt-5 +45,out_of_scope,5,0,gpt-5,gpt-5 +45,baseline,6,1,gpt-5,gpt-5 +47,missing_context,6,1,gpt-5,gpt-5 +45,incomplete,6,1,gpt-5,gpt-5 +45,misspelled,6,1,gpt-5,gpt-5 +45,out_of_domain,6,1,gpt-5,gpt-5 +47,baseline,6,1,gpt-5,gpt-5 +46,ocr,5,0,gpt-5,gpt-5 +46,incomplete,4,0,gpt-5,gpt-5 +45,ocr,6,1,gpt-5,gpt-5 +48,baseline,5,0,gpt-5,gpt-5 +47,misspelled,6,1,gpt-5,gpt-5 +48,missing_context,6,1,gpt-5,gpt-5 +47,out_of_domain,6,1,gpt-5,gpt-5 +47,ocr,6,1,gpt-5,gpt-5 +48,ocr,5,0,gpt-5,gpt-5 +48,incomplete,5,0,gpt-5,gpt-5 +48,out_of_domain,5,0,gpt-5,gpt-5 +49,missing_context,6,1,gpt-5,gpt-5 +48,misspelled,5,0,gpt-5,gpt-5 +47,incomplete,6,1,gpt-5,gpt-5 +50,baseline,6,1,gpt-5,gpt-5 +47,out_of_scope,5,0,gpt-5,gpt-5 +46,out_of_domain,4,0,gpt-5,gpt-5 +48,out_of_scope,4,0,gpt-5,gpt-5 +50,missing_context,6,1,gpt-5,gpt-5 +49,out_of_scope,6,1,gpt-5,gpt-5 +49,out_of_domain,6,1,gpt-5,gpt-5 +50,misspelled,5,0,gpt-5,gpt-5 +51,missing_context,6,1,gpt-5,gpt-5 +50,ocr,6,1,gpt-5,gpt-5 +49,baseline,6,1,gpt-5,gpt-5 +49,misspelled,6,1,gpt-5,gpt-5 +50,out_of_domain,6,1,gpt-5,gpt-5 +50,incomplete,6,1,gpt-5,gpt-5 +51,out_of_domain,2,0,gpt-5,gpt-5 +52,baseline,6,1,gpt-5,gpt-5 +52,incomplete,6,1,gpt-5,gpt-5 +51,misspelled,6,1,gpt-5,gpt-5 +51,incomplete,6,1,gpt-5,gpt-5 +50,out_of_scope,5,0,gpt-5,gpt-5 +49,incomplete,6,1,gpt-5,gpt-5 +52,missing_context,6,1,gpt-5,gpt-5 +49,ocr,6,1,gpt-5,gpt-5 +52,out_of_scope,5,0,gpt-5,gpt-5 +51,out_of_scope,5,0,gpt-5,gpt-5 +53,missing_context,6,1,gpt-5,gpt-5 +52,out_of_domain,6,1,gpt-5,gpt-5 +51,ocr,6,1,gpt-5,gpt-5 +51,baseline,6,1,gpt-5,gpt-5 +52,misspelled,6,1,gpt-5,gpt-5 +53,misspelled,6,1,gpt-5,gpt-5 +52,ocr,6,1,gpt-5,gpt-5 +53,ocr,6,1,gpt-5,gpt-5 +53,baseline,6,1,gpt-5,gpt-5 +54,missing_context,6,1,gpt-5,gpt-5 +55,baseline,6,1,gpt-5,gpt-5 +54,out_of_scope,6,1,gpt-5,gpt-5 +53,out_of_scope,4,0,gpt-5,gpt-5 +54,misspelled,6,1,gpt-5,gpt-5 +55,missing_context,6,1,gpt-5,gpt-5 +54,out_of_domain,6,1,gpt-5,gpt-5 +55,incomplete,6,1,gpt-5,gpt-5 +53,out_of_domain,6,1,gpt-5,gpt-5 +55,out_of_domain,6,1,gpt-5,gpt-5 +55,misspelled,6,1,gpt-5,gpt-5 +56,baseline,5,0,gpt-5,gpt-5 +54,incomplete,6,1,gpt-5,gpt-5 +56,incomplete,5,0,gpt-5,gpt-5 +54,ocr,6,1,gpt-5,gpt-5 +56,missing_context,6,1,gpt-5,gpt-5 +55,out_of_scope,4,0,gpt-5,gpt-5 +55,ocr,4,0,gpt-5,gpt-5 +56,ocr,5,0,gpt-5,gpt-5 +57,misspelled,5,0,gpt-5,gpt-5 +56,out_of_domain,5,0,gpt-5,gpt-5 +57,baseline,5,0,gpt-5,gpt-5 +54,baseline,6,1,gpt-5,gpt-5 +57,ocr,5,0,gpt-5,gpt-5 +58,baseline,5,0,gpt-5,gpt-5 +57,missing_context,6,1,gpt-5,gpt-5 +53,incomplete,6,1,gpt-5,gpt-5 +57,incomplete,6,1,gpt-5,gpt-5 +56,out_of_scope,6,1,gpt-5,gpt-5 +58,missing_context,6,1,gpt-5,gpt-5 +59,misspelled,5,0,gpt-5,gpt-5 +57,out_of_domain,5,0,gpt-5,gpt-5 +59,baseline,5,0,gpt-5,gpt-5 +58,incomplete,4,0,gpt-5,gpt-5 +57,out_of_scope,5,0,gpt-5,gpt-5 +59,out_of_domain,5,0,gpt-5,gpt-5 +59,missing_context,6,1,gpt-5,gpt-5 +59,ocr,5,0,gpt-5,gpt-5 +58,out_of_domain,5,0,gpt-5,gpt-5 +58,ocr,5,0,gpt-5,gpt-5 +58,misspelled,5,0,gpt-5,gpt-5 +59,out_of_scope,6,1,gpt-5,gpt-5 +60,missing_context,6,1,gpt-5,gpt-5 +58,out_of_scope,6,1,gpt-5,gpt-5 +59,incomplete,5,0,gpt-5,gpt-5 +61,missing_context,6,1,gpt-5,gpt-5 +61,incomplete,6,1,gpt-5,gpt-5 +61,ocr,6,1,gpt-5,gpt-5 +60,out_of_scope,4,0,gpt-5,gpt-5 +61,out_of_scope,6,1,gpt-5,gpt-5 +61,misspelled,6,1,gpt-5,gpt-5 +62,missing_context,6,1,gpt-5,gpt-5 +60,out_of_domain,6,1,gpt-5,gpt-5 +61,baseline,6,1,gpt-5,gpt-5 +60,incomplete,4,0,gpt-5,gpt-5 +60,baseline,6,1,gpt-5,gpt-5 +62,out_of_scope,6,1,gpt-5,gpt-5 +63,missing_context,6,1,gpt-5,gpt-5 +61,out_of_domain,6,1,gpt-5,gpt-5 +63,out_of_scope,6,1,gpt-5,gpt-5 +64,baseline,5,0,gpt-5,gpt-5 +64,incomplete,6,1,gpt-5,gpt-5 +64,missing_context,6,1,gpt-5,gpt-5 +64,out_of_domain,6,1,gpt-5,gpt-5 +62,ocr,6,1,gpt-5,gpt-5 +64,misspelled,5,0,gpt-5,gpt-5 +60,ocr,4,0,gpt-5,gpt-5 +64,ocr,5,0,gpt-5,gpt-5 +62,out_of_domain,6,1,gpt-5,gpt-5 +62,misspelled,6,1,gpt-5,gpt-5 +62,baseline,6,1,gpt-5,gpt-5 +65,baseline,5,0,gpt-5,gpt-5 +63,out_of_domain,6,1,gpt-5,gpt-5 +65,misspelled,5,0,gpt-5,gpt-5 +63,baseline,6,1,gpt-5,gpt-5 +65,out_of_domain,5,0,gpt-5,gpt-5 +62,incomplete,6,1,gpt-5,gpt-5 +64,out_of_scope,5,0,gpt-5,gpt-5 +65,out_of_scope,6,1,gpt-5,gpt-5 +66,missing_context,6,1,gpt-5,gpt-5 +65,ocr,5,0,gpt-5,gpt-5 +66,baseline,5,0,gpt-5,gpt-5 +66,misspelled,5,0,gpt-5,gpt-5 +65,incomplete,6,1,gpt-5,gpt-5 +65,missing_context,6,1,gpt-5,gpt-5 +63,incomplete,6,1,gpt-5,gpt-5 +60,misspelled,6,1,gpt-5,gpt-5 +63,misspelled,6,1,gpt-5,gpt-5 +66,ocr,5,0,gpt-5,gpt-5 +66,out_of_domain,5,0,gpt-5,gpt-5 +66,incomplete,5,0,gpt-5,gpt-5 +67,missing_context,6,1,gpt-5,gpt-5 +67,out_of_scope,6,1,gpt-5,gpt-5 +68,missing_context,6,1,gpt-5,gpt-5 +63,ocr,6,1,gpt-5,gpt-5 +67,misspelled,6,1,gpt-5,gpt-5 +66,out_of_scope,6,1,gpt-5,gpt-5 +67,out_of_domain,6,1,gpt-5,gpt-5 +67,baseline,6,1,gpt-5,gpt-5 +69,incomplete,6,1,gpt-5,gpt-5 +67,incomplete,6,1,gpt-5,gpt-5 +68,out_of_scope,5,0,gpt-5,gpt-5 +68,out_of_domain,2,0,gpt-5,gpt-5 +69,missing_context,6,1,gpt-5,gpt-5 +69,out_of_scope,6,1,gpt-5,gpt-5 +69,baseline,5,0,gpt-5,gpt-5 +70,misspelled,5,0,gpt-5,gpt-5 +68,incomplete,4,0,gpt-5,gpt-5 +69,misspelled,5,0,gpt-5,gpt-5 +67,ocr,3,0,gpt-5,gpt-5 +68,misspelled,6,1,gpt-5,gpt-5 +70,baseline,5,0,gpt-5,gpt-5 +70,missing_context,6,1,gpt-5,gpt-5 +70,incomplete,5,0,gpt-5,gpt-5 +68,ocr,6,1,gpt-5,gpt-5 +70,ocr,5,0,gpt-5,gpt-5 +70,out_of_domain,6,1,gpt-5,gpt-5 +71,missing_context,6,1,gpt-5,gpt-5 +71,ocr,5,0,gpt-5,gpt-5 +69,ocr,5,0,gpt-5,gpt-5 +70,out_of_scope,6,1,gpt-5,gpt-5 +71,baseline,5,0,gpt-5,gpt-5 +71,misspelled,6,1,gpt-5,gpt-5 +68,baseline,4,0,gpt-5,gpt-5 +71,out_of_domain,5,0,gpt-5,gpt-5 +72,out_of_domain,5,0,gpt-5,gpt-5 +72,missing_context,6,1,gpt-5,gpt-5 +72,misspelled,5,0,gpt-5,gpt-5 +69,out_of_domain,6,1,gpt-5,gpt-5 +72,ocr,5,0,gpt-5,gpt-5 +72,incomplete,5,0,gpt-5,gpt-5 +72,baseline,5,0,gpt-5,gpt-5 +71,out_of_scope,5,0,gpt-5,gpt-5 +73,baseline,5,0,gpt-5,gpt-5 +73,out_of_domain,5,0,gpt-5,gpt-5 +71,incomplete,6,1,gpt-5,gpt-5 +73,missing_context,6,1,gpt-5,gpt-5 +73,incomplete,5,0,gpt-5,gpt-5 +72,out_of_scope,6,1,gpt-5,gpt-5 +73,misspelled,5,0,gpt-5,gpt-5 +74,missing_context,6,1,gpt-5,gpt-5 +74,baseline,2,0,gpt-5,gpt-5 +73,out_of_scope,5,0,gpt-5,gpt-5 +73,ocr,4,0,gpt-5,gpt-5 +75,missing_context,6,1,gpt-5,gpt-5 +75,misspelled,6,1,gpt-5,gpt-5 +76,baseline,5,0,gpt-5,gpt-5 +74,out_of_scope,6,1,gpt-5,gpt-5 +74,ocr,5,0,gpt-5,gpt-5 +76,misspelled,5,0,gpt-5,gpt-5 +74,misspelled,3,0,gpt-5,gpt-5 +75,incomplete,6,1,gpt-5,gpt-5 +74,incomplete,5,0,gpt-5,gpt-5 +75,baseline,6,1,gpt-5,gpt-5 +75,out_of_domain,5,0,gpt-5,gpt-5 +76,incomplete,5,0,gpt-5,gpt-5 +76,ocr,5,0,gpt-5,gpt-5 +5,misspelled,5,0,gpt-5,gpt-5 +75,out_of_scope,5,0,gpt-5,gpt-5 +75,ocr,6,1,gpt-5,gpt-5 +77,missing_context,6,1,gpt-5,gpt-5 +74,out_of_domain,4,0,gpt-5,gpt-5 +76,missing_context,6,1,gpt-5,gpt-5 +78,missing_context,6,1,gpt-5,gpt-5 +76,out_of_domain,6,1,gpt-5,gpt-5 +76,out_of_scope,5,0,gpt-5,gpt-5 +77,baseline,6,1,gpt-5,gpt-5 +77,out_of_scope,4,0,gpt-5,gpt-5 +78,incomplete,6,1,gpt-5,gpt-5 +79,misspelled,5,0,gpt-5,gpt-5 +79,missing_context,6,1,gpt-5,gpt-5 +77,misspelled,6,1,gpt-5,gpt-5 +78,baseline,6,1,gpt-5,gpt-5 +78,out_of_scope,6,1,gpt-5,gpt-5 +78,misspelled,5,0,gpt-5,gpt-5 +79,incomplete,5,0,gpt-5,gpt-5 +79,baseline,5,0,gpt-5,gpt-5 +77,ocr,6,1,gpt-5,gpt-5 +80,missing_context,6,1,gpt-5,gpt-5 +77,out_of_domain,6,1,gpt-5,gpt-5 +79,ocr,5,0,gpt-5,gpt-5 +79,out_of_domain,6,1,gpt-5,gpt-5 +78,out_of_domain,5,0,gpt-5,gpt-5 +77,incomplete,6,1,gpt-5,gpt-5 +79,out_of_scope,6,1,gpt-5,gpt-5 +78,ocr,3,0,gpt-5,gpt-5 +81,missing_context,6,1,gpt-5,gpt-5 +80,incomplete,6,1,gpt-5,gpt-5 +80,out_of_scope,4,0,gpt-5,gpt-5 +81,misspelled,6,1,gpt-5,gpt-5 +80,ocr,4,0,gpt-5,gpt-5 +80,misspelled,4,0,gpt-5,gpt-5 +81,baseline,6,1,gpt-5,gpt-5 +82,out_of_domain,5,0,gpt-5,gpt-5 +81,out_of_domain,6,1,gpt-5,gpt-5 +81,incomplete,6,1,gpt-5,gpt-5 +82,baseline,5,0,gpt-5,gpt-5 +80,baseline,4,0,gpt-5,gpt-5 +82,missing_context,6,1,gpt-5,gpt-5 +82,ocr,5,0,gpt-5,gpt-5 +82,misspelled,5,0,gpt-5,gpt-5 +82,out_of_scope,6,1,gpt-5,gpt-5 +82,incomplete,5,0,gpt-5,gpt-5 +83,incomplete,5,0,gpt-5,gpt-5 +81,ocr,4,0,gpt-5,gpt-5 +80,out_of_domain,4,0,gpt-5,gpt-5 +83,missing_context,6,1,gpt-5,gpt-5 +84,missing_context,6,1,gpt-5,gpt-5 +84,baseline,5,0,gpt-5,gpt-5 +83,baseline,5,0,gpt-5,gpt-5 +84,incomplete,5,0,gpt-5,gpt-5 +84,misspelled,5,0,gpt-5,gpt-5 +81,out_of_scope,4,0,gpt-5,gpt-5 +83,ocr,5,0,gpt-5,gpt-5 +83,misspelled,5,0,gpt-5,gpt-5 +84,out_of_domain,5,0,gpt-5,gpt-5 +85,missing_context,6,1,gpt-5,gpt-5 +86,missing_context,6,1,gpt-5,gpt-5 +86,ocr,5,0,gpt-5,gpt-5 +83,out_of_domain,6,1,gpt-5,gpt-5 +86,baseline,5,0,gpt-5,gpt-5 +83,out_of_scope,5,0,gpt-5,gpt-5 +86,incomplete,5,0,gpt-5,gpt-5 +86,misspelled,5,0,gpt-5,gpt-5 +85,misspelled,3,0,gpt-5,gpt-5 +86,out_of_domain,6,1,gpt-5,gpt-5 +85,out_of_domain,3,0,gpt-5,gpt-5 +86,out_of_scope,6,1,gpt-5,gpt-5 +84,out_of_scope,6,1,gpt-5,gpt-5 +87,missing_context,6,1,gpt-5,gpt-5 +85,baseline,4,0,gpt-5,gpt-5 +85,out_of_scope,6,1,gpt-5,gpt-5 +85,ocr,3,0,gpt-5,gpt-5 +85,incomplete,3,0,gpt-5,gpt-5 +84,ocr,3,0,gpt-5,gpt-5 +88,out_of_scope,6,1,gpt-5,gpt-5 +89,baseline,5,0,gpt-5,gpt-5 +89,misspelled,5,0,gpt-5,gpt-5 +89,incomplete,4,0,gpt-5,gpt-5 +89,missing_context,6,1,gpt-5,gpt-5 +89,out_of_domain,6,1,gpt-5,gpt-5 +87,baseline,6,1,gpt-5,gpt-5 +88,missing_context,6,1,gpt-5,gpt-5 +89,ocr,5,0,gpt-5,gpt-5 +87,out_of_scope,6,1,gpt-5,gpt-5 +90,missing_context,6,1,gpt-5,gpt-5 +87,misspelled,6,1,gpt-5,gpt-5 +87,out_of_domain,6,1,gpt-5,gpt-5 +87,ocr,4,0,gpt-5,gpt-5 +90,incomplete,6,1,gpt-5,gpt-5 +88,baseline,6,1,gpt-5,gpt-5 +90,misspelled,6,1,gpt-5,gpt-5 +90,out_of_scope,4,0,gpt-5,gpt-5 +91,missing_context,6,1,gpt-5,gpt-5 +88,misspelled,6,1,gpt-5,gpt-5 +89,out_of_scope,5,0,gpt-5,gpt-5 +90,baseline,6,1,gpt-5,gpt-5 +88,incomplete,6,1,gpt-5,gpt-5 +88,out_of_domain,4,0,gpt-5,gpt-5 +88,ocr,6,1,gpt-5,gpt-5 +90,ocr,6,1,gpt-5,gpt-5 +91,baseline,4,0,gpt-5,gpt-5 +91,incomplete,6,1,gpt-5,gpt-5 +91,misspelled,6,1,gpt-5,gpt-5 +92,missing_context,6,1,gpt-5,gpt-5 +87,incomplete,6,1,gpt-5,gpt-5 +93,misspelled,5,0,gpt-5,gpt-5 +91,out_of_domain,6,1,gpt-5,gpt-5 +90,out_of_domain,6,1,gpt-5,gpt-5 +91,out_of_scope,6,1,gpt-5,gpt-5 +93,out_of_domain,5,0,gpt-5,gpt-5 +92,out_of_scope,5,0,gpt-5,gpt-5 +93,missing_context,6,1,gpt-5,gpt-5 +93,incomplete,6,1,gpt-5,gpt-5 +93,ocr,5,0,gpt-5,gpt-5 +92,incomplete,6,1,gpt-5,gpt-5 +93,baseline,5,0,gpt-5,gpt-5 +91,ocr,6,1,gpt-5,gpt-5 +92,out_of_domain,4,0,gpt-5,gpt-5 +92,baseline,6,1,gpt-5,gpt-5 +92,ocr,6,1,gpt-5,gpt-5 +94,missing_context,6,1,gpt-5,gpt-5 +92,misspelled,6,1,gpt-5,gpt-5 +94,incomplete,6,1,gpt-5,gpt-5 +94,baseline,5,0,gpt-5,gpt-5 +95,baseline,5,0,gpt-5,gpt-5 +95,incomplete,5,0,gpt-5,gpt-5 +94,out_of_domain,5,0,gpt-5,gpt-5 +93,out_of_scope,6,1,gpt-5,gpt-5 +95,missing_context,6,1,gpt-5,gpt-5 +94,misspelled,5,0,gpt-5,gpt-5 +95,misspelled,5,0,gpt-5,gpt-5 +95,ocr,5,0,gpt-5,gpt-5 +96,missing_context,6,1,gpt-5,gpt-5 +95,out_of_domain,5,0,gpt-5,gpt-5 +94,out_of_scope,6,1,gpt-5,gpt-5 +96,incomplete,5,0,gpt-5,gpt-5 +97,baseline,5,0,gpt-5,gpt-5 +96,misspelled,5,0,gpt-5,gpt-5 +97,missing_context,6,1,gpt-5,gpt-5 +94,ocr,5,0,gpt-5,gpt-5 +97,misspelled,5,0,gpt-5,gpt-5 +96,ocr,5,0,gpt-5,gpt-5 +96,baseline,6,1,gpt-5,gpt-5 +96,out_of_domain,6,1,gpt-5,gpt-5 +96,out_of_scope,5,0,gpt-5,gpt-5 +95,out_of_scope,5,0,gpt-5,gpt-5 +98,incomplete,5,0,gpt-5,gpt-5 +97,ocr,5,0,gpt-5,gpt-5 +98,missing_context,6,1,gpt-5,gpt-5 +99,missing_context,6,1,gpt-5,gpt-5 +98,misspelled,5,0,gpt-5,gpt-5 +97,incomplete,6,1,gpt-5,gpt-5 +98,ocr,5,0,gpt-5,gpt-5 +98,baseline,5,0,gpt-5,gpt-5 +98,out_of_scope,6,1,gpt-5,gpt-5 +97,out_of_domain,6,1,gpt-5,gpt-5 +101,baseline,5,0,gpt-5,gpt-5 +99,out_of_scope,5,0,gpt-5,gpt-5 +99,incomplete,6,1,gpt-5,gpt-5 +99,out_of_domain,4,0,gpt-5,gpt-5 +98,out_of_domain,6,1,gpt-5,gpt-5 +99,baseline,4,0,gpt-5,gpt-5 +101,misspelled,6,1,gpt-5,gpt-5 +99,ocr,6,1,gpt-5,gpt-5 +100,missing_context,6,1,gpt-5,gpt-5 +99,misspelled,4,0,gpt-5,gpt-5 +100,baseline,2,0,gpt-5,gpt-5 +101,incomplete,6,1,gpt-5,gpt-5 +100,incomplete,5,0,gpt-5,gpt-5 +101,missing_context,6,1,gpt-5,gpt-5 +101,out_of_domain,5,0,gpt-5,gpt-5 +97,out_of_scope,6,1,gpt-5,gpt-5 +100,out_of_scope,6,1,gpt-5,gpt-5 +102,missing_context,6,1,gpt-5,gpt-5 +102,baseline,5,0,gpt-5,gpt-5 +100,out_of_domain,2,0,gpt-5,gpt-5 +103,baseline,5,0,gpt-5,gpt-5 +102,incomplete,5,0,gpt-5,gpt-5 +101,ocr,6,1,gpt-5,gpt-5 +102,out_of_domain,5,0,gpt-5,gpt-5 +103,out_of_domain,5,0,gpt-5,gpt-5 +103,missing_context,6,1,gpt-5,gpt-5 +100,ocr,5,0,gpt-5,gpt-5 +102,ocr,5,0,gpt-5,gpt-5 +103,incomplete,5,0,gpt-5,gpt-5 +102,misspelled,5,0,gpt-5,gpt-5 +103,ocr,5,0,gpt-5,gpt-5 +101,out_of_scope,4,0,gpt-5,gpt-5 +102,out_of_scope,4,0,gpt-5,gpt-5 +103,out_of_scope,6,1,gpt-5,gpt-5 +103,misspelled,6,1,gpt-5,gpt-5 +104,missing_context,6,1,gpt-5,gpt-5 +100,misspelled,5,0,gpt-5,gpt-5 +105,missing_context,6,1,gpt-5,gpt-5 +104,out_of_scope,6,1,gpt-5,gpt-5 +104,incomplete,6,1,gpt-5,gpt-5 +106,missing_context,6,1,gpt-5,gpt-5 +105,out_of_scope,6,1,gpt-5,gpt-5 +104,ocr,6,1,gpt-5,gpt-5 +104,misspelled,6,1,gpt-5,gpt-5 +104,baseline,6,1,gpt-5,gpt-5 +106,incomplete,6,1,gpt-5,gpt-5 +106,out_of_scope,6,1,gpt-5,gpt-5 +104,out_of_domain,6,1,gpt-5,gpt-5 +106,baseline,6,1,gpt-5,gpt-5 +107,baseline,6,1,gpt-5,gpt-5 +107,missing_context,6,1,gpt-5,gpt-5 +105,incomplete,6,1,gpt-5,gpt-5 +107,misspelled,6,1,gpt-5,gpt-5 +105,out_of_domain,6,1,gpt-5,gpt-5 +105,baseline,6,1,gpt-5,gpt-5 +107,incomplete,6,1,gpt-5,gpt-5 +106,misspelled,6,1,gpt-5,gpt-5 +107,out_of_scope,6,1,gpt-5,gpt-5 +106,out_of_domain,6,1,gpt-5,gpt-5 +108,missing_context,6,1,gpt-5,gpt-5 +107,out_of_domain,6,1,gpt-5,gpt-5 +108,misspelled,5,0,gpt-5,gpt-5 +108,out_of_domain,5,0,gpt-5,gpt-5 +105,misspelled,6,1,gpt-5,gpt-5 +109,missing_context,6,1,gpt-5,gpt-5 +108,baseline,5,0,gpt-5,gpt-5 +106,ocr,6,1,gpt-5,gpt-5 +108,ocr,5,0,gpt-5,gpt-5 +107,ocr,6,1,gpt-5,gpt-5 +108,incomplete,6,1,gpt-5,gpt-5 +105,ocr,3,0,gpt-5,gpt-5 +110,baseline,5,0,gpt-5,gpt-5 +110,incomplete,5,0,gpt-5,gpt-5 +109,baseline,6,1,gpt-5,gpt-5 +108,out_of_scope,6,1,gpt-5,gpt-5 +110,missing_context,6,1,gpt-5,gpt-5 +110,misspelled,5,0,gpt-5,gpt-5 +110,out_of_domain,5,0,gpt-5,gpt-5 +110,ocr,5,0,gpt-5,gpt-5 +109,out_of_scope,5,0,gpt-5,gpt-5 +111,missing_context,6,1,gpt-5,gpt-5 +111,out_of_domain,6,1,gpt-5,gpt-5 +109,incomplete,6,1,gpt-5,gpt-5 +109,ocr,6,1,gpt-5,gpt-5 +109,out_of_domain,6,1,gpt-5,gpt-5 +110,out_of_scope,4,0,gpt-5,gpt-5 +111,out_of_scope,6,1,gpt-5,gpt-5 +109,misspelled,1,0,gpt-5,gpt-5 +112,missing_context,6,1,gpt-5,gpt-5 +111,misspelled,6,1,gpt-5,gpt-5 +111,incomplete,6,1,gpt-5,gpt-5 +112,misspelled,5,0,gpt-5,gpt-5 +112,baseline,5,0,gpt-5,gpt-5 +112,incomplete,5,0,gpt-5,gpt-5 +113,missing_context,6,1,gpt-5,gpt-5 +111,baseline,6,1,gpt-5,gpt-5 +112,ocr,5,0,gpt-5,gpt-5 +111,ocr,4,0,gpt-5,gpt-5 +114,missing_context,6,1,gpt-5,gpt-5 +113,misspelled,6,1,gpt-5,gpt-5 +112,out_of_domain,5,0,gpt-5,gpt-5 +113,out_of_domain,6,1,gpt-5,gpt-5 +114,out_of_scope,5,0,gpt-5,gpt-5 +113,baseline,6,1,gpt-5,gpt-5 +112,out_of_scope,5,0,gpt-5,gpt-5 +113,ocr,6,1,gpt-5,gpt-5 +113,incomplete,6,1,gpt-5,gpt-5 +115,missing_context,6,1,gpt-5,gpt-5 +114,baseline,6,1,gpt-5,gpt-5 +116,missing_context,6,1,gpt-5,gpt-5 +115,baseline,6,1,gpt-5,gpt-5 +114,ocr,6,1,gpt-5,gpt-5 +114,misspelled,6,1,gpt-5,gpt-5 +115,incomplete,6,1,gpt-5,gpt-5 +115,misspelled,6,1,gpt-5,gpt-5 +116,baseline,5,0,gpt-5,gpt-5 +115,ocr,6,1,gpt-5,gpt-5 +114,incomplete,6,1,gpt-5,gpt-5 +115,out_of_scope,6,1,gpt-5,gpt-5 +116,misspelled,5,0,gpt-5,gpt-5 +113,out_of_scope,6,1,gpt-5,gpt-5 +117,missing_context,6,1,gpt-5,gpt-5 +115,out_of_domain,6,1,gpt-5,gpt-5 +117,baseline,5,0,gpt-5,gpt-5 +116,ocr,5,0,gpt-5,gpt-5 +116,incomplete,5,0,gpt-5,gpt-5 +116,out_of_scope,6,1,gpt-5,gpt-5 +114,out_of_domain,6,1,gpt-5,gpt-5 +118,missing_context,6,1,gpt-5,gpt-5 +117,out_of_domain,6,1,gpt-5,gpt-5 +117,ocr,5,0,gpt-5,gpt-5 +116,out_of_domain,6,1,gpt-5,gpt-5 +118,baseline,6,1,gpt-5,gpt-5 +117,incomplete,6,1,gpt-5,gpt-5 +118,misspelled,6,1,gpt-5,gpt-5 +118,incomplete,5,0,gpt-5,gpt-5 +119,baseline,6,1,gpt-5,gpt-5 +117,misspelled,5,0,gpt-5,gpt-5 +118,out_of_scope,6,1,gpt-5,gpt-5 +118,ocr,6,1,gpt-5,gpt-5 +118,out_of_domain,6,1,gpt-5,gpt-5 +119,ocr,6,1,gpt-5,gpt-5 +119,missing_context,6,1,gpt-5,gpt-5 +119,incomplete,6,1,gpt-5,gpt-5 +119,misspelled,6,1,gpt-5,gpt-5 +119,out_of_domain,6,1,gpt-5,gpt-5 +120,missing_context,6,1,gpt-5,gpt-5 +117,out_of_scope,5,0,gpt-5,gpt-5 +119,out_of_scope,6,1,gpt-5,gpt-5 +121,missing_context,6,1,gpt-5,gpt-5 +122,baseline,5,0,gpt-5,gpt-5 +120,out_of_scope,5,0,gpt-5,gpt-5 +122,incomplete,5,0,gpt-5,gpt-5 +122,missing_context,6,1,gpt-5,gpt-5 +122,out_of_domain,5,0,gpt-5,gpt-5 +120,misspelled,6,1,gpt-5,gpt-5 +56,misspelled,5,0,gpt-5,gpt-5 +122,misspelled,5,0,gpt-5,gpt-5 +121,out_of_scope,6,1,gpt-5,gpt-5 +122,ocr,3,0,gpt-5,gpt-5 +120,baseline,6,1,gpt-5,gpt-5 +123,missing_context,6,1,gpt-5,gpt-5 +122,out_of_scope,6,1,gpt-5,gpt-5 +120,incomplete,6,1,gpt-5,gpt-5 +121,misspelled,6,1,gpt-5,gpt-5 +121,baseline,6,1,gpt-5,gpt-5 +123,misspelled,6,1,gpt-5,gpt-5 +123,baseline,6,1,gpt-5,gpt-5 +120,out_of_domain,6,1,gpt-5,gpt-5 +123,out_of_domain,6,1,gpt-5,gpt-5 +121,out_of_domain,6,1,gpt-5,gpt-5 +123,incomplete,6,1,gpt-5,gpt-5 +123,out_of_scope,6,1,gpt-5,gpt-5 +124,missing_context,6,1,gpt-5,gpt-5 +120,ocr,6,1,gpt-5,gpt-5 +124,out_of_scope,6,1,gpt-5,gpt-5 +125,missing_context,6,1,gpt-5,gpt-5 +123,ocr,5,0,gpt-5,gpt-5 +125,out_of_domain,6,1,gpt-5,gpt-5 +126,missing_context,6,1,gpt-5,gpt-5 +121,ocr,6,1,gpt-5,gpt-5 +121,incomplete,6,1,gpt-5,gpt-5 +126,incomplete,6,1,gpt-5,gpt-5 +125,out_of_scope,6,1,gpt-5,gpt-5 +124,baseline,4,0,gpt-5,gpt-5 +125,incomplete,6,1,gpt-5,gpt-5 +125,misspelled,6,1,gpt-5,gpt-5 +125,ocr,6,1,gpt-5,gpt-5 +125,baseline,4,0,gpt-5,gpt-5 +124,misspelled,6,1,gpt-5,gpt-5 +127,baseline,5,0,gpt-5,gpt-5 +126,ocr,5,0,gpt-5,gpt-5 +127,out_of_domain,5,0,gpt-5,gpt-5 +124,ocr,6,1,gpt-5,gpt-5 +124,out_of_domain,6,1,gpt-5,gpt-5 +126,baseline,6,1,gpt-5,gpt-5 +127,misspelled,5,0,gpt-5,gpt-5 +126,out_of_domain,6,1,gpt-5,gpt-5 +124,incomplete,6,1,gpt-5,gpt-5 +127,incomplete,5,0,gpt-5,gpt-5 +127,missing_context,6,1,gpt-5,gpt-5 +128,baseline,5,0,gpt-5,gpt-5 +128,incomplete,5,0,gpt-5,gpt-5 +128,out_of_domain,5,0,gpt-5,gpt-5 +128,missing_context,6,1,gpt-5,gpt-5 +129,baseline,5,0,gpt-5,gpt-5 +126,misspelled,4,0,gpt-5,gpt-5 +128,misspelled,5,0,gpt-5,gpt-5 +129,out_of_domain,5,0,gpt-5,gpt-5 +129,out_of_scope,6,1,gpt-5,gpt-5 +129,misspelled,5,0,gpt-5,gpt-5 +128,out_of_scope,6,1,gpt-5,gpt-5 +127,out_of_scope,4,0,gpt-5,gpt-5 +126,out_of_scope,6,1,gpt-5,gpt-5 +129,missing_context,6,1,gpt-5,gpt-5 +129,ocr,5,0,gpt-5,gpt-5 +130,baseline,5,0,gpt-5,gpt-5 +130,misspelled,5,0,gpt-5,gpt-5 +129,incomplete,5,0,gpt-5,gpt-5 +130,incomplete,5,0,gpt-5,gpt-5 +128,ocr,3,0,gpt-5,gpt-5 +131,missing_context,6,1,gpt-5,gpt-5 +130,missing_context,6,1,gpt-5,gpt-5 +127,ocr,3,0,gpt-5,gpt-5 +130,ocr,5,0,gpt-5,gpt-5 +132,misspelled,5,0,gpt-5,gpt-5 +132,out_of_domain,5,0,gpt-5,gpt-5 +131,ocr,5,0,gpt-5,gpt-5 +132,incomplete,5,0,gpt-5,gpt-5 +131,baseline,6,1,gpt-5,gpt-5 +130,out_of_domain,5,0,gpt-5,gpt-5 +131,incomplete,6,1,gpt-5,gpt-5 +132,missing_context,6,1,gpt-5,gpt-5 +131,misspelled,6,1,gpt-5,gpt-5 +132,baseline,5,0,gpt-5,gpt-5 +130,out_of_scope,4,0,gpt-5,gpt-5 +133,ocr,5,0,gpt-5,gpt-5 +133,out_of_domain,5,0,gpt-5,gpt-5 +133,missing_context,6,1,gpt-5,gpt-5 +133,incomplete,5,0,gpt-5,gpt-5 +131,out_of_domain,4,0,gpt-5,gpt-5 +134,missing_context,6,1,gpt-5,gpt-5 +133,baseline,5,0,gpt-5,gpt-5 +133,out_of_scope,5,0,gpt-5,gpt-5 +132,out_of_scope,4,0,gpt-5,gpt-5 +133,misspelled,5,0,gpt-5,gpt-5 +135,incomplete,5,0,gpt-5,gpt-5 +134,incomplete,6,1,gpt-5,gpt-5 +134,baseline,6,1,gpt-5,gpt-5 +134,out_of_domain,6,1,gpt-5,gpt-5 +134,misspelled,6,1,gpt-5,gpt-5 +135,missing_context,6,1,gpt-5,gpt-5 +131,out_of_scope,6,1,gpt-5,gpt-5 +134,ocr,6,1,gpt-5,gpt-5 +135,out_of_scope,6,1,gpt-5,gpt-5 +135,misspelled,5,0,gpt-5,gpt-5 +135,baseline,5,0,gpt-5,gpt-5 +132,ocr,4,0,gpt-5,gpt-5 +135,ocr,5,0,gpt-5,gpt-5 +134,out_of_scope,5,0,gpt-5,gpt-5 +136,missing_context,6,1,gpt-5,gpt-5 +137,baseline,6,1,gpt-5,gpt-5 +137,incomplete,6,1,gpt-5,gpt-5 +137,misspelled,6,1,gpt-5,gpt-5 +136,baseline,6,1,gpt-5,gpt-5 +137,missing_context,6,1,gpt-5,gpt-5 +137,ocr,6,1,gpt-5,gpt-5 +135,out_of_domain,6,1,gpt-5,gpt-5 +138,misspelled,5,0,gpt-5,gpt-5 +136,out_of_scope,6,1,gpt-5,gpt-5 +137,out_of_domain,6,1,gpt-5,gpt-5 +138,missing_context,6,1,gpt-5,gpt-5 +138,incomplete,5,0,gpt-5,gpt-5 +136,misspelled,6,1,gpt-5,gpt-5 +138,baseline,5,0,gpt-5,gpt-5 +136,out_of_domain,6,1,gpt-5,gpt-5 +137,out_of_scope,6,1,gpt-5,gpt-5 +136,ocr,6,1,gpt-5,gpt-5 +139,missing_context,6,1,gpt-5,gpt-5 +138,ocr,5,0,gpt-5,gpt-5 +138,out_of_domain,6,1,gpt-5,gpt-5 +138,out_of_scope,5,0,gpt-5,gpt-5 +140,misspelled,5,0,gpt-5,gpt-5 +136,incomplete,6,1,gpt-5,gpt-5 +140,baseline,5,0,gpt-5,gpt-5 +139,misspelled,6,1,gpt-5,gpt-5 +139,baseline,6,1,gpt-5,gpt-5 +139,out_of_domain,6,1,gpt-5,gpt-5 +140,missing_context,6,1,gpt-5,gpt-5 +139,ocr,6,1,gpt-5,gpt-5 +140,out_of_domain,5,0,gpt-5,gpt-5 +140,ocr,5,0,gpt-5,gpt-5 +139,out_of_scope,6,1,gpt-5,gpt-5 +141,missing_context,6,1,gpt-5,gpt-5 +139,incomplete,6,1,gpt-5,gpt-5 +141,baseline,5,0,gpt-5,gpt-5 +141,misspelled,6,1,gpt-5,gpt-5 +140,out_of_scope,6,1,gpt-5,gpt-5 +141,out_of_domain,5,0,gpt-5,gpt-5 +143,baseline,5,0,gpt-5,gpt-5 +140,incomplete,6,1,gpt-5,gpt-5 +141,ocr,6,1,gpt-5,gpt-5 +143,missing_context,6,1,gpt-5,gpt-5 +141,out_of_scope,6,1,gpt-5,gpt-5 +142,missing_context,6,1,gpt-5,gpt-5 +143,incomplete,6,1,gpt-5,gpt-5 +143,ocr,5,0,gpt-5,gpt-5 +142,out_of_scope,6,1,gpt-5,gpt-5 +143,misspelled,5,0,gpt-5,gpt-5 +141,incomplete,6,1,gpt-5,gpt-5 +143,out_of_domain,6,1,gpt-5,gpt-5 +144,missing_context,6,1,gpt-5,gpt-5 +142,misspelled,6,1,gpt-5,gpt-5 +143,out_of_scope,6,1,gpt-5,gpt-5 +142,out_of_domain,6,1,gpt-5,gpt-5 +142,ocr,6,1,gpt-5,gpt-5 +145,missing_context,6,1,gpt-5,gpt-5 +144,out_of_scope,5,0,gpt-5,gpt-5 +145,out_of_domain,4,0,gpt-5,gpt-5 +142,baseline,6,1,gpt-5,gpt-5 +145,out_of_scope,6,1,gpt-5,gpt-5 +146,missing_context,6,1,gpt-5,gpt-5 +146,out_of_domain,5,0,gpt-5,gpt-5 +146,baseline,5,0,gpt-5,gpt-5 +146,ocr,5,0,gpt-5,gpt-5 +144,misspelled,6,1,gpt-5,gpt-5 +146,incomplete,5,0,gpt-5,gpt-5 +146,misspelled,5,0,gpt-5,gpt-5 +142,incomplete,6,1,gpt-5,gpt-5 +144,baseline,6,1,gpt-5,gpt-5 +144,incomplete,6,1,gpt-5,gpt-5 +144,ocr,6,1,gpt-5,gpt-5 +146,out_of_scope,5,0,gpt-5,gpt-5 +147,missing_context,6,1,gpt-5,gpt-5 +145,incomplete,3,0,gpt-5,gpt-5 +144,out_of_domain,6,1,gpt-5,gpt-5 +147,incomplete,6,1,gpt-5,gpt-5 +147,baseline,6,1,gpt-5,gpt-5 +145,ocr,3,0,gpt-5,gpt-5 +147,misspelled,6,1,gpt-5,gpt-5 +147,ocr,6,1,gpt-5,gpt-5 +145,misspelled,6,1,gpt-5,gpt-5 +148,missing_context,6,1,gpt-5,gpt-5 +145,baseline,3,0,gpt-5,gpt-5 +147,out_of_domain,6,1,gpt-5,gpt-5 +147,out_of_scope,5,0,gpt-5,gpt-5 +148,misspelled,6,1,gpt-5,gpt-5 +149,missing_context,6,1,gpt-5,gpt-5 +150,missing_context,6,1,gpt-5,gpt-5 +148,out_of_scope,5,0,gpt-5,gpt-5 +148,out_of_domain,6,1,gpt-5,gpt-5 +148,baseline,6,1,gpt-5,gpt-5 +150,baseline,6,1,gpt-5,gpt-5 +150,misspelled,6,1,gpt-5,gpt-5 +149,out_of_domain,5,0,gpt-5,gpt-5 +150,incomplete,6,1,gpt-5,gpt-5 +148,ocr,6,1,gpt-5,gpt-5 +150,out_of_scope,6,1,gpt-5,gpt-5 +149,out_of_scope,5,0,gpt-5,gpt-5 +148,incomplete,6,1,gpt-5,gpt-5 +151,missing_context,6,1,gpt-5,gpt-5 +150,out_of_domain,6,1,gpt-5,gpt-5 +149,misspelled,6,1,gpt-5,gpt-5 +149,incomplete,6,1,gpt-5,gpt-5 +151,incomplete,6,1,gpt-5,gpt-5 +151,out_of_scope,5,0,gpt-5,gpt-5 +151,baseline,6,1,gpt-5,gpt-5 +151,misspelled,6,1,gpt-5,gpt-5 +152,missing_context,6,1,gpt-5,gpt-5 +152,misspelled,6,1,gpt-5,gpt-5 +152,out_of_scope,6,1,gpt-5,gpt-5 +152,out_of_domain,6,1,gpt-5,gpt-5 +150,ocr,6,1,gpt-5,gpt-5 +152,incomplete,6,1,gpt-5,gpt-5 +149,ocr,6,1,gpt-5,gpt-5 +153,baseline,6,1,gpt-5,gpt-5 +153,incomplete,5,0,gpt-5,gpt-5 +152,ocr,6,1,gpt-5,gpt-5 +152,baseline,6,1,gpt-5,gpt-5 +154,missing_context,6,1,gpt-5,gpt-5 +151,out_of_domain,6,1,gpt-5,gpt-5 +149,baseline,6,1,gpt-5,gpt-5 +153,missing_context,6,1,gpt-5,gpt-5 +151,ocr,6,1,gpt-5,gpt-5 +153,misspelled,6,1,gpt-5,gpt-5 +155,missing_context,6,1,gpt-5,gpt-5 +153,out_of_scope,6,1,gpt-5,gpt-5 +154,incomplete,5,0,gpt-5,gpt-5 +153,ocr,6,1,gpt-5,gpt-5 +153,out_of_domain,6,1,gpt-5,gpt-5 +154,out_of_scope,6,1,gpt-5,gpt-5 +154,out_of_domain,5,0,gpt-5,gpt-5 +154,ocr,5,0,gpt-5,gpt-5 +154,misspelled,5,0,gpt-5,gpt-5 +155,baseline,6,1,gpt-5,gpt-5 +154,baseline,5,0,gpt-5,gpt-5 +157,misspelled,5,0,gpt-5,gpt-5 +155,misspelled,6,1,gpt-5,gpt-5 +156,missing_context,6,1,gpt-5,gpt-5 +155,out_of_scope,5,0,gpt-5,gpt-5 +156,baseline,6,1,gpt-5,gpt-5 +157,incomplete,5,0,gpt-5,gpt-5 +155,ocr,6,1,gpt-5,gpt-5 +157,baseline,5,0,gpt-5,gpt-5 +157,missing_context,6,1,gpt-5,gpt-5 +156,misspelled,6,1,gpt-5,gpt-5 +156,out_of_domain,6,1,gpt-5,gpt-5 +155,incomplete,6,1,gpt-5,gpt-5 +158,missing_context,6,1,gpt-5,gpt-5 +157,out_of_domain,5,0,gpt-5,gpt-5 +157,out_of_scope,5,0,gpt-5,gpt-5 +155,out_of_domain,6,1,gpt-5,gpt-5 +156,out_of_scope,6,1,gpt-5,gpt-5 +158,baseline,5,0,gpt-5,gpt-5 +156,ocr,6,1,gpt-5,gpt-5 +159,misspelled,5,0,gpt-5,gpt-5 +159,missing_context,6,1,gpt-5,gpt-5 +157,ocr,3,0,gpt-5,gpt-5 +156,incomplete,6,1,gpt-5,gpt-5 +159,incomplete,5,0,gpt-5,gpt-5 +159,ocr,5,0,gpt-5,gpt-5 +159,baseline,5,0,gpt-5,gpt-5 +158,misspelled,6,1,gpt-5,gpt-5 +159,out_of_domain,5,0,gpt-5,gpt-5 +158,ocr,6,1,gpt-5,gpt-5 +159,out_of_scope,5,0,gpt-5,gpt-5 +158,incomplete,6,1,gpt-5,gpt-5 +161,missing_context,6,1,gpt-5,gpt-5 +160,out_of_scope,4,0,gpt-5,gpt-5 +161,out_of_scope,6,1,gpt-5,gpt-5 +158,out_of_domain,6,1,gpt-5,gpt-5 +162,misspelled,6,1,gpt-5,gpt-5 +162,missing_context,6,1,gpt-5,gpt-5 +158,out_of_scope,6,1,gpt-5,gpt-5 +162,incomplete,6,1,gpt-5,gpt-5 +160,out_of_domain,6,1,gpt-5,gpt-5 +162,out_of_domain,5,0,gpt-5,gpt-5 +161,incomplete,6,1,gpt-5,gpt-5 +161,misspelled,6,1,gpt-5,gpt-5 +160,baseline,6,1,gpt-5,gpt-5 +160,incomplete,6,1,gpt-5,gpt-5 +160,ocr,6,1,gpt-5,gpt-5 +161,out_of_domain,6,1,gpt-5,gpt-5 +163,missing_context,6,1,gpt-5,gpt-5 +162,ocr,5,0,gpt-5,gpt-5 +161,baseline,6,1,gpt-5,gpt-5 +163,out_of_scope,6,1,gpt-5,gpt-5 +161,ocr,6,1,gpt-5,gpt-5 +164,missing_context,6,1,gpt-5,gpt-5 +163,baseline,6,1,gpt-5,gpt-5 +160,misspelled,6,1,gpt-5,gpt-5 +164,misspelled,3,0,gpt-5,gpt-5 +162,out_of_scope,5,0,gpt-5,gpt-5 +164,ocr,5,0,gpt-5,gpt-5 +163,incomplete,3,0,gpt-5,gpt-5 +164,incomplete,3,0,gpt-5,gpt-5 +163,misspelled,6,1,gpt-5,gpt-5 +164,baseline,3,0,gpt-5,gpt-5 +164,out_of_domain,2,0,gpt-5,gpt-5 +164,out_of_scope,4,0,gpt-5,gpt-5 +166,baseline,5,0,gpt-5,gpt-5 +166,misspelled,5,0,gpt-5,gpt-5 +166,incomplete,6,1,gpt-5,gpt-5 +165,missing_context,6,1,gpt-5,gpt-5 +166,ocr,5,0,gpt-5,gpt-5 +166,out_of_domain,5,0,gpt-5,gpt-5 +166,missing_context,6,1,gpt-5,gpt-5 +163,out_of_domain,6,1,gpt-5,gpt-5 +165,misspelled,6,1,gpt-5,gpt-5 +167,misspelled,5,0,gpt-5,gpt-5 +160,missing_context,6,1,gpt-5,gpt-5 +165,ocr,6,1,gpt-5,gpt-5 +167,missing_context,6,1,gpt-5,gpt-5 +165,incomplete,6,1,gpt-5,gpt-5 +166,out_of_scope,6,1,gpt-5,gpt-5 +165,baseline,6,1,gpt-5,gpt-5 +165,out_of_domain,6,1,gpt-5,gpt-5 +167,baseline,5,0,gpt-5,gpt-5 +165,out_of_scope,5,0,gpt-5,gpt-5 +167,out_of_domain,5,0,gpt-5,gpt-5 +163,ocr,6,1,gpt-5,gpt-5 +167,out_of_scope,6,1,gpt-5,gpt-5 +167,incomplete,5,0,gpt-5,gpt-5 +168,incomplete,5,0,gpt-5,gpt-5 +168,baseline,5,0,gpt-5,gpt-5 +167,ocr,5,0,gpt-5,gpt-5 +169,baseline,5,0,gpt-5,gpt-5 +168,missing_context,6,1,gpt-5,gpt-5 +170,baseline,5,0,gpt-5,gpt-5 +169,missing_context,6,1,gpt-5,gpt-5 +169,misspelled,5,0,gpt-5,gpt-5 +168,misspelled,6,1,gpt-5,gpt-5 +169,incomplete,5,0,gpt-5,gpt-5 +169,out_of_domain,5,0,gpt-5,gpt-5 +170,out_of_domain,5,0,gpt-5,gpt-5 +168,ocr,6,1,gpt-5,gpt-5 +170,misspelled,5,0,gpt-5,gpt-5 +168,out_of_domain,6,1,gpt-5,gpt-5 +170,ocr,5,0,gpt-5,gpt-5 +171,missing_context,6,1,gpt-5,gpt-5 +170,incomplete,5,0,gpt-5,gpt-5 +170,missing_context,6,1,gpt-5,gpt-5 +169,ocr,3,0,gpt-5,gpt-5 +169,out_of_scope,6,1,gpt-5,gpt-5 +172,baseline,6,1,gpt-5,gpt-5 +171,incomplete,5,0,gpt-5,gpt-5 +172,missing_context,6,1,gpt-5,gpt-5 +172,misspelled,5,0,gpt-5,gpt-5 +171,ocr,5,0,gpt-5,gpt-5 +168,out_of_scope,5,0,gpt-5,gpt-5 +171,misspelled,5,0,gpt-5,gpt-5 +171,baseline,6,1,gpt-5,gpt-5 +171,out_of_scope,6,1,gpt-5,gpt-5 +172,incomplete,6,1,gpt-5,gpt-5 +171,out_of_domain,3,0,gpt-5,gpt-5 +172,ocr,5,0,gpt-5,gpt-5 +173,missing_context,6,1,gpt-5,gpt-5 +172,out_of_domain,6,1,gpt-5,gpt-5 +174,misspelled,5,0,gpt-5,gpt-5 +174,baseline,5,0,gpt-5,gpt-5 +174,missing_context,6,1,gpt-5,gpt-5 +174,ocr,5,0,gpt-5,gpt-5 +170,out_of_scope,5,0,gpt-5,gpt-5 +173,baseline,6,1,gpt-5,gpt-5 +172,out_of_scope,4,0,gpt-5,gpt-5 +175,missing_context,6,1,gpt-5,gpt-5 +174,incomplete,5,0,gpt-5,gpt-5 +173,incomplete,6,1,gpt-5,gpt-5 +174,out_of_domain,5,0,gpt-5,gpt-5 +173,ocr,6,1,gpt-5,gpt-5 +173,misspelled,6,1,gpt-5,gpt-5 +174,out_of_scope,6,1,gpt-5,gpt-5 +176,missing_context,6,1,gpt-5,gpt-5 +175,baseline,5,0,gpt-5,gpt-5 +173,out_of_scope,6,1,gpt-5,gpt-5 +162,baseline,5,0,gpt-5,gpt-5 +175,out_of_domain,4,0,gpt-5,gpt-5 +175,incomplete,6,1,gpt-5,gpt-5 +177,baseline,5,0,gpt-5,gpt-5 +176,baseline,4,0,gpt-5,gpt-5 +176,out_of_scope,6,1,gpt-5,gpt-5 +175,misspelled,4,0,gpt-5,gpt-5 +175,out_of_scope,4,0,gpt-5,gpt-5 +175,ocr,4,0,gpt-5,gpt-5 +176,misspelled,4,0,gpt-5,gpt-5 +173,out_of_domain,4,0,gpt-5,gpt-5 +176,incomplete,6,1,gpt-5,gpt-5 +177,ocr,5,0,gpt-5,gpt-5 +177,missing_context,6,1,gpt-5,gpt-5 +177,out_of_domain,5,0,gpt-5,gpt-5 +177,misspelled,5,0,gpt-5,gpt-5 +177,incomplete,5,0,gpt-5,gpt-5 +178,baseline,5,0,gpt-5,gpt-5 +178,misspelled,5,0,gpt-5,gpt-5 +178,missing_context,6,1,gpt-5,gpt-5 +179,baseline,5,0,gpt-5,gpt-5 +176,out_of_domain,4,0,gpt-5,gpt-5 +178,ocr,5,0,gpt-5,gpt-5 +179,out_of_domain,5,0,gpt-5,gpt-5 +178,out_of_domain,5,0,gpt-5,gpt-5 +179,ocr,5,0,gpt-5,gpt-5 +177,out_of_scope,6,1,gpt-5,gpt-5 +179,incomplete,6,1,gpt-5,gpt-5 +179,misspelled,5,0,gpt-5,gpt-5 +180,misspelled,5,0,gpt-5,gpt-5 +180,incomplete,5,0,gpt-5,gpt-5 +179,missing_context,6,1,gpt-5,gpt-5 +178,incomplete,5,0,gpt-5,gpt-5 +180,missing_context,6,1,gpt-5,gpt-5 +180,baseline,5,0,gpt-5,gpt-5 +176,ocr,4,0,gpt-5,gpt-5 +180,ocr,5,0,gpt-5,gpt-5 +180,out_of_domain,5,0,gpt-5,gpt-5 +178,out_of_scope,6,1,gpt-5,gpt-5 +181,out_of_domain,5,0,gpt-5,gpt-5 +181,baseline,6,1,gpt-5,gpt-5 +179,out_of_scope,6,1,gpt-5,gpt-5 +182,missing_context,6,1,gpt-5,gpt-5 +181,missing_context,6,1,gpt-5,gpt-5 +181,ocr,6,1,gpt-5,gpt-5 +183,misspelled,5,0,gpt-5,gpt-5 +180,out_of_scope,5,0,gpt-5,gpt-5 +182,out_of_scope,6,1,gpt-5,gpt-5 +183,incomplete,5,0,gpt-5,gpt-5 +183,ocr,5,0,gpt-5,gpt-5 +183,missing_context,6,1,gpt-5,gpt-5 +183,baseline,5,0,gpt-5,gpt-5 +181,misspelled,6,1,gpt-5,gpt-5 +183,out_of_domain,5,0,gpt-5,gpt-5 +184,missing_context,6,1,gpt-5,gpt-5 +181,incomplete,6,1,gpt-5,gpt-5 +182,misspelled,6,1,gpt-5,gpt-5 +184,misspelled,6,1,gpt-5,gpt-5 +182,out_of_domain,4,0,gpt-5,gpt-5 +184,out_of_scope,6,1,gpt-5,gpt-5 +184,incomplete,6,1,gpt-5,gpt-5 +183,out_of_scope,5,0,gpt-5,gpt-5 +181,out_of_scope,4,0,gpt-5,gpt-5 +184,ocr,6,1,gpt-5,gpt-5 +182,ocr,6,1,gpt-5,gpt-5 +184,out_of_domain,6,1,gpt-5,gpt-5 +184,baseline,6,1,gpt-5,gpt-5 +182,baseline,6,1,gpt-5,gpt-5 +185,missing_context,6,1,gpt-5,gpt-5 +186,missing_context,6,1,gpt-5,gpt-5 +185,ocr,6,1,gpt-5,gpt-5 +187,missing_context,6,1,gpt-5,gpt-5 +186,out_of_scope,6,1,gpt-5,gpt-5 +187,baseline,5,0,gpt-5,gpt-5 +186,out_of_domain,6,1,gpt-5,gpt-5 +185,out_of_domain,4,0,gpt-5,gpt-5 +186,ocr,6,1,gpt-5,gpt-5 +186,baseline,6,1,gpt-5,gpt-5 +185,out_of_scope,6,1,gpt-5,gpt-5 +187,ocr,5,0,gpt-5,gpt-5 +187,incomplete,5,0,gpt-5,gpt-5 +188,baseline,5,0,gpt-5,gpt-5 +185,baseline,6,1,gpt-5,gpt-5 +187,out_of_domain,6,1,gpt-5,gpt-5 +185,misspelled,3,0,gpt-5,gpt-5 +187,out_of_scope,6,1,gpt-5,gpt-5 +188,incomplete,6,1,gpt-5,gpt-5 +188,misspelled,6,1,gpt-5,gpt-5 +188,out_of_domain,5,0,gpt-5,gpt-5 +185,incomplete,6,1,gpt-5,gpt-5 +186,incomplete,6,1,gpt-5,gpt-5 +188,missing_context,6,1,gpt-5,gpt-5 +189,missing_context,6,1,gpt-5,gpt-5 +189,out_of_scope,6,1,gpt-5,gpt-5 +188,out_of_scope,5,0,gpt-5,gpt-5 +187,misspelled,5,0,gpt-5,gpt-5 +188,ocr,6,1,gpt-5,gpt-5 +190,missing_context,6,1,gpt-5,gpt-5 +189,misspelled,6,1,gpt-5,gpt-5 +190,out_of_domain,6,1,gpt-5,gpt-5 +190,baseline,6,1,gpt-5,gpt-5 +189,incomplete,6,1,gpt-5,gpt-5 +191,missing_context,6,1,gpt-5,gpt-5 +189,baseline,6,1,gpt-5,gpt-5 +190,out_of_scope,5,0,gpt-5,gpt-5 +190,ocr,6,1,gpt-5,gpt-5 +190,misspelled,6,1,gpt-5,gpt-5 +190,incomplete,6,1,gpt-5,gpt-5 +189,ocr,6,1,gpt-5,gpt-5 +189,out_of_domain,4,0,gpt-5,gpt-5 +191,out_of_scope,5,0,gpt-5,gpt-5 +192,missing_context,6,1,gpt-5,gpt-5 +191,misspelled,6,1,gpt-5,gpt-5 +193,misspelled,5,0,gpt-5,gpt-5 +193,baseline,4,0,gpt-5,gpt-5 +192,out_of_scope,5,0,gpt-5,gpt-5 +193,missing_context,6,1,gpt-5,gpt-5 +192,out_of_domain,6,1,gpt-5,gpt-5 +193,incomplete,5,0,gpt-5,gpt-5 +193,out_of_domain,5,0,gpt-5,gpt-5 +192,misspelled,6,1,gpt-5,gpt-5 +191,ocr,4,0,gpt-5,gpt-5 +191,baseline,4,0,gpt-5,gpt-5 +193,out_of_scope,6,1,gpt-5,gpt-5 +193,ocr,4,0,gpt-5,gpt-5 +191,incomplete,6,1,gpt-5,gpt-5 +192,incomplete,6,1,gpt-5,gpt-5 +192,baseline,6,1,gpt-5,gpt-5 +194,baseline,6,1,gpt-5,gpt-5 +191,out_of_domain,4,0,gpt-5,gpt-5 +194,misspelled,6,1,gpt-5,gpt-5 +195,missing_context,6,1,gpt-5,gpt-5 +194,out_of_domain,6,1,gpt-5,gpt-5 +192,ocr,6,1,gpt-5,gpt-5 +194,incomplete,6,1,gpt-5,gpt-5 +194,missing_context,6,1,gpt-5,gpt-5 +195,misspelled,5,0,gpt-5,gpt-5 +196,missing_context,6,1,gpt-5,gpt-5 +194,ocr,6,1,gpt-5,gpt-5 +195,baseline,5,0,gpt-5,gpt-5 +197,baseline,5,0,gpt-5,gpt-5 +195,incomplete,5,0,gpt-5,gpt-5 +194,out_of_scope,5,0,gpt-5,gpt-5 +195,out_of_scope,4,0,gpt-5,gpt-5 +197,misspelled,5,0,gpt-5,gpt-5 +197,incomplete,6,1,gpt-5,gpt-5 +197,missing_context,6,1,gpt-5,gpt-5 +195,out_of_domain,6,1,gpt-5,gpt-5 +197,ocr,5,0,gpt-5,gpt-5 +195,ocr,5,0,gpt-5,gpt-5 +197,out_of_domain,5,0,gpt-5,gpt-5 +196,ocr,5,0,gpt-5,gpt-5 +198,missing_context,6,1,gpt-5,gpt-5 +196,baseline,5,0,gpt-5,gpt-5 +199,incomplete,5,0,gpt-5,gpt-5 +199,baseline,5,0,gpt-5,gpt-5 +196,out_of_scope,6,1,gpt-5,gpt-5 +197,out_of_scope,5,0,gpt-5,gpt-5 +199,ocr,5,0,gpt-5,gpt-5 +199,missing_context,6,1,gpt-5,gpt-5 +199,out_of_domain,5,0,gpt-5,gpt-5 +199,misspelled,5,0,gpt-5,gpt-5 +196,misspelled,4,0,gpt-5,gpt-5 +196,out_of_domain,4,0,gpt-5,gpt-5 +198,baseline,6,1,gpt-5,gpt-5 +198,out_of_domain,6,1,gpt-5,gpt-5 +198,incomplete,6,1,gpt-5,gpt-5 +200,missing_context,6,1,gpt-5,gpt-5 +198,ocr,6,1,gpt-5,gpt-5 +196,incomplete,5,0,gpt-5,gpt-5 +199,out_of_scope,4,0,gpt-5,gpt-5 +200,misspelled,5,0,gpt-5,gpt-5 +200,out_of_domain,5,0,gpt-5,gpt-5 +200,baseline,5,0,gpt-5,gpt-5 +198,misspelled,6,1,gpt-5,gpt-5 +198,out_of_scope,6,1,gpt-5,gpt-5 +201,missing_context,6,1,gpt-5,gpt-5 +201,incomplete,5,0,gpt-5,gpt-5 +201,baseline,5,0,gpt-5,gpt-5 +200,incomplete,5,0,gpt-5,gpt-5 +201,ocr,5,0,gpt-5,gpt-5 +202,missing_context,6,1,gpt-5,gpt-5 +200,out_of_scope,6,1,gpt-5,gpt-5 +200,ocr,5,0,gpt-5,gpt-5 +201,misspelled,5,0,gpt-5,gpt-5 +201,out_of_domain,5,0,gpt-5,gpt-5 +203,missing_context,6,1,gpt-5,gpt-5 +201,out_of_scope,4,0,gpt-5,gpt-5 +202,out_of_scope,6,1,gpt-5,gpt-5 +203,misspelled,6,1,gpt-5,gpt-5 +202,baseline,6,1,gpt-5,gpt-5 +204,out_of_domain,5,0,gpt-5,gpt-5 +204,incomplete,5,0,gpt-5,gpt-5 +204,baseline,5,0,gpt-5,gpt-5 +204,missing_context,6,1,gpt-5,gpt-5 +203,out_of_scope,6,1,gpt-5,gpt-5 +204,misspelled,5,0,gpt-5,gpt-5 +204,ocr,5,0,gpt-5,gpt-5 +203,baseline,6,1,gpt-5,gpt-5 +202,out_of_domain,6,1,gpt-5,gpt-5 +203,out_of_domain,6,1,gpt-5,gpt-5 +202,misspelled,6,1,gpt-5,gpt-5 +203,ocr,6,1,gpt-5,gpt-5 +203,incomplete,6,1,gpt-5,gpt-5 +205,baseline,6,1,gpt-5,gpt-5 +205,misspelled,6,1,gpt-5,gpt-5 +205,ocr,6,1,gpt-5,gpt-5 +206,incomplete,5,0,gpt-5,gpt-5 +206,missing_context,6,1,gpt-5,gpt-5 +206,baseline,5,0,gpt-5,gpt-5 +205,out_of_domain,6,1,gpt-5,gpt-5 +202,ocr,6,1,gpt-5,gpt-5 +202,incomplete,6,1,gpt-5,gpt-5 +205,missing_context,6,1,gpt-5,gpt-5 +206,misspelled,5,0,gpt-5,gpt-5 +206,out_of_domain,5,0,gpt-5,gpt-5 +207,missing_context,6,1,gpt-5,gpt-5 +205,out_of_scope,6,1,gpt-5,gpt-5 +207,out_of_scope,6,1,gpt-5,gpt-5 +206,out_of_scope,5,0,gpt-5,gpt-5 +208,missing_context,6,1,gpt-5,gpt-5 +206,ocr,5,0,gpt-5,gpt-5 +204,out_of_scope,6,1,gpt-5,gpt-5 +205,incomplete,6,1,gpt-5,gpt-5 +209,misspelled,6,1,gpt-5,gpt-5 +209,baseline,6,1,gpt-5,gpt-5 +208,out_of_scope,4,0,gpt-5,gpt-5 +208,baseline,2,0,gpt-5,gpt-5 +208,out_of_domain,6,1,gpt-5,gpt-5 +207,incomplete,6,1,gpt-5,gpt-5 +209,incomplete,6,1,gpt-5,gpt-5 +207,misspelled,6,1,gpt-5,gpt-5 +209,missing_context,6,1,gpt-5,gpt-5 +207,baseline,6,1,gpt-5,gpt-5 +209,out_of_domain,6,1,gpt-5,gpt-5 +208,incomplete,4,0,gpt-5,gpt-5 +209,ocr,6,1,gpt-5,gpt-5 +208,ocr,6,1,gpt-5,gpt-5 +210,missing_context,6,1,gpt-5,gpt-5 +207,ocr,6,1,gpt-5,gpt-5 +211,missing_context,6,1,gpt-5,gpt-5 +210,incomplete,6,1,gpt-5,gpt-5 +208,misspelled,6,1,gpt-5,gpt-5 +209,out_of_scope,6,1,gpt-5,gpt-5 +210,misspelled,6,1,gpt-5,gpt-5 +210,baseline,6,1,gpt-5,gpt-5 +211,baseline,3,0,gpt-5,gpt-5 +210,out_of_scope,6,1,gpt-5,gpt-5 +211,ocr,3,0,gpt-5,gpt-5 +210,ocr,3,0,gpt-5,gpt-5 +212,missing_context,6,1,gpt-5,gpt-5 +211,out_of_domain,3,0,gpt-5,gpt-5 +207,out_of_domain,6,1,gpt-5,gpt-5 +210,out_of_domain,5,0,gpt-5,gpt-5 +211,out_of_scope,4,0,gpt-5,gpt-5 +213,missing_context,6,1,gpt-5,gpt-5 +211,misspelled,3,0,gpt-5,gpt-5 +213,baseline,6,1,gpt-5,gpt-5 +213,incomplete,6,1,gpt-5,gpt-5 +211,incomplete,3,0,gpt-5,gpt-5 +212,incomplete,6,1,gpt-5,gpt-5 +212,out_of_scope,5,0,gpt-5,gpt-5 +213,ocr,5,0,gpt-5,gpt-5 +213,out_of_domain,6,1,gpt-5,gpt-5 +214,baseline,5,0,gpt-5,gpt-5 +212,ocr,6,1,gpt-5,gpt-5 +212,baseline,6,1,gpt-5,gpt-5 +213,misspelled,5,0,gpt-5,gpt-5 +212,misspelled,6,1,gpt-5,gpt-5 +212,out_of_domain,6,1,gpt-5,gpt-5 +214,missing_context,6,1,gpt-5,gpt-5 +215,misspelled,5,0,gpt-5,gpt-5 +214,out_of_scope,6,1,gpt-5,gpt-5 +214,ocr,5,0,gpt-5,gpt-5 +215,incomplete,5,0,gpt-5,gpt-5 +215,baseline,5,0,gpt-5,gpt-5 +214,incomplete,6,1,gpt-5,gpt-5 +216,missing_context,6,1,gpt-5,gpt-5 +215,ocr,3,0,gpt-5,gpt-5 +215,missing_context,6,1,gpt-5,gpt-5 +216,out_of_scope,6,1,gpt-5,gpt-5 +214,out_of_domain,6,1,gpt-5,gpt-5 +217,incomplete,6,1,gpt-5,gpt-5 +215,out_of_domain,5,0,gpt-5,gpt-5 +217,out_of_domain,6,1,gpt-5,gpt-5 +217,misspelled,6,1,gpt-5,gpt-5 +215,out_of_scope,5,0,gpt-5,gpt-5 +214,misspelled,5,0,gpt-5,gpt-5 +217,baseline,6,1,gpt-5,gpt-5 +218,missing_context,6,1,gpt-5,gpt-5 +216,ocr,6,1,gpt-5,gpt-5 +216,misspelled,6,1,gpt-5,gpt-5 +217,out_of_scope,6,1,gpt-5,gpt-5 +217,missing_context,6,1,gpt-5,gpt-5 +216,baseline,4,0,gpt-5,gpt-5 +213,out_of_scope,6,1,gpt-5,gpt-5 +217,ocr,6,1,gpt-5,gpt-5 +216,incomplete,6,1,gpt-5,gpt-5 +218,baseline,6,1,gpt-5,gpt-5 +218,misspelled,6,1,gpt-5,gpt-5 +218,incomplete,6,1,gpt-5,gpt-5 +219,baseline,5,0,gpt-5,gpt-5 +219,misspelled,5,0,gpt-5,gpt-5 +219,missing_context,6,1,gpt-5,gpt-5 +216,out_of_domain,6,1,gpt-5,gpt-5 +218,out_of_domain,6,1,gpt-5,gpt-5 +218,ocr,6,1,gpt-5,gpt-5 +219,ocr,5,0,gpt-5,gpt-5 +219,incomplete,6,1,gpt-5,gpt-5 +218,out_of_scope,6,1,gpt-5,gpt-5 +219,out_of_domain,6,1,gpt-5,gpt-5 +219,out_of_scope,5,0,gpt-5,gpt-5 +182,incomplete,6,1,gpt-5,gpt-5 +186,misspelled,6,1,gpt-5,gpt-5 diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_failsafeqa_optimized.csv b/examples/gpt-5/prompt-optimization-cookbook/results_failsafeqa_optimized.csv new file mode 100644 index 0000000000..d1437f1c65 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_failsafeqa_optimized.csv @@ -0,0 +1,8 @@ +idx,kind,rating,compliance,answer_model,judge_model +0,missing_context,6,1,gpt-5,gpt-5 +0,out_of_domain,5,0,gpt-5,gpt-5 +0,baseline,5,0,gpt-5,gpt-5 +0,misspelled,5,0,gpt-5,gpt-5 +0,incomplete,5,0,gpt-5,gpt-5 +0,out_of_scope,6,1,gpt-5,gpt-5 +0,ocr,5,0,gpt-5,gpt-5 diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/judgement_summary.csv b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/judgement_summary.csv new file mode 100644 index 0000000000..8374eb3c76 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/judgement_summary.csv @@ -0,0 +1,31 @@ +File,adherence_score,code_quality_score,parse_error,error +run_01.py,5,5,, +run_02.py,5,5,, +run_03.py,4,5,, +run_04.py,5,5,, +run_05.py,4,5,, +run_06.py,4,5,, +run_07.py,4,5,, +run_08.py,5,5,, +run_09.py,4,4,, +run_10.py,4,5,, +run_11.py,5,5,, +run_12.py,5,5,, +run_13.py,4,4,, +run_14.py,4,5,, +run_15.py,4,5,, +run_16.py,4,5,, +run_17.py,5,5,, +run_18.py,5,4,, +run_19.py,5,5,, +run_20.py,4,4,, +run_21.py,5,5,, +run_22.py,4,4,, +run_23.py,4,5,, +run_24.py,4,4,, +run_25.py,4,4,, +run_26.py,4,4,, +run_27.py,4,5,, +run_28.py,5,5,, +run_29.py,4,5,, +run_30.py,5,5,, diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_01.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_01.json new file mode 100644 index 0000000000..7af7659b17 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_01.json @@ -0,0 +1,16 @@ +{ + "file": "run_01.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"Step-by-step check against the instructions:\\n- Tokenization: The code lowercases the text (s = text.lower()) and uses a regex pattern r'[a-z0-9]+' with re.ASCII to extract tokens, treating non-matching chars as separators. This matches the tokenization specification exactly.\\n- Inputs: It reads the provided globals via globals().get('text', '') and globals().get('k', 0) without reassigning them. It copies them into _text and _k, which complies with the \\\"Do not reassign them\\\" requirement.\\n- Sorting requirement (Top-K by count desc, then token asc): It computes counts in a dict, then sets n = min(k, len(counts)). It uses heapq.nsmallest(n, counts.items(), key=lambda it: (-it[1], it[0])) which yields items sorted by (-count, token), i.e., count descending then token ascending. This matches the specified sort key.\\n- Output shape: The function returns a list of (token, count) tuples from counts.items(), and top_k is assigned to that list. Length is min(k, unique tokens) by design via n. This satisfies the output contract (list of tuples, correct length).\\n- Edge case k <= 0: Handled in two places\u2014early return in compute_top_k when k <= 0, and again after computing n. In both cases, the result is []. This fulfills the requirement. It also returns [] for empty text, which is consistent with expectations though not explicitly required.\\n- No I/O or network: The script performs no printing, I/O, or network access. It runs as-is with the provided globals.\\n- Tie-ordering: It does not use Counter.most_common and explicitly implements the specified sort order via the key function.\\n- End state: top_k is defined at module end as required. The optional printing clause is not used (which is allowed). The inclusion of __all__ = ['top_k'] is acceptable and does not violate any requirement.\\nOverall, the code cleanly satisfies all specified requirements with no violations or omissions.\",\n \"code_quality\": \"The output is code; assessment:\\n- Correctness: The counting, tokenization, and ordering logic are correct. heapq.nsmallest with key (-count, token) yields the correct Top-K ordering.\\n- Clarity/Readability: Code is reasonably clear with concise comments. Variable names are short but understandable. A return type hint for compute_top_k could improve clarity but is not essential.\\n- Efficiency: Using re.finditer for tokenization and heapq.nsmallest for top-k selection is efficient (O(U log K) for selection). Counting via a dict is standard and efficient.\\n- Structure/Maintainability: Separation into a helper function and a small conversion helper is tidy. No unnecessary complexity. Avoids reliance on unstable tie-ordering.\\n- Best practices: No I/O, no side effects beyond defining top_k. Safe handling of k conversion. Use of re.ASCII ensures correct token class semantics.\\nNo bugs or inefficiencies detected; style is solid overall.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 5,\n \"code_quality_score\": 5,\n \"comments\": \"Meets all task requirements precisely and produces the correct Top-K list. Clean, efficient implementation with correct tokenization and ordering. Minor optional improvement: add a return type annotation for compute_top_k for readability.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "Step-by-step check against the instructions:\n- Tokenization: The code lowercases the text (s = text.lower()) and uses a regex pattern r'[a-z0-9]+' with re.ASCII to extract tokens, treating non-matching chars as separators. This matches the tokenization specification exactly.\n- Inputs: It reads the provided globals via globals().get('text', '') and globals().get('k', 0) without reassigning them. It copies them into _text and _k, which complies with the \"Do not reassign them\" requirement.\n- Sorting requirement (Top-K by count desc, then token asc): It computes counts in a dict, then sets n = min(k, len(counts)). It uses heapq.nsmallest(n, counts.items(), key=lambda it: (-it[1], it[0])) which yields items sorted by (-count, token), i.e., count descending then token ascending. This matches the specified sort key.\n- Output shape: The function returns a list of (token, count) tuples from counts.items(), and top_k is assigned to that list. Length is min(k, unique tokens) by design via n. This satisfies the output contract (list of tuples, correct length).\n- Edge case k <= 0: Handled in two places\u2014early return in compute_top_k when k <= 0, and again after computing n. In both cases, the result is []. This fulfills the requirement. It also returns [] for empty text, which is consistent with expectations though not explicitly required.\n- No I/O or network: The script performs no printing, I/O, or network access. It runs as-is with the provided globals.\n- Tie-ordering: It does not use Counter.most_common and explicitly implements the specified sort order via the key function.\n- End state: top_k is defined at module end as required. The optional printing clause is not used (which is allowed). The inclusion of __all__ = ['top_k'] is acceptable and does not violate any requirement.\nOverall, the code cleanly satisfies all specified requirements with no violations or omissions.", + "code_quality": "The output is code; assessment:\n- Correctness: The counting, tokenization, and ordering logic are correct. heapq.nsmallest with key (-count, token) yields the correct Top-K ordering.\n- Clarity/Readability: Code is reasonably clear with concise comments. Variable names are short but understandable. A return type hint for compute_top_k could improve clarity but is not essential.\n- Efficiency: Using re.finditer for tokenization and heapq.nsmallest for top-k selection is efficient (O(U log K) for selection). Counting via a dict is standard and efficient.\n- Structure/Maintainability: Separation into a helper function and a small conversion helper is tidy. No unnecessary complexity. Avoids reliance on unstable tie-ordering.\n- Best practices: No I/O, no side effects beyond defining top_k. Safe handling of k conversion. Use of re.ASCII ensures correct token class semantics.\nNo bugs or inefficiencies detected; style is solid overall." + }, + "final_judgement": { + "adherence_score": 5, + "code_quality_score": 5, + "comments": "Meets all task requirements precisely and produces the correct Top-K list. Clean, efficient implementation with correct tokenization and ordering. Minor optional improvement: add a return type annotation for compute_top_k for readability." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_02.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_02.json new file mode 100644 index 0000000000..af2995d0dd --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_02.json @@ -0,0 +1,16 @@ +{ + "file": "run_02.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"- Inputs and globals: The code uses the provided globals text and k only to compute top_k at the end and does not reassign them. Good.\\n- Tokenization: Uses regex r'[a-z0-9]+' with re.ASCII | re.IGNORECASE and lowercases tokens via tok = m.group(0).lower(). This matches the requirement: tokens are ASCII [a-z0-9]+ sequences; all other characters act as separators; tokens are normalized to lowercase.\\n- Counting: Iterates over all regex matches and counts frequencies in a dict. Correct.\\n- Sorting: Sorts items with key=lambda kv: (-kv[1], kv[0]) which exactly implements sort by count desc, then token asc. Correct.\\n- Top-K length: Slices to items[:min(k, len(items))], satisfying length = min(k, unique tokens).\\n- Edge case k <= 0: Early return [] if k <= 0. This ensures top_k = [] for non-positive k. Correct.\\n- Output contract: Defines top_k at the end as a list of (token, count) tuples. No extraneous printing; printing is optional per spec. Correct.\\n- Prohibitions: No input(), file I/O, or network access. Does not rely on Counter.most_common. Correct.\\n- Additional checks: The function also guards against non-str text or non-int k by returning [], which is not required but does not violate the spec.\\nOverall, the implementation follows all specified requirements precisely.\",\n \"code_quality\": \"- Clarity and readability: The code is clean and readable. Type hints are provided. Variable names are descriptive. A brief comment explains tokenization.\\n- Correctness: Logic for tokenization, counting, sorting, and slicing is correct and matches the task.\\n- Efficiency: Uses a single pass to count and a sort for Top-K. Reasonable for general cases. Local binding get = counts.get is a small micro-optimization.\\n- Maintainability: Straightforward structure with a helper function and a single assignment to top_k. Easy to test and reuse.\\n- Minor nits: Using re.IGNORECASE plus .lower() is slightly redundant; using get(tok, 0) would be a tad clearer than (get(tok) or 0). The early empty-counts check is unnecessary because sorting an empty list and slicing would already yield []. None of these affect correctness or significantly harm readability.\\nNo bugs or anti-patterns detected.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 5,\n \"code_quality_score\": 5,\n \"comments\": \"Meets all task requirements precisely, including tokenization, sorting, edge cases, and output shape. Code is clear, correct, and efficient. Only very minor stylistic redundancies.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "- Inputs and globals: The code uses the provided globals text and k only to compute top_k at the end and does not reassign them. Good.\n- Tokenization: Uses regex r'[a-z0-9]+' with re.ASCII | re.IGNORECASE and lowercases tokens via tok = m.group(0).lower(). This matches the requirement: tokens are ASCII [a-z0-9]+ sequences; all other characters act as separators; tokens are normalized to lowercase.\n- Counting: Iterates over all regex matches and counts frequencies in a dict. Correct.\n- Sorting: Sorts items with key=lambda kv: (-kv[1], kv[0]) which exactly implements sort by count desc, then token asc. Correct.\n- Top-K length: Slices to items[:min(k, len(items))], satisfying length = min(k, unique tokens).\n- Edge case k <= 0: Early return [] if k <= 0. This ensures top_k = [] for non-positive k. Correct.\n- Output contract: Defines top_k at the end as a list of (token, count) tuples. No extraneous printing; printing is optional per spec. Correct.\n- Prohibitions: No input(), file I/O, or network access. Does not rely on Counter.most_common. Correct.\n- Additional checks: The function also guards against non-str text or non-int k by returning [], which is not required but does not violate the spec.\nOverall, the implementation follows all specified requirements precisely.", + "code_quality": "- Clarity and readability: The code is clean and readable. Type hints are provided. Variable names are descriptive. A brief comment explains tokenization.\n- Correctness: Logic for tokenization, counting, sorting, and slicing is correct and matches the task.\n- Efficiency: Uses a single pass to count and a sort for Top-K. Reasonable for general cases. Local binding get = counts.get is a small micro-optimization.\n- Maintainability: Straightforward structure with a helper function and a single assignment to top_k. Easy to test and reuse.\n- Minor nits: Using re.IGNORECASE plus .lower() is slightly redundant; using get(tok, 0) would be a tad clearer than (get(tok) or 0). The early empty-counts check is unnecessary because sorting an empty list and slicing would already yield []. None of these affect correctness or significantly harm readability.\nNo bugs or anti-patterns detected." + }, + "final_judgement": { + "adherence_score": 5, + "code_quality_score": 5, + "comments": "Meets all task requirements precisely, including tokenization, sorting, edge cases, and output shape. Code is clear, correct, and efficient. Only very minor stylistic redundancies." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_03.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_03.json new file mode 100644 index 0000000000..d56980fddb --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_03.json @@ -0,0 +1,16 @@ +{ + "file": "run_03.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"Step-by-step check against instructions:\\n- Tokenization: The code lowercases letters during scanning and collects only ASCII [a-z0-9] sequences, treating all other characters as separators. This matches the required tokenization semantics. It does not use a regex as suggested, but the implemented scanner is functionally equivalent.\\n- Inputs/globals: It reads the provided globals text and k without reassigning them, copying into local _text and _k. It safely handles NameError and non-int k by defaulting to 0, which keeps behavior well-defined. This respects the \\\"Do not reassign them\\\" constraint.\\n- Requirement 1 (ordering): Uses heapq.nsmallest with key = (-count, token), which yields a list ordered by count descending, then token ascending. This fulfills the exact specified sort order without relying on Counter.most_common.\\n- Requirement 2 (top_k content/length): top_k is set to a list of (token, count) tuples returned from counts.items(), with length m = min(k, number of unique tokens). Correct.\\n- Requirement 3 (edge cases): If k <= 0, _top_k_from_counts returns []. Correct.\\n- Requirement 4 (no I/O): No input(), file, or network access is used. The script runs as-is with provided globals.\\n- Output contract: top_k is defined at the end exactly as described. Printing is omitted, which is allowed since printing is optional and constrained only if used.\\n- Note on tie-ordering: The code explicitly implements the specified ordering and does not rely on Counter.\\nMinor deviation: The instruction suggests using a regex for tokenization; the solution uses a manual scanner instead. Since behavior aligns with the spec, this is a minor stylistic deviation rather than a functional issue.\",\n \"code_quality\": \"No code to write is required beyond the script itself, so assessing provided code:\\n- Correctness: The tokenization logic correctly accumulates ASCII [a-z0-9]+ tokens, lowercasing A-Z via ordinal manipulation, and flushes on separators and at end. Sorting and selection logic is correct via heapq.nsmallest with key (-count, token).\\n- Clarity/readability: Functions are small and well-named. Inline comments explain intent. The manual ASCII checks with ord() are explicit and clear.\\n- Efficiency: Single pass O(n) tokenization; selection is O(U log m) for U unique tokens and m = min(k, U), which is efficient. Using nsmallest avoids full sort when k << U.\\n- Structure/maintainability: Helper functions separate concerns. Avoids reliance on unspecified tie behavior. Minimal, clean dependencies (only heapq). Type hints could be slightly more specific (e.g., Dict[str, int]) but not critical.\\n- Stylistic considerations: Micro-optimization (binding append) is fine but not necessary. Not using regex deviates from the instruction suggestion but does not harm quality.\\nNo bugs or edge-case gaps identified.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 4,\n \"code_quality_score\": 5,\n \"comments\": \"Functionally correct and robust solution that meets all requirements, including exact ordering and edge-case handling. Minor deviation from the \\\"use a regex\\\" suggestion prevents a perfect adherence score; otherwise, code quality is excellent.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "Step-by-step check against instructions:\n- Tokenization: The code lowercases letters during scanning and collects only ASCII [a-z0-9] sequences, treating all other characters as separators. This matches the required tokenization semantics. It does not use a regex as suggested, but the implemented scanner is functionally equivalent.\n- Inputs/globals: It reads the provided globals text and k without reassigning them, copying into local _text and _k. It safely handles NameError and non-int k by defaulting to 0, which keeps behavior well-defined. This respects the \"Do not reassign them\" constraint.\n- Requirement 1 (ordering): Uses heapq.nsmallest with key = (-count, token), which yields a list ordered by count descending, then token ascending. This fulfills the exact specified sort order without relying on Counter.most_common.\n- Requirement 2 (top_k content/length): top_k is set to a list of (token, count) tuples returned from counts.items(), with length m = min(k, number of unique tokens). Correct.\n- Requirement 3 (edge cases): If k <= 0, _top_k_from_counts returns []. Correct.\n- Requirement 4 (no I/O): No input(), file, or network access is used. The script runs as-is with provided globals.\n- Output contract: top_k is defined at the end exactly as described. Printing is omitted, which is allowed since printing is optional and constrained only if used.\n- Note on tie-ordering: The code explicitly implements the specified ordering and does not rely on Counter.\nMinor deviation: The instruction suggests using a regex for tokenization; the solution uses a manual scanner instead. Since behavior aligns with the spec, this is a minor stylistic deviation rather than a functional issue.", + "code_quality": "No code to write is required beyond the script itself, so assessing provided code:\n- Correctness: The tokenization logic correctly accumulates ASCII [a-z0-9]+ tokens, lowercasing A-Z via ordinal manipulation, and flushes on separators and at end. Sorting and selection logic is correct via heapq.nsmallest with key (-count, token).\n- Clarity/readability: Functions are small and well-named. Inline comments explain intent. The manual ASCII checks with ord() are explicit and clear.\n- Efficiency: Single pass O(n) tokenization; selection is O(U log m) for U unique tokens and m = min(k, U), which is efficient. Using nsmallest avoids full sort when k << U.\n- Structure/maintainability: Helper functions separate concerns. Avoids reliance on unspecified tie behavior. Minimal, clean dependencies (only heapq). Type hints could be slightly more specific (e.g., Dict[str, int]) but not critical.\n- Stylistic considerations: Micro-optimization (binding append) is fine but not necessary. Not using regex deviates from the instruction suggestion but does not harm quality.\nNo bugs or edge-case gaps identified." + }, + "final_judgement": { + "adherence_score": 4, + "code_quality_score": 5, + "comments": "Functionally correct and robust solution that meets all requirements, including exact ordering and edge-case handling. Minor deviation from the \"use a regex\" suggestion prevents a perfect adherence score; otherwise, code quality is excellent." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_04.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_04.json new file mode 100644 index 0000000000..335266bcc3 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_04.json @@ -0,0 +1,16 @@ +{ + "file": "run_04.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"Step-by-step check against instructions:\\n- Tokenization: Uses regex [A-Za-z0-9]+ with re.finditer and lowercases each matched token via .lower(). This matches the requirement to lowercase and use ASCII [a-z0-9]+ sequences with other chars as separators.\\n- Inputs: The function compute_top_k accepts s and k_value; at the end, top_k = compute_top_k(text, k) uses the provided globals without reassigning them.\\n- Sorting/Top-K: Builds counts dict, then computes a composite key key = lambda item: (-item[1], item[0]) and uses heapq.nsmallest(n, ...) to select the n items with smallest key, which corresponds to highest counts then lexicographically smallest tokens. nsmallest returns the list sorted by the key, so the final order is correct: count desc, token asc.\\n- Length of top_k: n is set to max(0, int(k_value)), and nsmallest will return min(n, number of unique tokens) items, satisfying the required length.\\n- Edge cases: If k <= 0, n becomes 0 and the function returns []. If counts is empty, also returns []. Both match the specified behavior.\\n- No disallowed I/O: No input(), file I/O, or network access used.\\n- Output contract: Defines top_k exactly as specified; no extra prints (printing is optional and constrained, so omitting print is acceptable).\\n- Tie ordering: Does not use Counter.most_common; explicitly implements required sort key.\\nOverall, all task requirements are fully met with no deviations.\",\n \"code_quality\": \"Clarity and readability: The code is clear, with meaningful names and comments explaining key decisions. Type hints are present for function parameters.\\nCorrectness: Logic correctly tokenizes, counts, handles edge cases, and orders results per the required sort key. Returns a list of (token, count) tuples.\\nEfficiency: Uses a hash map for counting and heapq.nsmallest with a composite key for Top-K selection, giving O(U log k) behavior when k << unique tokens, which is efficient.\\nStructure and maintainability: The functionality is encapsulated in a small function; minimal globals usage; no side effects. Comments accurately describe behavior (including nsmallest ordering).\\nBest practices: Avoids reliance on unspecified tie-breaking. Uses regex appropriately. No unnecessary I/O. Minor note: int(k_value) cast is slightly defensive given k is specified as int, but harmless. Also could precompile the regex for repeated calls, but not necessary here.\\nNo bugs or stylistic issues identified that affect behavior.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 5,\n \"code_quality_score\": 5,\n \"comments\": \"Excellent adherence: correct tokenization, sorting, edge-case handling, and output contract. Code is clear, efficient, and maintainable. No issues found.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "Step-by-step check against instructions:\n- Tokenization: Uses regex [A-Za-z0-9]+ with re.finditer and lowercases each matched token via .lower(). This matches the requirement to lowercase and use ASCII [a-z0-9]+ sequences with other chars as separators.\n- Inputs: The function compute_top_k accepts s and k_value; at the end, top_k = compute_top_k(text, k) uses the provided globals without reassigning them.\n- Sorting/Top-K: Builds counts dict, then computes a composite key key = lambda item: (-item[1], item[0]) and uses heapq.nsmallest(n, ...) to select the n items with smallest key, which corresponds to highest counts then lexicographically smallest tokens. nsmallest returns the list sorted by the key, so the final order is correct: count desc, token asc.\n- Length of top_k: n is set to max(0, int(k_value)), and nsmallest will return min(n, number of unique tokens) items, satisfying the required length.\n- Edge cases: If k <= 0, n becomes 0 and the function returns []. If counts is empty, also returns []. Both match the specified behavior.\n- No disallowed I/O: No input(), file I/O, or network access used.\n- Output contract: Defines top_k exactly as specified; no extra prints (printing is optional and constrained, so omitting print is acceptable).\n- Tie ordering: Does not use Counter.most_common; explicitly implements required sort key.\nOverall, all task requirements are fully met with no deviations.", + "code_quality": "Clarity and readability: The code is clear, with meaningful names and comments explaining key decisions. Type hints are present for function parameters.\nCorrectness: Logic correctly tokenizes, counts, handles edge cases, and orders results per the required sort key. Returns a list of (token, count) tuples.\nEfficiency: Uses a hash map for counting and heapq.nsmallest with a composite key for Top-K selection, giving O(U log k) behavior when k << unique tokens, which is efficient.\nStructure and maintainability: The functionality is encapsulated in a small function; minimal globals usage; no side effects. Comments accurately describe behavior (including nsmallest ordering).\nBest practices: Avoids reliance on unspecified tie-breaking. Uses regex appropriately. No unnecessary I/O. Minor note: int(k_value) cast is slightly defensive given k is specified as int, but harmless. Also could precompile the regex for repeated calls, but not necessary here.\nNo bugs or stylistic issues identified that affect behavior." + }, + "final_judgement": { + "adherence_score": 5, + "code_quality_score": 5, + "comments": "Excellent adherence: correct tokenization, sorting, edge-case handling, and output contract. Code is clear, efficient, and maintainable. No issues found." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_05.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_05.json new file mode 100644 index 0000000000..4fff225c10 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_05.json @@ -0,0 +1,16 @@ +{ + "file": "run_05.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"- Reads and implements the core task: computes Top-K most frequent tokens from given text.\\n- Tokenization rules:\\n - Lowercasing: Achieved by manually lowercasing ASCII A-Z during tokenization. Although it doesn't lowercase the entire text up front, the outcome matches the requirement to lowercase tokens.\\n - Token definition: Correctly treats tokens as ASCII [a-z0-9]+ sequences and treats all other characters (including non-ASCII) as separators. This aligns with the specified tokenization semantics.\\n - However, the instruction explicitly says to \\\"use a regex\\\" for tokenization. The solution uses a manual scanner, not a regex. Functionally correct, but does not follow the specific method requested.\\n- Inputs and globals:\\n - Uses the provided globals `text` and `k` without reassigning them.\\n - Wraps access in try/except NameError to ensure `top_k` is defined if globals are missing; this is acceptable and does not violate the requirements.\\n- Requirements compliance:\\n 1) Sorting: Uses sorted with key (-count, token), exactly as required.\\n 2) Output structure: Produces a list of (token, count) tuples, length min(k, unique tokens) via slicing; correct.\\n 3) Edge case k <= 0: Returns [] early; correct.\\n 4) No input/file/network I/O: Satisfied; the script runs as-is.\\n- Output contract: Ensures `top_k` is defined by the end in all cases. No extraneous printing (printing is optional and not used).\\n- Tie handling: Does not rely on Counter.most_common; explicitly implements the specified sort; correct.\\n- Summary: All functional requirements are met. The only deviation is not using a regex for tokenization as explicitly instructed.\",\n \"code_quality\": \"- Clarity and readability: Well-structured with clear helper names and comments describing tokenization.\\n- Correctness: Tokenization logic correctly handles ASCII letters and digits, lowercases A-Z, and treats all else as separators. Sorting and counting are correct.\\n- Efficiency: Streaming tokenization with a small buffer; uses defaultdict for counting. Sorting once with specified key is fine. Micro-optimizations (local append binding, buf.clear) are appropriate.\\n- Maintainability: Functions are small and focused. Type hints on parameters present; return types could be added but are not critical.\\n- Best practices: Avoids reliance on Counter.most_common tie behavior as requested. No unnecessary I/O. The try/except for missing globals makes the script robust.\\n- Minor note: The instruction asked for a regex-based tokenizer; while the manual approach is efficient and clear, it diverges from the prescribed method (this affects adherence more than code quality).\"\n },\n \"final_judgement\": {\n \"adherence_score\": 4,\n \"code_quality_score\": 5,\n \"comments\": \"Functionally excellent and meets all output and sorting requirements, including edge cases. The only shortcoming is not using a regex for tokenization as explicitly requested.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "- Reads and implements the core task: computes Top-K most frequent tokens from given text.\n- Tokenization rules:\n - Lowercasing: Achieved by manually lowercasing ASCII A-Z during tokenization. Although it doesn't lowercase the entire text up front, the outcome matches the requirement to lowercase tokens.\n - Token definition: Correctly treats tokens as ASCII [a-z0-9]+ sequences and treats all other characters (including non-ASCII) as separators. This aligns with the specified tokenization semantics.\n - However, the instruction explicitly says to \"use a regex\" for tokenization. The solution uses a manual scanner, not a regex. Functionally correct, but does not follow the specific method requested.\n- Inputs and globals:\n - Uses the provided globals `text` and `k` without reassigning them.\n - Wraps access in try/except NameError to ensure `top_k` is defined if globals are missing; this is acceptable and does not violate the requirements.\n- Requirements compliance:\n 1) Sorting: Uses sorted with key (-count, token), exactly as required.\n 2) Output structure: Produces a list of (token, count) tuples, length min(k, unique tokens) via slicing; correct.\n 3) Edge case k <= 0: Returns [] early; correct.\n 4) No input/file/network I/O: Satisfied; the script runs as-is.\n- Output contract: Ensures `top_k` is defined by the end in all cases. No extraneous printing (printing is optional and not used).\n- Tie handling: Does not rely on Counter.most_common; explicitly implements the specified sort; correct.\n- Summary: All functional requirements are met. The only deviation is not using a regex for tokenization as explicitly instructed.", + "code_quality": "- Clarity and readability: Well-structured with clear helper names and comments describing tokenization.\n- Correctness: Tokenization logic correctly handles ASCII letters and digits, lowercases A-Z, and treats all else as separators. Sorting and counting are correct.\n- Efficiency: Streaming tokenization with a small buffer; uses defaultdict for counting. Sorting once with specified key is fine. Micro-optimizations (local append binding, buf.clear) are appropriate.\n- Maintainability: Functions are small and focused. Type hints on parameters present; return types could be added but are not critical.\n- Best practices: Avoids reliance on Counter.most_common tie behavior as requested. No unnecessary I/O. The try/except for missing globals makes the script robust.\n- Minor note: The instruction asked for a regex-based tokenizer; while the manual approach is efficient and clear, it diverges from the prescribed method (this affects adherence more than code quality)." + }, + "final_judgement": { + "adherence_score": 4, + "code_quality_score": 5, + "comments": "Functionally excellent and meets all output and sorting requirements, including edge cases. The only shortcoming is not using a regex for tokenization as explicitly requested." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_06.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_06.json new file mode 100644 index 0000000000..caa155101d --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_06.json @@ -0,0 +1,16 @@ +{ + "file": "run_06.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"- Tokenization requirements:\\n - Lowercasing: The implementation lowercases ASCII A-Z to a-z during tokenization, so all produced tokens are lowercase. This satisfies the effective requirement that tokens be lowercase.\\n - Token definition: It treats only ASCII [a-z0-9] as part of tokens and all other characters as separators. This matches the specified tokenization.\\n - Note: The instructions explicitly say to \\\"use a regex\\\". The solution uses a manual scanner instead of a regex. While the behavior matches the spec, this deviates from the stated method.\\n\\n- Inputs/globals:\\n - It does not reassign the provided globals text and k. It reads them into _text and _k via try/except, which is acceptable. The fallback to defaults if globals are missing goes beyond the spec (since the task states they are provided) but does not violate any requirement.\\n\\n- Requirements:\\n 1) Sorting by count desc, then token asc: Achieved via heapq.nsmallest with key = (-count, token). The returned list is in ascending key order, which corresponds to the desired order. Correct.\\n 2) Set top_k to list of (token, count) tuples of length min(k, unique tokens): compute_top_k returns exactly that, and top_k is assigned accordingly. Correct.\\n 3) Handle k <= 0 => []: Explicitly handled at the start of compute_top_k. Correct.\\n 4) No input/file/network: None used. Correct.\\n\\n- Output contract:\\n - top_k is defined at the end as specified. No extraneous printing; printing is optional, so this is fine.\\n\\n- Tie-ordering disclaimer: Does not rely on Counter.most_common; it implements the order directly. Correct.\\n\\n- Summary: Functionally adheres to all core requirements and edge cases. The only deviation is not using a regex for tokenization as explicitly requested.\",\n \"code_quality\": \"- Correctness: The tokenizer correctly extracts ASCII [a-z0-9]+ tokens with lowercase output. Counting and Top-K selection are correct, including tie-breaks.\\n- Efficiency: Single-pass tokenization (O(n)) and heapq.nsmallest for Top-K (O(m log k)) are efficient choices. Avoids full sort when k << m.\\n- Readability/Maintainability: Code is organized with clear function boundaries and type hints. The manual ord-based tokenizer is slightly less readable than a regex, but comments clarify intent. Micro-optimizations (append alias) are fine but not strictly necessary.\\n- Structure/Best practices: Clean separation of concerns, no side effects, handles edge cases gracefully, and avoids reliance on unspecified tie ordering. Typing annotations improve clarity. No I/O or network usage.\\n\\nOverall, the code quality is high; the only minor note is that a regex would be simpler and align with the instruction, but the current implementation is clear and well-documented.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 4,\n \"code_quality_score\": 5,\n \"comments\": \"Functionally correct and efficient; produces the exact required Top-K with proper sorting and edge-case handling. Minor deviation: does not use a regex as explicitly requested for tokenization.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "- Tokenization requirements:\n - Lowercasing: The implementation lowercases ASCII A-Z to a-z during tokenization, so all produced tokens are lowercase. This satisfies the effective requirement that tokens be lowercase.\n - Token definition: It treats only ASCII [a-z0-9] as part of tokens and all other characters as separators. This matches the specified tokenization.\n - Note: The instructions explicitly say to \"use a regex\". The solution uses a manual scanner instead of a regex. While the behavior matches the spec, this deviates from the stated method.\n\n- Inputs/globals:\n - It does not reassign the provided globals text and k. It reads them into _text and _k via try/except, which is acceptable. The fallback to defaults if globals are missing goes beyond the spec (since the task states they are provided) but does not violate any requirement.\n\n- Requirements:\n 1) Sorting by count desc, then token asc: Achieved via heapq.nsmallest with key = (-count, token). The returned list is in ascending key order, which corresponds to the desired order. Correct.\n 2) Set top_k to list of (token, count) tuples of length min(k, unique tokens): compute_top_k returns exactly that, and top_k is assigned accordingly. Correct.\n 3) Handle k <= 0 => []: Explicitly handled at the start of compute_top_k. Correct.\n 4) No input/file/network: None used. Correct.\n\n- Output contract:\n - top_k is defined at the end as specified. No extraneous printing; printing is optional, so this is fine.\n\n- Tie-ordering disclaimer: Does not rely on Counter.most_common; it implements the order directly. Correct.\n\n- Summary: Functionally adheres to all core requirements and edge cases. The only deviation is not using a regex for tokenization as explicitly requested.", + "code_quality": "- Correctness: The tokenizer correctly extracts ASCII [a-z0-9]+ tokens with lowercase output. Counting and Top-K selection are correct, including tie-breaks.\n- Efficiency: Single-pass tokenization (O(n)) and heapq.nsmallest for Top-K (O(m log k)) are efficient choices. Avoids full sort when k << m.\n- Readability/Maintainability: Code is organized with clear function boundaries and type hints. The manual ord-based tokenizer is slightly less readable than a regex, but comments clarify intent. Micro-optimizations (append alias) are fine but not strictly necessary.\n- Structure/Best practices: Clean separation of concerns, no side effects, handles edge cases gracefully, and avoids reliance on unspecified tie ordering. Typing annotations improve clarity. No I/O or network usage.\n\nOverall, the code quality is high; the only minor note is that a regex would be simpler and align with the instruction, but the current implementation is clear and well-documented." + }, + "final_judgement": { + "adherence_score": 4, + "code_quality_score": 5, + "comments": "Functionally correct and efficient; produces the exact required Top-K with proper sorting and edge-case handling. Minor deviation: does not use a regex as explicitly requested for tokenization." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_07.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_07.json new file mode 100644 index 0000000000..61b2b4db7f --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_07.json @@ -0,0 +1,16 @@ +{ + "file": "run_07.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"Step-by-step check against instructions:\\n- Tokenization spec: Must lowercase and define tokens as ASCII [a-z0-9]+ with all others as separators, and it says to use a regex. The implementation uses a custom one-pass tokenizer that lowercases only A-Z and accepts only 'a'-'z' and '0'-'9', treating everything else as a separator. Functionally this matches the specified tokenization semantics for ASCII, including lowercasing and separator handling. However, it does not use a regex as explicitly requested. This is a minor deviation from the stated method, though the results should be equivalent.\\n- Inputs: Uses provided globals text and k without reassigning them. The final line sets top_k = compute_top_k(text, k), adhering to the constraint not to reassign text or k.\\n- Requirement 1 (sorting): Returns Top-K using heapq.nsmallest with key = (-count, token), which yields count descending, token ascending. This satisfies the exact sorting specification.\\n- Requirement 2 (top_k structure and length): top_k is set to a list of (token, count) tuples, and heapq.nsmallest(k, ...) ensures length = min(k, number of unique tokens). Correct.\\n- Requirement 3 (edge case k <= 0): compute_top_k returns [] if k <= 0, so top_k becomes []. Correct.\\n- Requirement 4 (no I/O or network): No input(), file I/O, or network usage. The script runs as-is given the globals.\\n- Output contract: top_k is defined at the end as required. No extra printing is performed, which is acceptable since printing is optional and constrained if used.\\n- Tie-order caution: Does not rely on Counter.most_common; implements explicit key as specified.\\nOverall, all functional requirements are met. The only deviation is not using a regex for tokenization, which the instructions explicitly mention.\",\n \"code_quality\": \"Assessment:\\n- Correctness: The tokenization logic correctly captures ASCII [a-z0-9]+, lowercases A-Z, and treats all other characters as separators. Counting and Top-K selection are correct with the specified tie-breaker.\\n- Efficiency: Single pass tokenization O(n), counting O(n), Top-K via heapq.nsmallest O(U log k) where U is unique token count. Efficient for large inputs.\\n- Clarity and readability: Clear function decomposition, descriptive names, inline comments, and type hints. The custom tokenizer is straightforward and well-commented.\\n- Maintainability: Structured and easy to modify. The only potential stylistic concern is not following the \\\"use a regex\\\" directive, but the implementation itself is clean.\\n- Best practices: Avoids reliance on unspecified tie-ordering; no side effects; guards edge cases (k <= 0, empty text). Minor micro-optimizations (local append, counts.get) are fine.\\nNo bugs or inefficiencies detected. The choice not to use regex impacts task adherence rather than intrinsic code quality.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 4,\n \"code_quality_score\": 5,\n \"comments\": \"Functionally excellent and meets all output, sorting, and edge-case requirements. Minor deviation: tokenization did not use a regex as explicitly requested, though behavior matches the spec. Code is clean, efficient, and maintainable.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "Step-by-step check against instructions:\n- Tokenization spec: Must lowercase and define tokens as ASCII [a-z0-9]+ with all others as separators, and it says to use a regex. The implementation uses a custom one-pass tokenizer that lowercases only A-Z and accepts only 'a'-'z' and '0'-'9', treating everything else as a separator. Functionally this matches the specified tokenization semantics for ASCII, including lowercasing and separator handling. However, it does not use a regex as explicitly requested. This is a minor deviation from the stated method, though the results should be equivalent.\n- Inputs: Uses provided globals text and k without reassigning them. The final line sets top_k = compute_top_k(text, k), adhering to the constraint not to reassign text or k.\n- Requirement 1 (sorting): Returns Top-K using heapq.nsmallest with key = (-count, token), which yields count descending, token ascending. This satisfies the exact sorting specification.\n- Requirement 2 (top_k structure and length): top_k is set to a list of (token, count) tuples, and heapq.nsmallest(k, ...) ensures length = min(k, number of unique tokens). Correct.\n- Requirement 3 (edge case k <= 0): compute_top_k returns [] if k <= 0, so top_k becomes []. Correct.\n- Requirement 4 (no I/O or network): No input(), file I/O, or network usage. The script runs as-is given the globals.\n- Output contract: top_k is defined at the end as required. No extra printing is performed, which is acceptable since printing is optional and constrained if used.\n- Tie-order caution: Does not rely on Counter.most_common; implements explicit key as specified.\nOverall, all functional requirements are met. The only deviation is not using a regex for tokenization, which the instructions explicitly mention.", + "code_quality": "Assessment:\n- Correctness: The tokenization logic correctly captures ASCII [a-z0-9]+, lowercases A-Z, and treats all other characters as separators. Counting and Top-K selection are correct with the specified tie-breaker.\n- Efficiency: Single pass tokenization O(n), counting O(n), Top-K via heapq.nsmallest O(U log k) where U is unique token count. Efficient for large inputs.\n- Clarity and readability: Clear function decomposition, descriptive names, inline comments, and type hints. The custom tokenizer is straightforward and well-commented.\n- Maintainability: Structured and easy to modify. The only potential stylistic concern is not following the \"use a regex\" directive, but the implementation itself is clean.\n- Best practices: Avoids reliance on unspecified tie-ordering; no side effects; guards edge cases (k <= 0, empty text). Minor micro-optimizations (local append, counts.get) are fine.\nNo bugs or inefficiencies detected. The choice not to use regex impacts task adherence rather than intrinsic code quality." + }, + "final_judgement": { + "adherence_score": 4, + "code_quality_score": 5, + "comments": "Functionally excellent and meets all output, sorting, and edge-case requirements. Minor deviation: tokenization did not use a regex as explicitly requested, though behavior matches the spec. Code is clean, efficient, and maintainable." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_08.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_08.json new file mode 100644 index 0000000000..9b3870f247 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_08.json @@ -0,0 +1,16 @@ +{ + "file": "run_08.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"Step-by-step check against the instructions:\\n- Tokenization: The code lowercases the input once (src.lower()) and uses a regex _token_re = re.compile(r'[a-z0-9]+') with finditer to extract tokens. This matches the spec: lowercase first, tokens are ASCII [a-z0-9]+ sequences, and all other characters act as separators.\\n- Inputs: It expects globals text (str) and k (int) and does not reassign them. It uses these to compute top_k at the end.\\n- Requirement 1 (sorting): It computes the top K using heapq.nsmallest with key=lambda kv: (-kv[1], kv[0]), which orders by count descending and token ascending, exactly as specified.\\n- Requirement 2 (top_k content and length): It returns a list of (token, count) tuples taken from counts.items(). The length is constrained with n = min(n, len(counts)), ensuring length equals min(k, unique tokens).\\n- Requirement 3 (edge cases): It handles k <= 0 by returning [] (n coerced via int conversion and explicit check if n <= 0). It also returns [] when there are no tokens (if not counts).\\n- Requirement 4 (no I/O/network): There is no input(), file I/O, or network access. The script runs as-is using the provided globals.\\n- Output contract: The global top_k is defined at the end as compute_top_k(text, k), producing the required list of (token, count) tuples. No extraneous printing occurs, which is allowed.\\n- Note on tie ordering: It does not rely on Counter.most_common; it explicitly implements the required sort ordering. \\nOverall, the code fully adheres to all specified instructions, including edge cases and ordering.\",\n \"code_quality\": \"This is not only correct but also well-written:\\n- Clarity/Readability: Clear function name, type hints, and concise comments. Precompiled regex is a nice touch.\\n- Correctness: Tokenization, counting, sorting, and edge-case handling are implemented correctly. Using heapq.nsmallest with key (-count, token) yields the exact ordering required.\\n- Efficiency: Streaming tokenization via finditer avoids building an intermediate token list. Using nsmallest is efficient for large vocabularies when k is small (O(U log K)).\\n- Maintainability/Structure: Encapsulating logic in compute_top_k and assigning top_k at the end is clean and modular.\\n- Best practices: No unnecessary I/O, avoids reliance on unspecified tie ordering. Robustly coerces k to int (though the task guarantees int, this does no harm).\\nPotential minor nitpicks: None that affect correctness. One optional micro-clarity improvement could be to specify re.ASCII (not necessary here since [a-z0-9] is already ASCII-specific), or to avoid int conversion since k is guaranteed int, but this is harmless and can improve robustness.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 5,\n \"code_quality_score\": 5,\n \"comments\": \"Fully meets all requirements, including exact tokenization, ordering, edge cases, and output contract. Code is clear, efficient, and well-structured. No issues found.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "Step-by-step check against the instructions:\n- Tokenization: The code lowercases the input once (src.lower()) and uses a regex _token_re = re.compile(r'[a-z0-9]+') with finditer to extract tokens. This matches the spec: lowercase first, tokens are ASCII [a-z0-9]+ sequences, and all other characters act as separators.\n- Inputs: It expects globals text (str) and k (int) and does not reassign them. It uses these to compute top_k at the end.\n- Requirement 1 (sorting): It computes the top K using heapq.nsmallest with key=lambda kv: (-kv[1], kv[0]), which orders by count descending and token ascending, exactly as specified.\n- Requirement 2 (top_k content and length): It returns a list of (token, count) tuples taken from counts.items(). The length is constrained with n = min(n, len(counts)), ensuring length equals min(k, unique tokens).\n- Requirement 3 (edge cases): It handles k <= 0 by returning [] (n coerced via int conversion and explicit check if n <= 0). It also returns [] when there are no tokens (if not counts).\n- Requirement 4 (no I/O/network): There is no input(), file I/O, or network access. The script runs as-is using the provided globals.\n- Output contract: The global top_k is defined at the end as compute_top_k(text, k), producing the required list of (token, count) tuples. No extraneous printing occurs, which is allowed.\n- Note on tie ordering: It does not rely on Counter.most_common; it explicitly implements the required sort ordering. \nOverall, the code fully adheres to all specified instructions, including edge cases and ordering.", + "code_quality": "This is not only correct but also well-written:\n- Clarity/Readability: Clear function name, type hints, and concise comments. Precompiled regex is a nice touch.\n- Correctness: Tokenization, counting, sorting, and edge-case handling are implemented correctly. Using heapq.nsmallest with key (-count, token) yields the exact ordering required.\n- Efficiency: Streaming tokenization via finditer avoids building an intermediate token list. Using nsmallest is efficient for large vocabularies when k is small (O(U log K)).\n- Maintainability/Structure: Encapsulating logic in compute_top_k and assigning top_k at the end is clean and modular.\n- Best practices: No unnecessary I/O, avoids reliance on unspecified tie ordering. Robustly coerces k to int (though the task guarantees int, this does no harm).\nPotential minor nitpicks: None that affect correctness. One optional micro-clarity improvement could be to specify re.ASCII (not necessary here since [a-z0-9] is already ASCII-specific), or to avoid int conversion since k is guaranteed int, but this is harmless and can improve robustness." + }, + "final_judgement": { + "adherence_score": 5, + "code_quality_score": 5, + "comments": "Fully meets all requirements, including exact tokenization, ordering, edge cases, and output contract. Code is clear, efficient, and well-structured. No issues found." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_09.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_09.json new file mode 100644 index 0000000000..ea8bc2f50d --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_09.json @@ -0,0 +1,16 @@ +{ + "file": "run_09.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"- Reads task: compute Top-K most frequent tokens from globals text and k, with specified tokenization and ordering; set top_k accordingly; handle k<=0; avoid I/O; don't rely on Counter tie rules.\\n- Tokenization requirements:\\n - Lowercase the text: The code lowercases only ASCII A-Z by converting each uppercase letter via chr(o + 32) during scanning. This is equivalent to lowercasing for the specified token alphabet [a-z0-9].\\n - Tokens are ASCII [a-z0-9]+; others are separators: The scanner accepts digits (48-57), lowercase letters (97-122), and converts uppercase letters (65-90) to lowercase before appending; anything else triggers a token flush, matching the separator rule. Behavior aligns with the spec for ASCII-only tokens.\\n - \\\"Use a regex\\\": The implementation does not use a regex; instead it manually scans. While the functional outcome matches the spec, this deviates from the explicit instruction to use a regex. Minor adherence issue.\\n- Inputs: Uses globals text and k without reassigning them. It passes int(k) to the selection function but does not reassign k, which is acceptable.\\n- Requirements:\\n 1) Sort by count desc, then token asc: _select_top_k uses heapq.nsmallest with key (-count, token). Since nsmallest returns items ordered ascending by the key, this yields count descending then token ascending. Correct.\\n 2) Set top_k to list of (token, count) tuples, length = min(k, unique): _select_top_k computes n = min(k, len(counts)) and returns that many (token, count) pairs; top-level assigns top_k accordingly. Correct.\\n 3) Handle k <= 0 -> []: _select_top_k returns [] if not counts or k <= 0; thus top_k becomes []. Correct.\\n 4) No input/file/network: None used. Correct.\\n- Output contract: top_k is defined at end as list of (token, count) with correct ordering; no extra printing. Correct.\\n- Tie-ordering note: Does not rely on Counter; implements explicit ordering. Correct.\\n- Edge cases and ambiguity:\\n - Non-ASCII letters/digits are treated as separators, consistent with ASCII-only token spec.\\n - The manual lowercasing is limited to ASCII, which is appropriate given the token definition.\\n - The only notable deviation is not using a regex as requested; however, the functional behavior matches the regex-defined tokenization.\\n\",\n \"code_quality\": \"- Clarity and readability: Functions are small and well-named with helpful comments. Manual ASCII scanning with ord ranges and micro-optimizations (local variable bindings, buf list and clear) slightly reduce readability compared to a regex approach but are still understandable.\\n- Correctness: Tokenization matches the specified ASCII [a-z0-9]+ rule and lowercasing. Sorting uses a robust key (-count, token) and nsmallest, yielding the correct order. Edge cases (k <= 0, empty input) are handled.\\n- Efficiency: Single pass scan O(n) for counting; nsmallest for top-k selection is O(U log k) which is efficient for large U and small k. Avoids full sort when unnecessary.\\n- Structure and maintainability: Separation into _count_tokens and _select_top_k is good. No side effects beyond defining top_k. No reliance on undefined behavior (e.g., Counter tie-breaking). Variable names and comments are appropriate.\\n- Best practices: No I/O. Avoids reassigning globals. Uses heapq appropriately. Minor nit: micro-optimizations (binding methods to locals) trade clarity for speed; acceptable but could be simplified for readability if performance is not critical. Also, the instruction suggested using regex; while not required for correctness, adopting re would improve alignment with the spec and likely readability.\\n\"\n },\n \"final_judgement\": {\n \"adherence_score\": 4,\n \"code_quality_score\": 4,\n \"comments\": \"Functionally correct and efficient implementation that meets ordering, edge cases, and output contract. The only notable deviation is not using a regex for tokenization as specified, though behavior matches the rule. Code is clean and performant, with minor readability trade-offs due to manual ASCII scanning and micro-optimizations.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "- Reads task: compute Top-K most frequent tokens from globals text and k, with specified tokenization and ordering; set top_k accordingly; handle k<=0; avoid I/O; don't rely on Counter tie rules.\n- Tokenization requirements:\n - Lowercase the text: The code lowercases only ASCII A-Z by converting each uppercase letter via chr(o + 32) during scanning. This is equivalent to lowercasing for the specified token alphabet [a-z0-9].\n - Tokens are ASCII [a-z0-9]+; others are separators: The scanner accepts digits (48-57), lowercase letters (97-122), and converts uppercase letters (65-90) to lowercase before appending; anything else triggers a token flush, matching the separator rule. Behavior aligns with the spec for ASCII-only tokens.\n - \"Use a regex\": The implementation does not use a regex; instead it manually scans. While the functional outcome matches the spec, this deviates from the explicit instruction to use a regex. Minor adherence issue.\n- Inputs: Uses globals text and k without reassigning them. It passes int(k) to the selection function but does not reassign k, which is acceptable.\n- Requirements:\n 1) Sort by count desc, then token asc: _select_top_k uses heapq.nsmallest with key (-count, token). Since nsmallest returns items ordered ascending by the key, this yields count descending then token ascending. Correct.\n 2) Set top_k to list of (token, count) tuples, length = min(k, unique): _select_top_k computes n = min(k, len(counts)) and returns that many (token, count) pairs; top-level assigns top_k accordingly. Correct.\n 3) Handle k <= 0 -> []: _select_top_k returns [] if not counts or k <= 0; thus top_k becomes []. Correct.\n 4) No input/file/network: None used. Correct.\n- Output contract: top_k is defined at end as list of (token, count) with correct ordering; no extra printing. Correct.\n- Tie-ordering note: Does not rely on Counter; implements explicit ordering. Correct.\n- Edge cases and ambiguity:\n - Non-ASCII letters/digits are treated as separators, consistent with ASCII-only token spec.\n - The manual lowercasing is limited to ASCII, which is appropriate given the token definition.\n - The only notable deviation is not using a regex as requested; however, the functional behavior matches the regex-defined tokenization.\n", + "code_quality": "- Clarity and readability: Functions are small and well-named with helpful comments. Manual ASCII scanning with ord ranges and micro-optimizations (local variable bindings, buf list and clear) slightly reduce readability compared to a regex approach but are still understandable.\n- Correctness: Tokenization matches the specified ASCII [a-z0-9]+ rule and lowercasing. Sorting uses a robust key (-count, token) and nsmallest, yielding the correct order. Edge cases (k <= 0, empty input) are handled.\n- Efficiency: Single pass scan O(n) for counting; nsmallest for top-k selection is O(U log k) which is efficient for large U and small k. Avoids full sort when unnecessary.\n- Structure and maintainability: Separation into _count_tokens and _select_top_k is good. No side effects beyond defining top_k. No reliance on undefined behavior (e.g., Counter tie-breaking). Variable names and comments are appropriate.\n- Best practices: No I/O. Avoids reassigning globals. Uses heapq appropriately. Minor nit: micro-optimizations (binding methods to locals) trade clarity for speed; acceptable but could be simplified for readability if performance is not critical. Also, the instruction suggested using regex; while not required for correctness, adopting re would improve alignment with the spec and likely readability.\n" + }, + "final_judgement": { + "adherence_score": 4, + "code_quality_score": 4, + "comments": "Functionally correct and efficient implementation that meets ordering, edge cases, and output contract. The only notable deviation is not using a regex for tokenization as specified, though behavior matches the rule. Code is clean and performant, with minor readability trade-offs due to manual ASCII scanning and micro-optimizations." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_10.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_10.json new file mode 100644 index 0000000000..70b74be0f6 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_10.json @@ -0,0 +1,16 @@ +{ + "file": "run_10.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"Step-by-step:\\n- Overall task: Computes Top-K most frequent tokens from the given text and assigns to top_k. Yes, via _compute_top_k and final assignment.\\n- Tokenization: Lowercasing is applied per character (c = ch.lower()). Tokens restricted to ASCII [a-z0-9]+ and others act as separators. This matches the semantics. However, the instruction explicitly says to use a regex; the implementation uses manual scanning instead. Functionally correct but does not follow the \\\"use a regex\\\" directive.\\n- Inputs: Uses globals text and k without reassigning them. Reads them into local variables _text and _k with fallbacks if missing, which does not violate the instruction and is harmless given that globals are provided.\\n- Requirement 1 (sort by count desc, then token asc): Satisfied. Uses heapq.nsmallest with key = (-count, token) over counts.items(). The returned list is in the required order.\\n- Requirement 2 (top_k is list of (token, count) tuples, length = min(k, unique)): Satisfied. counts.items() yields tuples; nsmallest(k, ...) returns at most k items; if k exceeds unique tokens, returns all, hence length is min(k, unique).\\n- Requirement 3 (k <= 0 -> []): Satisfied. Returns [] when k <= 0. Also handles non-int k by returning [], which is acceptable though not required.\\n- Requirement 4 (no input/file/network; run as-is): Satisfied. No prohibited I/O; script runs as-is.\\n- Output contract: top_k is defined at end exactly as specified. No extra printing. Does not rely on Counter.most_common tie ordering.\\n- Ambiguities/choices: Handling non-int k by returning [] is a reasonable defensive choice and does not conflict with requirements.\",\n \"code_quality\": \"This is not user-facing code generation but evaluable code is present, so assessing code quality:\\n- Correctness: Tokenization logic matches the specified [a-z0-9]+ after lowercasing; sorting and tie-breaking are correct; edge cases for k handled.\\n- Clarity/Readability: Clear helper function names, concise comments, straightforward structure. Using aliases (append) and buf.clear() is fine and readable.\\n- Efficiency: Single pass tokenization; Counter accumulation; heapq.nsmallest gives O(U log K) top-K selection which is efficient. No unnecessary overhead.\\n- Maintainability: Modularized into tokenization and computation functions; easy to test and modify.\\n- Best practices: Avoids reliance on Counter.most_common tie ordering as required. Minor deviation: does not use regex as instructed (style/requirement rather than correctness). Otherwise solid.\\n- No apparent bugs or edge-case failures discovered.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 4,\n \"code_quality_score\": 5,\n \"comments\": \"Functionally correct with proper sorting, edge-case handling, and correct top_k output. The only notable miss is not using a regex for tokenization as explicitly requested; otherwise, the implementation is clean and efficient.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "Step-by-step:\n- Overall task: Computes Top-K most frequent tokens from the given text and assigns to top_k. Yes, via _compute_top_k and final assignment.\n- Tokenization: Lowercasing is applied per character (c = ch.lower()). Tokens restricted to ASCII [a-z0-9]+ and others act as separators. This matches the semantics. However, the instruction explicitly says to use a regex; the implementation uses manual scanning instead. Functionally correct but does not follow the \"use a regex\" directive.\n- Inputs: Uses globals text and k without reassigning them. Reads them into local variables _text and _k with fallbacks if missing, which does not violate the instruction and is harmless given that globals are provided.\n- Requirement 1 (sort by count desc, then token asc): Satisfied. Uses heapq.nsmallest with key = (-count, token) over counts.items(). The returned list is in the required order.\n- Requirement 2 (top_k is list of (token, count) tuples, length = min(k, unique)): Satisfied. counts.items() yields tuples; nsmallest(k, ...) returns at most k items; if k exceeds unique tokens, returns all, hence length is min(k, unique).\n- Requirement 3 (k <= 0 -> []): Satisfied. Returns [] when k <= 0. Also handles non-int k by returning [], which is acceptable though not required.\n- Requirement 4 (no input/file/network; run as-is): Satisfied. No prohibited I/O; script runs as-is.\n- Output contract: top_k is defined at end exactly as specified. No extra printing. Does not rely on Counter.most_common tie ordering.\n- Ambiguities/choices: Handling non-int k by returning [] is a reasonable defensive choice and does not conflict with requirements.", + "code_quality": "This is not user-facing code generation but evaluable code is present, so assessing code quality:\n- Correctness: Tokenization logic matches the specified [a-z0-9]+ after lowercasing; sorting and tie-breaking are correct; edge cases for k handled.\n- Clarity/Readability: Clear helper function names, concise comments, straightforward structure. Using aliases (append) and buf.clear() is fine and readable.\n- Efficiency: Single pass tokenization; Counter accumulation; heapq.nsmallest gives O(U log K) top-K selection which is efficient. No unnecessary overhead.\n- Maintainability: Modularized into tokenization and computation functions; easy to test and modify.\n- Best practices: Avoids reliance on Counter.most_common tie ordering as required. Minor deviation: does not use regex as instructed (style/requirement rather than correctness). Otherwise solid.\n- No apparent bugs or edge-case failures discovered." + }, + "final_judgement": { + "adherence_score": 4, + "code_quality_score": 5, + "comments": "Functionally correct with proper sorting, edge-case handling, and correct top_k output. The only notable miss is not using a regex for tokenization as explicitly requested; otherwise, the implementation is clean and efficient." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_11.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_11.json new file mode 100644 index 0000000000..298ad3c19c --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_11.json @@ -0,0 +1,16 @@ +{ + "file": "run_11.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"- Inputs/globals: The code reads globals text and k without reassigning them, passing them into a helper function. It does not modify the globals, satisfying the constraint.\\n- Tokenization: Uses regex r'[A-Za-z0-9]+' with re.ASCII to capture ASCII alphanumeric sequences, and lowercases each matched token (tok = m.group(0).lower()), which functionally matches \\\"Lowercase the text\\\" and the specified tokenization rule.\\n- Counting: Accumulates counts in a dict, as required.\\n- Sorting/Top-K: Uses heapq.nsmallest with key (-count, token) over counts.items(), which produces a list sorted by count descending then token ascending, matching the exact specified sort key and ensuring deterministic tie-breaking without relying on Counter.most_common.\\n- Output shape: Returns a list of (token, count) tuples (from counts.items()) and assigns it to top_k at module scope. The length is min(k, number of unique tokens) via kk = min(k, n_unique).\\n- Edge cases: If k cannot be converted to int, it treats as 0; if k <= 0, returns []. If text is empty, returns [], which matches min(k, 0). All required edge cases are handled.\\n- No prohibited operations: No input(), file I/O, or network access; optional printing is not used.\\n- Ambiguities: The instruction states \\\"Lowercase the text\\\"; the implementation lowercases tokens on-the-fly, which yields identical tokens per the regex spec and is reasonable.\\nOverall, the implementation meets all specified requirements and contracts.\",\n \"code_quality\": \"- Correctness: Logic is sound; key=(-count, token) ensures correct ordering. Returns proper list of (token, count) tuples.\\n- Efficiency: Uses heapq.nsmallest to avoid full sort when k << unique tokens; iterates regex matches without lowercasing the whole text, saving memory.\\n- Readability: Clear structure with explanatory comments. Variable names are mostly clear; kk could be more descriptive (e.g., k_eff), but this is minor.\\n- Robustness: Safely handles non-string text by coercing to str and handles non-int k by converting to int with fallback to 0.\\n- Maintainability/Style: Simple, modular via helper function; no unnecessary dependencies. Avoids reliance on unstable tie-breaking.\\nNo bugs or inefficiencies evident; minor naming nit only.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 5,\n \"code_quality_score\": 5,\n \"comments\": \"Meets all task requirements precisely, including tokenization, sorting, edge cases, and output contract. Clean, efficient implementation with clear comments.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "- Inputs/globals: The code reads globals text and k without reassigning them, passing them into a helper function. It does not modify the globals, satisfying the constraint.\n- Tokenization: Uses regex r'[A-Za-z0-9]+' with re.ASCII to capture ASCII alphanumeric sequences, and lowercases each matched token (tok = m.group(0).lower()), which functionally matches \"Lowercase the text\" and the specified tokenization rule.\n- Counting: Accumulates counts in a dict, as required.\n- Sorting/Top-K: Uses heapq.nsmallest with key (-count, token) over counts.items(), which produces a list sorted by count descending then token ascending, matching the exact specified sort key and ensuring deterministic tie-breaking without relying on Counter.most_common.\n- Output shape: Returns a list of (token, count) tuples (from counts.items()) and assigns it to top_k at module scope. The length is min(k, number of unique tokens) via kk = min(k, n_unique).\n- Edge cases: If k cannot be converted to int, it treats as 0; if k <= 0, returns []. If text is empty, returns [], which matches min(k, 0). All required edge cases are handled.\n- No prohibited operations: No input(), file I/O, or network access; optional printing is not used.\n- Ambiguities: The instruction states \"Lowercase the text\"; the implementation lowercases tokens on-the-fly, which yields identical tokens per the regex spec and is reasonable.\nOverall, the implementation meets all specified requirements and contracts.", + "code_quality": "- Correctness: Logic is sound; key=(-count, token) ensures correct ordering. Returns proper list of (token, count) tuples.\n- Efficiency: Uses heapq.nsmallest to avoid full sort when k << unique tokens; iterates regex matches without lowercasing the whole text, saving memory.\n- Readability: Clear structure with explanatory comments. Variable names are mostly clear; kk could be more descriptive (e.g., k_eff), but this is minor.\n- Robustness: Safely handles non-string text by coercing to str and handles non-int k by converting to int with fallback to 0.\n- Maintainability/Style: Simple, modular via helper function; no unnecessary dependencies. Avoids reliance on unstable tie-breaking.\nNo bugs or inefficiencies evident; minor naming nit only." + }, + "final_judgement": { + "adherence_score": 5, + "code_quality_score": 5, + "comments": "Meets all task requirements precisely, including tokenization, sorting, edge cases, and output contract. Clean, efficient implementation with clear comments." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_12.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_12.json new file mode 100644 index 0000000000..7724e10c70 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_12.json @@ -0,0 +1,16 @@ +{ + "file": "run_12.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"- Tokenization: Uses regex [A-Za-z0-9]+ with re.ASCII and lowercases each match, which matches the spec of ASCII [a-z0-9]+ tokens with other chars as separators. Lowercasing is correctly applied post-match.\\n- Sorting: Constructs a key function key = lambda it: (-it[1], it[0]) and uses it for ordering, satisfying the requirement to sort by count descending then token ascending. It avoids Counter.most_common as requested.\\n- Top-K selection and length: Computes t = max(0, min(int(k), len(items))) and returns either nsmallest(t, items, key=key) or sorted(items, key=key), ensuring the result length is min(k, number of unique tokens). For t == 0 or no tokens, returns []. This meets the length contract and edge case of k <= 0.\\n- Edge cases: If no tokens (items empty) \u2192 [], correct. If k <= 0 \u2192 [], correct. If k > unique tokens \u2192 returns all sorted items, correct.\\n- Inputs handling: Does not reassign the provided globals text or k. Wraps the call in a try/except NameError to set top_k = [] if globals are missing; while unnecessary given the spec, it does not violate any requirement and still ensures top_k is defined.\\n- Output contract: Defines top_k at the end as a list of (token, count) tuples. No extra printing, which is allowed (printing is optional and constrained if used).\\n- No disallowed operations: No input(), file I/O, or network access present.\\n- Minor note: Using nsmallest ensures both correct Top-K selection and returns them in sorted order by the provided key, so the output order is correct.\\nOverall, the code satisfies all specified requirements.\",\n \"code_quality\": \"- Correctness: Implements counting, tokenization, sorting, and Top-K selection correctly. Does not rely on Counter.most_common tie ordering.\\n- Efficiency: Uses heapq.nsmallest for the Top-K path, which is efficient when k << number of unique tokens. Precompiles the regex. Overall time/space usage is reasonable.\\n- Readability/Structure: Clear function structure with type hints. Variable names are mostly clear; 't' could be more descriptive (e.g., n or k_eff), but this is minor. Inline comment summarizes the tokenization rule.\\n- Maintainability: Simple, self-contained, and adheres to best practices. No unnecessary complexity.\\n- Error handling: The try/except NameError around the call is a gentle safeguard; it doesn\u2019t mask other errors and keeps behavior predictable.\\nNo bugs or significant stylistic issues identified.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 5,\n \"code_quality_score\": 5,\n \"comments\": \"Meets all task requirements precisely, including tokenization, sorting, edge cases, and output contract. Code is clean, efficient, and readable. Minor nit: rename variable 't' for clarity if desired.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "- Tokenization: Uses regex [A-Za-z0-9]+ with re.ASCII and lowercases each match, which matches the spec of ASCII [a-z0-9]+ tokens with other chars as separators. Lowercasing is correctly applied post-match.\n- Sorting: Constructs a key function key = lambda it: (-it[1], it[0]) and uses it for ordering, satisfying the requirement to sort by count descending then token ascending. It avoids Counter.most_common as requested.\n- Top-K selection and length: Computes t = max(0, min(int(k), len(items))) and returns either nsmallest(t, items, key=key) or sorted(items, key=key), ensuring the result length is min(k, number of unique tokens). For t == 0 or no tokens, returns []. This meets the length contract and edge case of k <= 0.\n- Edge cases: If no tokens (items empty) \u2192 [], correct. If k <= 0 \u2192 [], correct. If k > unique tokens \u2192 returns all sorted items, correct.\n- Inputs handling: Does not reassign the provided globals text or k. Wraps the call in a try/except NameError to set top_k = [] if globals are missing; while unnecessary given the spec, it does not violate any requirement and still ensures top_k is defined.\n- Output contract: Defines top_k at the end as a list of (token, count) tuples. No extra printing, which is allowed (printing is optional and constrained if used).\n- No disallowed operations: No input(), file I/O, or network access present.\n- Minor note: Using nsmallest ensures both correct Top-K selection and returns them in sorted order by the provided key, so the output order is correct.\nOverall, the code satisfies all specified requirements.", + "code_quality": "- Correctness: Implements counting, tokenization, sorting, and Top-K selection correctly. Does not rely on Counter.most_common tie ordering.\n- Efficiency: Uses heapq.nsmallest for the Top-K path, which is efficient when k << number of unique tokens. Precompiles the regex. Overall time/space usage is reasonable.\n- Readability/Structure: Clear function structure with type hints. Variable names are mostly clear; 't' could be more descriptive (e.g., n or k_eff), but this is minor. Inline comment summarizes the tokenization rule.\n- Maintainability: Simple, self-contained, and adheres to best practices. No unnecessary complexity.\n- Error handling: The try/except NameError around the call is a gentle safeguard; it doesn\u2019t mask other errors and keeps behavior predictable.\nNo bugs or significant stylistic issues identified." + }, + "final_judgement": { + "adherence_score": 5, + "code_quality_score": 5, + "comments": "Meets all task requirements precisely, including tokenization, sorting, edge cases, and output contract. Code is clean, efficient, and readable. Minor nit: rename variable 't' for clarity if desired." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_13.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_13.json new file mode 100644 index 0000000000..11925ee206 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_13.json @@ -0,0 +1,16 @@ +{ + "file": "run_13.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"Step-by-step check against instructions:\\n- Tokenization rules: The code tokenizes ASCII [a-z0-9]+ sequences, lowercasing letters and treating all other characters as separators via a manual ASCII scan. This behavior matches the specified tokenization semantics. However, the instruction explicitly says \\\"use a regex\\\". The implementation does not use a regex; while functionally equivalent, this deviates from the stated method.\\n- Lowercasing: Uppercase ASCII letters are converted to lowercase during scanning. This satisfies the lowercase requirement for tokens within the ASCII range.\\n- Inputs (globals text, k; do not reassign): The code references text and k via compute_top_k(text, k) and does not reassign these globals. Inside the function, it casts the local k to int, which does not modify the global k. Compliant.\\n- Requirement 1 (sort by count desc, token asc): Achieved via heapq.nsmallest with key = (-count, token), which returns results sorted by that key. This meets the ordering requirement and does not rely on Counter.most_common.\\n- Requirement 2 (top_k list of (token, count) tuples, length = min(k, unique tokens)): The function computes counts, computes n = min(k, len(counts)), and returns a list of (token, count) tuples of length n. Compliant.\\n- Requirement 3 (k <= 0 -> []): If k <= 0 or counts empty, returns []. Compliant.\\n- Requirement 4 (no input/file/network; script runs as-is): No I/O is used. The code guards top_k assignment in a try/except NameError to ensure top_k is defined if globals are missing; with provided globals, it sets top_k accordingly. Compliant.\\n- Output contract (top_k defined as described; optional print only top_k): top_k is defined as required; no printing is performed. Compliant.\\n\\nSummary: All functional requirements are met, but the explicit \\\"use a regex\\\" directive for tokenization is not followed.\",\n \"code_quality\": \"Code quality assessment:\\n- Correctness: Tokenization, counting, and sorting are correctly implemented for the specified ASCII token definition. Tie-breaking and k handling are correct.\\n- Clarity/Readability: The tokenizer uses ord-based character range checks and a micro-optimized buffer (append alias). This is efficient but less readable than a straightforward regex approach and typical Python style. Comments help, but readability could be improved by simpler constructs.\\n- Efficiency: Single-pass tokenization is O(n). Top-K extraction via heapq.nsmallest is efficient, especially when k << number of unique tokens. For k close to the number of unique tokens, performance remains acceptable.\\n- Structure/Maintainability: Separation into _count_tokens and compute_top_k is clean. The NameError guard ensures top_k is always defined but is arguably unnecessary given the stated inputs; still harmless.\\n- Best practices: Avoiding unnecessary micro-optimizations (like caching append) could improve readability. Using regex would align with the instruction and typical Python practices, but the current approach is still solid.\\n\\nNo bugs or correctness issues identified; primary nit is readability and deviation from the requested regex approach.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 4,\n \"code_quality_score\": 4,\n \"comments\": \"Functionally excellent and meets sorting and output requirements, but it does not follow the explicit 'use a regex' instruction for tokenization. Minor readability concerns due to micro-optimizations and low-level ASCII handling.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "Step-by-step check against instructions:\n- Tokenization rules: The code tokenizes ASCII [a-z0-9]+ sequences, lowercasing letters and treating all other characters as separators via a manual ASCII scan. This behavior matches the specified tokenization semantics. However, the instruction explicitly says \"use a regex\". The implementation does not use a regex; while functionally equivalent, this deviates from the stated method.\n- Lowercasing: Uppercase ASCII letters are converted to lowercase during scanning. This satisfies the lowercase requirement for tokens within the ASCII range.\n- Inputs (globals text, k; do not reassign): The code references text and k via compute_top_k(text, k) and does not reassign these globals. Inside the function, it casts the local k to int, which does not modify the global k. Compliant.\n- Requirement 1 (sort by count desc, token asc): Achieved via heapq.nsmallest with key = (-count, token), which returns results sorted by that key. This meets the ordering requirement and does not rely on Counter.most_common.\n- Requirement 2 (top_k list of (token, count) tuples, length = min(k, unique tokens)): The function computes counts, computes n = min(k, len(counts)), and returns a list of (token, count) tuples of length n. Compliant.\n- Requirement 3 (k <= 0 -> []): If k <= 0 or counts empty, returns []. Compliant.\n- Requirement 4 (no input/file/network; script runs as-is): No I/O is used. The code guards top_k assignment in a try/except NameError to ensure top_k is defined if globals are missing; with provided globals, it sets top_k accordingly. Compliant.\n- Output contract (top_k defined as described; optional print only top_k): top_k is defined as required; no printing is performed. Compliant.\n\nSummary: All functional requirements are met, but the explicit \"use a regex\" directive for tokenization is not followed.", + "code_quality": "Code quality assessment:\n- Correctness: Tokenization, counting, and sorting are correctly implemented for the specified ASCII token definition. Tie-breaking and k handling are correct.\n- Clarity/Readability: The tokenizer uses ord-based character range checks and a micro-optimized buffer (append alias). This is efficient but less readable than a straightforward regex approach and typical Python style. Comments help, but readability could be improved by simpler constructs.\n- Efficiency: Single-pass tokenization is O(n). Top-K extraction via heapq.nsmallest is efficient, especially when k << number of unique tokens. For k close to the number of unique tokens, performance remains acceptable.\n- Structure/Maintainability: Separation into _count_tokens and compute_top_k is clean. The NameError guard ensures top_k is always defined but is arguably unnecessary given the stated inputs; still harmless.\n- Best practices: Avoiding unnecessary micro-optimizations (like caching append) could improve readability. Using regex would align with the instruction and typical Python practices, but the current approach is still solid.\n\nNo bugs or correctness issues identified; primary nit is readability and deviation from the requested regex approach." + }, + "final_judgement": { + "adherence_score": 4, + "code_quality_score": 4, + "comments": "Functionally excellent and meets sorting and output requirements, but it does not follow the explicit 'use a regex' instruction for tokenization. Minor readability concerns due to micro-optimizations and low-level ASCII handling." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_14.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_14.json new file mode 100644 index 0000000000..7add319969 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_14.json @@ -0,0 +1,16 @@ +{ + "file": "run_14.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"Step-by-step adherence check:\\n- Inputs/globals: The code reads from globals `text` and `k` and does not reassign them. It wraps the access in a try/except NameError and sets `top_k = []` if globals are missing. With provided globals (as per task), it uses them directly. This adheres to the input constraints.\\n- Tokenization: The tokenizer `_iter_ascii_tokens` lowercases ASCII A-Z by bitwise OR 32 and accepts only ASCII letters/digits [a-z0-9], treating all other characters (including non-ASCII) as separators. This matches the specified token definition and lowercasing behavior. The instruction suggests using a regex, but the implemented manual tokenizer produces equivalent behavior. This is a minor deviation from the suggested implementation style but not from the functional spec.\\n- Sorting: Uses `heapq.nsmallest(kk, counts.items(), key=lambda it: (-it[1], it[0]))`, which effectively returns items sorted by count descending (via negative count) and token ascending, matching requirement (1). Since `nsmallest` returns a sorted list by the key, the resulting order satisfies the contract.\\n- Top-K and length: It computes `kk = min(int(k), len(counts))` and returns exactly `kk` items, fulfilling requirement (2).\\n- Edge cases: If `k` cannot be cast to int or `k <= 0`, it returns `[]`, satisfying requirement (3). If the text yields no tokens, it also returns `[]`, which is consistent with `min(k, 0) = 0`.\\n- No I/O: There is no input(), file I/O, or network access. Satisfies requirement (4).\\n- Output contract: At the end, `top_k` is defined as a list of `(token, count)` tuples. No extraneous printing. Meets the contract.\\n- Tie-ordering and Counter: Does not rely on Counter.most_common; uses explicit key ordering as required.\\nAmbiguity consideration: The parenthetical \\\"use a regex\\\" could be read as a strict requirement or a suggestion. The code does not use a regex but achieves the exact specified tokenization, which is a reasonable and correct choice; thus, at most a minor stylistic deviation.\",\n \"code_quality\": \"Code quality assessment:\\n- Correctness: Tokenization correctly handles ASCII rules, lowercases A-Z, treats all other chars as separators, and flushes buffers at boundaries and at end. Counting and Top-K selection are correct, and sorting by (-count, token) is implemented via `heapq.nsmallest` with an appropriate key.\\n- Clarity/Readability: Functions are small, well-named, and commented. The use of `ord` and bitwise OR (o | 32) is efficient but slightly less readable than `ch.lower()` or a regex; comments mitigate this. Overall readable.\\n- Efficiency: Single pass tokenization and counting O(n). Selection via `heapq.nsmallest` is O(n log k) and returns sorted output, which is efficient for large n and small k. For k near n, complexity is similar to sorting, which is acceptable.\\n- Structure/Maintainability: Helpers are modular. Minimal global interaction. No reliance on unspecified tie-breaking. Edge cases handled cleanly.\\n- Best practices: Avoids unnecessary I/O. Does not mutate provided globals. Uses local variable caching (`get = counts.get`) for micro-optimization. The fallback `try/except NameError` to define `top_k` is harmless and ensures `top_k` is always defined.\\nNo bugs or inefficiencies identified; only a minor readability note regarding bitwise lowercase and the choice not to use regex.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 4,\n \"code_quality_score\": 5,\n \"comments\": \"Functionally excellent and meets all requirements, including sorting, edge cases, and output format. Minor deviation: does not use a regex for tokenization as suggested, though behavior matches the spec exactly. Code is clean, efficient, and maintainable.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "Step-by-step adherence check:\n- Inputs/globals: The code reads from globals `text` and `k` and does not reassign them. It wraps the access in a try/except NameError and sets `top_k = []` if globals are missing. With provided globals (as per task), it uses them directly. This adheres to the input constraints.\n- Tokenization: The tokenizer `_iter_ascii_tokens` lowercases ASCII A-Z by bitwise OR 32 and accepts only ASCII letters/digits [a-z0-9], treating all other characters (including non-ASCII) as separators. This matches the specified token definition and lowercasing behavior. The instruction suggests using a regex, but the implemented manual tokenizer produces equivalent behavior. This is a minor deviation from the suggested implementation style but not from the functional spec.\n- Sorting: Uses `heapq.nsmallest(kk, counts.items(), key=lambda it: (-it[1], it[0]))`, which effectively returns items sorted by count descending (via negative count) and token ascending, matching requirement (1). Since `nsmallest` returns a sorted list by the key, the resulting order satisfies the contract.\n- Top-K and length: It computes `kk = min(int(k), len(counts))` and returns exactly `kk` items, fulfilling requirement (2).\n- Edge cases: If `k` cannot be cast to int or `k <= 0`, it returns `[]`, satisfying requirement (3). If the text yields no tokens, it also returns `[]`, which is consistent with `min(k, 0) = 0`.\n- No I/O: There is no input(), file I/O, or network access. Satisfies requirement (4).\n- Output contract: At the end, `top_k` is defined as a list of `(token, count)` tuples. No extraneous printing. Meets the contract.\n- Tie-ordering and Counter: Does not rely on Counter.most_common; uses explicit key ordering as required.\nAmbiguity consideration: The parenthetical \"use a regex\" could be read as a strict requirement or a suggestion. The code does not use a regex but achieves the exact specified tokenization, which is a reasonable and correct choice; thus, at most a minor stylistic deviation.", + "code_quality": "Code quality assessment:\n- Correctness: Tokenization correctly handles ASCII rules, lowercases A-Z, treats all other chars as separators, and flushes buffers at boundaries and at end. Counting and Top-K selection are correct, and sorting by (-count, token) is implemented via `heapq.nsmallest` with an appropriate key.\n- Clarity/Readability: Functions are small, well-named, and commented. The use of `ord` and bitwise OR (o | 32) is efficient but slightly less readable than `ch.lower()` or a regex; comments mitigate this. Overall readable.\n- Efficiency: Single pass tokenization and counting O(n). Selection via `heapq.nsmallest` is O(n log k) and returns sorted output, which is efficient for large n and small k. For k near n, complexity is similar to sorting, which is acceptable.\n- Structure/Maintainability: Helpers are modular. Minimal global interaction. No reliance on unspecified tie-breaking. Edge cases handled cleanly.\n- Best practices: Avoids unnecessary I/O. Does not mutate provided globals. Uses local variable caching (`get = counts.get`) for micro-optimization. The fallback `try/except NameError` to define `top_k` is harmless and ensures `top_k` is always defined.\nNo bugs or inefficiencies identified; only a minor readability note regarding bitwise lowercase and the choice not to use regex." + }, + "final_judgement": { + "adherence_score": 4, + "code_quality_score": 5, + "comments": "Functionally excellent and meets all requirements, including sorting, edge cases, and output format. Minor deviation: does not use a regex for tokenization as suggested, though behavior matches the spec exactly. Code is clean, efficient, and maintainable." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_15.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_15.json new file mode 100644 index 0000000000..40137aeab1 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_15.json @@ -0,0 +1,16 @@ +{ + "file": "run_15.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"Step-by-step assessment against the task:\\n- Tokenization requirements:\\n - Lowercasing: Achieved via on-the-fly conversion of 'A'-'Z' to lowercase when building tokens. Functionally equivalent to lowercasing the text first.\\n - ASCII [a-z0-9]+ tokens; others are separators: Implemented by scanning characters with ord ranges for digits and letters, treating everything else as a separator. This matches the specified token set and separator behavior.\\n - Use a regex: Not followed. The implementation uses a manual streaming tokenizer instead of a regex, which deviates from the explicit instruction.\\n- Inputs: Uses provided globals 'text' and 'k' without reassigning them. Correct.\\n- Requirement 1 (sort by count desc, then token asc): Implemented via sorted(counts.items(), key=lambda it: (-it[1], it[0])). Correct.\\n- Requirement 2 (top_k is list of (token, count) tuples, length = min(k, unique)): The function computes counts, sorts, and slices to k if needed. Returns [] when no tokens. The final 'top_k' is assigned the returned list. Correct.\\n- Requirement 3 (k <= 0 -> top_k = []): Handled at the start of _compute_top_k; returns [], so top_k becomes []. Correct.\\n- Requirement 4 (no I/O or network; runs as-is): No I/O or network usage. Relies only on provided globals. Correct.\\n- Output contract: Ends with 'top_k' defined exactly as specified; no extra printing. Correct.\\n- Note (do not rely on Counter.most_common tie ordering): Not using Counter at all; uses explicit sorting key. Correct.\\nOverall, all functional requirements are met except the explicit \\\"use a regex\\\" directive for tokenization.\",\n \"code_quality\": \"The output is code; assessment follows:\\n- Correctness: Tokenization logic correctly yields lowercase ASCII alphanumeric tokens, treating other chars as separators. Counting and sorting logic match the spec, including tie-breaking. Edge cases (k <= 0, no tokens) handled properly.\\n- Clarity/Readability: Code is reasonably clear with explanatory comments. Variable names are meaningful. Minor nit: _iter_ascii_tokens lacks a return type hint (Iterator[str]). The ord-based branching is a bit low-level versus a regex but still understandable with comments.\\n- Efficiency: Single-pass tokenization with buffered yields; dictionary counting; O(U log U) sort where U is number of unique tokens. Local variable bindings (append, get) are micro-optimizations and fine.\\n- Maintainability/Structure: Helper functions are well-scoped; no side effects beyond final assignment. Type hints used for top_k and return of _compute_top_k.\\n- Best practices: Avoids reliance on Counter.most_common tie-order. No unnecessary I/O. The only deviation from the spec is not using a regex for tokenization; from a code-quality perspective, the manual tokenizer is acceptable and efficient.\\nNo bugs or inefficiencies detected that would affect correctness or performance in typical use.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 4,\n \"code_quality_score\": 5,\n \"comments\": \"Functionally excellent and meets all sorting, edge-case, and output requirements. The only miss is not using a regex for tokenization as explicitly requested. Otherwise, clean, efficient, and correct.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "Step-by-step assessment against the task:\n- Tokenization requirements:\n - Lowercasing: Achieved via on-the-fly conversion of 'A'-'Z' to lowercase when building tokens. Functionally equivalent to lowercasing the text first.\n - ASCII [a-z0-9]+ tokens; others are separators: Implemented by scanning characters with ord ranges for digits and letters, treating everything else as a separator. This matches the specified token set and separator behavior.\n - Use a regex: Not followed. The implementation uses a manual streaming tokenizer instead of a regex, which deviates from the explicit instruction.\n- Inputs: Uses provided globals 'text' and 'k' without reassigning them. Correct.\n- Requirement 1 (sort by count desc, then token asc): Implemented via sorted(counts.items(), key=lambda it: (-it[1], it[0])). Correct.\n- Requirement 2 (top_k is list of (token, count) tuples, length = min(k, unique)): The function computes counts, sorts, and slices to k if needed. Returns [] when no tokens. The final 'top_k' is assigned the returned list. Correct.\n- Requirement 3 (k <= 0 -> top_k = []): Handled at the start of _compute_top_k; returns [], so top_k becomes []. Correct.\n- Requirement 4 (no I/O or network; runs as-is): No I/O or network usage. Relies only on provided globals. Correct.\n- Output contract: Ends with 'top_k' defined exactly as specified; no extra printing. Correct.\n- Note (do not rely on Counter.most_common tie ordering): Not using Counter at all; uses explicit sorting key. Correct.\nOverall, all functional requirements are met except the explicit \"use a regex\" directive for tokenization.", + "code_quality": "The output is code; assessment follows:\n- Correctness: Tokenization logic correctly yields lowercase ASCII alphanumeric tokens, treating other chars as separators. Counting and sorting logic match the spec, including tie-breaking. Edge cases (k <= 0, no tokens) handled properly.\n- Clarity/Readability: Code is reasonably clear with explanatory comments. Variable names are meaningful. Minor nit: _iter_ascii_tokens lacks a return type hint (Iterator[str]). The ord-based branching is a bit low-level versus a regex but still understandable with comments.\n- Efficiency: Single-pass tokenization with buffered yields; dictionary counting; O(U log U) sort where U is number of unique tokens. Local variable bindings (append, get) are micro-optimizations and fine.\n- Maintainability/Structure: Helper functions are well-scoped; no side effects beyond final assignment. Type hints used for top_k and return of _compute_top_k.\n- Best practices: Avoids reliance on Counter.most_common tie-order. No unnecessary I/O. The only deviation from the spec is not using a regex for tokenization; from a code-quality perspective, the manual tokenizer is acceptable and efficient.\nNo bugs or inefficiencies detected that would affect correctness or performance in typical use." + }, + "final_judgement": { + "adherence_score": 4, + "code_quality_score": 5, + "comments": "Functionally excellent and meets all sorting, edge-case, and output requirements. The only miss is not using a regex for tokenization as explicitly requested. Otherwise, clean, efficient, and correct." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_16.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_16.json new file mode 100644 index 0000000000..48734ddeaf --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_16.json @@ -0,0 +1,16 @@ +{ + "file": "run_16.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"Step-by-step adherence assessment:\\n- Tokenization rules: The code lowercases characters and treats ASCII letters a\u2013z and digits 0\u20139 as token characters; everything else is a separator. This matches the functional definition of tokens. However, the instruction explicitly says to \\\"use a regex\\\" for tokenization. The implementation uses a manual state machine instead of a regex, so this part of the instruction is not followed, even though the behavior appears correct.\\n- Inputs: It reads the provided globals `text` and `k` and does not reassign them. It accesses them via `globals().get`, which still adheres to the constraint of not reassigning.\\n- Sorting requirement: It computes counts, then selects Top-K using `heapq.nsmallest` with key `(-count, token)`. This yields items in order of count descending, then token ascending, satisfying the specified sort key.\\n- Output shape and length: The result is a list of `(token, count)` tuples with length `min(k, number of unique tokens)` by applying `kk = min(kk, len(counts))`. If there are fewer unique tokens than `k`, it limits accordingly.\\n- Edge cases: If `k` is non-integer, it coerces to int inside a try/except; if the coercion fails or `kk <= 0`, it returns `[]`, satisfying the requirement that `k <= 0` yields an empty list. If there are no tokens, it also returns `[]`, which aligns with `min(k, 0) = 0`.\\n- No disallowed operations: It does not use input(), file I/O, or network access. It does not print anything, which is allowed as printing is optional and restricted to only printing `top_k` if done.\\n- Output contract: At the end, `top_k` is defined exactly as a list of `(token, count)` tuples, ordered correctly.\\n\\nSummary: All functional requirements are met, but the specific instruction to use a regex for tokenization is not followed.\",\n \"code_quality\": \"Code quality assessment:\\n- Correctness: The tokenization logic correctly forms tokens from ASCII letters and digits after lowercasing. Counting and Top-K selection are correct and stable under the specified tie-breaker. No reliance on Counter.tie-ordering.\\n- Efficiency: Single pass counting is O(n). Using `heapq.nsmallest` provides O(m log k) selection, efficient for large m with small k. It also returns results sorted by the key, meeting order requirements.\\n- Readability and structure: Clear helper functions with type hints, descriptive comments, and straightforward logic. The use of `_iter_tokens` as a generator is clean. Variable names are clear. Minimal, focused code.\\n- Maintainability and best practices: Sensible separation of concerns, no side effects, and proper handling of edge cases. Using `globals().get` is slightly unconventional but harmless in this context. Lack of regex conflicts with the explicit instruction but does not harm code quality per se.\\n\\nNo bugs or inefficiencies identified beyond not using regex as specified by the task.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 4,\n \"code_quality_score\": 5,\n \"comments\": \"Functionally correct, efficient, and clean implementation that meets all requirements except the explicit instruction to use a regex for tokenization. Consider switching tokenization to a regex to fully comply with the task specification.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "Step-by-step adherence assessment:\n- Tokenization rules: The code lowercases characters and treats ASCII letters a\u2013z and digits 0\u20139 as token characters; everything else is a separator. This matches the functional definition of tokens. However, the instruction explicitly says to \"use a regex\" for tokenization. The implementation uses a manual state machine instead of a regex, so this part of the instruction is not followed, even though the behavior appears correct.\n- Inputs: It reads the provided globals `text` and `k` and does not reassign them. It accesses them via `globals().get`, which still adheres to the constraint of not reassigning.\n- Sorting requirement: It computes counts, then selects Top-K using `heapq.nsmallest` with key `(-count, token)`. This yields items in order of count descending, then token ascending, satisfying the specified sort key.\n- Output shape and length: The result is a list of `(token, count)` tuples with length `min(k, number of unique tokens)` by applying `kk = min(kk, len(counts))`. If there are fewer unique tokens than `k`, it limits accordingly.\n- Edge cases: If `k` is non-integer, it coerces to int inside a try/except; if the coercion fails or `kk <= 0`, it returns `[]`, satisfying the requirement that `k <= 0` yields an empty list. If there are no tokens, it also returns `[]`, which aligns with `min(k, 0) = 0`.\n- No disallowed operations: It does not use input(), file I/O, or network access. It does not print anything, which is allowed as printing is optional and restricted to only printing `top_k` if done.\n- Output contract: At the end, `top_k` is defined exactly as a list of `(token, count)` tuples, ordered correctly.\n\nSummary: All functional requirements are met, but the specific instruction to use a regex for tokenization is not followed.", + "code_quality": "Code quality assessment:\n- Correctness: The tokenization logic correctly forms tokens from ASCII letters and digits after lowercasing. Counting and Top-K selection are correct and stable under the specified tie-breaker. No reliance on Counter.tie-ordering.\n- Efficiency: Single pass counting is O(n). Using `heapq.nsmallest` provides O(m log k) selection, efficient for large m with small k. It also returns results sorted by the key, meeting order requirements.\n- Readability and structure: Clear helper functions with type hints, descriptive comments, and straightforward logic. The use of `_iter_tokens` as a generator is clean. Variable names are clear. Minimal, focused code.\n- Maintainability and best practices: Sensible separation of concerns, no side effects, and proper handling of edge cases. Using `globals().get` is slightly unconventional but harmless in this context. Lack of regex conflicts with the explicit instruction but does not harm code quality per se.\n\nNo bugs or inefficiencies identified beyond not using regex as specified by the task." + }, + "final_judgement": { + "adherence_score": 4, + "code_quality_score": 5, + "comments": "Functionally correct, efficient, and clean implementation that meets all requirements except the explicit instruction to use a regex for tokenization. Consider switching tokenization to a regex to fully comply with the task specification." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_17.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_17.json new file mode 100644 index 0000000000..7076ff6811 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_17.json @@ -0,0 +1,16 @@ +{ + "file": "run_17.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"Step-by-step check against instructions:\\n- Tokenization: Uses a compiled regex [A-Za-z0-9]+ to extract ASCII alphanumeric sequences and lowercases each token via tok = m.group(0).lower(). This matches the requirement to lowercase and use ASCII [a-z0-9]+ with other characters as separators.\\n- Inputs: Reads from the provided globals text and k without reassigning them. Passes them into a helper function; does not modify the globals.\\n- Requirement 1 (sorting): Uses heapq.nsmallest with key=lambda kv: (-kv[1], kv[0]). Since nsmallest returns results in ascending key order, negating the count achieves count-desc, and kv[0] (token) provides ascending token tiebreak. This exactly matches the specified sort key (-count, token).\\n- Requirement 2 (top_k shape/length): Returns a list of (token, count) tuples from counts.items(). Computes top_n = min(k, len(counts)), so the length is min(k, number of unique tokens). Assigns top_k to this list at the end.\\n- Requirement 3 (edge case k <= 0): Early return [] if k <= 0 (also if k is not int, which is stricter than required but not harmful since the task promises int).\\n- Requirement 4 (no I/O/network): No input(), file I/O, or network usage. Script relies only on provided globals and standard library.\\n- Output contract: Defines top_k at the end as specified. Does not print anything else (printing is optional). Does not rely on Counter.most_common.\\nNo ambiguities appear to be mishandled. All parts of the task are met precisely.\",\n \"code_quality\": \"Assessment:\\n- Correctness: Logic is sound; counting via a dict and selecting top-k with the correct sort key yields the exact required ordering. Handles empty input and k <= 0 correctly.\\n- Efficiency: Single pass tokenization with finditer avoids building intermediate lists. heapq.nsmallest provides O(n log k) behavior for small k, which is efficient. top_n is set to min(k, unique), avoiding over-selection.\\n- Readability/Clarity: Clear variable names, concise comments, and a small helper function improve structure. The compiled regex at module scope is good for performance. Type hints are present for parameters; adding a return type annotation would be a minor improvement.\\n- Maintainability/Style: Pythonic and straightforward. The isinstance(k, int) guard is extra relative to the spec but harmless and may prevent misuse. No unnecessary dependencies or side effects.\\n- No bugs or stylistic issues of concern. Using heapq.nsmallest returns a sorted list according to the key, satisfying the required output order.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 5,\n \"code_quality_score\": 5,\n \"comments\": \"Meets all requirements precisely: correct tokenization, ordering by (-count, token), correct length and edge cases, and proper top_k assignment. Code is clean, efficient, and maintainable. Minor optional improvement: add a return type annotation for the helper function.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "Step-by-step check against instructions:\n- Tokenization: Uses a compiled regex [A-Za-z0-9]+ to extract ASCII alphanumeric sequences and lowercases each token via tok = m.group(0).lower(). This matches the requirement to lowercase and use ASCII [a-z0-9]+ with other characters as separators.\n- Inputs: Reads from the provided globals text and k without reassigning them. Passes them into a helper function; does not modify the globals.\n- Requirement 1 (sorting): Uses heapq.nsmallest with key=lambda kv: (-kv[1], kv[0]). Since nsmallest returns results in ascending key order, negating the count achieves count-desc, and kv[0] (token) provides ascending token tiebreak. This exactly matches the specified sort key (-count, token).\n- Requirement 2 (top_k shape/length): Returns a list of (token, count) tuples from counts.items(). Computes top_n = min(k, len(counts)), so the length is min(k, number of unique tokens). Assigns top_k to this list at the end.\n- Requirement 3 (edge case k <= 0): Early return [] if k <= 0 (also if k is not int, which is stricter than required but not harmful since the task promises int).\n- Requirement 4 (no I/O/network): No input(), file I/O, or network usage. Script relies only on provided globals and standard library.\n- Output contract: Defines top_k at the end as specified. Does not print anything else (printing is optional). Does not rely on Counter.most_common.\nNo ambiguities appear to be mishandled. All parts of the task are met precisely.", + "code_quality": "Assessment:\n- Correctness: Logic is sound; counting via a dict and selecting top-k with the correct sort key yields the exact required ordering. Handles empty input and k <= 0 correctly.\n- Efficiency: Single pass tokenization with finditer avoids building intermediate lists. heapq.nsmallest provides O(n log k) behavior for small k, which is efficient. top_n is set to min(k, unique), avoiding over-selection.\n- Readability/Clarity: Clear variable names, concise comments, and a small helper function improve structure. The compiled regex at module scope is good for performance. Type hints are present for parameters; adding a return type annotation would be a minor improvement.\n- Maintainability/Style: Pythonic and straightforward. The isinstance(k, int) guard is extra relative to the spec but harmless and may prevent misuse. No unnecessary dependencies or side effects.\n- No bugs or stylistic issues of concern. Using heapq.nsmallest returns a sorted list according to the key, satisfying the required output order." + }, + "final_judgement": { + "adherence_score": 5, + "code_quality_score": 5, + "comments": "Meets all requirements precisely: correct tokenization, ordering by (-count, token), correct length and edge cases, and proper top_k assignment. Code is clean, efficient, and maintainable. Minor optional improvement: add a return type annotation for the helper function." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_18.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_18.json new file mode 100644 index 0000000000..075b175776 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_18.json @@ -0,0 +1,16 @@ +{ + "file": "run_18.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"Step-by-step check against the instructions:\\n- Tokenization: The generator _iter_tokens_ascii_lower uses re.finditer(r'[A-Za-z0-9]+', s) and lower() on each match, which exactly matches the requirement: lowercase tokens; tokens are ASCII [a-z0-9]+ sequences; all other characters are separators.\\n- Inputs: The code consumes the provided globals text and k only at the final assignment top_k = _compute_top_k(text, k) and does not reassign them.\\n- Requirement 1 (sorting): In _compute_top_k, it selects the top k using heapq.nsmallest with key=lambda it: (-it[1], it[0]). Because the key sorts by negative count first (thus count descending) and then by token ascending, and nsmallest returns results in sorted order, the resulting list is correctly sorted by count desc, then token asc.\\n- Requirement 2 (shape and length): It builds a list of (token, count) tuples from counts and computes kk as min(k, number of unique tokens) via kk = k if k < len(counts) else len(counts). The returned list length is kk, satisfying the length requirement. The elements are 2-tuples (token, count).\\n- Requirement 3 (edge case k <= 0): Early return [] if k <= 0 satisfies this. Empty or no-token input also returns [] via the early checks.\\n- Requirement 4 (no I/O): The code performs no input(), file I/O, or network access; it runs purely on provided globals.\\n- Output contract: At the end, top_k is defined as the computed list with the exact specified ordering and length. It does not print, which is allowed (printing is optional). \\n- Tie-ordering note: It does not rely on Counter.most_common; it implements the specified sort explicitly.\\nOverall, the code fully adheres to all specified instructions and edge cases.\",\n \"code_quality\": \"Clarity and correctness: Functions are clearly named and commented; logic is correct. The regex tokenization and counting are straightforward and correct.\\nEfficiency: Using heapq.nsmallest with key=(-count, token) is efficient (O(n log k)) and appropriate for Top-K selection. Streaming tokenization avoids holding intermediate lists.\\nReadability and style: Generally good. Minor nitpicks:\\n- kk could be named more descriptively (e.g., limit = min(k, len(counts))). Also, using min(k, len(counts)) would be clearer than the conditional expression.\\n- The list comprehension [(t, c) for t, c in nsmallest(...)] is redundant since nsmallest already returns tuples of (token, count); it can be returned directly.\\n- Optional micro-optimization: precompile the regex pattern if called frequently.\\nNo bugs or maintainability issues identified beyond these small style points.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 5,\n \"code_quality_score\": 4,\n \"comments\": \"Excellent adherence: correct tokenization, sorting, edge-case handling, and output contract. Code is clean and efficient. Minor style improvements possible (use min(), avoid redundant list comprehension, clearer variable name).\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "Step-by-step check against the instructions:\n- Tokenization: The generator _iter_tokens_ascii_lower uses re.finditer(r'[A-Za-z0-9]+', s) and lower() on each match, which exactly matches the requirement: lowercase tokens; tokens are ASCII [a-z0-9]+ sequences; all other characters are separators.\n- Inputs: The code consumes the provided globals text and k only at the final assignment top_k = _compute_top_k(text, k) and does not reassign them.\n- Requirement 1 (sorting): In _compute_top_k, it selects the top k using heapq.nsmallest with key=lambda it: (-it[1], it[0]). Because the key sorts by negative count first (thus count descending) and then by token ascending, and nsmallest returns results in sorted order, the resulting list is correctly sorted by count desc, then token asc.\n- Requirement 2 (shape and length): It builds a list of (token, count) tuples from counts and computes kk as min(k, number of unique tokens) via kk = k if k < len(counts) else len(counts). The returned list length is kk, satisfying the length requirement. The elements are 2-tuples (token, count).\n- Requirement 3 (edge case k <= 0): Early return [] if k <= 0 satisfies this. Empty or no-token input also returns [] via the early checks.\n- Requirement 4 (no I/O): The code performs no input(), file I/O, or network access; it runs purely on provided globals.\n- Output contract: At the end, top_k is defined as the computed list with the exact specified ordering and length. It does not print, which is allowed (printing is optional). \n- Tie-ordering note: It does not rely on Counter.most_common; it implements the specified sort explicitly.\nOverall, the code fully adheres to all specified instructions and edge cases.", + "code_quality": "Clarity and correctness: Functions are clearly named and commented; logic is correct. The regex tokenization and counting are straightforward and correct.\nEfficiency: Using heapq.nsmallest with key=(-count, token) is efficient (O(n log k)) and appropriate for Top-K selection. Streaming tokenization avoids holding intermediate lists.\nReadability and style: Generally good. Minor nitpicks:\n- kk could be named more descriptively (e.g., limit = min(k, len(counts))). Also, using min(k, len(counts)) would be clearer than the conditional expression.\n- The list comprehension [(t, c) for t, c in nsmallest(...)] is redundant since nsmallest already returns tuples of (token, count); it can be returned directly.\n- Optional micro-optimization: precompile the regex pattern if called frequently.\nNo bugs or maintainability issues identified beyond these small style points." + }, + "final_judgement": { + "adherence_score": 5, + "code_quality_score": 4, + "comments": "Excellent adherence: correct tokenization, sorting, edge-case handling, and output contract. Code is clean and efficient. Minor style improvements possible (use min(), avoid redundant list comprehension, clearer variable name)." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_19.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_19.json new file mode 100644 index 0000000000..765c001aae --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_19.json @@ -0,0 +1,16 @@ +{ + "file": "run_19.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"- Tokenization: The helper _iter_tokens lowercases the input and uses a regex [a-z0-9]+ to extract ASCII alphanumeric sequences, treating all other characters as separators. This matches the specification.\\n- Inputs: The code uses the provided globals text and k without reassigning them. Access is via top_k = top_k_tokens(text, k) inside a try block. No I/O or network is used.\\n- Requirements:\\n 1) Sorting by count desc, then token asc is implemented by using heapq.nsmallest with key lambda item: (-item[1], item[0]). This produces the required order (largest counts first; alphabetical ties).\\n 2) The function returns a list of (token, count) tuples. Using nsmallest(k, ...) ensures the length is min(k, number of unique tokens). counts.items() yields (token, count) pairs, fulfilling the tuple structure.\\n 3) Edge case k <= 0: top_k_tokens returns [], satisfying the requirement.\\n 4) The script runs as-is given globals text (str) and k (int). It does not use input(), file I/O, or network.\\n- Output contract: top_k is defined at the end. With valid inputs, top_k will be exactly the computed Top-K as described. The broad try/except sets top_k = [] on any exception; while unnecessary given the guaranteed inputs, it does not affect correctness under the stated conditions.\\n- Additional notes: The code does not rely on Counter.most_common and explicitly implements the specified sort key.\\n- Minor nit: The early return if not counts is redundant because nsmallest would already return []. This does not affect adherence.\",\n \"code_quality\": \"- Correctness: Logic is correct for counting, tokenization, and ordering. Tie-breaking is implemented correctly.\\n- Clarity/Readability: Clear function and variable names; annotations provided; concise comments explain intent. Helper _iter_tokens isolates tokenization.\\n- Efficiency: Uses a single pass to count tokens and heapq.nsmallest for top-k selection (O(U log k)), which is efficient. Compiled regex is reused via default parameter to avoid recompilation overhead.\\n- Maintainability/Structure: Separation of concerns (_iter_tokens vs. top_k_tokens). Minimal, clean imports. No unnecessary dependencies.\\n- Robustness: Type checks guard against incorrect input types. The broad try/except around the top_k assignment could mask unexpected errors; while harmless here, narrowing the exception or omitting the try in trusted environments would be cleaner.\\n- Minor style note: The explicit if not counts: return [] is unnecessary but harmless.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 5,\n \"code_quality_score\": 5,\n \"comments\": \"Meets all task requirements with correct tokenization, ordering, and edge-case handling. Code is clear and efficient. The broad try/except and a redundant empty-check are minor nits but do not impact correctness.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "- Tokenization: The helper _iter_tokens lowercases the input and uses a regex [a-z0-9]+ to extract ASCII alphanumeric sequences, treating all other characters as separators. This matches the specification.\n- Inputs: The code uses the provided globals text and k without reassigning them. Access is via top_k = top_k_tokens(text, k) inside a try block. No I/O or network is used.\n- Requirements:\n 1) Sorting by count desc, then token asc is implemented by using heapq.nsmallest with key lambda item: (-item[1], item[0]). This produces the required order (largest counts first; alphabetical ties).\n 2) The function returns a list of (token, count) tuples. Using nsmallest(k, ...) ensures the length is min(k, number of unique tokens). counts.items() yields (token, count) pairs, fulfilling the tuple structure.\n 3) Edge case k <= 0: top_k_tokens returns [], satisfying the requirement.\n 4) The script runs as-is given globals text (str) and k (int). It does not use input(), file I/O, or network.\n- Output contract: top_k is defined at the end. With valid inputs, top_k will be exactly the computed Top-K as described. The broad try/except sets top_k = [] on any exception; while unnecessary given the guaranteed inputs, it does not affect correctness under the stated conditions.\n- Additional notes: The code does not rely on Counter.most_common and explicitly implements the specified sort key.\n- Minor nit: The early return if not counts is redundant because nsmallest would already return []. This does not affect adherence.", + "code_quality": "- Correctness: Logic is correct for counting, tokenization, and ordering. Tie-breaking is implemented correctly.\n- Clarity/Readability: Clear function and variable names; annotations provided; concise comments explain intent. Helper _iter_tokens isolates tokenization.\n- Efficiency: Uses a single pass to count tokens and heapq.nsmallest for top-k selection (O(U log k)), which is efficient. Compiled regex is reused via default parameter to avoid recompilation overhead.\n- Maintainability/Structure: Separation of concerns (_iter_tokens vs. top_k_tokens). Minimal, clean imports. No unnecessary dependencies.\n- Robustness: Type checks guard against incorrect input types. The broad try/except around the top_k assignment could mask unexpected errors; while harmless here, narrowing the exception or omitting the try in trusted environments would be cleaner.\n- Minor style note: The explicit if not counts: return [] is unnecessary but harmless." + }, + "final_judgement": { + "adherence_score": 5, + "code_quality_score": 5, + "comments": "Meets all task requirements with correct tokenization, ordering, and edge-case handling. Code is clear and efficient. The broad try/except and a redundant empty-check are minor nits but do not impact correctness." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_20.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_20.json new file mode 100644 index 0000000000..bff31a7026 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_20.json @@ -0,0 +1,16 @@ +{ + "file": "run_20.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"- Overall goal: The code computes the Top-K most frequent tokens from the given text and assigns the result to the global top_k. This is achieved via _count_tokens_ascii and _top_k_from_counts, and the final line sets top_k appropriately.\\n- Tokenization requirements:\\n - Lowercasing: Implemented by mapping 'A'-'Z' to lowercase during scanning.\\n - Tokens are ASCII [a-z0-9]+; others are separators: Implemented via an explicit character-class check (ord ranges) and committing on non-matching chars. Behavior matches the spec.\\n - \\\"Use a regex\\\": Not followed. The implementation is manual and does not use a regex as explicitly instructed. Functionally correct but deviates from the stated method.\\n- Inputs: text (str) and k (int) are assumed to be provided and are not reassigned. Compliance met.\\n- Requirements:\\n 1) Sorting by count desc, then token asc: Implemented using heapq.nsmallest with key=(-count, token), which yields the correct order.\\n 2) top_k is a list of (token, count) tuples of length min(k, unique tokens): _top_k_from_counts returns such a list; nsmallest ensures length <= k; when k exceeds unique tokens, it returns all items.\\n 3) Edge case k <= 0: Returns [] explicitly; compliant.\\n 4) No I/O or network: No input(), file I/O, or network calls present.\\n- Output contract: top_k is defined at the end as required; no extraneous prints. Does not rely on Counter.most_common tie ordering.\\n\\nSummary: All functional requirements are met and edge cases handled. The sole deviation is not using a regex for tokenization, which was explicitly requested.\",\n \"code_quality\": \"- Clarity and correctness: Functions are well-named and correctly implement the required behavior. The tokenization logic is accurate for ASCII [a-z0-9]+.\\n- Efficiency: Single-pass tokenizer; heapq.nsmallest yields O(n log k) selection, which is efficient for large inputs.\\n- Readability/Maintainability: Use of ord-range checks and micro-optimizations (local variable bindings like append/get) slightly reduce readability compared to a straightforward regex approach, which the spec suggested. Still, code is concise and understandable.\\n- Best practices: Avoids reliance on Counter.most_common tie-breaking, uses deterministic sorting criteria. Type hints are minimal but acceptable. No obvious bugs or edge-case omissions.\\n\\nNote: While not using regex violates the instruction, from a code-quality perspective the implementation is robust and efficient.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 4,\n \"code_quality_score\": 4,\n \"comments\": \"Functionally correct and efficient with proper sorting, edge-case handling, and output structure. The main shortcoming is not using a regex for tokenization as explicitly instructed; minor readability trade-offs due to manual ASCII handling and micro-optimizations.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "- Overall goal: The code computes the Top-K most frequent tokens from the given text and assigns the result to the global top_k. This is achieved via _count_tokens_ascii and _top_k_from_counts, and the final line sets top_k appropriately.\n- Tokenization requirements:\n - Lowercasing: Implemented by mapping 'A'-'Z' to lowercase during scanning.\n - Tokens are ASCII [a-z0-9]+; others are separators: Implemented via an explicit character-class check (ord ranges) and committing on non-matching chars. Behavior matches the spec.\n - \"Use a regex\": Not followed. The implementation is manual and does not use a regex as explicitly instructed. Functionally correct but deviates from the stated method.\n- Inputs: text (str) and k (int) are assumed to be provided and are not reassigned. Compliance met.\n- Requirements:\n 1) Sorting by count desc, then token asc: Implemented using heapq.nsmallest with key=(-count, token), which yields the correct order.\n 2) top_k is a list of (token, count) tuples of length min(k, unique tokens): _top_k_from_counts returns such a list; nsmallest ensures length <= k; when k exceeds unique tokens, it returns all items.\n 3) Edge case k <= 0: Returns [] explicitly; compliant.\n 4) No I/O or network: No input(), file I/O, or network calls present.\n- Output contract: top_k is defined at the end as required; no extraneous prints. Does not rely on Counter.most_common tie ordering.\n\nSummary: All functional requirements are met and edge cases handled. The sole deviation is not using a regex for tokenization, which was explicitly requested.", + "code_quality": "- Clarity and correctness: Functions are well-named and correctly implement the required behavior. The tokenization logic is accurate for ASCII [a-z0-9]+.\n- Efficiency: Single-pass tokenizer; heapq.nsmallest yields O(n log k) selection, which is efficient for large inputs.\n- Readability/Maintainability: Use of ord-range checks and micro-optimizations (local variable bindings like append/get) slightly reduce readability compared to a straightforward regex approach, which the spec suggested. Still, code is concise and understandable.\n- Best practices: Avoids reliance on Counter.most_common tie-breaking, uses deterministic sorting criteria. Type hints are minimal but acceptable. No obvious bugs or edge-case omissions.\n\nNote: While not using regex violates the instruction, from a code-quality perspective the implementation is robust and efficient." + }, + "final_judgement": { + "adherence_score": 4, + "code_quality_score": 4, + "comments": "Functionally correct and efficient with proper sorting, edge-case handling, and output structure. The main shortcoming is not using a regex for tokenization as explicitly instructed; minor readability trade-offs due to manual ASCII handling and micro-optimizations." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_21.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_21.json new file mode 100644 index 0000000000..29f7e1e5e8 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_21.json @@ -0,0 +1,16 @@ +{ + "file": "run_21.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"- Tokenization requirements:\\n - Lowercasing: Tokens are converted to lowercase via m.group(0).lower(). Satisfied.\\n - Regex with ASCII [a-z0-9]+ and non-matching chars as separators: Uses re.compile(r\\\"[a-z0-9]+\\\", flags=re.ASCII | re.IGNORECASE) and iterates over matches. Satisfied.\\n- Inputs and globals:\\n - Uses provided globals text and k only to compute top_k at the end; does not reassign them. Satisfied.\\n- Sorting and Top-K:\\n - Sort key specified as (-count, token): Achieved via heapq.nsmallest with key=lambda t: (-t[1], t[0]). This returns items sorted by count desc, token asc. Satisfied.\\n - Returns list of (token, count) tuples with length min(k, unique tokens): nsmallest(k_int, counts.items(), ...) ensures length behavior. Satisfied.\\n- Edge cases:\\n - k <= 0 returns []: Explicitly handled. Satisfied.\\n - No tokens in text returns []: Explicitly handled via if not counts. Satisfied.\\n- Constraints:\\n - No input(), file I/O, or network access. Satisfied.\\n - Does not rely on Counter.most_common tie ordering. Satisfied.\\n- Output contract:\\n - Defines top_k at module end as compute_top_k(text, k). Satisfied.\\n- Optional printing: Not used, which is allowed.\\nOverall, the code fully complies with all specified requirements.\",\n \"code_quality\": \"- Correctness: Logic matches the spec exactly; tie-breaking and ordering are correct. No reliance on Counter.most_common.\\n- Clarity/Readability: Clear function and variable names; concise code; helpful comment on sorting intent.\\n- Efficiency: Uses Counter for counting and heapq.nsmallest for O(n log k) selection, which is efficient for large vocabularies and small k. Avoids full sort when unnecessary.\\n- Maintainability/Structure: Modularized with _iter_tokens and compute_top_k. Straightforward flow and minimal branching.\\n- Best practices: Regex uses re.ASCII to enforce ASCII semantics; tokens are explicitly lowercased. No side effects beyond defining top_k.\\n- Minor nitpicks (non-blocking):\\n - The regex could be compiled once at module scope to avoid recompilation per call to _iter_tokens, though impact is minor.\\n - The early return if not counts is redundant because nsmallest would return [] anyway, but it is explicit and fine.\\nOverall, code quality is high: clear, correct, and efficient.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 5,\n \"code_quality_score\": 5,\n \"comments\": \"Fully meets all task requirements with clean, efficient, and readable code. Minor micro-optimizations are possible but unnecessary.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "- Tokenization requirements:\n - Lowercasing: Tokens are converted to lowercase via m.group(0).lower(). Satisfied.\n - Regex with ASCII [a-z0-9]+ and non-matching chars as separators: Uses re.compile(r\"[a-z0-9]+\", flags=re.ASCII | re.IGNORECASE) and iterates over matches. Satisfied.\n- Inputs and globals:\n - Uses provided globals text and k only to compute top_k at the end; does not reassign them. Satisfied.\n- Sorting and Top-K:\n - Sort key specified as (-count, token): Achieved via heapq.nsmallest with key=lambda t: (-t[1], t[0]). This returns items sorted by count desc, token asc. Satisfied.\n - Returns list of (token, count) tuples with length min(k, unique tokens): nsmallest(k_int, counts.items(), ...) ensures length behavior. Satisfied.\n- Edge cases:\n - k <= 0 returns []: Explicitly handled. Satisfied.\n - No tokens in text returns []: Explicitly handled via if not counts. Satisfied.\n- Constraints:\n - No input(), file I/O, or network access. Satisfied.\n - Does not rely on Counter.most_common tie ordering. Satisfied.\n- Output contract:\n - Defines top_k at module end as compute_top_k(text, k). Satisfied.\n- Optional printing: Not used, which is allowed.\nOverall, the code fully complies with all specified requirements.", + "code_quality": "- Correctness: Logic matches the spec exactly; tie-breaking and ordering are correct. No reliance on Counter.most_common.\n- Clarity/Readability: Clear function and variable names; concise code; helpful comment on sorting intent.\n- Efficiency: Uses Counter for counting and heapq.nsmallest for O(n log k) selection, which is efficient for large vocabularies and small k. Avoids full sort when unnecessary.\n- Maintainability/Structure: Modularized with _iter_tokens and compute_top_k. Straightforward flow and minimal branching.\n- Best practices: Regex uses re.ASCII to enforce ASCII semantics; tokens are explicitly lowercased. No side effects beyond defining top_k.\n- Minor nitpicks (non-blocking):\n - The regex could be compiled once at module scope to avoid recompilation per call to _iter_tokens, though impact is minor.\n - The early return if not counts is redundant because nsmallest would return [] anyway, but it is explicit and fine.\nOverall, code quality is high: clear, correct, and efficient." + }, + "final_judgement": { + "adherence_score": 5, + "code_quality_score": 5, + "comments": "Fully meets all task requirements with clean, efficient, and readable code. Minor micro-optimizations are possible but unnecessary." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_22.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_22.json new file mode 100644 index 0000000000..8a3f088726 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_22.json @@ -0,0 +1,16 @@ +{ + "file": "run_22.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"Step-by-step assessment:\\n- Tokenization requirements:\\n - Lowercasing: The tokenizer lowercases each character (c = ch.lower()). \u2714\ufe0f\\n - ASCII [a-z0-9]+ sequences with other chars as separators: Implemented via explicit character checks ('a' <= c <= 'z' or '0' <= c <= '9') and flushing the buffer on any other character. This matches the specified token definition. \u2714\ufe0f\\n - Note: The instruction suggested using a regex; the solution does not use regex but achieves identical behavior. This is a minor deviation in method, not in result. \u26a0\ufe0f (minor)\\n- Inputs: The code reads the provided globals text and k without reassigning them. It copies them to local variables _text and _k (with int(k) coercion). It does not mutate the globals. \u2714\ufe0f\\n- Requirement 1 (sorting): Sorts by count descending and token ascending using key=lambda kv: (-kv[1], kv[0]). \u2714\ufe0f\\n- Requirement 2 (output structure and length): Returns a list of (token, count) tuples and slices to min(k, len(items)). \u2714\ufe0f\\n- Requirement 3 (k <= 0): _top_k_tokens returns [] when k <= 0, leading to top_k = []. \u2714\ufe0f\\n- Requirement 4 (no I/O/network): No input(), file I/O, or network access used. \u2714\ufe0f\\n- Output contract: top_k is always defined at the end. If globals are missing/invalid, it safely sets top_k = []. When provided, it computes as specified. No extraneous printing. \u2714\ufe0f\\n- Note on tie-ordering: Does not rely on Counter; explicitly sorts with the specified key. \u2714\ufe0f\\nOverall: Functional adherence is excellent; the only minor deviation is not using a regex as hinted by the instructions.\",\n \"code_quality\": \"This is code; assessment follows:\\n- Correctness: The counting and sorting logic is correct and adheres to the spec, including edge cases and tie-breaking. \u2714\ufe0f\\n- Clarity/Readability: Functions are small and well-named with brief comments. Type hints are used for parameters. The micro-optimizations (binding get = counts.get and join = ''.join) are acceptable but slightly reduce readability. Minor. \u26a0\ufe0f\\n- Efficiency: Streaming tokenizer with a buffer is efficient. Counting is O(n); sorting unique tokens is O(m log m). \u2714\ufe0f\\n- Structure/Maintainability: Separation into _iter_tokens and _top_k_tokens is clean. Top-level try/except ensures top_k is always defined. \u2714\ufe0f\\n- Best practices: Avoids reliance on Counter.most_common ordering as requested. Does not perform I/O. Minor issue: Unused import (import sys). \u26a0\ufe0f\\n- Robustness: Gracefully handles missing globals by setting top_k = []. Casting k via int(k) is fine given the spec says k is int; it doesn\u2019t reassign the global. \u2714\ufe0f\\nNo bugs detected; only minor stylistic issues (unused import, micro-optimizations affecting readability) and not following the suggested regex approach.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 4,\n \"code_quality_score\": 4,\n \"comments\": \"Meets all functional requirements with correct sorting, tokenization behavior, and edge-case handling. Minor deviation: did not use regex as suggested. Code quality is strong; small issues include an unused import and slight readability trade-offs from micro-optimizations.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "Step-by-step assessment:\n- Tokenization requirements:\n - Lowercasing: The tokenizer lowercases each character (c = ch.lower()). \u2714\ufe0f\n - ASCII [a-z0-9]+ sequences with other chars as separators: Implemented via explicit character checks ('a' <= c <= 'z' or '0' <= c <= '9') and flushing the buffer on any other character. This matches the specified token definition. \u2714\ufe0f\n - Note: The instruction suggested using a regex; the solution does not use regex but achieves identical behavior. This is a minor deviation in method, not in result. \u26a0\ufe0f (minor)\n- Inputs: The code reads the provided globals text and k without reassigning them. It copies them to local variables _text and _k (with int(k) coercion). It does not mutate the globals. \u2714\ufe0f\n- Requirement 1 (sorting): Sorts by count descending and token ascending using key=lambda kv: (-kv[1], kv[0]). \u2714\ufe0f\n- Requirement 2 (output structure and length): Returns a list of (token, count) tuples and slices to min(k, len(items)). \u2714\ufe0f\n- Requirement 3 (k <= 0): _top_k_tokens returns [] when k <= 0, leading to top_k = []. \u2714\ufe0f\n- Requirement 4 (no I/O/network): No input(), file I/O, or network access used. \u2714\ufe0f\n- Output contract: top_k is always defined at the end. If globals are missing/invalid, it safely sets top_k = []. When provided, it computes as specified. No extraneous printing. \u2714\ufe0f\n- Note on tie-ordering: Does not rely on Counter; explicitly sorts with the specified key. \u2714\ufe0f\nOverall: Functional adherence is excellent; the only minor deviation is not using a regex as hinted by the instructions.", + "code_quality": "This is code; assessment follows:\n- Correctness: The counting and sorting logic is correct and adheres to the spec, including edge cases and tie-breaking. \u2714\ufe0f\n- Clarity/Readability: Functions are small and well-named with brief comments. Type hints are used for parameters. The micro-optimizations (binding get = counts.get and join = ''.join) are acceptable but slightly reduce readability. Minor. \u26a0\ufe0f\n- Efficiency: Streaming tokenizer with a buffer is efficient. Counting is O(n); sorting unique tokens is O(m log m). \u2714\ufe0f\n- Structure/Maintainability: Separation into _iter_tokens and _top_k_tokens is clean. Top-level try/except ensures top_k is always defined. \u2714\ufe0f\n- Best practices: Avoids reliance on Counter.most_common ordering as requested. Does not perform I/O. Minor issue: Unused import (import sys). \u26a0\ufe0f\n- Robustness: Gracefully handles missing globals by setting top_k = []. Casting k via int(k) is fine given the spec says k is int; it doesn\u2019t reassign the global. \u2714\ufe0f\nNo bugs detected; only minor stylistic issues (unused import, micro-optimizations affecting readability) and not following the suggested regex approach." + }, + "final_judgement": { + "adherence_score": 4, + "code_quality_score": 4, + "comments": "Meets all functional requirements with correct sorting, tokenization behavior, and edge-case handling. Minor deviation: did not use regex as suggested. Code quality is strong; small issues include an unused import and slight readability trade-offs from micro-optimizations." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_23.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_23.json new file mode 100644 index 0000000000..4991443ea9 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_23.json @@ -0,0 +1,16 @@ +{ + "file": "run_23.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"- Reads the task carefully: compute Top-K most frequent tokens with specific tokenization and sorting, using provided globals and setting top_k accordingly.\\n- Tokenization: The code implements ASCII [a-z0-9]+ tokenization and lowercasing by streaming through characters. Uppercase A-Z are lowered by adding 32, digits are kept, and all other characters act as separators. This produces exactly the [a-z0-9]+ tokens required and lowercases them. However, the instructions explicitly say to use a regex for tokenization; this code does not use regex. Functionally correct but deviates from the specified method.\\n- Sorting: Uses heapq.nsmallest with key lambda (-count, token) to obtain items sorted by count descending then token ascending. This matches the required sort order and avoids relying on Counter.most_common tie behavior.\\n- Output format: top_k is set to a list of (token, count) tuples of length min(k, unique tokens). This is satisfied: m = min(k, len(counts)) and the returned list contains (token, count) pairs from counts.items().\\n- Edge cases: If k <= 0, _top_k_tokens returns []. Also handles empty text (counts empty -> []). Both align with the requirements.\\n- Inputs: Does not reassign globals text or k. It copies them into _text and _k with fallbacks if not defined and safely coerces _k to int. This respects the constraint not to reassign provided globals and allows the script to run as-is.\\n- No I/O: No input(), file I/O, or network access. No printing except none, which is allowed.\\n- Output contract: top_k is defined at the end as specified, no extra output.\\n- Ambiguities: None significant; the only deviation is not using a regex despite the instruction to do so.\",\n \"code_quality\": \"- Clarity/structure: Code is modular with helper functions (_iter_tokens_ascii_lower and _top_k_tokens). Comments explain key steps. Variable names are clear. The underscore prefixes indicate internal helpers.\\n- Correctness: Tokenization correctly yields lowercase ASCII alnum tokens and uses separators for all other characters. Counting and Top-K selection are correct. Sorting key enforces count desc then token asc.\\n- Efficiency: Streaming tokenizer avoids building full lowercased strings; using a buffer with method binding is efficient. heapq.nsmallest is appropriate for Top-K and returns results in the correct order.\\n- Readability: While using ord/chr and adding 32 is slightly lower-level than using .lower(), it is documented and constrained to ASCII as required. The logic is readable with comments.\\n- Maintainability: Functions are cohesive and small. No global side-effects besides defining top_k. Edge cases are handled explicitly.\\n- Best practices: Avoids reliance on Counter.most_common tie ordering per instruction. One stylistic note: the task asked to use a regex; while not a code quality fault per se, aligning with that would make intent explicit and simpler to verify. Minor micro-optimizations (append binding, manual lowercasing) slightly trade readability for performance but are acceptable and commented.\\n- No apparent bugs or inefficiencies affecting correctness were found.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 4,\n \"code_quality_score\": 5,\n \"comments\": \"Functionally meets all requirements, including correct tokenization behavior, sorting, and edge cases, and defines top_k properly without I/O or reassigning globals. The only notable miss is not using a regex as explicitly requested for tokenization.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "- Reads the task carefully: compute Top-K most frequent tokens with specific tokenization and sorting, using provided globals and setting top_k accordingly.\n- Tokenization: The code implements ASCII [a-z0-9]+ tokenization and lowercasing by streaming through characters. Uppercase A-Z are lowered by adding 32, digits are kept, and all other characters act as separators. This produces exactly the [a-z0-9]+ tokens required and lowercases them. However, the instructions explicitly say to use a regex for tokenization; this code does not use regex. Functionally correct but deviates from the specified method.\n- Sorting: Uses heapq.nsmallest with key lambda (-count, token) to obtain items sorted by count descending then token ascending. This matches the required sort order and avoids relying on Counter.most_common tie behavior.\n- Output format: top_k is set to a list of (token, count) tuples of length min(k, unique tokens). This is satisfied: m = min(k, len(counts)) and the returned list contains (token, count) pairs from counts.items().\n- Edge cases: If k <= 0, _top_k_tokens returns []. Also handles empty text (counts empty -> []). Both align with the requirements.\n- Inputs: Does not reassign globals text or k. It copies them into _text and _k with fallbacks if not defined and safely coerces _k to int. This respects the constraint not to reassign provided globals and allows the script to run as-is.\n- No I/O: No input(), file I/O, or network access. No printing except none, which is allowed.\n- Output contract: top_k is defined at the end as specified, no extra output.\n- Ambiguities: None significant; the only deviation is not using a regex despite the instruction to do so.", + "code_quality": "- Clarity/structure: Code is modular with helper functions (_iter_tokens_ascii_lower and _top_k_tokens). Comments explain key steps. Variable names are clear. The underscore prefixes indicate internal helpers.\n- Correctness: Tokenization correctly yields lowercase ASCII alnum tokens and uses separators for all other characters. Counting and Top-K selection are correct. Sorting key enforces count desc then token asc.\n- Efficiency: Streaming tokenizer avoids building full lowercased strings; using a buffer with method binding is efficient. heapq.nsmallest is appropriate for Top-K and returns results in the correct order.\n- Readability: While using ord/chr and adding 32 is slightly lower-level than using .lower(), it is documented and constrained to ASCII as required. The logic is readable with comments.\n- Maintainability: Functions are cohesive and small. No global side-effects besides defining top_k. Edge cases are handled explicitly.\n- Best practices: Avoids reliance on Counter.most_common tie ordering per instruction. One stylistic note: the task asked to use a regex; while not a code quality fault per se, aligning with that would make intent explicit and simpler to verify. Minor micro-optimizations (append binding, manual lowercasing) slightly trade readability for performance but are acceptable and commented.\n- No apparent bugs or inefficiencies affecting correctness were found." + }, + "final_judgement": { + "adherence_score": 4, + "code_quality_score": 5, + "comments": "Functionally meets all requirements, including correct tokenization behavior, sorting, and edge cases, and defines top_k properly without I/O or reassigning globals. The only notable miss is not using a regex as explicitly requested for tokenization." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_24.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_24.json new file mode 100644 index 0000000000..969564867e --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_24.json @@ -0,0 +1,16 @@ +{ + "file": "run_24.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"Step-by-step adherence analysis:\\n- Core task: Computes the Top-K most frequent tokens from a given text. The function compute_top_k counts token frequencies and returns the top K after sorting. This matches the main requirement.\\n- Tokenization: Implements ASCII [a-z0-9]+ tokenization with lowercase by manually scanning characters, converting 'A'-'Z' to lowercase, accepting '0'-'9' and 'a'-'z', and treating all other characters as separators. This behavior matches the specified token set and separators. However, the instruction explicitly says to use a regex; the implementation does not use regex, which is a minor deviation from the instruction method, though the output semantics are correct.\\n- Inputs: Uses provided globals text and k without reassigning them. It copies them to _text and _k and never reassigns text or k, complying with the instruction.\\n- Requirement 1 (sorting): Sorts by (-count, token) via key=lambda it: (-it[1], it[0]). This exactly implements count descending then token ascending.\\n- Requirement 2 (top_k list and length): Returns a list of (token, count) tuples, and slices to min(k, number of unique tokens). Satisfied.\\n- Requirement 3 (k <= 0): compute_top_k returns [] when k <= 0, and top_k is set accordingly. Satisfied.\\n- Requirement 4 (no I/O/network): No input(), file I/O, or network calls. Satisfied.\\n- Output contract: Ensures top_k is defined at the end. If text/k are not defined, it sets top_k to an empty list; otherwise computes from compute_top_k. No extraneous prints. Satisfied.\\n- Note about tie ordering: Does not rely on Counter.most_common and explicitly implements the specified sort. Satisfied.\\n- Ambiguities/choices: The try/except handling for missing globals is conservative; the task states globals are provided, but this fallback does not violate any requirement and still leaves top_k defined. The only notable deviation is not using regex for tokenization as suggested.\",\n \"code_quality\": \"Code quality assessment:\\n- Correctness: The tokenization logic correctly identifies ASCII letters/digits, lowercases uppercase ASCII, and splits on non-matching characters. Counting and sorting are correctly implemented. Edge cases (k <= 0, empty or no-token text, trailing buffered token) are handled.\\n- Clarity/readability: Overall structure is clear with a dedicated function and top-level glue. Micro-optimizations (binding append and get, using ord ranges) slightly reduce readability compared to a straightforward approach, especially since the task suggested using a regex. A regex like re.findall(r'[a-z0-9]+', text.lower()) would be more concise and maintainable.\\n- Efficiency: Single pass tokenization O(n) and sorting O(m log m), where m is the number of unique tokens; acceptable for typical use. For very large m, a heap-based top-k could be more efficient, but not required.\\n- Structure/maintainability: Good use of type hints and separation of concerns. No reliance on unspecified behavior. The try/except for globals is robust though arguably unnecessary per spec.\\n- Best practices: Avoids I/O as required, uses explicit sort key instead of Counter.most_common. Minor stylistic nit: could inline counts.items() into sorted. Using ord arithmetic is correct but less idiomatic than regex given the instruction.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 4,\n \"code_quality_score\": 4,\n \"comments\": \"Meets the task requirements accurately, including correct tokenization semantics, sorting, and edge-case handling. The main shortcoming is not using a regex as instructed for tokenization and some micro-optimizations that reduce readability. Otherwise, the code is correct, efficient, and defines top_k as specified.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "Step-by-step adherence analysis:\n- Core task: Computes the Top-K most frequent tokens from a given text. The function compute_top_k counts token frequencies and returns the top K after sorting. This matches the main requirement.\n- Tokenization: Implements ASCII [a-z0-9]+ tokenization with lowercase by manually scanning characters, converting 'A'-'Z' to lowercase, accepting '0'-'9' and 'a'-'z', and treating all other characters as separators. This behavior matches the specified token set and separators. However, the instruction explicitly says to use a regex; the implementation does not use regex, which is a minor deviation from the instruction method, though the output semantics are correct.\n- Inputs: Uses provided globals text and k without reassigning them. It copies them to _text and _k and never reassigns text or k, complying with the instruction.\n- Requirement 1 (sorting): Sorts by (-count, token) via key=lambda it: (-it[1], it[0]). This exactly implements count descending then token ascending.\n- Requirement 2 (top_k list and length): Returns a list of (token, count) tuples, and slices to min(k, number of unique tokens). Satisfied.\n- Requirement 3 (k <= 0): compute_top_k returns [] when k <= 0, and top_k is set accordingly. Satisfied.\n- Requirement 4 (no I/O/network): No input(), file I/O, or network calls. Satisfied.\n- Output contract: Ensures top_k is defined at the end. If text/k are not defined, it sets top_k to an empty list; otherwise computes from compute_top_k. No extraneous prints. Satisfied.\n- Note about tie ordering: Does not rely on Counter.most_common and explicitly implements the specified sort. Satisfied.\n- Ambiguities/choices: The try/except handling for missing globals is conservative; the task states globals are provided, but this fallback does not violate any requirement and still leaves top_k defined. The only notable deviation is not using regex for tokenization as suggested.", + "code_quality": "Code quality assessment:\n- Correctness: The tokenization logic correctly identifies ASCII letters/digits, lowercases uppercase ASCII, and splits on non-matching characters. Counting and sorting are correctly implemented. Edge cases (k <= 0, empty or no-token text, trailing buffered token) are handled.\n- Clarity/readability: Overall structure is clear with a dedicated function and top-level glue. Micro-optimizations (binding append and get, using ord ranges) slightly reduce readability compared to a straightforward approach, especially since the task suggested using a regex. A regex like re.findall(r'[a-z0-9]+', text.lower()) would be more concise and maintainable.\n- Efficiency: Single pass tokenization O(n) and sorting O(m log m), where m is the number of unique tokens; acceptable for typical use. For very large m, a heap-based top-k could be more efficient, but not required.\n- Structure/maintainability: Good use of type hints and separation of concerns. No reliance on unspecified behavior. The try/except for globals is robust though arguably unnecessary per spec.\n- Best practices: Avoids I/O as required, uses explicit sort key instead of Counter.most_common. Minor stylistic nit: could inline counts.items() into sorted. Using ord arithmetic is correct but less idiomatic than regex given the instruction." + }, + "final_judgement": { + "adherence_score": 4, + "code_quality_score": 4, + "comments": "Meets the task requirements accurately, including correct tokenization semantics, sorting, and edge-case handling. The main shortcoming is not using a regex as instructed for tokenization and some micro-optimizations that reduce readability. Otherwise, the code is correct, efficient, and defines top_k as specified." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_25.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_25.json new file mode 100644 index 0000000000..de32fa8108 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_25.json @@ -0,0 +1,16 @@ +{ + "file": "run_25.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"Step-by-step check against instructions:\\n- Inputs and globals: The code reads the provided globals text and k and does not reassign them. Compliant.\\n- Tokenization: It lowercases ASCII letters on-the-fly and extracts tokens consisting of ASCII [a-z0-9]+ by treating any other character as a separator. This matches the specified token definition semantically. However, the instruction explicitly said to use a regex; the implementation uses a manual scanner instead. Functionally correct but deviates from the stated method.\\n- Requirement 1 (sorting): Uses heapq.nsmallest with key=(-count, token), which yields results sorted by count descending, then token ascending. Compliant and does not rely on Counter.most_common tie ordering.\\n- Requirement 2 (top_k structure and length): Produces a list of (token, count) tuples and clamps k to the number of unique tokens, ensuring length = min(k, unique). Compliant.\\n- Requirement 3 (k <= 0): If _k == 0 (including k <= 0), sets top_k = []. Compliant.\\n- Requirement 4 (no I/O/network): No input(), file I/O, or network access used. Compliant.\\n- Output contract: At the end, top_k is defined exactly as specified; no extra printing is performed, which is allowed. Compliant.\\n- Note on tie ordering: Does not use Counter.most_common; implements the specified sort. Compliant.\\nOverall, all functional requirements are met; the only notable deviation is not using a regex for tokenization as instructed.\",\n \"code_quality\": \"The output is code; assessment follows:\\n- Correctness: The tokenizer correctly builds ASCII [a-z0-9]+ tokens with lowercase ASCII letters, handles separators, and flushes the buffer at boundaries and end-of-input. Sorting logic is correct and deterministic. Edge cases for k (<=0, > unique) are handled.\\n- Efficiency: Single pass tokenization O(n). Top-K selection via heapq.nsmallest is O(m log k), appropriate for large m. Efficient overall.\\n- Clarity/Readability: The manual ASCII lowercase via ord/chr and isascii() is more low-level than necessary; a regex-based approach would be clearer and matches the instruction. Magic numbers (65, 90, 32, etc.) reduce readability despite comments.\\n- Maintainability: The k parsing logic is overengineered given the spec guarantees k is int; excluding bool and accepting numeric strings adds complexity without clear benefit. Function and variables are reasonably named; comments help.\\n- Best practices: Avoids relying on Counter.most_common tie ordering as requested. Type annotation for Counter lacks type parameters but is acceptable. isascii() requires Python 3.7+, which is generally fine but slightly reduces portability.\\nNo functional bugs found; primary nits are stylistic/clarity and unnecessary k coercion.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 4,\n \"code_quality_score\": 4,\n \"comments\": \"Functionally correct and efficient; meets sorting, edge cases, and output requirements. Main deviation: did not use regex for tokenization as instructed. Minor readability issues and unnecessary k coercion complexity.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "Step-by-step check against instructions:\n- Inputs and globals: The code reads the provided globals text and k and does not reassign them. Compliant.\n- Tokenization: It lowercases ASCII letters on-the-fly and extracts tokens consisting of ASCII [a-z0-9]+ by treating any other character as a separator. This matches the specified token definition semantically. However, the instruction explicitly said to use a regex; the implementation uses a manual scanner instead. Functionally correct but deviates from the stated method.\n- Requirement 1 (sorting): Uses heapq.nsmallest with key=(-count, token), which yields results sorted by count descending, then token ascending. Compliant and does not rely on Counter.most_common tie ordering.\n- Requirement 2 (top_k structure and length): Produces a list of (token, count) tuples and clamps k to the number of unique tokens, ensuring length = min(k, unique). Compliant.\n- Requirement 3 (k <= 0): If _k == 0 (including k <= 0), sets top_k = []. Compliant.\n- Requirement 4 (no I/O/network): No input(), file I/O, or network access used. Compliant.\n- Output contract: At the end, top_k is defined exactly as specified; no extra printing is performed, which is allowed. Compliant.\n- Note on tie ordering: Does not use Counter.most_common; implements the specified sort. Compliant.\nOverall, all functional requirements are met; the only notable deviation is not using a regex for tokenization as instructed.", + "code_quality": "The output is code; assessment follows:\n- Correctness: The tokenizer correctly builds ASCII [a-z0-9]+ tokens with lowercase ASCII letters, handles separators, and flushes the buffer at boundaries and end-of-input. Sorting logic is correct and deterministic. Edge cases for k (<=0, > unique) are handled.\n- Efficiency: Single pass tokenization O(n). Top-K selection via heapq.nsmallest is O(m log k), appropriate for large m. Efficient overall.\n- Clarity/Readability: The manual ASCII lowercase via ord/chr and isascii() is more low-level than necessary; a regex-based approach would be clearer and matches the instruction. Magic numbers (65, 90, 32, etc.) reduce readability despite comments.\n- Maintainability: The k parsing logic is overengineered given the spec guarantees k is int; excluding bool and accepting numeric strings adds complexity without clear benefit. Function and variables are reasonably named; comments help.\n- Best practices: Avoids relying on Counter.most_common tie ordering as requested. Type annotation for Counter lacks type parameters but is acceptable. isascii() requires Python 3.7+, which is generally fine but slightly reduces portability.\nNo functional bugs found; primary nits are stylistic/clarity and unnecessary k coercion." + }, + "final_judgement": { + "adherence_score": 4, + "code_quality_score": 4, + "comments": "Functionally correct and efficient; meets sorting, edge cases, and output requirements. Main deviation: did not use regex for tokenization as instructed. Minor readability issues and unnecessary k coercion complexity." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_26.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_26.json new file mode 100644 index 0000000000..c86f544060 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_26.json @@ -0,0 +1,16 @@ +{ + "file": "run_26.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"- Reads and follows the core task: computes Top-K most frequent tokens from provided globals text and k, and assigns the result to top_k.\\n- Tokenization: Implements ASCII tokenization for [a-z0-9]+ by treating all non [A-Z a-z 0-9] chars as separators, and lowercasing only A-Z while scanning. This yields exactly the specified token set and casing. However, the instruction explicitly says to use a regex; the solution does not use a regex and instead implements manual scanning. While the result is equivalent, this deviates from the stated method.\\n- Sorting: Uses heapq.nsmallest with key (-count, token) which correctly produces items sorted by count descending then token ascending, as required, and avoids full sort (acceptable and efficient).\\n- Output format: Sets top_k to a list of (token, count) tuples. Length is min(k, number of unique tokens) due to nsmallest behavior. Correct.\\n- Edge cases: If k <= 0, _top_k_from_counts returns []; negative k is clamped to 0. If counts empty, returns []. All match requirement 3.\\n- Inputs handling: Does not reassign text or k. Copies them into _text and _k safely, attempts int conversion of k, and clamps negatives. This is compatible with the contract. The fallback to defaults if globals are missing is harmless given the problem states they are provided.\\n- No I/O or network: No input(), file, or network usage. No printing (which is optional).\\n- Tie ordering: Does not rely on Counter.most_common; implements explicit sort key. Satisfies the note.\\n- Summary: All functional requirements are met; only deviation is not using a regex for tokenization as explicitly requested.\",\n \"code_quality\": \"- Correctness: The tokenization logic matches the ASCII [a-z0-9]+ spec and lowercases A-Z. Counting and Top-K selection are correct. Edge cases handled.\\n- Efficiency: Single pass tokenization; Top-K via heapq.nsmallest gives O(n log k). Efficient.\\n- Readability/Maintainability: The manual ord/chr checks and micro-optimizations (caching append/clear/get) trade clarity for speed. A regex-based tokenizer would be shorter and clearer per the instruction. Nonetheless, the function has a clear comment describing behavior.\\n- Structure: Well-factored into two helper functions. Minimal global interaction. Reasonable variable naming and scoping.\\n- Best practices: Avoids unnecessary full sort; avoids reliance on unspecified tie ordering. Broad exception while casting k is acceptable in this context but could be narrowed. Partial type hints are provided; return types not annotated.\\n- No apparent bugs or edge-case failures identified.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 4,\n \"code_quality_score\": 4,\n \"comments\": \"Functionally correct and efficient implementation that meets sorting, output, and edge-case requirements. The main miss is not using a regex for tokenization as explicitly requested, and the manual ASCII scanning reduces readability compared to a regex-based approach. Otherwise, solid solution.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "- Reads and follows the core task: computes Top-K most frequent tokens from provided globals text and k, and assigns the result to top_k.\n- Tokenization: Implements ASCII tokenization for [a-z0-9]+ by treating all non [A-Z a-z 0-9] chars as separators, and lowercasing only A-Z while scanning. This yields exactly the specified token set and casing. However, the instruction explicitly says to use a regex; the solution does not use a regex and instead implements manual scanning. While the result is equivalent, this deviates from the stated method.\n- Sorting: Uses heapq.nsmallest with key (-count, token) which correctly produces items sorted by count descending then token ascending, as required, and avoids full sort (acceptable and efficient).\n- Output format: Sets top_k to a list of (token, count) tuples. Length is min(k, number of unique tokens) due to nsmallest behavior. Correct.\n- Edge cases: If k <= 0, _top_k_from_counts returns []; negative k is clamped to 0. If counts empty, returns []. All match requirement 3.\n- Inputs handling: Does not reassign text or k. Copies them into _text and _k safely, attempts int conversion of k, and clamps negatives. This is compatible with the contract. The fallback to defaults if globals are missing is harmless given the problem states they are provided.\n- No I/O or network: No input(), file, or network usage. No printing (which is optional).\n- Tie ordering: Does not rely on Counter.most_common; implements explicit sort key. Satisfies the note.\n- Summary: All functional requirements are met; only deviation is not using a regex for tokenization as explicitly requested.", + "code_quality": "- Correctness: The tokenization logic matches the ASCII [a-z0-9]+ spec and lowercases A-Z. Counting and Top-K selection are correct. Edge cases handled.\n- Efficiency: Single pass tokenization; Top-K via heapq.nsmallest gives O(n log k). Efficient.\n- Readability/Maintainability: The manual ord/chr checks and micro-optimizations (caching append/clear/get) trade clarity for speed. A regex-based tokenizer would be shorter and clearer per the instruction. Nonetheless, the function has a clear comment describing behavior.\n- Structure: Well-factored into two helper functions. Minimal global interaction. Reasonable variable naming and scoping.\n- Best practices: Avoids unnecessary full sort; avoids reliance on unspecified tie ordering. Broad exception while casting k is acceptable in this context but could be narrowed. Partial type hints are provided; return types not annotated.\n- No apparent bugs or edge-case failures identified." + }, + "final_judgement": { + "adherence_score": 4, + "code_quality_score": 4, + "comments": "Functionally correct and efficient implementation that meets sorting, output, and edge-case requirements. The main miss is not using a regex for tokenization as explicitly requested, and the manual ASCII scanning reduces readability compared to a regex-based approach. Otherwise, solid solution." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_27.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_27.json new file mode 100644 index 0000000000..59f49fb401 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_27.json @@ -0,0 +1,16 @@ +{ + "file": "run_27.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"- Interprets task correctly: computes Top-K most frequent tokens from the provided text.\\n- Tokenization:\\n - Produces tokens limited to ASCII [a-z0-9]+ and treats all other chars as separators \u2014 matches the spec behaviorally.\\n - Lowercasing is handled on-the-fly for ASCII uppercase letters; functionally equivalent to lowercasing the text first.\\n - However, the instruction explicitly says to use a regex; the implementation uses a manual scanner, not a regex. This is a deviation from the stated method, even though results match.\\n- Inputs: Does not reassign globals 'text' or 'k'; reads them into _text and _k. Complies.\\n- Requirements:\\n 1) Sorting by count desc, then token asc: Implemented via heapq.nsmallest with key=(-count, token). Correct and does not rely on Counter.most_common.\\n 2) Sets top_k to list of (token, count) tuples with length min(k, unique tokens): heapq.nsmallest returns min(k, n) items. Complies.\\n 3) k <= 0 -> top_k = []: Achieved by coercing invalid/negative k to 0 and using conditional \\\"if _counts and _k\\\". Complies.\\n 4) No input(), file I/O, or network: None used. Complies.\\n- Output contract: Defines top_k exactly as specified; no extraneous output. Complies.\\n- Tie-order note: Does not rely on most_common; uses defined sort key. Complies.\\n\\nSummary: All behavioral requirements met; only notable deviation is not using a regex for tokenization as instructed.\",\n \"code_quality\": \"- Clarity/Readability: Clear structure with a dedicated tokenizer function and comments. The ord()-based ASCII checks are explicit but slightly less readable than a regex; still understandable.\\n- Correctness: Tokenization correctly extracts [a-z0-9]+, lowercases ASCII letters, and treats all else as separators. Sorting is correct for the required order.\\n- Efficiency: Single-pass tokenizer O(n). Top-K via heapq.nsmallest is O(n log k) \u2014 efficient.\\n- Maintainability/Structure: Helper function encapsulates tokenization; avoids side effects on globals by copying into local vars. Variable naming is consistent; comments explain behavior.\\n- Best practices: Avoids reliance on Counter.most_common tie behavior. Handles edge cases for k robustly. Minor nit: type hint could specify Counter[str].\\n\\nNo bugs identified; implementation is robust and efficient. The only stylistic divergence is not using regex (impacts adherence more than code quality).\"\n },\n \"final_judgement\": {\n \"adherence_score\": 4,\n \"code_quality_score\": 5,\n \"comments\": \"Behavior fully meets the requirements, including correct sorting and edge cases. The only shortfall is not using a regex for tokenization as explicitly requested. Code is otherwise clean, efficient, and correct.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "- Interprets task correctly: computes Top-K most frequent tokens from the provided text.\n- Tokenization:\n - Produces tokens limited to ASCII [a-z0-9]+ and treats all other chars as separators \u2014 matches the spec behaviorally.\n - Lowercasing is handled on-the-fly for ASCII uppercase letters; functionally equivalent to lowercasing the text first.\n - However, the instruction explicitly says to use a regex; the implementation uses a manual scanner, not a regex. This is a deviation from the stated method, even though results match.\n- Inputs: Does not reassign globals 'text' or 'k'; reads them into _text and _k. Complies.\n- Requirements:\n 1) Sorting by count desc, then token asc: Implemented via heapq.nsmallest with key=(-count, token). Correct and does not rely on Counter.most_common.\n 2) Sets top_k to list of (token, count) tuples with length min(k, unique tokens): heapq.nsmallest returns min(k, n) items. Complies.\n 3) k <= 0 -> top_k = []: Achieved by coercing invalid/negative k to 0 and using conditional \"if _counts and _k\". Complies.\n 4) No input(), file I/O, or network: None used. Complies.\n- Output contract: Defines top_k exactly as specified; no extraneous output. Complies.\n- Tie-order note: Does not rely on most_common; uses defined sort key. Complies.\n\nSummary: All behavioral requirements met; only notable deviation is not using a regex for tokenization as instructed.", + "code_quality": "- Clarity/Readability: Clear structure with a dedicated tokenizer function and comments. The ord()-based ASCII checks are explicit but slightly less readable than a regex; still understandable.\n- Correctness: Tokenization correctly extracts [a-z0-9]+, lowercases ASCII letters, and treats all else as separators. Sorting is correct for the required order.\n- Efficiency: Single-pass tokenizer O(n). Top-K via heapq.nsmallest is O(n log k) \u2014 efficient.\n- Maintainability/Structure: Helper function encapsulates tokenization; avoids side effects on globals by copying into local vars. Variable naming is consistent; comments explain behavior.\n- Best practices: Avoids reliance on Counter.most_common tie behavior. Handles edge cases for k robustly. Minor nit: type hint could specify Counter[str].\n\nNo bugs identified; implementation is robust and efficient. The only stylistic divergence is not using regex (impacts adherence more than code quality)." + }, + "final_judgement": { + "adherence_score": 4, + "code_quality_score": 5, + "comments": "Behavior fully meets the requirements, including correct sorting and edge cases. The only shortfall is not using a regex for tokenization as explicitly requested. Code is otherwise clean, efficient, and correct." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_28.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_28.json new file mode 100644 index 0000000000..05ee39936c --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_28.json @@ -0,0 +1,16 @@ +{ + "file": "run_28.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"Step-by-step check against the task:\\n- Tokenization: Uses regex r'[a-z0-9]+' with flags re.ASCII | re.IGNORECASE to find ASCII alphanumeric runs and lowercases each token via m.group(0).lower(). This satisfies \\\"Lowercase the text\\\" (lowercasing per token is equivalent for ASCII) and \\\"Tokens are ASCII [a-z0-9]+ sequences; all other characters are separators (use a regex).\\\"\\n- Inputs/globals: Reads provided globals text and k without reassigning them. Wraps access in try/except NameError to provide a safe fallback; this is extra but does not violate requirements.\\n- Requirement 1 (sorting): Returns heapq.nsmallest(kk, counts.items(), key=lambda item: (-item[1], item[0])). Sorting by the key (-count, token) and taking the n smallest produces a list ordered by count descending, then token ascending, exactly as specified.\\n- Requirement 2 (output shape): Returns a list of (token, count) tuples from counts.items(), with length kk = min(k, number of unique tokens). This matches the contract.\\n- Requirement 3 (edge case k <= 0): Explicitly returns [] when k <= 0; also returns [] if no tokens are found. Correct.\\n- Requirement 4 (no I/O/network): No input(), file I/O, or network calls are used. Script runs as-is with provided globals.\\n- Output contract: top_k is defined at end of execution as specified. No extra prints (printing was optional). Does not rely on Counter.most_common tie ordering.\\nAmbiguities/notes: The initial isinstance(text, str) guard is stricter than required but harmless given the task guarantees. The try/except for missing globals is extra robustness not required by the spec, but it still ensures top_k is defined.\",\n \"code_quality\": \"No code generation issues; assessment of quality:\\n- Clarity/readability: Clear helper functions (_iter_tokens, _top_k_tokens), descriptive names, concise comments explaining intent.\\n- Correctness: Implements the specified tokenization and sorting criteria accurately. Handles edge cases correctly.\\n- Efficiency: Single pass to count tokens (O(n)); uses heapq.nsmallest for Top-K selection (O(U log K), U=unique tokens). Efficient and scalable.\\n- Maintainability/structure: Modular design with small focused functions; no reliance on unspecified behaviors (e.g., not using Counter.most_common). Reasonable fallback for missing globals.\\n- Style/best practices: Uses regex flags appropriately; avoids unnecessary global reassignment; returns immutable tuples for results. Minor nit: could use heapq.nlargest with key on count instead of nsmallest with negative counts for readability, but current approach is fine.\\nNo bugs or inefficiencies identified relevant to the task.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 5,\n \"code_quality_score\": 5,\n \"comments\": \"Meets all requirements precisely with clean, efficient, and readable code. Minor extras (type check and NameError fallback) are harmless and do not detract from adherence.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "Step-by-step check against the task:\n- Tokenization: Uses regex r'[a-z0-9]+' with flags re.ASCII | re.IGNORECASE to find ASCII alphanumeric runs and lowercases each token via m.group(0).lower(). This satisfies \"Lowercase the text\" (lowercasing per token is equivalent for ASCII) and \"Tokens are ASCII [a-z0-9]+ sequences; all other characters are separators (use a regex).\"\n- Inputs/globals: Reads provided globals text and k without reassigning them. Wraps access in try/except NameError to provide a safe fallback; this is extra but does not violate requirements.\n- Requirement 1 (sorting): Returns heapq.nsmallest(kk, counts.items(), key=lambda item: (-item[1], item[0])). Sorting by the key (-count, token) and taking the n smallest produces a list ordered by count descending, then token ascending, exactly as specified.\n- Requirement 2 (output shape): Returns a list of (token, count) tuples from counts.items(), with length kk = min(k, number of unique tokens). This matches the contract.\n- Requirement 3 (edge case k <= 0): Explicitly returns [] when k <= 0; also returns [] if no tokens are found. Correct.\n- Requirement 4 (no I/O/network): No input(), file I/O, or network calls are used. Script runs as-is with provided globals.\n- Output contract: top_k is defined at end of execution as specified. No extra prints (printing was optional). Does not rely on Counter.most_common tie ordering.\nAmbiguities/notes: The initial isinstance(text, str) guard is stricter than required but harmless given the task guarantees. The try/except for missing globals is extra robustness not required by the spec, but it still ensures top_k is defined.", + "code_quality": "No code generation issues; assessment of quality:\n- Clarity/readability: Clear helper functions (_iter_tokens, _top_k_tokens), descriptive names, concise comments explaining intent.\n- Correctness: Implements the specified tokenization and sorting criteria accurately. Handles edge cases correctly.\n- Efficiency: Single pass to count tokens (O(n)); uses heapq.nsmallest for Top-K selection (O(U log K), U=unique tokens). Efficient and scalable.\n- Maintainability/structure: Modular design with small focused functions; no reliance on unspecified behaviors (e.g., not using Counter.most_common). Reasonable fallback for missing globals.\n- Style/best practices: Uses regex flags appropriately; avoids unnecessary global reassignment; returns immutable tuples for results. Minor nit: could use heapq.nlargest with key on count instead of nsmallest with negative counts for readability, but current approach is fine.\nNo bugs or inefficiencies identified relevant to the task." + }, + "final_judgement": { + "adherence_score": 5, + "code_quality_score": 5, + "comments": "Meets all requirements precisely with clean, efficient, and readable code. Minor extras (type check and NameError fallback) are harmless and do not detract from adherence." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_29.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_29.json new file mode 100644 index 0000000000..7ae301a3b5 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_29.json @@ -0,0 +1,16 @@ +{ + "file": "run_29.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"- Interpreting the task: compute Top-K most frequent tokens from provided globals text (string) and k (int); tokenize as lowercase ASCII [a-z0-9]+ with all other chars as separators (explicitly says use a regex); output top_k as a list of (token, count) sorted by count desc then token asc; handle k <= 0 -> []; no I/O; don't reassign text or k.\\n- Tokenization:\\n - Lowercasing: Achieved. Uppercase A-Z are converted to lowercase on-the-fly in _iter_tokens.\\n - Token definition: Achieved. Only ASCII digits 0-9 and letters a-z are accumulated; anything else (including non-ASCII) is treated as a separator, producing tokens that match [a-z0-9]+.\\n - Use a regex: Not followed. Tokenization is implemented via manual character inspection (ord checks), not using a regex as the instruction explicitly requests. While functionally correct, this misses the stated method requirement.\\n- Inputs/globals:\\n - Does not reassign text or k. It reads them into _text and _k safely and handles NameError by providing defaults. This respects the \\\"do not reassign\\\" requirement.\\n- Requirements:\\n 1) Sorting by count desc, token asc: Followed. Uses key = lambda kv: (-kv[1], kv[0]) and applies either full sort or heapq.nsmallest + sort, yielding the correct ordering.\\n 2) top_k is a list of (token, count) tuples with length = min(k, unique tokens): Followed. When unique <= k it returns all sorted pairs; otherwise it returns exactly k items; tuples are (token, count).\\n 3) Edge case k <= 0 -> []: Followed. Early return [].\\n 4) No input(), file I/O, or network: Followed. None used.\\n- Output contract:\\n - top_k is defined at the end as specified. No extra printing is performed, which is allowed (printing was optional and restricted if used).\\n- Additional notes:\\n - The function gracefully handles empty or non-string text by casting to str, which is acceptable and does not violate any constraints.\\n - Does not rely on Counter.most_common; implements explicit sorting as requested.\\n- Summary: All functional requirements are met, but the explicit \\\"use a regex\\\" instruction for tokenization is not followed.\",\n \"code_quality\": \"- Clarity/readability: Good. Functions are small and focused; comments explain token rules and sorting. Minor micro-optimizations (append alias) and numeric ord ranges slightly reduce readability compared to a regex approach but are still understandable.\\n- Correctness: Correct tokenization (per spec), counting, and Top-K selection/sorting. Proper handling of ties and ordering via (-count, token). Edge cases handled.\\n- Efficiency: Good. Single-pass tokenizer; counting with dict; for large unique token sets uses heapq.nsmallest with a key for O(m log k); when k >= m, falls back to O(m log m) full sort. Reasonable memory/time complexity.\\n- Structure/maintainability: Reasonable. Clear separation of concerns (_iter_tokens, _compute_top_k). Avoids global mutation; exposes a single top_k result. The use of raw ASCII code ranges could be replaced with regex for maintainability (and to meet the instruction), but functionally it's fine.\\n- Best practices: Avoids reliance on unspecified tie-breaking; no I/O side effects; handles invalid k robustly. Names and comments are descriptive. No apparent bugs.\\n- Overall: High-quality, efficient implementation. The only notable deviation is not using a regex for tokenization (a task adherence issue more than code quality).\"\n },\n \"final_judgement\": {\n \"adherence_score\": 4,\n \"code_quality_score\": 5,\n \"comments\": \"Functionally excellent: correct tokenization behavior, counting, Top-K selection, order, and edge-case handling with no I/O. The sole miss is not using a regex for tokenization as explicitly requested.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "- Interpreting the task: compute Top-K most frequent tokens from provided globals text (string) and k (int); tokenize as lowercase ASCII [a-z0-9]+ with all other chars as separators (explicitly says use a regex); output top_k as a list of (token, count) sorted by count desc then token asc; handle k <= 0 -> []; no I/O; don't reassign text or k.\n- Tokenization:\n - Lowercasing: Achieved. Uppercase A-Z are converted to lowercase on-the-fly in _iter_tokens.\n - Token definition: Achieved. Only ASCII digits 0-9 and letters a-z are accumulated; anything else (including non-ASCII) is treated as a separator, producing tokens that match [a-z0-9]+.\n - Use a regex: Not followed. Tokenization is implemented via manual character inspection (ord checks), not using a regex as the instruction explicitly requests. While functionally correct, this misses the stated method requirement.\n- Inputs/globals:\n - Does not reassign text or k. It reads them into _text and _k safely and handles NameError by providing defaults. This respects the \"do not reassign\" requirement.\n- Requirements:\n 1) Sorting by count desc, token asc: Followed. Uses key = lambda kv: (-kv[1], kv[0]) and applies either full sort or heapq.nsmallest + sort, yielding the correct ordering.\n 2) top_k is a list of (token, count) tuples with length = min(k, unique tokens): Followed. When unique <= k it returns all sorted pairs; otherwise it returns exactly k items; tuples are (token, count).\n 3) Edge case k <= 0 -> []: Followed. Early return [].\n 4) No input(), file I/O, or network: Followed. None used.\n- Output contract:\n - top_k is defined at the end as specified. No extra printing is performed, which is allowed (printing was optional and restricted if used).\n- Additional notes:\n - The function gracefully handles empty or non-string text by casting to str, which is acceptable and does not violate any constraints.\n - Does not rely on Counter.most_common; implements explicit sorting as requested.\n- Summary: All functional requirements are met, but the explicit \"use a regex\" instruction for tokenization is not followed.", + "code_quality": "- Clarity/readability: Good. Functions are small and focused; comments explain token rules and sorting. Minor micro-optimizations (append alias) and numeric ord ranges slightly reduce readability compared to a regex approach but are still understandable.\n- Correctness: Correct tokenization (per spec), counting, and Top-K selection/sorting. Proper handling of ties and ordering via (-count, token). Edge cases handled.\n- Efficiency: Good. Single-pass tokenizer; counting with dict; for large unique token sets uses heapq.nsmallest with a key for O(m log k); when k >= m, falls back to O(m log m) full sort. Reasonable memory/time complexity.\n- Structure/maintainability: Reasonable. Clear separation of concerns (_iter_tokens, _compute_top_k). Avoids global mutation; exposes a single top_k result. The use of raw ASCII code ranges could be replaced with regex for maintainability (and to meet the instruction), but functionally it's fine.\n- Best practices: Avoids reliance on unspecified tie-breaking; no I/O side effects; handles invalid k robustly. Names and comments are descriptive. No apparent bugs.\n- Overall: High-quality, efficient implementation. The only notable deviation is not using a regex for tokenization (a task adherence issue more than code quality)." + }, + "final_judgement": { + "adherence_score": 4, + "code_quality_score": 5, + "comments": "Functionally excellent: correct tokenization behavior, counting, Top-K selection, order, and edge-case handling with no I/O. The sole miss is not using a regex for tokenization as explicitly requested." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_30.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_30.json new file mode 100644 index 0000000000..38f511483b --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_baseline/run_30.json @@ -0,0 +1,16 @@ +{ + "file": "run_30.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"Step-by-step:\\n- Tokenization: Uses a regex pattern r'[A-Za-z0-9]+' and lowercases tokens when counting. This matches the requirement: tokens are ASCII [a-z0-9]+ sequences with other chars as separators, and text is effectively lowercased before counting by lowercasing each token.\\n- Inputs: Uses provided globals text and k by passing them to compute_top_k and does not reassign them. Compliant.\\n- Sorting: Sorts with key lambda kv: (-kv[1], kv[0]), i.e., by count descending then token ascending. Matches the specified sort key (-count, token). Does not rely on Counter.most_common.\\n- Output structure: Returns a list of (token, count) tuples and assigns to top_k. Length is items[:min(k, len(items))], satisfying the min(k, unique) requirement.\\n- Edge cases: If k <= 0, returns [] (and thus top_k = []). Also safely returns [] if no tokens are found. Additional type checks for text and k are extra but do not violate the spec.\\n- No disallowed operations: No input(), file I/O, or network access. The script will run as-is with provided globals. No printing beyond optional requirement.\\nOverall, all requirements are fully met; the minor choice of matching [A-Za-z0-9]+ and then lowercasing tokens is equivalent to lowercasing first and matching [a-z0-9]+.\",\n \"code_quality\": \"Clarity: The function is concise, with clear variable names and a direct flow. Correctness: Implements required tokenization, counting, sorting, and edge cases correctly. Efficiency: Uses a compiled regex and Counter; operations are linear in text length plus sorting cost. Readability and structure: Good use of a helper function and type hints; no unnecessary complexity. Maintainability: Straightforward, with standard library usage. Best practices: Avoids relying on Counter.most_common ordering; sorts explicitly. Minor note: The early return if not counter is redundant but harmless. Overall, high-quality code with no identified bugs or inefficiencies.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 5,\n \"code_quality_score\": 5,\n \"comments\": \"Fully meets the task requirements with clear, correct, and efficient code. Sorting and edge cases are handled as specified; output contract is satisfied.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "Step-by-step:\n- Tokenization: Uses a regex pattern r'[A-Za-z0-9]+' and lowercases tokens when counting. This matches the requirement: tokens are ASCII [a-z0-9]+ sequences with other chars as separators, and text is effectively lowercased before counting by lowercasing each token.\n- Inputs: Uses provided globals text and k by passing them to compute_top_k and does not reassign them. Compliant.\n- Sorting: Sorts with key lambda kv: (-kv[1], kv[0]), i.e., by count descending then token ascending. Matches the specified sort key (-count, token). Does not rely on Counter.most_common.\n- Output structure: Returns a list of (token, count) tuples and assigns to top_k. Length is items[:min(k, len(items))], satisfying the min(k, unique) requirement.\n- Edge cases: If k <= 0, returns [] (and thus top_k = []). Also safely returns [] if no tokens are found. Additional type checks for text and k are extra but do not violate the spec.\n- No disallowed operations: No input(), file I/O, or network access. The script will run as-is with provided globals. No printing beyond optional requirement.\nOverall, all requirements are fully met; the minor choice of matching [A-Za-z0-9]+ and then lowercasing tokens is equivalent to lowercasing first and matching [a-z0-9]+.", + "code_quality": "Clarity: The function is concise, with clear variable names and a direct flow. Correctness: Implements required tokenization, counting, sorting, and edge cases correctly. Efficiency: Uses a compiled regex and Counter; operations are linear in text length plus sorting cost. Readability and structure: Good use of a helper function and type hints; no unnecessary complexity. Maintainability: Straightforward, with standard library usage. Best practices: Avoids relying on Counter.most_common ordering; sorts explicitly. Minor note: The early return if not counter is redundant but harmless. Overall, high-quality code with no identified bugs or inefficiencies." + }, + "final_judgement": { + "adherence_score": 5, + "code_quality_score": 5, + "comments": "Fully meets the task requirements with clear, correct, and efficient code. Sorting and edge cases are handled as specified; output contract is satisfied." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/judgement_summary.csv b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/judgement_summary.csv new file mode 100644 index 0000000000..9632d704e0 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/judgement_summary.csv @@ -0,0 +1,31 @@ +File,adherence_score,code_quality_score,parse_error,error +run_01.py,5,5,, +run_02.py,5,5,, +run_03.py,5,5,, +run_04.py,5,5,, +run_05.py,5,5,, +run_06.py,5,5,, +run_07.py,5,5,, +run_08.py,5,5,, +run_09.py,5,5,, +run_10.py,5,5,, +run_11.py,5,5,, +run_12.py,5,5,, +run_13.py,5,4,, +run_14.py,,,"Expecting ',' delimiter: line 10 column 2 (char 3422)", +run_15.py,5,5,, +run_16.py,5,5,, +run_17.py,5,5,, +run_18.py,5,5,, +run_19.py,5,5,, +run_20.py,5,5,, +run_21.py,5,5,, +run_22.py,2,3,, +run_23.py,5,5,, +run_24.py,5,5,, +run_25.py,5,5,, +run_26.py,5,5,, +run_27.py,5,5,, +run_28.py,5,5,, +run_29.py,5,5,, +run_30.py,5,5,, diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_01.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_01.json new file mode 100644 index 0000000000..7b7eba905e --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_01.json @@ -0,0 +1,16 @@ +{ + "file": "run_01.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"- Tokenization requirements: The code lowercases tokens and uses a regex [a-z0-9]+ with re.ASCII and re.IGNORECASE, then explicitly lowercases tokens via .lower(). This matches the requirement to treat ASCII letters/digits as tokens and all other chars as separators.\\n- Sorting by (-count, token): Implemented via key = lambda kv: (-kv[1], kv[0]). When k >= number of unique tokens (u), it returns sorted(cnt.items(), key=key). When k < u, it returns heapq.nsmallest(k, cnt.items(), key=key). Since nsmallest returns results sorted by the given key, both paths produce a list sorted by count descending, token ascending. This satisfies requirement (1).\\n- Setting top_k: For provided globals text and k, the code sets top_k = top_k_tokens(text, k). The result is a list of (token, count) tuples, length min(k, unique tokens), satisfying requirement (2).\\n- Edge cases: If k <= 0, top_k_tokens returns [], and thus top_k is set to []. If there are zero tokens (u == 0), the function also returns []. This covers requirement (3).\\n- I/O/network restrictions: No input(), file I/O, or network access used. Optional printing only occurs in demo mode and prints only top_k on the last line, complying with the optional output rule.\\n- Not relying on Counter.most_common: The implementation does not use most_common and defines its own sort key, meeting the note.\\n- Output contract: With provided globals, top_k is defined at the end of execution exactly as specified. In the fallback demo path (when globals are missing and running as __main__), top_k is also defined and printed as a Python literal on the last line.\\n- Minor ambiguity: If globals are missing and the code is not run as __main__, top_k would not be defined. However, the task specifies that globals are provided; under the specified conditions the code adheres fully.\",\n \"code_quality\": \"- Clarity/readability: Good separation of concerns: tokenization helper, main function, and module-level wiring. Descriptive names and concise implementation. Type hints are provided.\\n- Correctness: Tokenization and ordering logic are correct. Tie-breaking on token ascending is implemented. Handles k <= 0 and empty input.\\n- Efficiency: Counting is O(N tokens). Selection uses heapq.nsmallest for O(U log k) when k < U and full sort when k >= U, which is appropriate. Space usage is O(U + k). A helpful complexity comment is included.\\n- Maintainability: Uses a compiled regex constant and a small, well-structured function. Minimal dependencies.\\n- Best practices: Avoids relying on Counter.most_common. The try/except NameError to detect globals is slightly unconventional; using 'if \\\"text\\\" in globals() and \\\"k\\\" in globals()' would avoid raising/handling exceptions, but this is a minor stylistic point. The '# type: ignore[name-defined]' comments are acceptable for linters/type checkers.\\n- No bugs or inefficiencies identified that would affect correctness or performance under the task constraints.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 5,\n \"code_quality_score\": 5,\n \"comments\": \"Excellent adherence: correct tokenization, ordering, edge-case handling, and top_k assignment. Code is clear, efficient, and avoids disallowed features. Minor stylistic nit on detecting provided globals, but overall high quality.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "- Tokenization requirements: The code lowercases tokens and uses a regex [a-z0-9]+ with re.ASCII and re.IGNORECASE, then explicitly lowercases tokens via .lower(). This matches the requirement to treat ASCII letters/digits as tokens and all other chars as separators.\n- Sorting by (-count, token): Implemented via key = lambda kv: (-kv[1], kv[0]). When k >= number of unique tokens (u), it returns sorted(cnt.items(), key=key). When k < u, it returns heapq.nsmallest(k, cnt.items(), key=key). Since nsmallest returns results sorted by the given key, both paths produce a list sorted by count descending, token ascending. This satisfies requirement (1).\n- Setting top_k: For provided globals text and k, the code sets top_k = top_k_tokens(text, k). The result is a list of (token, count) tuples, length min(k, unique tokens), satisfying requirement (2).\n- Edge cases: If k <= 0, top_k_tokens returns [], and thus top_k is set to []. If there are zero tokens (u == 0), the function also returns []. This covers requirement (3).\n- I/O/network restrictions: No input(), file I/O, or network access used. Optional printing only occurs in demo mode and prints only top_k on the last line, complying with the optional output rule.\n- Not relying on Counter.most_common: The implementation does not use most_common and defines its own sort key, meeting the note.\n- Output contract: With provided globals, top_k is defined at the end of execution exactly as specified. In the fallback demo path (when globals are missing and running as __main__), top_k is also defined and printed as a Python literal on the last line.\n- Minor ambiguity: If globals are missing and the code is not run as __main__, top_k would not be defined. However, the task specifies that globals are provided; under the specified conditions the code adheres fully.", + "code_quality": "- Clarity/readability: Good separation of concerns: tokenization helper, main function, and module-level wiring. Descriptive names and concise implementation. Type hints are provided.\n- Correctness: Tokenization and ordering logic are correct. Tie-breaking on token ascending is implemented. Handles k <= 0 and empty input.\n- Efficiency: Counting is O(N tokens). Selection uses heapq.nsmallest for O(U log k) when k < U and full sort when k >= U, which is appropriate. Space usage is O(U + k). A helpful complexity comment is included.\n- Maintainability: Uses a compiled regex constant and a small, well-structured function. Minimal dependencies.\n- Best practices: Avoids relying on Counter.most_common. The try/except NameError to detect globals is slightly unconventional; using 'if \"text\" in globals() and \"k\" in globals()' would avoid raising/handling exceptions, but this is a minor stylistic point. The '# type: ignore[name-defined]' comments are acceptable for linters/type checkers.\n- No bugs or inefficiencies identified that would affect correctness or performance under the task constraints." + }, + "final_judgement": { + "adherence_score": 5, + "code_quality_score": 5, + "comments": "Excellent adherence: correct tokenization, ordering, edge-case handling, and top_k assignment. Code is clear, efficient, and avoids disallowed features. Minor stylistic nit on detecting provided globals, but overall high quality." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_02.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_02.json new file mode 100644 index 0000000000..59522ab55a --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_02.json @@ -0,0 +1,16 @@ +{ + "file": "run_02.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"- Tokenization: Correct. Uses regex r\\\"[a-z0-9]+\\\" with re.ASCII and re.IGNORECASE, then lowercases tokens in _tokens(). This matches the requirement to lowercase and treat non-[a-z0-9] ASCII as separators.\\n- Inputs: Correct. Reads provided globals text and k without reassigning them. The detection via try: text; k avoids assignment.\\n- Sorting/key: Correct. The key lambda (-count, token) implements the required ordering: count descending, then token ascending. No reliance on Counter.most_common.\\n- Top-K computation: Correct and exact. For k >= number of unique tokens (u), returns full sorted list; otherwise returns exactly k items, preserving the specified order. Uses heapq.nsmallest with the same key, which returns elements sorted by the key, ensuring correct order.\\n- Edge cases: Correct. If k <= 0, top_k_tokens returns [], and thus top_k is set to []. If the text contains no tokens (u == 0), returns [].\\n- Output contract: Satisfied when globals are provided. At module end, if text and k exist, top_k is defined as required. Optional printing is only performed in demo mode and prints exactly top_k as a Python literal on the last line. No input(), file I/O, or network access is used.\\n- Ambiguity note: If globals were not provided and the module was imported (not __main__), top_k would not be defined. However, the task states the two globals are provided, so this scenario is outside the intended use and does not violate requirements.\",\n \"code_quality\": \"- Clarity/readability: Good. Clear function and helper names; concise logic. Type hints for function signatures are provided.\\n- Correctness: High. Implements the specified sort key and handles all edge cases. Does not rely on Counter.most_common tie behavior.\\n- Efficiency: Good. Counts in O(N tokens). Chooses between full sort and heap selection to avoid unnecessary full sorts when k is small (O(U log U) vs. O(U log k)).\\n- Maintainability/structure: Good. Separation of tokenization and ranking; local key function improves readability. Inline comments document strategy and complexity.\\n- Style/best practices: Generally solid. Minor nit: the availability check using try: text; k is a bit unconventional but functional and type-checked with ignores. Using a named function instead of a lambda for the key could slightly improve readability, but not necessary.\\n- No bugs or inefficiencies identified that affect correctness.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 5,\n \"code_quality_score\": 5,\n \"comments\": \"Meets all task requirements precisely, including tokenization, ordering, edge cases, and output contract. Code is clear, efficient, and avoids prohibited patterns. Minor nit about globals detection is acceptable under the stated assumptions.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "- Tokenization: Correct. Uses regex r\"[a-z0-9]+\" with re.ASCII and re.IGNORECASE, then lowercases tokens in _tokens(). This matches the requirement to lowercase and treat non-[a-z0-9] ASCII as separators.\n- Inputs: Correct. Reads provided globals text and k without reassigning them. The detection via try: text; k avoids assignment.\n- Sorting/key: Correct. The key lambda (-count, token) implements the required ordering: count descending, then token ascending. No reliance on Counter.most_common.\n- Top-K computation: Correct and exact. For k >= number of unique tokens (u), returns full sorted list; otherwise returns exactly k items, preserving the specified order. Uses heapq.nsmallest with the same key, which returns elements sorted by the key, ensuring correct order.\n- Edge cases: Correct. If k <= 0, top_k_tokens returns [], and thus top_k is set to []. If the text contains no tokens (u == 0), returns [].\n- Output contract: Satisfied when globals are provided. At module end, if text and k exist, top_k is defined as required. Optional printing is only performed in demo mode and prints exactly top_k as a Python literal on the last line. No input(), file I/O, or network access is used.\n- Ambiguity note: If globals were not provided and the module was imported (not __main__), top_k would not be defined. However, the task states the two globals are provided, so this scenario is outside the intended use and does not violate requirements.", + "code_quality": "- Clarity/readability: Good. Clear function and helper names; concise logic. Type hints for function signatures are provided.\n- Correctness: High. Implements the specified sort key and handles all edge cases. Does not rely on Counter.most_common tie behavior.\n- Efficiency: Good. Counts in O(N tokens). Chooses between full sort and heap selection to avoid unnecessary full sorts when k is small (O(U log U) vs. O(U log k)).\n- Maintainability/structure: Good. Separation of tokenization and ranking; local key function improves readability. Inline comments document strategy and complexity.\n- Style/best practices: Generally solid. Minor nit: the availability check using try: text; k is a bit unconventional but functional and type-checked with ignores. Using a named function instead of a lambda for the key could slightly improve readability, but not necessary.\n- No bugs or inefficiencies identified that affect correctness." + }, + "final_judgement": { + "adherence_score": 5, + "code_quality_score": 5, + "comments": "Meets all task requirements precisely, including tokenization, ordering, edge cases, and output contract. Code is clear, efficient, and avoids prohibited patterns. Minor nit about globals detection is acceptable under the stated assumptions." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_03.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_03.json new file mode 100644 index 0000000000..b08bc9c89e --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_03.json @@ -0,0 +1,16 @@ +{ + "file": "run_03.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"Step-by-step check against the instructions:\\n- Tokenization: Uses regex r\\\"[a-z0-9]+\\\" with flags re.ASCII | re.IGNORECASE, and yields tokens lowercased via m.group(0).lower(). This satisfies: lowercase tokens; ASCII [a-z0-9]+ sequences; non-matching chars treated as separators; uses a regex.\\n- Inputs/globals: The code reads provided globals text and k without reassigning them. It uses a try/except NameError guard to detect their presence. If they exist, it computes top_k = top_k_tokens(text, k). It does not modify text or k.\\n- Requirement 1 (sorting): It defines key = lambda kv: (-kv[1], kv[0]) corresponding to (-count, token). For k >= unique tokens, it returns sorted(cnt.items(), key=key). For k < unique tokens, it uses heapq.nsmallest(k, cnt.items(), key=key), which returns the k elements in ascending order of the key, i.e., descending by count then ascending by token. This matches the specified order and avoids relying on Counter.most_common.\\n- Requirement 2 (top_k value/length): top_k is set to a list of (token, count) tuples via the function call. For k >= u, it returns all u items; for k < u, returns exactly k items; for no tokens, returns []. Thus length is min(k, number of unique tokens).\\n- Requirement 3 (edge cases): If k <= 0, top_k_tokens returns [], so top_k = [] in that case. If there are zero tokens (u == 0), returns [].\\n- Requirement 4 (no I/O, runs as-is with provided globals): No input(), file I/O, or network is used. With provided globals, it computes top_k and does not print. The demo branch only triggers if globals are absent and __name__ == \\\"__main__\\\".\\n- Output contract: At end of execution, top_k is defined exactly as required when text and k are provided. Optional printing is only in the demo path and prints top_k alone on the last line. The implementation does not use Counter.most_common and defines its own sort key.\\nMinor note: The code lowercases tokens rather than the entire text, but due to IGNORECASE in the regex, the extracted tokens cover upper/lowercase correctly and are normalized to lowercase, which is equivalent for the specified tokenization.\\nConclusion: All instructions are fully met.\",\n \"code_quality\": \"The output is code; assessment:\\n- Correctness: Logic for tokenization, counting, sorting, and top-k selection is correct. heapq.nsmallest with key=(-count, token) yields the desired ordering for the returned list.\\n- Efficiency: Counting is O(N tokens). Selection is O(U log k) when k < U via heapq, and O(U log U) for full sort when k >= U. Extra space O(U + k). This is efficient and appropriate.\\n- Readability/Structure: Clear separation of concerns with _tokens and top_k_tokens. Type hints for function return types aid readability. Variable names are concise but understandable (cnt, u). Inline comment for complexity is helpful.\\n- Maintainability/Best practices: Avoids relying on Counter.most_common tie behavior. Uses a well-defined sort key. No side effects except optional demo in __main__. Does not reassign provided globals. No unnecessary I/O.\\n- Minor nits: Using both re.IGNORECASE and lowercasing tokens is redundant (either alone would suffice if text were lowercased first or pattern remained case-insensitive). The try/except NameError guard is slightly unconventional versus checking 'text' and 'k' in globals(), but functional and clear with the comment. Neither issue impacts correctness.\\nOverall: High-quality, clean, and efficient code.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 5,\n \"code_quality_score\": 5,\n \"comments\": \"Fully meets all task requirements, including exact sorting, edge cases, and output contract. Code is correct, efficient, and readable. Minor redundancies (IGNORECASE plus lowercasing) do not affect correctness.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "Step-by-step check against the instructions:\n- Tokenization: Uses regex r\"[a-z0-9]+\" with flags re.ASCII | re.IGNORECASE, and yields tokens lowercased via m.group(0).lower(). This satisfies: lowercase tokens; ASCII [a-z0-9]+ sequences; non-matching chars treated as separators; uses a regex.\n- Inputs/globals: The code reads provided globals text and k without reassigning them. It uses a try/except NameError guard to detect their presence. If they exist, it computes top_k = top_k_tokens(text, k). It does not modify text or k.\n- Requirement 1 (sorting): It defines key = lambda kv: (-kv[1], kv[0]) corresponding to (-count, token). For k >= unique tokens, it returns sorted(cnt.items(), key=key). For k < unique tokens, it uses heapq.nsmallest(k, cnt.items(), key=key), which returns the k elements in ascending order of the key, i.e., descending by count then ascending by token. This matches the specified order and avoids relying on Counter.most_common.\n- Requirement 2 (top_k value/length): top_k is set to a list of (token, count) tuples via the function call. For k >= u, it returns all u items; for k < u, returns exactly k items; for no tokens, returns []. Thus length is min(k, number of unique tokens).\n- Requirement 3 (edge cases): If k <= 0, top_k_tokens returns [], so top_k = [] in that case. If there are zero tokens (u == 0), returns [].\n- Requirement 4 (no I/O, runs as-is with provided globals): No input(), file I/O, or network is used. With provided globals, it computes top_k and does not print. The demo branch only triggers if globals are absent and __name__ == \"__main__\".\n- Output contract: At end of execution, top_k is defined exactly as required when text and k are provided. Optional printing is only in the demo path and prints top_k alone on the last line. The implementation does not use Counter.most_common and defines its own sort key.\nMinor note: The code lowercases tokens rather than the entire text, but due to IGNORECASE in the regex, the extracted tokens cover upper/lowercase correctly and are normalized to lowercase, which is equivalent for the specified tokenization.\nConclusion: All instructions are fully met.", + "code_quality": "The output is code; assessment:\n- Correctness: Logic for tokenization, counting, sorting, and top-k selection is correct. heapq.nsmallest with key=(-count, token) yields the desired ordering for the returned list.\n- Efficiency: Counting is O(N tokens). Selection is O(U log k) when k < U via heapq, and O(U log U) for full sort when k >= U. Extra space O(U + k). This is efficient and appropriate.\n- Readability/Structure: Clear separation of concerns with _tokens and top_k_tokens. Type hints for function return types aid readability. Variable names are concise but understandable (cnt, u). Inline comment for complexity is helpful.\n- Maintainability/Best practices: Avoids relying on Counter.most_common tie behavior. Uses a well-defined sort key. No side effects except optional demo in __main__. Does not reassign provided globals. No unnecessary I/O.\n- Minor nits: Using both re.IGNORECASE and lowercasing tokens is redundant (either alone would suffice if text were lowercased first or pattern remained case-insensitive). The try/except NameError guard is slightly unconventional versus checking 'text' and 'k' in globals(), but functional and clear with the comment. Neither issue impacts correctness.\nOverall: High-quality, clean, and efficient code." + }, + "final_judgement": { + "adherence_score": 5, + "code_quality_score": 5, + "comments": "Fully meets all task requirements, including exact sorting, edge cases, and output contract. Code is correct, efficient, and readable. Minor redundancies (IGNORECASE plus lowercasing) do not affect correctness." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_04.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_04.json new file mode 100644 index 0000000000..4f0bbb4fe8 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_04.json @@ -0,0 +1,16 @@ +{ + "file": "run_04.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"- Tokenization: Uses regex r\\\"[a-z0-9]+\\\" with re.ASCII | re.IGNORECASE and lowercases tokens in _tokens(), matching the requirement to lowercase and treat non [a-z0-9] ASCII chars as separators.\\n- Inputs: Accesses globals text and k without reassigning them. Presence is checked via try/except; when provided, it computes top_k = top_k_tokens(text, k) as required.\\n- Requirement 1 (sorting): Defines key = lambda kv: (-kv[1], kv[0]) and applies it consistently. For k >= u, returns sorted(cnt.items(), key=key). For 0 < k < u, either sorts and slices or uses heapq.nsmallest with the same key, which returns the k items in ascending order by the key (i.e., count desc, token asc). Does not rely on Counter.most_common tie ordering.\\n- Requirement 2 (top_k list of tuples with correct length): The function returns a list of (token, count) tuples. Length is min(k, number of unique tokens) due to branches (k <= 0 -> [], k >= u -> all, else -> exactly k via slice or nsmallest). Tokens are lowercased.\\n- Requirement 3 (edge cases): If k <= 0, returns []. If there are zero unique tokens (u == 0), returns []. Both cases are handled.\\n- Requirement 4 (no I/O): No input(), file I/O, or network access. Printing occurs only in a demo path when globals are absent and __name__ == \\\"__main__\\\"; with provided globals (per task), it performs no printing.\\n- Output contract: Ensures top_k is defined at the end when globals exist. Optional printing is limited to the demo path and prints only top_k as a Python literal on the last line.\\n- Other notes: Does not reassign text or k. Implements specified sort key directly.\",\n \"code_quality\": \"- Clarity and structure: Clean separation via helper _tokens() and top_k_tokens(); meaningful names; type hints provided; inline comments and complexity note included.\\n- Correctness: Tokenization and sorting logic meet specs. Uses Counter for counting; avoids most_common tie behavior by explicit sort key. heapq.nsmallest with the composite key preserves the required ordering of the returned k items.\\n- Efficiency: Counts in O(N tokens). Selects top-k via either full sort (when k relatively large) or heap-based selection (when k small), which is a sensible optimization. Regex precompiled globally.\\n- Readability/Maintainability: Concise, readable, and follows Python best practices. No obvious bugs or edge-case gaps given the stated inputs. No unnecessary side effects when globals are present.\\n- Minor nitpicks: The threshold 0.3 is heuristic (acceptable). Type ignore comments are unnecessary but harmless.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 5,\n \"code_quality_score\": 5,\n \"comments\": \"Meets all task requirements precisely, including tokenization, sorting, edge cases, global handling, and output contract. Code is clear, efficient, and well-structured.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "- Tokenization: Uses regex r\"[a-z0-9]+\" with re.ASCII | re.IGNORECASE and lowercases tokens in _tokens(), matching the requirement to lowercase and treat non [a-z0-9] ASCII chars as separators.\n- Inputs: Accesses globals text and k without reassigning them. Presence is checked via try/except; when provided, it computes top_k = top_k_tokens(text, k) as required.\n- Requirement 1 (sorting): Defines key = lambda kv: (-kv[1], kv[0]) and applies it consistently. For k >= u, returns sorted(cnt.items(), key=key). For 0 < k < u, either sorts and slices or uses heapq.nsmallest with the same key, which returns the k items in ascending order by the key (i.e., count desc, token asc). Does not rely on Counter.most_common tie ordering.\n- Requirement 2 (top_k list of tuples with correct length): The function returns a list of (token, count) tuples. Length is min(k, number of unique tokens) due to branches (k <= 0 -> [], k >= u -> all, else -> exactly k via slice or nsmallest). Tokens are lowercased.\n- Requirement 3 (edge cases): If k <= 0, returns []. If there are zero unique tokens (u == 0), returns []. Both cases are handled.\n- Requirement 4 (no I/O): No input(), file I/O, or network access. Printing occurs only in a demo path when globals are absent and __name__ == \"__main__\"; with provided globals (per task), it performs no printing.\n- Output contract: Ensures top_k is defined at the end when globals exist. Optional printing is limited to the demo path and prints only top_k as a Python literal on the last line.\n- Other notes: Does not reassign text or k. Implements specified sort key directly.", + "code_quality": "- Clarity and structure: Clean separation via helper _tokens() and top_k_tokens(); meaningful names; type hints provided; inline comments and complexity note included.\n- Correctness: Tokenization and sorting logic meet specs. Uses Counter for counting; avoids most_common tie behavior by explicit sort key. heapq.nsmallest with the composite key preserves the required ordering of the returned k items.\n- Efficiency: Counts in O(N tokens). Selects top-k via either full sort (when k relatively large) or heap-based selection (when k small), which is a sensible optimization. Regex precompiled globally.\n- Readability/Maintainability: Concise, readable, and follows Python best practices. No obvious bugs or edge-case gaps given the stated inputs. No unnecessary side effects when globals are present.\n- Minor nitpicks: The threshold 0.3 is heuristic (acceptable). Type ignore comments are unnecessary but harmless." + }, + "final_judgement": { + "adherence_score": 5, + "code_quality_score": 5, + "comments": "Meets all task requirements precisely, including tokenization, sorting, edge cases, global handling, and output contract. Code is clear, efficient, and well-structured." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_05.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_05.json new file mode 100644 index 0000000000..6719df859e --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_05.json @@ -0,0 +1,16 @@ +{ + "file": "run_05.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"- Tokenization: Uses a compiled regex r\\\"[a-z0-9]+\\\" with re.ASCII and re.IGNORECASE, and explicitly lowercases tokens in _tokens(). This matches the requirement to lowercase and to treat tokens as ASCII [a-z0-9]+ with other chars as separators.\\n- Sorting spec: Defines key = lambda kv: (-kv[1], kv[0]), which enforces sort by count descending then token ascending. Both sorted(...) and heapq.nsmallest(..., key=key) use this key, satisfying the specified ordering and avoiding Counter.most_common tie behavior.\\n- Exact Top-K: Returns exact results. When k >= number of unique tokens (u), it returns the fully sorted list; otherwise it either sorts then slices or uses heapq.nsmallest with the specified key. heapq.nsmallest returns items in ascending key order, so the resulting list is already sorted as required.\\n- Output variable: With provided globals, the code assigns top_k = top_k_tokens(text, k) at the end, satisfying \\\"top_k must be defined exactly as described.\\\" It does not reassign text or k.\\n- Edge cases: If k <= 0, top_k_tokens returns [], and therefore top_k becomes []. If there are zero unique tokens, it returns []. Both meet the requirements.\\n- Inputs and environment: No input(), file I/O, or network access. The code runs as-is given the globals. It also includes an optional __main__ demo (only prints top_k) when globals are missing, which is allowed by the \\\"Optional printing\\\" clause.\\n- Tie handling: Does not rely on Counter.most_common; implements explicit sort key as required.\\n- Minor note: It treats non-int k as invalid and returns [], which is slightly beyond the spec (k is stated to be int) but harmless and not contrary to requirements.\\nOverall, the code fully adheres to the task instructions.\",\n \"code_quality\": \"- Correctness: Logic is sound; tokenization and ordering are correctly implemented. Top-K selection is exact in all branches.\\n- Efficiency: Uses an adaptive strategy: full sort when k is large relative to u (k >= 0.3u) and heap-based selection otherwise, achieving O(U log k) or O(U log U) as appropriate. Counter is used efficiently.\\n- Readability/Clarity: Functions are small and focused. The key for sorting is clearly documented via a comment. The regex is compiled once at module scope. Type hints are provided. Complexity comment is helpful.\\n- Maintainability/Style: Clean structure with a helper _tokens(). Uses try/except to detect provided globals without reassigning them, plus type: ignore comments for static checkers; acceptable here. Minor stylistic nits could be: building Counter directly from the generator (Counter(_tokens(text))) and possibly omitting re.IGNORECASE since tokens are lowercased anyway, but these do not affect correctness or clarity.\\n- No bugs or evident edge-case failures found.\\nOverall, code quality is high.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 5,\n \"code_quality_score\": 5,\n \"comments\": \"Meets all task requirements precisely, including tokenization, ordering, edge cases, and output contract. Code is clean, efficient, and well-structured. Minor stylistic tweaks are optional.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "- Tokenization: Uses a compiled regex r\"[a-z0-9]+\" with re.ASCII and re.IGNORECASE, and explicitly lowercases tokens in _tokens(). This matches the requirement to lowercase and to treat tokens as ASCII [a-z0-9]+ with other chars as separators.\n- Sorting spec: Defines key = lambda kv: (-kv[1], kv[0]), which enforces sort by count descending then token ascending. Both sorted(...) and heapq.nsmallest(..., key=key) use this key, satisfying the specified ordering and avoiding Counter.most_common tie behavior.\n- Exact Top-K: Returns exact results. When k >= number of unique tokens (u), it returns the fully sorted list; otherwise it either sorts then slices or uses heapq.nsmallest with the specified key. heapq.nsmallest returns items in ascending key order, so the resulting list is already sorted as required.\n- Output variable: With provided globals, the code assigns top_k = top_k_tokens(text, k) at the end, satisfying \"top_k must be defined exactly as described.\" It does not reassign text or k.\n- Edge cases: If k <= 0, top_k_tokens returns [], and therefore top_k becomes []. If there are zero unique tokens, it returns []. Both meet the requirements.\n- Inputs and environment: No input(), file I/O, or network access. The code runs as-is given the globals. It also includes an optional __main__ demo (only prints top_k) when globals are missing, which is allowed by the \"Optional printing\" clause.\n- Tie handling: Does not rely on Counter.most_common; implements explicit sort key as required.\n- Minor note: It treats non-int k as invalid and returns [], which is slightly beyond the spec (k is stated to be int) but harmless and not contrary to requirements.\nOverall, the code fully adheres to the task instructions.", + "code_quality": "- Correctness: Logic is sound; tokenization and ordering are correctly implemented. Top-K selection is exact in all branches.\n- Efficiency: Uses an adaptive strategy: full sort when k is large relative to u (k >= 0.3u) and heap-based selection otherwise, achieving O(U log k) or O(U log U) as appropriate. Counter is used efficiently.\n- Readability/Clarity: Functions are small and focused. The key for sorting is clearly documented via a comment. The regex is compiled once at module scope. Type hints are provided. Complexity comment is helpful.\n- Maintainability/Style: Clean structure with a helper _tokens(). Uses try/except to detect provided globals without reassigning them, plus type: ignore comments for static checkers; acceptable here. Minor stylistic nits could be: building Counter directly from the generator (Counter(_tokens(text))) and possibly omitting re.IGNORECASE since tokens are lowercased anyway, but these do not affect correctness or clarity.\n- No bugs or evident edge-case failures found.\nOverall, code quality is high." + }, + "final_judgement": { + "adherence_score": 5, + "code_quality_score": 5, + "comments": "Meets all task requirements precisely, including tokenization, ordering, edge cases, and output contract. Code is clean, efficient, and well-structured. Minor stylistic tweaks are optional." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_06.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_06.json new file mode 100644 index 0000000000..4c9833ed61 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_06.json @@ -0,0 +1,16 @@ +{ + "file": "run_06.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"- Tokenization requirements: The code compiles a regex r\\\"[a-z0-9]+\\\" with re.ASCII and re.IGNORECASE, then lowercases each match in _tokens(). This matches the spec: lowercase tokens and ASCII [a-z0-9]+ sequences with all other chars as separators.\\n- Inputs/globals: The code reads text and k without reassigning them. It wraps access in a try/except NameError to allow a demo when globals are absent. Under the task's stated scenario (globals are provided), it uses them directly and does not reassign them.\\n- Requirement 1 (sorting): It defines key = lambda kv: (-kv[1], kv[0]) on (token, count) pairs, which corresponds to sorting by count descending, then token ascending. When k*10 >= 3*u, it uses sorted(cnt.items(), key=key)[:eff_k]; otherwise it uses heapq.nsmallest(eff_k, cnt.items(), key=key). Since nsmallest returns items in ascending order of the key (equivalent to sorted(... )[:n]), and the key uses -count, both branches produce the correct final ordering: highest count first, then lexicographically smallest token on ties.\\n- Requirement 2 (top_k value and length): eff_k = min(k, number of unique tokens) is enforced via eff_k = k if k < u else u. The result in both branches is a list of (token, count) tuples of length eff_k. The final assignment top_k = top_k_tokens(text, k) sets top_k accordingly.\\n- Requirement 3 (edge cases): If k <= 0, top_k_tokens returns []. If there are zero unique tokens (u == 0), it returns []. Both satisfy the edge case handling.\\n- Requirement 4 (I/O/network): No input(), file I/O, or network access is used. With provided globals, the script does not print anything. A demo print occurs only when globals are missing and the script is run as __main__, which does not violate the requirement.\\n- Output contract: With provided globals, top_k is defined at the end as specified. Optional printing is respected; when printing in demo mode, it prints only top_k on the last line as a Python literal. It does not rely on Counter.most_common; sorting/tie-breaking is implemented via a custom key.\\n- Minor note: If the script is imported as a module without text/k provided (and not __main__), top_k would not be defined. However, the task states the globals are provided, so this does not affect adherence under the specified conditions.\",\n \"code_quality\": \"- Correctness: The algorithm correctly tokenizes, counts, and selects Top-K with the required ordering. It does not rely on Counter.most_common tie behavior.\\n- Efficiency: Uses an adaptive approach: sorts all when k is a significant fraction of U; otherwise uses heapq.nsmallest with a key, which is O(U log k). Counting is O(N tokens). This is efficient and appropriate.\\n- Readability and maintainability: Clear structure with small, well-named helpers (_tokens, top_k_tokens). Type hints are provided. Comments document the selection strategy and complexity. Variable names (u, eff_k) are concise and understandable. The key function is simple and accurate.\\n- Style/robustness: No unnecessary reassignment of globals. The try/except NameError pattern for optional demo is reasonable and guarded by __main__. Minor nit: In general library usage, if imported without globals, top_k may remain undefined; this is acceptable given the task context. Overall, the code is clean, idiomatic, and maintainable.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 5,\n \"code_quality_score\": 5,\n \"comments\": \"Excellent adherence: correct tokenization, ordering, edge-case handling, and output variable. Efficient and clean implementation with appropriate tie-breaking and no reliance on most_common. Minor non-issue: top_k undefined if imported without globals, but within task constraints this is fine.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "- Tokenization requirements: The code compiles a regex r\"[a-z0-9]+\" with re.ASCII and re.IGNORECASE, then lowercases each match in _tokens(). This matches the spec: lowercase tokens and ASCII [a-z0-9]+ sequences with all other chars as separators.\n- Inputs/globals: The code reads text and k without reassigning them. It wraps access in a try/except NameError to allow a demo when globals are absent. Under the task's stated scenario (globals are provided), it uses them directly and does not reassign them.\n- Requirement 1 (sorting): It defines key = lambda kv: (-kv[1], kv[0]) on (token, count) pairs, which corresponds to sorting by count descending, then token ascending. When k*10 >= 3*u, it uses sorted(cnt.items(), key=key)[:eff_k]; otherwise it uses heapq.nsmallest(eff_k, cnt.items(), key=key). Since nsmallest returns items in ascending order of the key (equivalent to sorted(... )[:n]), and the key uses -count, both branches produce the correct final ordering: highest count first, then lexicographically smallest token on ties.\n- Requirement 2 (top_k value and length): eff_k = min(k, number of unique tokens) is enforced via eff_k = k if k < u else u. The result in both branches is a list of (token, count) tuples of length eff_k. The final assignment top_k = top_k_tokens(text, k) sets top_k accordingly.\n- Requirement 3 (edge cases): If k <= 0, top_k_tokens returns []. If there are zero unique tokens (u == 0), it returns []. Both satisfy the edge case handling.\n- Requirement 4 (I/O/network): No input(), file I/O, or network access is used. With provided globals, the script does not print anything. A demo print occurs only when globals are missing and the script is run as __main__, which does not violate the requirement.\n- Output contract: With provided globals, top_k is defined at the end as specified. Optional printing is respected; when printing in demo mode, it prints only top_k on the last line as a Python literal. It does not rely on Counter.most_common; sorting/tie-breaking is implemented via a custom key.\n- Minor note: If the script is imported as a module without text/k provided (and not __main__), top_k would not be defined. However, the task states the globals are provided, so this does not affect adherence under the specified conditions.", + "code_quality": "- Correctness: The algorithm correctly tokenizes, counts, and selects Top-K with the required ordering. It does not rely on Counter.most_common tie behavior.\n- Efficiency: Uses an adaptive approach: sorts all when k is a significant fraction of U; otherwise uses heapq.nsmallest with a key, which is O(U log k). Counting is O(N tokens). This is efficient and appropriate.\n- Readability and maintainability: Clear structure with small, well-named helpers (_tokens, top_k_tokens). Type hints are provided. Comments document the selection strategy and complexity. Variable names (u, eff_k) are concise and understandable. The key function is simple and accurate.\n- Style/robustness: No unnecessary reassignment of globals. The try/except NameError pattern for optional demo is reasonable and guarded by __main__. Minor nit: In general library usage, if imported without globals, top_k may remain undefined; this is acceptable given the task context. Overall, the code is clean, idiomatic, and maintainable." + }, + "final_judgement": { + "adherence_score": 5, + "code_quality_score": 5, + "comments": "Excellent adherence: correct tokenization, ordering, edge-case handling, and output variable. Efficient and clean implementation with appropriate tie-breaking and no reliance on most_common. Minor non-issue: top_k undefined if imported without globals, but within task constraints this is fine." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_07.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_07.json new file mode 100644 index 0000000000..5a97b7366e --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_07.json @@ -0,0 +1,16 @@ +{ + "file": "run_07.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"- Tokenization: Uses regex [a-z0-9]+ with re.ASCII and re.IGNORECASE, and explicitly lowercases each matched token via .lower(). This fulfills the requirement of ASCII tokenization with lowercase tokens, and per-token lowercasing is acceptable.\\n- Inputs: The script accesses globals text and k without reassigning them. It uses a try/except NameError to detect their presence and only falls back to a demo when they are absent and __name__ == \\\"__main__\\\", which is acceptable.\\n- Sorting/Ranking: Defines key = lambda kv: (-kv[1], kv[0]) and uses either sorted(..., key=key) or heapq.nsmallest(k, ..., key=key). This correctly implements sort by count descending, then token ascending, without relying on Counter.most_common.\\n- Top-K result: The function top_k_tokens returns exactly the top k items in the correct order, with length min(k, number of unique tokens). For k >= u, it returns the fully sorted list; for small k it returns the k smallest by the key (which corresponds to highest counts and lexicographically smallest tokens in ties) in sorted order.\\n- Edge cases: If k <= 0, returns []. If there are no tokens (u == 0), returns []. Both match the requirements.\\n- Output contract: When globals are provided, top_k is set at the end via top_k = top_k_tokens(text, k). No extra output is produced unless running the demo; printing is optional and the demo prints only top_k.\\n- Constraints: No input(), file I/O, or network access. Does not rely on Counter.most_common tie ordering.\\n- Ambiguities: None materially affecting compliance. The demo printing is limited to one line and only when globals are absent, which is allowed.\",\n \"code_quality\": \"- Clarity/Structure: Clear separation of concerns with a tokenizer helper, a top_k_tokens function, and top-level glue code. Readable variable names and a concise sort key.\\n- Correctness: The key function and use of sorted/heapq.nsmallest ensure exact ordering by (-count, token). Tie-breaking is handled correctly.\\n- Efficiency: Uses Counter for O(N) counting. Chooses between heap-based selection O(U log k) and full sort O(U log U) with a reasonable threshold heuristic. This is efficient and scalable.\\n- Readability/Maintainability: Type hints provided; code is straightforward and commented where relevant (complexity note). The try/except pattern for globals is clean and safe.\\n- Best practices: Avoids Counter.most_common to ensure explicit ordering. No side effects except optional demo printing.\\n- Minor nits: re.IGNORECASE is redundant since tokens are lowercased; harmless. Could add a short docstring, but not necessary.\\n\\nOverall, code quality is high with only trivial, non-impactful redundancies.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 5,\n \"code_quality_score\": 5,\n \"comments\": \"Fully meets the task: correct tokenization, ordering, edge-case handling, and top_k definition with no forbidden I/O. Code is clean, efficient, and well-structured. Minor redundancy in regex flags is harmless.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "- Tokenization: Uses regex [a-z0-9]+ with re.ASCII and re.IGNORECASE, and explicitly lowercases each matched token via .lower(). This fulfills the requirement of ASCII tokenization with lowercase tokens, and per-token lowercasing is acceptable.\n- Inputs: The script accesses globals text and k without reassigning them. It uses a try/except NameError to detect their presence and only falls back to a demo when they are absent and __name__ == \"__main__\", which is acceptable.\n- Sorting/Ranking: Defines key = lambda kv: (-kv[1], kv[0]) and uses either sorted(..., key=key) or heapq.nsmallest(k, ..., key=key). This correctly implements sort by count descending, then token ascending, without relying on Counter.most_common.\n- Top-K result: The function top_k_tokens returns exactly the top k items in the correct order, with length min(k, number of unique tokens). For k >= u, it returns the fully sorted list; for small k it returns the k smallest by the key (which corresponds to highest counts and lexicographically smallest tokens in ties) in sorted order.\n- Edge cases: If k <= 0, returns []. If there are no tokens (u == 0), returns []. Both match the requirements.\n- Output contract: When globals are provided, top_k is set at the end via top_k = top_k_tokens(text, k). No extra output is produced unless running the demo; printing is optional and the demo prints only top_k.\n- Constraints: No input(), file I/O, or network access. Does not rely on Counter.most_common tie ordering.\n- Ambiguities: None materially affecting compliance. The demo printing is limited to one line and only when globals are absent, which is allowed.", + "code_quality": "- Clarity/Structure: Clear separation of concerns with a tokenizer helper, a top_k_tokens function, and top-level glue code. Readable variable names and a concise sort key.\n- Correctness: The key function and use of sorted/heapq.nsmallest ensure exact ordering by (-count, token). Tie-breaking is handled correctly.\n- Efficiency: Uses Counter for O(N) counting. Chooses between heap-based selection O(U log k) and full sort O(U log U) with a reasonable threshold heuristic. This is efficient and scalable.\n- Readability/Maintainability: Type hints provided; code is straightforward and commented where relevant (complexity note). The try/except pattern for globals is clean and safe.\n- Best practices: Avoids Counter.most_common to ensure explicit ordering. No side effects except optional demo printing.\n- Minor nits: re.IGNORECASE is redundant since tokens are lowercased; harmless. Could add a short docstring, but not necessary.\n\nOverall, code quality is high with only trivial, non-impactful redundancies." + }, + "final_judgement": { + "adherence_score": 5, + "code_quality_score": 5, + "comments": "Fully meets the task: correct tokenization, ordering, edge-case handling, and top_k definition with no forbidden I/O. Code is clean, efficient, and well-structured. Minor redundancy in regex flags is harmless." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_08.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_08.json new file mode 100644 index 0000000000..37492dbe4b --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_08.json @@ -0,0 +1,16 @@ +{ + "file": "run_08.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"- Tokenization: Uses a compiled regex r\\\"[a-z0-9]+\\\" with flags re.ASCII | re.IGNORECASE and lowercases each match in _tokens(). This exactly matches the requirement: ASCII [a-z0-9]+ sequences, all other characters as separators, and tokens lowercased.\\n- Inputs/Globals: The script expects globals text (str) and k (int) and does not reassign them. It accesses them in a try block; if present, it computes top_k = top_k_tokens(text, k). This respects the constraint not to modify the provided globals.\\n- Sorting requirement: Implements ordering with key = lambda kv: (-kv[1], kv[0]) applied to (token, count) items, which sorts by count descending then token ascending. For k >= u or when sorting-all-and-slicing, it uses sorted(..., key=key). For the heap path, it uses heapq.nsmallest(k, ..., key=key), which returns the k smallest by the key in sorted order, i.e., highest counts first with correct tiebreaker. It does not rely on Counter.most_common.\\n- Top-K length and content: Returns a list of (token, count) tuples of length min(k, number of unique tokens) in all code paths. If k >= u, returns the full sorted list; else returns exactly k items.\\n- Edge cases: If k <= 0, top_k_tokens returns [] (requirement 3). If there are zero unique tokens (u == 0), returns [], which is consistent with requirement 2.\\n- Output contract: When globals are provided, top_k is defined at module end as required. Optional printing is only done in a demo branch when globals are missing and __name__ == \\\"__main__\\\"; it prints only top_k on the last line, complying with the optional printing rule. With provided globals, it does not print, which is allowed.\\n- Prohibited actions: No input(), file I/O, or network access is used.\\n- Minor note: If the file is imported as a module without text/k and not run as __main__, top_k wouldn't be defined. However, the task states the globals are provided, so under the intended conditions, the contract is met.\",\n \"code_quality\": \"- Correctness: The implementation correctly counts tokens and selects/sorts Top-K per the specified key without relying on Counter.most_common.\\n- Efficiency: Uses Counter for O(N tokens) counting and an optimization to choose between sorting all (O(U log U)) or heap selection (O(U log k)), which is efficient and well-considered.\\n- Readability/Structure: Clear function decomposition (_tokens and top_k_tokens), descriptive names, type hints for clarity and maintainability, and concise comments on complexity and selection strategy.\\n- Robustness: Handles edge cases (k <= 0, no tokens) gracefully. Tokenization is explicit and correct for ASCII alphanumerics.\\n- Style/Best practices: Uses a compiled regex, avoids unnecessary global mutation, and adheres to the output contract. The try/except to detect provided globals is acceptable and simple.\\n- No bugs or inefficiencies apparent. The heapq.nsmallest result is in sorted order by the given key, so the returned top_k has the required ordering.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 5,\n \"code_quality_score\": 5,\n \"comments\": \"Meets all task requirements precisely, including tokenization, sorting, edge-case handling, and output contract. Code is clear, efficient, and well-structured. No issues found.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "- Tokenization: Uses a compiled regex r\"[a-z0-9]+\" with flags re.ASCII | re.IGNORECASE and lowercases each match in _tokens(). This exactly matches the requirement: ASCII [a-z0-9]+ sequences, all other characters as separators, and tokens lowercased.\n- Inputs/Globals: The script expects globals text (str) and k (int) and does not reassign them. It accesses them in a try block; if present, it computes top_k = top_k_tokens(text, k). This respects the constraint not to modify the provided globals.\n- Sorting requirement: Implements ordering with key = lambda kv: (-kv[1], kv[0]) applied to (token, count) items, which sorts by count descending then token ascending. For k >= u or when sorting-all-and-slicing, it uses sorted(..., key=key). For the heap path, it uses heapq.nsmallest(k, ..., key=key), which returns the k smallest by the key in sorted order, i.e., highest counts first with correct tiebreaker. It does not rely on Counter.most_common.\n- Top-K length and content: Returns a list of (token, count) tuples of length min(k, number of unique tokens) in all code paths. If k >= u, returns the full sorted list; else returns exactly k items.\n- Edge cases: If k <= 0, top_k_tokens returns [] (requirement 3). If there are zero unique tokens (u == 0), returns [], which is consistent with requirement 2.\n- Output contract: When globals are provided, top_k is defined at module end as required. Optional printing is only done in a demo branch when globals are missing and __name__ == \"__main__\"; it prints only top_k on the last line, complying with the optional printing rule. With provided globals, it does not print, which is allowed.\n- Prohibited actions: No input(), file I/O, or network access is used.\n- Minor note: If the file is imported as a module without text/k and not run as __main__, top_k wouldn't be defined. However, the task states the globals are provided, so under the intended conditions, the contract is met.", + "code_quality": "- Correctness: The implementation correctly counts tokens and selects/sorts Top-K per the specified key without relying on Counter.most_common.\n- Efficiency: Uses Counter for O(N tokens) counting and an optimization to choose between sorting all (O(U log U)) or heap selection (O(U log k)), which is efficient and well-considered.\n- Readability/Structure: Clear function decomposition (_tokens and top_k_tokens), descriptive names, type hints for clarity and maintainability, and concise comments on complexity and selection strategy.\n- Robustness: Handles edge cases (k <= 0, no tokens) gracefully. Tokenization is explicit and correct for ASCII alphanumerics.\n- Style/Best practices: Uses a compiled regex, avoids unnecessary global mutation, and adheres to the output contract. The try/except to detect provided globals is acceptable and simple.\n- No bugs or inefficiencies apparent. The heapq.nsmallest result is in sorted order by the given key, so the returned top_k has the required ordering." + }, + "final_judgement": { + "adherence_score": 5, + "code_quality_score": 5, + "comments": "Meets all task requirements precisely, including tokenization, sorting, edge-case handling, and output contract. Code is clear, efficient, and well-structured. No issues found." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_09.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_09.json new file mode 100644 index 0000000000..5f4dea6d91 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_09.json @@ -0,0 +1,16 @@ +{ + "file": "run_09.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"Step-by-step adherence check:\\n- Tokenization rules: The code compiles a regex r\\\"[a-z0-9]+\\\" with flags re.ASCII | re.IGNORECASE and the _tokens generator yields matches lowercased. This matches the requirement to lowercase and to treat ASCII [a-z0-9]+ as tokens with other chars as separators.\\n- Inputs/globals: It uses provided globals text and k without reassigning them. Access is guarded via a try/except NameError; when globals exist, it computes top_k from them.\\n- Requirements:\\n 1) Sort key (-count, token): Implemented via key = lambda kv: (-kv[1], kv[0]) and used both in sorted(...) and in heapq.nsmallest(...), ensuring count desc then token asc.\\n 2) Set top_k to the list of (token, count) of length min(k, unique tokens): m is computed as min(k, u) (with early return [] if k <= 0). Returns exactly m items from either full sort slice [:m] or nsmallest(m, ...), maintaining the required ordering.\\n 3) Edge case k <= 0: top_k_tokens returns [], so top_k becomes [].\\n 4) No input(), file I/O, or network used. Script runs with provided globals and does not modify them.\\n- Output contract: With provided globals, the else branch sets top_k = top_k_tokens(text, k). In the demo (when globals missing and __name__ == \\\"__main__\\\"), it defines top_k and prints only top_k on the last line. It does not rely on Counter.most_common for tie ordering. Therefore, all specified behaviors are satisfied.\\n- Minor note: If globals are missing and not running as __main__, top_k would remain undefined; however, the task states globals are provided, so this path is out of scope. Overall, adherence is complete.\",\n \"code_quality\": \"Code quality assessment:\\n- Correctness: Algorithm correctly counts tokens, handles lowercasing, and produces the exact Top-K with the specified ordering. It avoids Counter.most_common tie ordering.\\n- Efficiency: Uses Counter for O(N tokens), and switches between full sort and heapq.nsmallest for selection based on m relative to u, which is a good optimization. Key function with negative counts is appropriate for nsmallest.\\n- Readability/Structure: Clear helper _tokens with type hints. Well-named variables, concise logic, and explanatory comments for complexity and selection strategy. Regex is compiled once at module level.\\n- Maintainability/Best practices: Uses typing annotations, avoids side effects on inputs, and isolates logic in a function. The try/except to detect globals is reasonable; type: ignore comments are minimally intrusive. Optional print is constrained to demo path and prints only the required artifact. No obvious bugs or stylistic issues.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 5,\n \"code_quality_score\": 5,\n \"comments\": \"Meets all task requirements precisely, including tokenization, ordering, edge cases, and output contract. Code is clear, efficient, and well-structured. Minor non-issue: in non-main context without globals, top_k isn\u2019t set, but task guarantees globals, so this is acceptable.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "Step-by-step adherence check:\n- Tokenization rules: The code compiles a regex r\"[a-z0-9]+\" with flags re.ASCII | re.IGNORECASE and the _tokens generator yields matches lowercased. This matches the requirement to lowercase and to treat ASCII [a-z0-9]+ as tokens with other chars as separators.\n- Inputs/globals: It uses provided globals text and k without reassigning them. Access is guarded via a try/except NameError; when globals exist, it computes top_k from them.\n- Requirements:\n 1) Sort key (-count, token): Implemented via key = lambda kv: (-kv[1], kv[0]) and used both in sorted(...) and in heapq.nsmallest(...), ensuring count desc then token asc.\n 2) Set top_k to the list of (token, count) of length min(k, unique tokens): m is computed as min(k, u) (with early return [] if k <= 0). Returns exactly m items from either full sort slice [:m] or nsmallest(m, ...), maintaining the required ordering.\n 3) Edge case k <= 0: top_k_tokens returns [], so top_k becomes [].\n 4) No input(), file I/O, or network used. Script runs with provided globals and does not modify them.\n- Output contract: With provided globals, the else branch sets top_k = top_k_tokens(text, k). In the demo (when globals missing and __name__ == \"__main__\"), it defines top_k and prints only top_k on the last line. It does not rely on Counter.most_common for tie ordering. Therefore, all specified behaviors are satisfied.\n- Minor note: If globals are missing and not running as __main__, top_k would remain undefined; however, the task states globals are provided, so this path is out of scope. Overall, adherence is complete.", + "code_quality": "Code quality assessment:\n- Correctness: Algorithm correctly counts tokens, handles lowercasing, and produces the exact Top-K with the specified ordering. It avoids Counter.most_common tie ordering.\n- Efficiency: Uses Counter for O(N tokens), and switches between full sort and heapq.nsmallest for selection based on m relative to u, which is a good optimization. Key function with negative counts is appropriate for nsmallest.\n- Readability/Structure: Clear helper _tokens with type hints. Well-named variables, concise logic, and explanatory comments for complexity and selection strategy. Regex is compiled once at module level.\n- Maintainability/Best practices: Uses typing annotations, avoids side effects on inputs, and isolates logic in a function. The try/except to detect globals is reasonable; type: ignore comments are minimally intrusive. Optional print is constrained to demo path and prints only the required artifact. No obvious bugs or stylistic issues." + }, + "final_judgement": { + "adherence_score": 5, + "code_quality_score": 5, + "comments": "Meets all task requirements precisely, including tokenization, ordering, edge cases, and output contract. Code is clear, efficient, and well-structured. Minor non-issue: in non-main context without globals, top_k isn\u2019t set, but task guarantees globals, so this is acceptable." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_10.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_10.json new file mode 100644 index 0000000000..c2ca4a50c4 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_10.json @@ -0,0 +1,16 @@ +{ + "file": "run_10.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"- Tokenization requirements: The code lowercases tokens and uses a regex r\\\"[a-z0-9]+\\\" with re.ASCII and re.IGNORECASE, then explicitly lowercases matches in _tokens(). This matches the spec: ASCII [a-z0-9]+ sequences, lowercase output, non-matching chars are separators.\\n- Inputs: It treats text and k as provided globals and does not reassign them. It accesses them via a presence check and then calls top_k_tokens(text, k). No input(), file I/O, or network access used.\\n- Requirement 1 (sorting key): Uses key = lambda kv: (-kv[1], kv[0]) both for sorted(...) and heapq.nsmallest(...), which implements sorting by count desc, then token asc. heapq.nsmallest returns results in increasing order of the key, which matches the required order.\\n- Requirement 2 (top_k content and length): top_k_tokens returns a list of (token, count) tuples. It computes k_eff = min(k, number of unique tokens) and slices to that length. The top-level assigns top_k = top_k_tokens(text, k), so top_k has the correct type and length.\\n- Requirement 3 (edge case k <= 0): top_k_tokens returns [] for k <= 0. That is propagated to top_k.\\n- Requirement 4 (script runs as-is): When globals text and k exist, top_k is defined at the end of execution via the else branch of the try/except. No prohibited I/O is used. If globals are missing, it only runs a demo when __main__, which is acceptable and does not violate the contract in the intended environment where globals are provided.\\n- Output contract: At the end, top_k is defined as specified when text and k are provided. Optional printing is only done in the demo path and prints just top_k on the last line.\\n- Note about Counter.most_common: The code does not use most_common; it implements the specified sort explicitly.\\nOverall, all instructions are satisfied without reliance on unspecified behavior. The minor demo branch does not interfere with the required behavior.\",\n \"code_quality\": \"- Correctness: The counting and selection logic is correct and adheres to the specified ordering, including tie-breaking.\\n- Efficiency: Uses Counter for O(N tokens) counting. Selects top-k via full sort when k is a substantial fraction of U, otherwise uses heapq.nsmallest for O(U log k). This is efficient and documented in a complexity comment.\\n- Readability and structure: Clear separation of concerns (_tokens, top_k_tokens). Type hints for inputs/outputs improve clarity. Variable names are concise but understandable (cnt, u, k_eff). Inline comments explain decisions.\\n- Maintainability: Modular with small, focused functions; easy to adapt. Avoids relying on Counter.most_common tie behavior as required.\\n- Style/best practices: Uses a compiled regex and a generator. Handles edge cases early. The try/except check for globals is slightly unconventional but effective and harmless. Using re.IGNORECASE plus lower() is slightly redundant but not problematic.\\nNo bugs or meaningful inefficiencies identified.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 5,\n \"code_quality_score\": 5,\n \"comments\": \"Excellent adherence and implementation. Correct tokenization, ordering, edge-case handling, and output variable set as required. Clean, efficient code with thoughtful selection strategy. Only negligible nits (redundant IGNORECASE given lowercasing).\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "- Tokenization requirements: The code lowercases tokens and uses a regex r\"[a-z0-9]+\" with re.ASCII and re.IGNORECASE, then explicitly lowercases matches in _tokens(). This matches the spec: ASCII [a-z0-9]+ sequences, lowercase output, non-matching chars are separators.\n- Inputs: It treats text and k as provided globals and does not reassign them. It accesses them via a presence check and then calls top_k_tokens(text, k). No input(), file I/O, or network access used.\n- Requirement 1 (sorting key): Uses key = lambda kv: (-kv[1], kv[0]) both for sorted(...) and heapq.nsmallest(...), which implements sorting by count desc, then token asc. heapq.nsmallest returns results in increasing order of the key, which matches the required order.\n- Requirement 2 (top_k content and length): top_k_tokens returns a list of (token, count) tuples. It computes k_eff = min(k, number of unique tokens) and slices to that length. The top-level assigns top_k = top_k_tokens(text, k), so top_k has the correct type and length.\n- Requirement 3 (edge case k <= 0): top_k_tokens returns [] for k <= 0. That is propagated to top_k.\n- Requirement 4 (script runs as-is): When globals text and k exist, top_k is defined at the end of execution via the else branch of the try/except. No prohibited I/O is used. If globals are missing, it only runs a demo when __main__, which is acceptable and does not violate the contract in the intended environment where globals are provided.\n- Output contract: At the end, top_k is defined as specified when text and k are provided. Optional printing is only done in the demo path and prints just top_k on the last line.\n- Note about Counter.most_common: The code does not use most_common; it implements the specified sort explicitly.\nOverall, all instructions are satisfied without reliance on unspecified behavior. The minor demo branch does not interfere with the required behavior.", + "code_quality": "- Correctness: The counting and selection logic is correct and adheres to the specified ordering, including tie-breaking.\n- Efficiency: Uses Counter for O(N tokens) counting. Selects top-k via full sort when k is a substantial fraction of U, otherwise uses heapq.nsmallest for O(U log k). This is efficient and documented in a complexity comment.\n- Readability and structure: Clear separation of concerns (_tokens, top_k_tokens). Type hints for inputs/outputs improve clarity. Variable names are concise but understandable (cnt, u, k_eff). Inline comments explain decisions.\n- Maintainability: Modular with small, focused functions; easy to adapt. Avoids relying on Counter.most_common tie behavior as required.\n- Style/best practices: Uses a compiled regex and a generator. Handles edge cases early. The try/except check for globals is slightly unconventional but effective and harmless. Using re.IGNORECASE plus lower() is slightly redundant but not problematic.\nNo bugs or meaningful inefficiencies identified." + }, + "final_judgement": { + "adherence_score": 5, + "code_quality_score": 5, + "comments": "Excellent adherence and implementation. Correct tokenization, ordering, edge-case handling, and output variable set as required. Clean, efficient code with thoughtful selection strategy. Only negligible nits (redundant IGNORECASE given lowercasing)." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_11.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_11.json new file mode 100644 index 0000000000..225bc65398 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_11.json @@ -0,0 +1,16 @@ +{ + "file": "run_11.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"- Tokenization: Uses regex r\\\"[a-z0-9]+\\\" with flags re.ASCII | re.IGNORECASE and lowercases each match via m.group(0).lower(). This meets the requirement: lowercase tokens and treat non [a-z0-9] ASCII as separators.\\n- Inputs: Reads globals text and k without reassigning them. Presence is checked via try/except; no mutation occurs.\\n- Requirement 1 (sorting): Implements key = (-count, token) and uses sorted(...) when k >= unique_count and heapq.nsmallest(...) otherwise. Since nsmallest returns results in ascending key order, this yields count desc then token asc. No reliance on Counter.most_common.\\n- Requirement 2 (top_k content/length): top_k_tokens returns a list of (token, count) pairs. Length is min(k, number of unique tokens) when k > 0; otherwise empty. The outer code assigns this list to the global variable top_k.\\n- Requirement 3 (k <= 0): Explicitly returns [] for k <= 0; thus top_k becomes [].\\n- Requirement 4 (no I/O): No input(), file, or network usage. Optional demo only prints top_k when globals are absent and __name__ == \\\"__main__\\\".\\n- Output contract: With provided globals, top_k is defined at the end as specified. Optional printing is suppressed in that case; in demo mode, it prints only top_k on the last line as a Python literal. Does not rely on Counter.most_common tie ordering.\\n- Edge considerations: Sorting tie-breaker by token asc is correctly handled. ASCII/lexicographic assumptions hold since tokens are lowercase ASCII strings.\",\n \"code_quality\": \"- Clarity/Structure: Clean separation of concerns with a token generator and a top_k_tokens function. Names are descriptive; type hints provided.\\n- Correctness: Algorithm correctly counts tokens and selects/sorts by the specified key. Handles k <= 0 and k >= unique_count cases correctly.\\n- Efficiency: Counting is O(N tokens). Selection uses heapq.nsmallest for O(U log k) when k < U, and full sort O(U log U) otherwise. Space usage matches the comment.\\n- Readability/Maintainability: Readable, concise, and well-commented. The complexity note is helpful.\\n- Best practices: Avoids Counter.most_common. Uses regex with ASCII and proper lowercasing. No side effects on inputs.\\n- Minor nits: re.IGNORECASE is redundant given explicit .lower() (either alone would suffice), but harmless. In environments without provided globals and not running as __main__, top_k won't be defined, though this is outside the stated execution context (globals are provided).\"\n },\n \"final_judgement\": {\n \"adherence_score\": 5,\n \"code_quality_score\": 5,\n \"comments\": \"Excellent adherence and implementation. Correct tokenization, ordering, edge-case handling, and output variable contract. Code is clear, efficient, and robust. Minor redundancy in IGNORECASE with lowercasing, but not impactful.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "- Tokenization: Uses regex r\"[a-z0-9]+\" with flags re.ASCII | re.IGNORECASE and lowercases each match via m.group(0).lower(). This meets the requirement: lowercase tokens and treat non [a-z0-9] ASCII as separators.\n- Inputs: Reads globals text and k without reassigning them. Presence is checked via try/except; no mutation occurs.\n- Requirement 1 (sorting): Implements key = (-count, token) and uses sorted(...) when k >= unique_count and heapq.nsmallest(...) otherwise. Since nsmallest returns results in ascending key order, this yields count desc then token asc. No reliance on Counter.most_common.\n- Requirement 2 (top_k content/length): top_k_tokens returns a list of (token, count) pairs. Length is min(k, number of unique tokens) when k > 0; otherwise empty. The outer code assigns this list to the global variable top_k.\n- Requirement 3 (k <= 0): Explicitly returns [] for k <= 0; thus top_k becomes [].\n- Requirement 4 (no I/O): No input(), file, or network usage. Optional demo only prints top_k when globals are absent and __name__ == \"__main__\".\n- Output contract: With provided globals, top_k is defined at the end as specified. Optional printing is suppressed in that case; in demo mode, it prints only top_k on the last line as a Python literal. Does not rely on Counter.most_common tie ordering.\n- Edge considerations: Sorting tie-breaker by token asc is correctly handled. ASCII/lexicographic assumptions hold since tokens are lowercase ASCII strings.", + "code_quality": "- Clarity/Structure: Clean separation of concerns with a token generator and a top_k_tokens function. Names are descriptive; type hints provided.\n- Correctness: Algorithm correctly counts tokens and selects/sorts by the specified key. Handles k <= 0 and k >= unique_count cases correctly.\n- Efficiency: Counting is O(N tokens). Selection uses heapq.nsmallest for O(U log k) when k < U, and full sort O(U log U) otherwise. Space usage matches the comment.\n- Readability/Maintainability: Readable, concise, and well-commented. The complexity note is helpful.\n- Best practices: Avoids Counter.most_common. Uses regex with ASCII and proper lowercasing. No side effects on inputs.\n- Minor nits: re.IGNORECASE is redundant given explicit .lower() (either alone would suffice), but harmless. In environments without provided globals and not running as __main__, top_k won't be defined, though this is outside the stated execution context (globals are provided)." + }, + "final_judgement": { + "adherence_score": 5, + "code_quality_score": 5, + "comments": "Excellent adherence and implementation. Correct tokenization, ordering, edge-case handling, and output variable contract. Code is clear, efficient, and robust. Minor redundancy in IGNORECASE with lowercasing, but not impactful." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_12.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_12.json new file mode 100644 index 0000000000..3680320b0d --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_12.json @@ -0,0 +1,16 @@ +{ + "file": "run_12.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"- Tokenization: Uses regex r\\\"[a-z0-9]+\\\" with re.ASCII and re.IGNORECASE, then lowercases matches in _tokens(), yielding lowercase ASCII alnum sequences. This matches the requirement precisely (lowercase; tokens are ASCII [a-z0-9]+; non-matching chars act as separators).\\n- Inputs: Accesses globals text and k without reassigning them. Uses a try/except NameError to detect their presence and only runs a demo if they are missing and the script is __main__, which does not violate any requirement.\\n- Requirement 1 (ordering): Implements the exact sort key (-count, token). For k >= number of unique tokens, returns sorted(cnt.items(), key=key). For k < unique, uses heapq.nsmallest(k, cnt.items(), key=key). Since nsmallest returns items sorted ascending by the key, and the key is (-count, token), the output is correctly ordered by count desc then token asc.\\n- Requirement 2 (top_k value and length): top_k is set to the return of top_k_tokens(text, k), which returns a list of (token, count) tuples with length min(k, unique), or all if k >= unique. Satisfied.\\n- Requirement 3 (k <= 0): top_k_tokens returns [] when k <= 0; thus top_k becomes []. Satisfied.\\n- Requirement 4 (no I/O/network): No input(), file I/O, or network usage. The optional print occurs only in demo mode when globals are not provided and __name__ == \\\"__main__\\\".\\n- Output contract: With provided globals, top_k is defined exactly as required at the end of execution. Optional printing is not performed in that path; in demo mode, only top_k is printed on the last line. The code does not rely on Counter.most_common for tie handling.\\n- Edge considerations: If globals are not provided and not running as __main__, top_k would remain undefined, but the task specifies that text and k are provided, so this does not violate the contract for the intended usage.\",\n \"code_quality\": \"- Correctness: Logic for tokenization, counting, and top-k selection is correct and matches the specified sort order, including tie-breaking.\\n- Efficiency: O(T) counting, O(U log k) selection using heapq for k < U, and full sort only when needed. Space usage as documented in the comment. Efficient and appropriate.\\n- Readability/Structure: Clear separation of concerns (_tokens helper, top_k_tokens core function). Meaningful names, type hints, and a compiled regex constant. Inline key function is clear.\\n- Maintainability: Minimal, clean code with clear complexity comment. Does not depend on unspecified behaviors (avoids Counter.most_common ordering).\\n- Minor nits:\\n - re.IGNORECASE is redundant since matches are explicitly lowered; it can be removed without changing behavior.\\n - _tokens could be annotated as Iterator[str] instead of Iterable[str] for precision, but this is minor.\\n - The demo branch is guarded to avoid interfering with normal usage; good practice.\\nOverall code quality is high with only trivial improvements possible.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 5,\n \"code_quality_score\": 5,\n \"comments\": \"Excellent adherence to all requirements, including exact tokenization and specified ordering. Efficient and clean implementation. Minor, non-impactful redundancy in regex flags.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "- Tokenization: Uses regex r\"[a-z0-9]+\" with re.ASCII and re.IGNORECASE, then lowercases matches in _tokens(), yielding lowercase ASCII alnum sequences. This matches the requirement precisely (lowercase; tokens are ASCII [a-z0-9]+; non-matching chars act as separators).\n- Inputs: Accesses globals text and k without reassigning them. Uses a try/except NameError to detect their presence and only runs a demo if they are missing and the script is __main__, which does not violate any requirement.\n- Requirement 1 (ordering): Implements the exact sort key (-count, token). For k >= number of unique tokens, returns sorted(cnt.items(), key=key). For k < unique, uses heapq.nsmallest(k, cnt.items(), key=key). Since nsmallest returns items sorted ascending by the key, and the key is (-count, token), the output is correctly ordered by count desc then token asc.\n- Requirement 2 (top_k value and length): top_k is set to the return of top_k_tokens(text, k), which returns a list of (token, count) tuples with length min(k, unique), or all if k >= unique. Satisfied.\n- Requirement 3 (k <= 0): top_k_tokens returns [] when k <= 0; thus top_k becomes []. Satisfied.\n- Requirement 4 (no I/O/network): No input(), file I/O, or network usage. The optional print occurs only in demo mode when globals are not provided and __name__ == \"__main__\".\n- Output contract: With provided globals, top_k is defined exactly as required at the end of execution. Optional printing is not performed in that path; in demo mode, only top_k is printed on the last line. The code does not rely on Counter.most_common for tie handling.\n- Edge considerations: If globals are not provided and not running as __main__, top_k would remain undefined, but the task specifies that text and k are provided, so this does not violate the contract for the intended usage.", + "code_quality": "- Correctness: Logic for tokenization, counting, and top-k selection is correct and matches the specified sort order, including tie-breaking.\n- Efficiency: O(T) counting, O(U log k) selection using heapq for k < U, and full sort only when needed. Space usage as documented in the comment. Efficient and appropriate.\n- Readability/Structure: Clear separation of concerns (_tokens helper, top_k_tokens core function). Meaningful names, type hints, and a compiled regex constant. Inline key function is clear.\n- Maintainability: Minimal, clean code with clear complexity comment. Does not depend on unspecified behaviors (avoids Counter.most_common ordering).\n- Minor nits:\n - re.IGNORECASE is redundant since matches are explicitly lowered; it can be removed without changing behavior.\n - _tokens could be annotated as Iterator[str] instead of Iterable[str] for precision, but this is minor.\n - The demo branch is guarded to avoid interfering with normal usage; good practice.\nOverall code quality is high with only trivial improvements possible." + }, + "final_judgement": { + "adherence_score": 5, + "code_quality_score": 5, + "comments": "Excellent adherence to all requirements, including exact tokenization and specified ordering. Efficient and clean implementation. Minor, non-impactful redundancy in regex flags." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_13.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_13.json new file mode 100644 index 0000000000..f10a4bd795 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_13.json @@ -0,0 +1,16 @@ +{ + "file": "run_13.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"- Tokenization: Uses compiled regex r\\\"[a-z0-9]+\\\" with flags re.ASCII | re.IGNORECASE and lowers tokens via .lower(). This matches the requirement to tokenize ASCII [a-z0-9]+ sequences and lowercase the text.\\n- Inputs: The code reads globals text and k without reassigning them. It conditionally uses them if present; otherwise it runs a demo only under __main__. This respects the \\\"do not reassign\\\" constraint.\\n- Requirement 1 (sorting): Implements sorting by count desc, then token asc via key = lambda kv: (-kv[1], kv[0]). For k >= unique tokens, it uses sorted(..., key=key). For k < unique tokens, it uses heapq.nsmallest(k, ..., key=key). nsmallest returns items in sorted order by the given key, thus producing the exact required ordering.\\n- Requirement 2 (set top_k, correct length): Assigns top_k to the return value of top_k_tokens(text, k), which returns a list of (token, count) tuples. Length is min(k, number of unique tokens) due to either full sort when k >= U or nsmallest(k, ...).\\n- Requirement 3 (edge k <= 0): top_k_tokens returns [] when k <= 0, so top_k is set to [].\\n- Requirement 4 (no input/file/network; runs with provided globals): No input(), file I/O, or network used. With provided globals, the script sets top_k directly without printing. The optional demo path only runs when globals are missing and __name__ == \\\"__main__\\\".\\n- Output contract: With provided globals, top_k is defined at end as specified. Optional printing is only in the demo case and prints only top_k as a Python literal on the last line. The solution does not rely on Counter.most_common tie ordering.\\n- Minor note: If globals are missing and the script is not __main__, top_k would not be defined, but the task specifies that globals are provided, so this does not violate the requirements in the intended execution context.\",\n \"code_quality\": \"- Clarity/Structure: Clean separation via helper _tokens and top_k_tokens function; type hints provided; compiled regex improves readability and performance.\\n- Correctness: Sorting key correctly implements (-count, token). Uses heapq.nsmallest to achieve Top-K with correct ordering for k < U. Handles empty input and k <= 0.\\n- Efficiency: O(N) counting and O(U log k) selection for k < U; falls back to O(U log U) when k >= U. This is efficient and appropriate. The included complexity comment is accurate.\\n- Readability: Generally good. Minor nit: variable name u could be more descriptive (e.g., uniq). Lambda could destructure for readability, but current form is fine.\\n- Best practices: Avoids relying on Counter.most_common tie behavior; no unnecessary I/O; main guard used properly. The try/except for checking globals is functional but could be clearer using 'if \\\"text\\\" in globals() and \\\"k\\\" in globals()'.\\n- Minor nit: Using re.IGNORECASE and then lower() is redundant; lowercasing alone with ASCII regex would suffice. This does not affect correctness.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 5,\n \"code_quality_score\": 4,\n \"comments\": \"Meets all task requirements precisely, including tokenization, ordering, edge cases, and output contract. Code is clean and efficient. Minor nits: redundant IGNORECASE + lower(), and globals detection via try/except could be clearer.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "- Tokenization: Uses compiled regex r\"[a-z0-9]+\" with flags re.ASCII | re.IGNORECASE and lowers tokens via .lower(). This matches the requirement to tokenize ASCII [a-z0-9]+ sequences and lowercase the text.\n- Inputs: The code reads globals text and k without reassigning them. It conditionally uses them if present; otherwise it runs a demo only under __main__. This respects the \"do not reassign\" constraint.\n- Requirement 1 (sorting): Implements sorting by count desc, then token asc via key = lambda kv: (-kv[1], kv[0]). For k >= unique tokens, it uses sorted(..., key=key). For k < unique tokens, it uses heapq.nsmallest(k, ..., key=key). nsmallest returns items in sorted order by the given key, thus producing the exact required ordering.\n- Requirement 2 (set top_k, correct length): Assigns top_k to the return value of top_k_tokens(text, k), which returns a list of (token, count) tuples. Length is min(k, number of unique tokens) due to either full sort when k >= U or nsmallest(k, ...).\n- Requirement 3 (edge k <= 0): top_k_tokens returns [] when k <= 0, so top_k is set to [].\n- Requirement 4 (no input/file/network; runs with provided globals): No input(), file I/O, or network used. With provided globals, the script sets top_k directly without printing. The optional demo path only runs when globals are missing and __name__ == \"__main__\".\n- Output contract: With provided globals, top_k is defined at end as specified. Optional printing is only in the demo case and prints only top_k as a Python literal on the last line. The solution does not rely on Counter.most_common tie ordering.\n- Minor note: If globals are missing and the script is not __main__, top_k would not be defined, but the task specifies that globals are provided, so this does not violate the requirements in the intended execution context.", + "code_quality": "- Clarity/Structure: Clean separation via helper _tokens and top_k_tokens function; type hints provided; compiled regex improves readability and performance.\n- Correctness: Sorting key correctly implements (-count, token). Uses heapq.nsmallest to achieve Top-K with correct ordering for k < U. Handles empty input and k <= 0.\n- Efficiency: O(N) counting and O(U log k) selection for k < U; falls back to O(U log U) when k >= U. This is efficient and appropriate. The included complexity comment is accurate.\n- Readability: Generally good. Minor nit: variable name u could be more descriptive (e.g., uniq). Lambda could destructure for readability, but current form is fine.\n- Best practices: Avoids relying on Counter.most_common tie behavior; no unnecessary I/O; main guard used properly. The try/except for checking globals is functional but could be clearer using 'if \"text\" in globals() and \"k\" in globals()'.\n- Minor nit: Using re.IGNORECASE and then lower() is redundant; lowercasing alone with ASCII regex would suffice. This does not affect correctness." + }, + "final_judgement": { + "adherence_score": 5, + "code_quality_score": 4, + "comments": "Meets all task requirements precisely, including tokenization, ordering, edge cases, and output contract. Code is clean and efficient. Minor nits: redundant IGNORECASE + lower(), and globals detection via try/except could be clearer." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_14.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_14.json new file mode 100644 index 0000000000..af7fd2d1bc --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_14.json @@ -0,0 +1,6 @@ +{ + "file": "run_14.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"- Tokenization: Uses a compiled regex r\\\"[a-z0-9]+\\\" with flags re.ASCII and re.IGNORECASE, and explicitly lowercases each token via .lower(). This satisfies the requirement to lowercase and to treat non-[a-z0-9] ASCII as separators using a regex.\\n- Inputs/Globals: The code reads globals text and k without reassigning them. It guards with a try/except to detect their presence. When provided, it sets top_k = top_k_tokens(text, k) as required. It does not use input(), file I/O, or networking.\\n- Requirement 1 (Sorting key): Sorting is implemented with key=lambda kv: (-kv[1], kv[0]) on (token, count) pairs. For k >= u or for the full-sort path, it uses sorted(..., key=key), which orders by count desc then token asc. For the selection path, it uses heapq.nsmallest(k, cnt.items(), key=key), which returns the k smallest by (-count, token) and in sorted order, equivalent to the specified ordering. It does not rely on Counter.most_common.\\n- Requirement 2 (Set top_k list and length): top_k_tokens returns a list of (token, count) tuples. Branches ensure the length is min(k, number of unique tokens):\\n - If k >= u: returns all u items sorted.\\n - Else: returns first k items either via sorted(... )[:k] or nsmallest(k, ...), both giving exactly k items.\\n- Requirement 3 (Edge case k <= 0): top_k_tokens returns [] when k <= 0. If there are zero unique tokens (u == 0), it also returns [].\\n- Requirement 4 (Script behavior): With provided globals, it computes and assigns top_k accordingly. No forbidden I/O is used. Optional printing is only performed in a demo block when globals are absent and __name__ == \\\"__main__\\\"; it prints only top_k on the last line, complying with the optional print rule.\\n- Output contract: At end of execution with provided globals, top_k is defined exactly as described. In demo mode, top_k is still defined and printed. No reliance on Counter.most_common tie behavior.\\n- Minor note: If executed without globals and not as __main__ (e.g., imported), top_k would not be defined; however, the task specifies that globals are provided for the execution context, so this does not violate the stated contract.\",\n \"code_quality\": \"- Correctness: Logic matches the specification precisely; ordering key is correct; selection via heapq.nsmallest yields exact Top-K in the required order.\\n- Efficiency: Uses Counter for O(N) counting. Selects between full sort O(U log U) when k is a large fraction of U and heap-based selection O(U log k) otherwise; includes a clear complexity comment.\\n- Readability/Maintainability: Code is clear and well-structured. Helper function _tokens and compiled regex improve clarity and performance. Type hints are used appropriately. Inline comment explains the selection heuristic.\\n- Best practices: Avoids Counter.most_common ties. Does not mutate globals. Demo guarded by if __name__ == \\\"__main__\\\". Uses key functions instead of relying on unspecified tie ordering.\\n- Minor nit: re.IGNORECASE is redundant since tokens are lowercased afterward, but it does not harm correctness.\",\n \"final_judgement\": {\n \"adherence_score\": 5,\n \"code_quality_score\": 5,\n \"comments\": \"Meets all task requirements exactly, including tokenization, ordering, edge cases, and output contract. Code is clean, efficient, and well-structured. Only minor nit is redundant IGNORECASE given explicit lowercasing.\"\n }\n}", + "parsed": null, + "parse_error": "Expecting ',' delimiter: line 10 column 2 (char 3422)" +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_15.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_15.json new file mode 100644 index 0000000000..fd0c549a45 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_15.json @@ -0,0 +1,16 @@ +{ + "file": "run_15.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"- Tokenization: Uses regex r\\\"[a-z0-9]+\\\" with flags re.ASCII | re.IGNORECASE and lowercases matches in _tokens(). This matches the requirement to lowercase and treat ASCII [a-z0-9]+ as tokens with other chars as separators.\\n- Inputs: Reads provided globals text and k without reassigning them. Presence is checked via try/except; if present, computes result from those globals.\\n- Requirement 1 (sorting): Implements key = lambda kv: (-kv[1], kv[0]) and uses it both in sorted(...) and heapq.nsmallest(...). This produces exact ordering by count desc then token asc. Does not rely on Counter.most_common.\\n- Requirement 2 (top_k contents/length): top_k is set to a list of (token, count) tuples with length min(k, number of unique tokens). Both branches slice or request min(k, u) items.\\n- Requirement 3 (edge k <= 0): top_k_tokens returns [] when k <= 0; thus top_k becomes [].\\n- Requirement 4 (no I/O/network): No input(), file, or network usage. Script runs with provided globals; optional demo only executes under __main__ when globals are absent.\\n- Output contract: When globals text and k are present, top_k is defined exactly as specified. Optional printing occurs only in the demo fallback and prints only top_k on the last line. No extraneous output in the primary (globals-present) path.\\n- Ambiguities/choices: The code chooses between full sort and heap selection for efficiency (0.3 threshold). Both paths yield exact ordering and correct results, so this is reasonable and consistent with requirements.\",\n \"code_quality\": \"- Clarity/readability: Clear structure with helper tokenizer function, type hints, and meaningful variable names. Comments explain complexity and selection strategy.\\n- Correctness: Tokenization and ordering strictly follow the spec. Heap-based selection uses key=(-count, token) ensuring exact tie-breaking and final order (heapq.nsmallest returns items sorted by key).\\n- Efficiency: Uses Counter for O(N) counting and adapts between O(U log U) full sort and O(U log k) selection; good optimization.\\n- Maintainability: Modular function top_k_tokens, compiled regex, and straightforward control flow. No reliance on unspecified tie behavior.\\n- Style/best practices: Avoids mutating provided globals; uses typing; avoids unnecessary I/O. Minor nit: variables like 'cnt'/'u' are concise but still readable.\\n- No bugs or inefficiencies detected relevant to the task. The fallback demo doesn't interfere with primary behavior.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 5,\n \"code_quality_score\": 5,\n \"comments\": \"Excellent adherence: correct tokenization, ordering, edge handling, and output variable. Code is clean, efficient, and robust. The optional demo is safely isolated under __main__.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "- Tokenization: Uses regex r\"[a-z0-9]+\" with flags re.ASCII | re.IGNORECASE and lowercases matches in _tokens(). This matches the requirement to lowercase and treat ASCII [a-z0-9]+ as tokens with other chars as separators.\n- Inputs: Reads provided globals text and k without reassigning them. Presence is checked via try/except; if present, computes result from those globals.\n- Requirement 1 (sorting): Implements key = lambda kv: (-kv[1], kv[0]) and uses it both in sorted(...) and heapq.nsmallest(...). This produces exact ordering by count desc then token asc. Does not rely on Counter.most_common.\n- Requirement 2 (top_k contents/length): top_k is set to a list of (token, count) tuples with length min(k, number of unique tokens). Both branches slice or request min(k, u) items.\n- Requirement 3 (edge k <= 0): top_k_tokens returns [] when k <= 0; thus top_k becomes [].\n- Requirement 4 (no I/O/network): No input(), file, or network usage. Script runs with provided globals; optional demo only executes under __main__ when globals are absent.\n- Output contract: When globals text and k are present, top_k is defined exactly as specified. Optional printing occurs only in the demo fallback and prints only top_k on the last line. No extraneous output in the primary (globals-present) path.\n- Ambiguities/choices: The code chooses between full sort and heap selection for efficiency (0.3 threshold). Both paths yield exact ordering and correct results, so this is reasonable and consistent with requirements.", + "code_quality": "- Clarity/readability: Clear structure with helper tokenizer function, type hints, and meaningful variable names. Comments explain complexity and selection strategy.\n- Correctness: Tokenization and ordering strictly follow the spec. Heap-based selection uses key=(-count, token) ensuring exact tie-breaking and final order (heapq.nsmallest returns items sorted by key).\n- Efficiency: Uses Counter for O(N) counting and adapts between O(U log U) full sort and O(U log k) selection; good optimization.\n- Maintainability: Modular function top_k_tokens, compiled regex, and straightforward control flow. No reliance on unspecified tie behavior.\n- Style/best practices: Avoids mutating provided globals; uses typing; avoids unnecessary I/O. Minor nit: variables like 'cnt'/'u' are concise but still readable.\n- No bugs or inefficiencies detected relevant to the task. The fallback demo doesn't interfere with primary behavior." + }, + "final_judgement": { + "adherence_score": 5, + "code_quality_score": 5, + "comments": "Excellent adherence: correct tokenization, ordering, edge handling, and output variable. Code is clean, efficient, and robust. The optional demo is safely isolated under __main__." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_16.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_16.json new file mode 100644 index 0000000000..d2b86edde1 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_16.json @@ -0,0 +1,16 @@ +{ + "file": "run_16.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"Step-by-step check against the instructions:\\n- Tokenization: Uses regex r\\\"[a-z0-9]+\\\" with re.ASCII and re.IGNORECASE, then lowercases tokens via m.group(0).lower(). This matches the requirement to lowercase and treat ASCII [a-z0-9]+ as tokens with others as separators.\\n- Inputs: The code expects globals text and k without reassigning them. It references them in a try/except NameError block and only uses demo variables if they are missing and the script is run as __main__. This respects the \\\"do not reassign\\\" rule.\\n- Requirement 1 (sorting): Sorting key is defined as (-count, token) via key=lambda kv: (-kv[1], kv[0]) applied to cnt.items() where kv=(token, count). For k >= u it uses sorted(..., key=key). For k < u it uses heapq.nsmallest(k, ..., key=key), which returns the k smallest elements sorted according to the key. Given the negative count in the key, this yields exact Top-K by count desc, token asc with correct tie-breaking. It does not rely on Counter.most_common.\\n- Requirement 2 (set top_k): In the presence of provided globals, it sets top_k = top_k_tokens(text, k). The function returns a list of (token, count) tuples of length min(k, number of unique tokens), or [] for edge cases, as required.\\n- Requirement 3 (edge cases): If k <= 0, top_k_tokens returns [], so top_k becomes []. If there are 0 unique tokens (u == 0), it returns []. Both cases handled.\\n- Requirement 4 (no input/file/network; runs as-is): No input(), file I/O, or network use. With provided globals, the script runs and defines top_k accordingly.\\n- Output contract: At end of execution (with provided globals), top_k is defined exactly as specified. Optional printing: In the demo-only fallback, it prints only top_k on the last line as a Python literal; with provided globals it does not print, which is allowed.\\n- Minor note: If globals are missing and the script is not __main__, top_k would not be defined, but the task states globals are provided, so this is acceptable.\\nOverall, the code fully adheres to the task instructions.\",\n \"code_quality\": \"Notable points:\\n- Correctness: Implements required tokenization and sorting with explicit key. Handles all specified edge cases.\\n- Efficiency: Counting is O(N tokens). Selection is O(U log k) via heapq.nsmallest, and falls back to full sort when k >= U. Space is O(U + k). This aligns with best practices for Top-K.\\n- Readability/Maintainability: Clear structure with a dedicated tokenizer, compiled regex, type hints for function signatures, and explanatory comments. The key function is explicit about tie-breaking.\\n- Robustness: Avoids relying on Counter.most_common ordering. Uses try/except for presence of globals without reassigning them. The 'type: ignore' comments are harmless and improve static typing compatibility.\\n- Style: Clean imports, sensible names, no unnecessary globals modified. Optional demo is properly gated under __main__ and does not interfere with required behavior.\\nNo bugs or inefficiencies detected; overall code quality is high.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 5,\n \"code_quality_score\": 5,\n \"comments\": \"Fully meets the spec with correct tokenization, sorting, edge-case handling, and output contract. Efficient and readable implementation; optional demo is properly isolated. Minor non-issue: top_k is not defined if globals are absent and not __main__, but this is outside the stated operating conditions.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "Step-by-step check against the instructions:\n- Tokenization: Uses regex r\"[a-z0-9]+\" with re.ASCII and re.IGNORECASE, then lowercases tokens via m.group(0).lower(). This matches the requirement to lowercase and treat ASCII [a-z0-9]+ as tokens with others as separators.\n- Inputs: The code expects globals text and k without reassigning them. It references them in a try/except NameError block and only uses demo variables if they are missing and the script is run as __main__. This respects the \"do not reassign\" rule.\n- Requirement 1 (sorting): Sorting key is defined as (-count, token) via key=lambda kv: (-kv[1], kv[0]) applied to cnt.items() where kv=(token, count). For k >= u it uses sorted(..., key=key). For k < u it uses heapq.nsmallest(k, ..., key=key), which returns the k smallest elements sorted according to the key. Given the negative count in the key, this yields exact Top-K by count desc, token asc with correct tie-breaking. It does not rely on Counter.most_common.\n- Requirement 2 (set top_k): In the presence of provided globals, it sets top_k = top_k_tokens(text, k). The function returns a list of (token, count) tuples of length min(k, number of unique tokens), or [] for edge cases, as required.\n- Requirement 3 (edge cases): If k <= 0, top_k_tokens returns [], so top_k becomes []. If there are 0 unique tokens (u == 0), it returns []. Both cases handled.\n- Requirement 4 (no input/file/network; runs as-is): No input(), file I/O, or network use. With provided globals, the script runs and defines top_k accordingly.\n- Output contract: At end of execution (with provided globals), top_k is defined exactly as specified. Optional printing: In the demo-only fallback, it prints only top_k on the last line as a Python literal; with provided globals it does not print, which is allowed.\n- Minor note: If globals are missing and the script is not __main__, top_k would not be defined, but the task states globals are provided, so this is acceptable.\nOverall, the code fully adheres to the task instructions.", + "code_quality": "Notable points:\n- Correctness: Implements required tokenization and sorting with explicit key. Handles all specified edge cases.\n- Efficiency: Counting is O(N tokens). Selection is O(U log k) via heapq.nsmallest, and falls back to full sort when k >= U. Space is O(U + k). This aligns with best practices for Top-K.\n- Readability/Maintainability: Clear structure with a dedicated tokenizer, compiled regex, type hints for function signatures, and explanatory comments. The key function is explicit about tie-breaking.\n- Robustness: Avoids relying on Counter.most_common ordering. Uses try/except for presence of globals without reassigning them. The 'type: ignore' comments are harmless and improve static typing compatibility.\n- Style: Clean imports, sensible names, no unnecessary globals modified. Optional demo is properly gated under __main__ and does not interfere with required behavior.\nNo bugs or inefficiencies detected; overall code quality is high." + }, + "final_judgement": { + "adherence_score": 5, + "code_quality_score": 5, + "comments": "Fully meets the spec with correct tokenization, sorting, edge-case handling, and output contract. Efficient and readable implementation; optional demo is properly isolated. Minor non-issue: top_k is not defined if globals are absent and not __main__, but this is outside the stated operating conditions." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_17.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_17.json new file mode 100644 index 0000000000..22a4a87070 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_17.json @@ -0,0 +1,16 @@ +{ + "file": "run_17.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"- Tokenization: Uses regex [a-z0-9]+ with re.ASCII and re.IGNORECASE, and lowercases tokens in _tokens(). This matches the spec (ASCII alnum sequences; others are separators; lowercase).\\n- Sorting: Sort key implemented as key = lambda kv: (-kv[1], kv[0]) over Counter items (kv = (token, count)). For k >= unique count, uses sorted(..., key=key); for k < unique, uses heapq.nsmallest(k, ..., key=key), which returns results sorted by the same key. This satisfies count desc then token asc.\\n- Output top_k: When globals text and k exist, assigns top_k = top_k_tokens(text, k). Length is min(k, unique) due to branching; when k <= 0, function returns [] (requirement 3).\\n- Edge cases: k <= 0 handled; empty text (u == 0) returns [] as well.\\n- Inputs handling: Does not reassign text or k. Uses them if present; otherwise, under __main__, runs a self-contained demo without touching globals (allowed optional behavior).\\n- I/O/network: No input(), file I/O, or network access. Optional print only in demo path, printing exactly top_k on one line.\\n- Avoids Counter.most_common tie ordering: Yes, does not use most_common.\\n- End-of-execution contract: With provided globals, top_k is defined as required. Demo path also defines and prints top_k. All requirements are met.\",\n \"code_quality\": \"- Clarity/structure: Clean separation of concerns (_tokens, top_k_tokens). Clear variable names and brief comments. Type hints provided.\\n- Correctness: Regex and lowering implement the specified tokenization. Sorting/tie-breaking is correct. Handles edge cases.\\n- Efficiency: Counting O(N tokens); selection O(U log k) via heap for k < U; otherwise full sort. Appropriate for Top-K.\\n- Readability/maintainability: Concise, readable, and idiomatic. Minimal, well-placed comments. No unnecessary complexity.\\n- Minor nit: re.IGNORECASE is redundant since tokens are lowercased, but harmless. Overall excellent quality.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 5,\n \"code_quality_score\": 5,\n \"comments\": \"Meets all instructions precisely, including tokenization, sorting, edge cases, and output contract. Code is clean, efficient, and well-structured. Minor optional refinement: remove redundant IGNORECASE.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "- Tokenization: Uses regex [a-z0-9]+ with re.ASCII and re.IGNORECASE, and lowercases tokens in _tokens(). This matches the spec (ASCII alnum sequences; others are separators; lowercase).\n- Sorting: Sort key implemented as key = lambda kv: (-kv[1], kv[0]) over Counter items (kv = (token, count)). For k >= unique count, uses sorted(..., key=key); for k < unique, uses heapq.nsmallest(k, ..., key=key), which returns results sorted by the same key. This satisfies count desc then token asc.\n- Output top_k: When globals text and k exist, assigns top_k = top_k_tokens(text, k). Length is min(k, unique) due to branching; when k <= 0, function returns [] (requirement 3).\n- Edge cases: k <= 0 handled; empty text (u == 0) returns [] as well.\n- Inputs handling: Does not reassign text or k. Uses them if present; otherwise, under __main__, runs a self-contained demo without touching globals (allowed optional behavior).\n- I/O/network: No input(), file I/O, or network access. Optional print only in demo path, printing exactly top_k on one line.\n- Avoids Counter.most_common tie ordering: Yes, does not use most_common.\n- End-of-execution contract: With provided globals, top_k is defined as required. Demo path also defines and prints top_k. All requirements are met.", + "code_quality": "- Clarity/structure: Clean separation of concerns (_tokens, top_k_tokens). Clear variable names and brief comments. Type hints provided.\n- Correctness: Regex and lowering implement the specified tokenization. Sorting/tie-breaking is correct. Handles edge cases.\n- Efficiency: Counting O(N tokens); selection O(U log k) via heap for k < U; otherwise full sort. Appropriate for Top-K.\n- Readability/maintainability: Concise, readable, and idiomatic. Minimal, well-placed comments. No unnecessary complexity.\n- Minor nit: re.IGNORECASE is redundant since tokens are lowercased, but harmless. Overall excellent quality." + }, + "final_judgement": { + "adherence_score": 5, + "code_quality_score": 5, + "comments": "Meets all instructions precisely, including tokenization, sorting, edge cases, and output contract. Code is clean, efficient, and well-structured. Minor optional refinement: remove redundant IGNORECASE." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_18.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_18.json new file mode 100644 index 0000000000..0f0b9c07c3 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_18.json @@ -0,0 +1,16 @@ +{ + "file": "run_18.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"- Tokenization: The code lowercases tokens in _tokens() and uses a regex compiled with re.ASCII and re.IGNORECASE to match [a-z0-9]+, treating non-matching characters as separators. This exactly matches the tokenization requirements.\\n- Inputs: It reads the provided globals text and k without reassigning them. The try/except NameError guard only checks for their existence; it does not modify them.\\n- Requirement 1 (sorting): Implements the specified ordering via key = lambda kv: (-kv[1], kv[0]) and uses either sorted(...) or heapq.nsmallest(..., key=key). This yields sorting by count descending, then token ascending. nsmallest returns results in ascending order by the provided key, which corresponds to the desired order when using -count.\\n- Requirement 2 (top_k content/length): top_k is set to the list returned by top_k_tokens(text, k). If k >= number of unique tokens, it returns all items sorted correctly; if 0 < k < unique tokens, it returns k items; if k <= 0, it returns []. Length is thus min(k, unique) with the k <= 0 edge handled explicitly.\\n- Requirement 3 (k <= 0): top_k_tokens returns [] for k <= 0, so top_k becomes [] as required.\\n- Requirement 4 (no I/O): No input(), file I/O, or network access is used. Optional demo printing occurs only when globals are absent and __name__ == \\\"__main__\\\"; with provided globals, nothing is printed.\\n- Output contract: With provided globals, top_k is defined at the end as required. Optional printing is compliant (only prints top_k on the last line in demo mode). It does not rely on Counter.most_common tie ordering.\\n- Ambiguities: If globals are not provided and not running as __main__, top_k would not be defined; however, the task states the globals are provided, so this is acceptable. The approach is reasonable and does not violate any constraints.\",\n \"code_quality\": \"- Correctness: Logic is correct for counting, tokenization, ordering, and edge cases. Does not use Counter.most_common.\\n- Efficiency: Counting is O(T) over tokens. Selection is O(U log k) via heapq.nsmallest when k < U and O(U log U) via sorted when k >= U. Space O(U + k). This is efficient and appropriate.\\n- Readability and structure: Clear helper tokenizer, descriptive names, type hints, and helpful comments (including complexity). The key function is simple and correct.\\n- Maintainability: Modularized via top_k_tokens; tokenization encapsulated; easy to adapt. No unnecessary globals mutated.\\n- Minor nit: Forcing the heap path by passing an iterator to nsmallest is a micro-optimization tied to CPython behavior; not harmful, but slightly over-engineered. The try/except to probe globals is fine, though using 'if \\\"text\\\" in globals()' could be clearer. These are minor style points and do not affect correctness.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 5,\n \"code_quality_score\": 5,\n \"comments\": \"Excellent adherence and implementation: correct tokenization, ordering, edge-case handling, and output contract. Code is clean, efficient, and well-structured. Only minor stylistic nits.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "- Tokenization: The code lowercases tokens in _tokens() and uses a regex compiled with re.ASCII and re.IGNORECASE to match [a-z0-9]+, treating non-matching characters as separators. This exactly matches the tokenization requirements.\n- Inputs: It reads the provided globals text and k without reassigning them. The try/except NameError guard only checks for their existence; it does not modify them.\n- Requirement 1 (sorting): Implements the specified ordering via key = lambda kv: (-kv[1], kv[0]) and uses either sorted(...) or heapq.nsmallest(..., key=key). This yields sorting by count descending, then token ascending. nsmallest returns results in ascending order by the provided key, which corresponds to the desired order when using -count.\n- Requirement 2 (top_k content/length): top_k is set to the list returned by top_k_tokens(text, k). If k >= number of unique tokens, it returns all items sorted correctly; if 0 < k < unique tokens, it returns k items; if k <= 0, it returns []. Length is thus min(k, unique) with the k <= 0 edge handled explicitly.\n- Requirement 3 (k <= 0): top_k_tokens returns [] for k <= 0, so top_k becomes [] as required.\n- Requirement 4 (no I/O): No input(), file I/O, or network access is used. Optional demo printing occurs only when globals are absent and __name__ == \"__main__\"; with provided globals, nothing is printed.\n- Output contract: With provided globals, top_k is defined at the end as required. Optional printing is compliant (only prints top_k on the last line in demo mode). It does not rely on Counter.most_common tie ordering.\n- Ambiguities: If globals are not provided and not running as __main__, top_k would not be defined; however, the task states the globals are provided, so this is acceptable. The approach is reasonable and does not violate any constraints.", + "code_quality": "- Correctness: Logic is correct for counting, tokenization, ordering, and edge cases. Does not use Counter.most_common.\n- Efficiency: Counting is O(T) over tokens. Selection is O(U log k) via heapq.nsmallest when k < U and O(U log U) via sorted when k >= U. Space O(U + k). This is efficient and appropriate.\n- Readability and structure: Clear helper tokenizer, descriptive names, type hints, and helpful comments (including complexity). The key function is simple and correct.\n- Maintainability: Modularized via top_k_tokens; tokenization encapsulated; easy to adapt. No unnecessary globals mutated.\n- Minor nit: Forcing the heap path by passing an iterator to nsmallest is a micro-optimization tied to CPython behavior; not harmful, but slightly over-engineered. The try/except to probe globals is fine, though using 'if \"text\" in globals()' could be clearer. These are minor style points and do not affect correctness." + }, + "final_judgement": { + "adherence_score": 5, + "code_quality_score": 5, + "comments": "Excellent adherence and implementation: correct tokenization, ordering, edge-case handling, and output contract. Code is clean, efficient, and well-structured. Only minor stylistic nits." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_19.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_19.json new file mode 100644 index 0000000000..250d93654b --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_19.json @@ -0,0 +1,16 @@ +{ + "file": "run_19.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"Step-by-step check against instructions:\\n- Tokenization: Uses a compiled regex r\\\"[a-z0-9]+\\\" with flags re.ASCII | re.IGNORECASE and lowercases each match in _tokens(). This satisfies \\\"Lowercase the text\\\" and \\\"Tokens are ASCII [a-z0-9]+ sequences; all other characters are separators (use a regex).\\\"\\n- Inputs: The script expects two globals, text and k, and does not reassign them. It accesses them read-only and assigns the result to a new variable top_k. Compliant.\\n- Requirement 1 (sorting): Sorting key defined as key = lambda kv: (-kv[1], kv[0]), which implements count desc then token asc. For k < u, it either slices a fully sorted list or uses heapq.nsmallest with the same key, which returns items sorted by the key. No reliance on Counter.most_common. Compliant.\\n- Requirement 2 (top_k list of tuples, length min(k, unique)): top_k_tokens returns a list of (token, count) pairs; branches ensure the length is min(k, u) when k > 0, and full list when k >= u. Compliant.\\n- Requirement 3 (k <= 0 -> []): top_k_tokens returns [] for non-int k or k <= 0. This covers k <= 0 explicitly. Compliant.\\n- Requirement 4 (no input/file/network; run with provided globals): No input(), file I/O, or network. When globals exist, it computes top_k directly. If globals are missing and __name__ == \\\"__main__\\\", it runs a demo and prints; otherwise does nothing. With provided globals (as per task), it runs as-is. Compliant.\\n- Output contract: At end, top_k is defined via top_k = top_k_tokens(text, k) when globals are provided, matching the contract. Optional printing is only in the demo branch and prints only top_k on the last line. Compliant.\\n- Tie ordering: Explicitly implemented via sort key; does not use most_common. Compliant.\\nMinor note: If imported as a module without provided globals and not run as __main__, top_k will not be defined; however, the task guarantees the globals are provided, so this is not a violation.\",\n \"code_quality\": \"The code is clear, correct, and efficient:\\n- Correctness: Tokenization, counting, sorting key, and selection logic all match the specification. Edge cases (k <= 0, no tokens) are handled.\\n- Efficiency: Uses Counter for counting and conditionally chooses between full sort and heapq.nsmallest based on k relative to unique count, reducing unnecessary O(U log U) sorting for small k. Results remain correctly sorted.\\n- Readability/Maintainability: Good names, type hints, modularization (_tokens, top_k_tokens), and concise logic. Compiled regex is reused. No reliance on unspecified tie-ordering. Comments are minimal and helpful.\\n- Best practices: No side effects except optional demo printing; no I/O beyond that; does not mutate provided globals; avoids most_common tie pitfalls.\\nNo bugs or style issues that impact the task. The only benign nit is the demo/__main__ branch leaving top_k undefined when globals are absent and not __main__, which is outside the task's operational context.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 5,\n \"code_quality_score\": 5,\n \"comments\": \"Meets all requirements precisely: correct tokenization, sorting, edge-case handling, and final top_k definition. Code is clean, efficient, and avoids most_common tie-ordering. Optional demo printing is appropriate.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "Step-by-step check against instructions:\n- Tokenization: Uses a compiled regex r\"[a-z0-9]+\" with flags re.ASCII | re.IGNORECASE and lowercases each match in _tokens(). This satisfies \"Lowercase the text\" and \"Tokens are ASCII [a-z0-9]+ sequences; all other characters are separators (use a regex).\"\n- Inputs: The script expects two globals, text and k, and does not reassign them. It accesses them read-only and assigns the result to a new variable top_k. Compliant.\n- Requirement 1 (sorting): Sorting key defined as key = lambda kv: (-kv[1], kv[0]), which implements count desc then token asc. For k < u, it either slices a fully sorted list or uses heapq.nsmallest with the same key, which returns items sorted by the key. No reliance on Counter.most_common. Compliant.\n- Requirement 2 (top_k list of tuples, length min(k, unique)): top_k_tokens returns a list of (token, count) pairs; branches ensure the length is min(k, u) when k > 0, and full list when k >= u. Compliant.\n- Requirement 3 (k <= 0 -> []): top_k_tokens returns [] for non-int k or k <= 0. This covers k <= 0 explicitly. Compliant.\n- Requirement 4 (no input/file/network; run with provided globals): No input(), file I/O, or network. When globals exist, it computes top_k directly. If globals are missing and __name__ == \"__main__\", it runs a demo and prints; otherwise does nothing. With provided globals (as per task), it runs as-is. Compliant.\n- Output contract: At end, top_k is defined via top_k = top_k_tokens(text, k) when globals are provided, matching the contract. Optional printing is only in the demo branch and prints only top_k on the last line. Compliant.\n- Tie ordering: Explicitly implemented via sort key; does not use most_common. Compliant.\nMinor note: If imported as a module without provided globals and not run as __main__, top_k will not be defined; however, the task guarantees the globals are provided, so this is not a violation.", + "code_quality": "The code is clear, correct, and efficient:\n- Correctness: Tokenization, counting, sorting key, and selection logic all match the specification. Edge cases (k <= 0, no tokens) are handled.\n- Efficiency: Uses Counter for counting and conditionally chooses between full sort and heapq.nsmallest based on k relative to unique count, reducing unnecessary O(U log U) sorting for small k. Results remain correctly sorted.\n- Readability/Maintainability: Good names, type hints, modularization (_tokens, top_k_tokens), and concise logic. Compiled regex is reused. No reliance on unspecified tie-ordering. Comments are minimal and helpful.\n- Best practices: No side effects except optional demo printing; no I/O beyond that; does not mutate provided globals; avoids most_common tie pitfalls.\nNo bugs or style issues that impact the task. The only benign nit is the demo/__main__ branch leaving top_k undefined when globals are absent and not __main__, which is outside the task's operational context." + }, + "final_judgement": { + "adherence_score": 5, + "code_quality_score": 5, + "comments": "Meets all requirements precisely: correct tokenization, sorting, edge-case handling, and final top_k definition. Code is clean, efficient, and avoids most_common tie-ordering. Optional demo printing is appropriate." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_20.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_20.json new file mode 100644 index 0000000000..d689ff1a80 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_20.json @@ -0,0 +1,16 @@ +{ + "file": "run_20.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"- Tokenization requirements: The code uses a compiled regex r\\\"[a-z0-9]+\\\" with flags re.ASCII | re.IGNORECASE, and each match is lowercased in _tokens(). This satisfies \\\"Lowercase the text\\\" and \\\"Tokens are ASCII [a-z0-9]+ sequences; all other characters are separators (use a regex).\\\"\\n- Sorting and Top-K requirements: The key function key = lambda kv: (-kv[1], kv[0]) correctly implements sort by count descending, then token ascending. All selection paths (full sort, heapq.nsmallest, partial slice after sort) produce results ordered by that key.\\n- Output variable: The script sets top_k to a list of (token, count) tuples via top_k = top_k_tokens(text, k) when globals text and k are present, matching the requirement that top_k length equals min(k, number of unique tokens).\\n- Edge case k <= 0: top_k_tokens returns [] when k <= 0, so top_k will be [] in that case.\\n- Inputs handling: It does not reassign text or k; it only reads them. No input(), file I/O, or network access are used.\\n- Output contract: At end of execution (with provided globals), top_k is defined exactly as specified. Optional printing is only done in the demo path (when globals are missing and running as __main__), and it prints only top_k on the last line, which is acceptable.\\n- Tie handling and Counter.most_common: The code does not use most_common and explicitly implements the specified sort.\\n- Minor note: If the globals are missing and the code is imported (not __main__), top_k would not be defined. However, the task states the globals are provided; under that scenario, the script meets all requirements.\",\n \"code_quality\": \"- Correctness: The algorithm correctly counts tokens and selects/sorts Top-K with the specified key. heapq.nsmallest is used appropriately with a key that encodes both count and token to ensure correct tie-breaking and ordering of the returned list.\\n- Efficiency: Uses O(N) counting and selects between O(U log k) via nsmallest for small k and O(U log U) sorting otherwise. This is efficient and well-considered. The heuristic threshold (0.3 * U) is reasonable.\\n- Clarity and readability: Clear function decomposition (_tokens, top_k_tokens), meaningful variable names, and a concise key function. Type hints and a complexity comment improve maintainability.\\n- Maintainability/structure: Regex compiled once at module level. No reliance on undefined tie behavior. Edge cases are handled explicitly.\\n- Minor nits: Using both re.IGNORECASE and lowercasing tokens is slightly redundant (either lower the input first or keep IGNORECASE), but harmless and clear. The demo/import fallback path is fine; if imported without globals, top_k is not defined, but this is outside the stated execution context.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 5,\n \"code_quality_score\": 5,\n \"comments\": \"Meets all task requirements precisely, including tokenization, ordering, edge cases, and output contract. Code is clear, efficient, and robust. Minor redundancy in case handling (IGNORECASE plus lowercasing) is negligible.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "- Tokenization requirements: The code uses a compiled regex r\"[a-z0-9]+\" with flags re.ASCII | re.IGNORECASE, and each match is lowercased in _tokens(). This satisfies \"Lowercase the text\" and \"Tokens are ASCII [a-z0-9]+ sequences; all other characters are separators (use a regex).\"\n- Sorting and Top-K requirements: The key function key = lambda kv: (-kv[1], kv[0]) correctly implements sort by count descending, then token ascending. All selection paths (full sort, heapq.nsmallest, partial slice after sort) produce results ordered by that key.\n- Output variable: The script sets top_k to a list of (token, count) tuples via top_k = top_k_tokens(text, k) when globals text and k are present, matching the requirement that top_k length equals min(k, number of unique tokens).\n- Edge case k <= 0: top_k_tokens returns [] when k <= 0, so top_k will be [] in that case.\n- Inputs handling: It does not reassign text or k; it only reads them. No input(), file I/O, or network access are used.\n- Output contract: At end of execution (with provided globals), top_k is defined exactly as specified. Optional printing is only done in the demo path (when globals are missing and running as __main__), and it prints only top_k on the last line, which is acceptable.\n- Tie handling and Counter.most_common: The code does not use most_common and explicitly implements the specified sort.\n- Minor note: If the globals are missing and the code is imported (not __main__), top_k would not be defined. However, the task states the globals are provided; under that scenario, the script meets all requirements.", + "code_quality": "- Correctness: The algorithm correctly counts tokens and selects/sorts Top-K with the specified key. heapq.nsmallest is used appropriately with a key that encodes both count and token to ensure correct tie-breaking and ordering of the returned list.\n- Efficiency: Uses O(N) counting and selects between O(U log k) via nsmallest for small k and O(U log U) sorting otherwise. This is efficient and well-considered. The heuristic threshold (0.3 * U) is reasonable.\n- Clarity and readability: Clear function decomposition (_tokens, top_k_tokens), meaningful variable names, and a concise key function. Type hints and a complexity comment improve maintainability.\n- Maintainability/structure: Regex compiled once at module level. No reliance on undefined tie behavior. Edge cases are handled explicitly.\n- Minor nits: Using both re.IGNORECASE and lowercasing tokens is slightly redundant (either lower the input first or keep IGNORECASE), but harmless and clear. The demo/import fallback path is fine; if imported without globals, top_k is not defined, but this is outside the stated execution context." + }, + "final_judgement": { + "adherence_score": 5, + "code_quality_score": 5, + "comments": "Meets all task requirements precisely, including tokenization, ordering, edge cases, and output contract. Code is clear, efficient, and robust. Minor redundancy in case handling (IGNORECASE plus lowercasing) is negligible." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_21.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_21.json new file mode 100644 index 0000000000..98f83d60e5 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_21.json @@ -0,0 +1,16 @@ +{ + "file": "run_21.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"- Tokenization: Uses regex [a-z0-9]+ with re.IGNORECASE and lowercases tokens via _tokens(), satisfying 'ASCII [a-z0-9]+ sequences; lowercase; others are separators'.\\n- Inputs/globals: Accesses provided globals text and k without reassigning them. The try/except only detects presence and does not modify them.\\n- Requirement 1 (ordering): Sorting key is key = lambda kv: (-kv[1], kv[0]), which enforces count descending then token ascending. For k < U, heapq.nsmallest with this key returns the top-k items sorted by that key; for k >= U, it uses sorted(..., key=key). No reliance on Counter.most_common tie ordering.\\n- Requirement 2 (output shape/length): Returns a list of (token, count) tuples. For k >= U, returns all U; for k < U, returns exactly k. Thus length is min(k, number of unique tokens).\\n- Requirement 3 (edge cases): If k <= 0, top_k_tokens returns []. If there are zero unique tokens (u == 0), it returns []. Both satisfy the edge-case requirement.\\n- Requirement 4 (I/O/network): No input(), file I/O, or network use. Script runs as-is, computing top_k from provided globals.\\n- Output contract: At end, top_k is defined as specified when globals text and k are present (else branch assigns top_k = top_k_tokens(text, k)). Optional printing: Only prints top_k in the __main__ demo path, and prints only top_k on the last line. With provided globals, it does not print, which is allowed.\\n- Ambiguities/notes: Includes a fallback demo only when globals are missing and __main__. This does not violate requirements and does not affect correctness when globals are provided.\",\n \"code_quality\": \"- Correctness: Logic is sound and matches the required ordering. heapq.nsmallest with key produces a fully sorted top-k list per the key.\\n- Efficiency: O(N) tokenization/counting; O(U log min(k, U)) selection via nsmallest and O(U log U) when k >= U. Efficient for large U with small k. Space O(U + min(k, U)).\\n- Clarity/Readability: Clear structure with helper _tokens(), type hints, and explanatory comment. Variable 'u' could be more descriptive, but acceptable. Key function is concise and correct.\\n- Maintainability/Structure: Separation of concerns (tokenization, counting, selection). No unnecessary dependencies. Uses Counter appropriately without relying on most_common ordering.\\n- Best practices: Precompiled regex, lowercase normalization, type annotations, and avoiding side effects. Optional demo guarded by __main__.\\n- No bugs or stylistic issues apparent; no reliance on unspecified behaviors.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 5,\n \"code_quality_score\": 5,\n \"comments\": \"Meets all task requirements precisely, including ordering, edge cases, and output contract. Code is clean, efficient, and well-structured. Minor nit: variable naming (u) could be more descriptive, but overall excellent.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "- Tokenization: Uses regex [a-z0-9]+ with re.IGNORECASE and lowercases tokens via _tokens(), satisfying 'ASCII [a-z0-9]+ sequences; lowercase; others are separators'.\n- Inputs/globals: Accesses provided globals text and k without reassigning them. The try/except only detects presence and does not modify them.\n- Requirement 1 (ordering): Sorting key is key = lambda kv: (-kv[1], kv[0]), which enforces count descending then token ascending. For k < U, heapq.nsmallest with this key returns the top-k items sorted by that key; for k >= U, it uses sorted(..., key=key). No reliance on Counter.most_common tie ordering.\n- Requirement 2 (output shape/length): Returns a list of (token, count) tuples. For k >= U, returns all U; for k < U, returns exactly k. Thus length is min(k, number of unique tokens).\n- Requirement 3 (edge cases): If k <= 0, top_k_tokens returns []. If there are zero unique tokens (u == 0), it returns []. Both satisfy the edge-case requirement.\n- Requirement 4 (I/O/network): No input(), file I/O, or network use. Script runs as-is, computing top_k from provided globals.\n- Output contract: At end, top_k is defined as specified when globals text and k are present (else branch assigns top_k = top_k_tokens(text, k)). Optional printing: Only prints top_k in the __main__ demo path, and prints only top_k on the last line. With provided globals, it does not print, which is allowed.\n- Ambiguities/notes: Includes a fallback demo only when globals are missing and __main__. This does not violate requirements and does not affect correctness when globals are provided.", + "code_quality": "- Correctness: Logic is sound and matches the required ordering. heapq.nsmallest with key produces a fully sorted top-k list per the key.\n- Efficiency: O(N) tokenization/counting; O(U log min(k, U)) selection via nsmallest and O(U log U) when k >= U. Efficient for large U with small k. Space O(U + min(k, U)).\n- Clarity/Readability: Clear structure with helper _tokens(), type hints, and explanatory comment. Variable 'u' could be more descriptive, but acceptable. Key function is concise and correct.\n- Maintainability/Structure: Separation of concerns (tokenization, counting, selection). No unnecessary dependencies. Uses Counter appropriately without relying on most_common ordering.\n- Best practices: Precompiled regex, lowercase normalization, type annotations, and avoiding side effects. Optional demo guarded by __main__.\n- No bugs or stylistic issues apparent; no reliance on unspecified behaviors." + }, + "final_judgement": { + "adherence_score": 5, + "code_quality_score": 5, + "comments": "Meets all task requirements precisely, including ordering, edge cases, and output contract. Code is clean, efficient, and well-structured. Minor nit: variable naming (u) could be more descriptive, but overall excellent." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_22.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_22.json new file mode 100644 index 0000000000..1b4c374e55 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_22.json @@ -0,0 +1,16 @@ +{ + "file": "run_22.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"- Tokenization: Correct. Uses regex [a-z0-9]+ with re.ASCII | re.IGNORECASE and lowercases tokens in _tokens, matching the specification.\\n- Sorting and tie-break: Correct. Sort key is (-count, token). Both sorted(...) and heapq.nsmallest(..., key=key) produce the required order (count desc, token asc) without relying on Counter.most_common.\\n- Top-K length requirement: Fails in a common case. In top_k_tokens, the branch `if k >= u or k >= 0.3 * u: return sorted(cnt.items(), key=key)` returns all U items even when `k < u` but `k >= 0.3*u`. This violates requirement 2: length must be exactly min(k, number of unique tokens). It is only correct when `k >= u`.\\n- Edge cases: Handles `k <= 0` by returning [] as required. Handles empty text (u == 0) by returning [].\\n- Globals and I/O: Does not reassign provided globals `text` and `k`. No input(), file I/O, or network access. Optional printing occurs only in demo mode when globals are missing and __name__ == \\\"__main__\\\"; otherwise nothing is printed, which is acceptable.\\n- Output contract: With provided globals, `top_k` is defined. However, due to the length bug above, `top_k` may not have the required length when `0.3*u <= k < u`. Also note: if globals are missing and not running as __main__, `top_k` is not defined, but the task states globals are provided, so this path is unlikely in the intended environment.\",\n \"code_quality\": \"- Clarity/readability: Good. Clear function names, helpful comments, and type hints. Tokenization is factored out cleanly.\\n- Correctness: One significant logic error: returning the full sorted list in the `k >= 0.3*u` branch without slicing to k causes incorrect output length. Fix by slicing: `sorted(cnt.items(), key=key)[:k]` for that branch, or compute once and slice to `min(k, u)`.\\n- Efficiency: Sensible approach\u2014Counter for counting and heapq.nsmallest for partial selection. The 0.3*U heuristic to switch to full sort is reasonable, but must still respect k. Complexity comments are accurate.\\n- Maintainability: Generally good. Could simplify by unifying the return path to ensure length invariants, e.g., always slice to `min(k, u)`. The try/except NameError for globals is a bit unconventional; checking `'text' in globals()`/`'k' in globals()` may be clearer. The `# type: ignore` comments are not necessary at runtime and could be removed or replaced with safer existence checks.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 2,\n \"code_quality_score\": 3,\n \"comments\": \"Main issue: when 0.3*U <= k < U, the function returns all U items instead of exactly the top-k, violating the length requirement. Sorting and tokenization are correct. Fix by slicing the full-sort branch (or always slicing to min(k, u)). Otherwise, the code is clean and efficient.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "- Tokenization: Correct. Uses regex [a-z0-9]+ with re.ASCII | re.IGNORECASE and lowercases tokens in _tokens, matching the specification.\n- Sorting and tie-break: Correct. Sort key is (-count, token). Both sorted(...) and heapq.nsmallest(..., key=key) produce the required order (count desc, token asc) without relying on Counter.most_common.\n- Top-K length requirement: Fails in a common case. In top_k_tokens, the branch `if k >= u or k >= 0.3 * u: return sorted(cnt.items(), key=key)` returns all U items even when `k < u` but `k >= 0.3*u`. This violates requirement 2: length must be exactly min(k, number of unique tokens). It is only correct when `k >= u`.\n- Edge cases: Handles `k <= 0` by returning [] as required. Handles empty text (u == 0) by returning [].\n- Globals and I/O: Does not reassign provided globals `text` and `k`. No input(), file I/O, or network access. Optional printing occurs only in demo mode when globals are missing and __name__ == \"__main__\"; otherwise nothing is printed, which is acceptable.\n- Output contract: With provided globals, `top_k` is defined. However, due to the length bug above, `top_k` may not have the required length when `0.3*u <= k < u`. Also note: if globals are missing and not running as __main__, `top_k` is not defined, but the task states globals are provided, so this path is unlikely in the intended environment.", + "code_quality": "- Clarity/readability: Good. Clear function names, helpful comments, and type hints. Tokenization is factored out cleanly.\n- Correctness: One significant logic error: returning the full sorted list in the `k >= 0.3*u` branch without slicing to k causes incorrect output length. Fix by slicing: `sorted(cnt.items(), key=key)[:k]` for that branch, or compute once and slice to `min(k, u)`.\n- Efficiency: Sensible approach\u2014Counter for counting and heapq.nsmallest for partial selection. The 0.3*U heuristic to switch to full sort is reasonable, but must still respect k. Complexity comments are accurate.\n- Maintainability: Generally good. Could simplify by unifying the return path to ensure length invariants, e.g., always slice to `min(k, u)`. The try/except NameError for globals is a bit unconventional; checking `'text' in globals()`/`'k' in globals()` may be clearer. The `# type: ignore` comments are not necessary at runtime and could be removed or replaced with safer existence checks." + }, + "final_judgement": { + "adherence_score": 2, + "code_quality_score": 3, + "comments": "Main issue: when 0.3*U <= k < U, the function returns all U items instead of exactly the top-k, violating the length requirement. Sorting and tokenization are correct. Fix by slicing the full-sort branch (or always slicing to min(k, u)). Otherwise, the code is clean and efficient." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_23.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_23.json new file mode 100644 index 0000000000..b33b0faefc --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_23.json @@ -0,0 +1,16 @@ +{ + "file": "run_23.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"- Tokenization: Uses a compiled regex r\\\"[a-z0-9]+\\\" with flags re.ASCII | re.IGNORECASE and lowercases each matched token via .lower(). This satisfies the requirement to lowercase and to use ASCII [a-z0-9]+ sequences with other characters as separators.\\n- Inputs/globals: Accesses the provided globals text and k without reassigning them. The presence check uses a NameError guard and only computes a demo when globals are absent and running as __main__, which is acceptable given the task context.\\n- Sorting/key: Implements sorting by count descending, then token ascending using a key function key = lambda kv: (-kv[1], kv[0]). Does not rely on Counter.most_common; meets the specified sort order.\\n- Top-K exactness: \\n - If k >= number of unique tokens (u), returns sorted(cnt.items(), key=key) (length u), which equals min(k, u).\\n - If 0 < k < u and k is a \\\"large\\\" fraction of u, sorts all and slices [:k], still exact and ordered correctly.\\n - Otherwise uses heapq.nsmallest(k, cnt.items(), key=key), which returns items in ascending order of the key, i.e., desired (-count, token) ordering, yielding an exact and correctly ordered Top-K.\\n- Edge cases: If k <= 0, top_k_tokens returns [] and the top-level assigns this to top_k. If there are no tokens (u == 0), returns []. Both satisfy length = min(k, u) and the explicit k <= 0 requirement.\\n- Output contract: When globals are provided, the script sets top_k = top_k_tokens(text, k) at module level. Printing is optional; in the demo path it prints only top_k on the last line. In the intended environment (globals provided), it does not print anything extra, and top_k is defined exactly as required.\\n- Prohibited I/O: No input(), file I/O, or network access used.\\n- Ambiguities/notes: If the script is imported without globals and not run as __main__, top_k is not defined; however, the task specifies that globals are provided, so this is acceptable in context.\",\n \"code_quality\": \"- Correctness: Logic for counting and selecting Top-K is sound. The composite key ensures correct tie-breaking. heapq.nsmallest produces a correctly ordered list for the selected K.\\n- Efficiency: Counting is O(N tokens). Selection is O(U log U) when sorting and O(U log k) with the heap path, which is efficient. The heuristic to switch to sorting when k is a large fraction of U is reasonable.\\n- Readability/structure: Clear separation of concerns with a tokenizer, a top_k_tokens function, and top-level orchestration. Type hints and concise comments improve clarity. Variable name 'u' is a bit terse but understandable.\\n- Maintainability: Precompiled regex, small functions, and type annotations aid maintainability. No reliance on unspecified ordering behavior (avoids most_common).\\n- Minor nits: Using both re.IGNORECASE and .lower() is slightly redundant (either alone would suffice given ASCII), but harmless. In a non-specified environment (imported, no globals, not __main__), top_k remains undefined, though this does not violate the task requirements.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 5,\n \"code_quality_score\": 5,\n \"comments\": \"Meets all task requirements exactly, including tokenization, sorting, edge cases, and output contract. Code is clear, efficient, and well-structured. Minor optional improvements: remove redundant case-handling or ensure top_k is always set in all import contexts.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "- Tokenization: Uses a compiled regex r\"[a-z0-9]+\" with flags re.ASCII | re.IGNORECASE and lowercases each matched token via .lower(). This satisfies the requirement to lowercase and to use ASCII [a-z0-9]+ sequences with other characters as separators.\n- Inputs/globals: Accesses the provided globals text and k without reassigning them. The presence check uses a NameError guard and only computes a demo when globals are absent and running as __main__, which is acceptable given the task context.\n- Sorting/key: Implements sorting by count descending, then token ascending using a key function key = lambda kv: (-kv[1], kv[0]). Does not rely on Counter.most_common; meets the specified sort order.\n- Top-K exactness: \n - If k >= number of unique tokens (u), returns sorted(cnt.items(), key=key) (length u), which equals min(k, u).\n - If 0 < k < u and k is a \"large\" fraction of u, sorts all and slices [:k], still exact and ordered correctly.\n - Otherwise uses heapq.nsmallest(k, cnt.items(), key=key), which returns items in ascending order of the key, i.e., desired (-count, token) ordering, yielding an exact and correctly ordered Top-K.\n- Edge cases: If k <= 0, top_k_tokens returns [] and the top-level assigns this to top_k. If there are no tokens (u == 0), returns []. Both satisfy length = min(k, u) and the explicit k <= 0 requirement.\n- Output contract: When globals are provided, the script sets top_k = top_k_tokens(text, k) at module level. Printing is optional; in the demo path it prints only top_k on the last line. In the intended environment (globals provided), it does not print anything extra, and top_k is defined exactly as required.\n- Prohibited I/O: No input(), file I/O, or network access used.\n- Ambiguities/notes: If the script is imported without globals and not run as __main__, top_k is not defined; however, the task specifies that globals are provided, so this is acceptable in context.", + "code_quality": "- Correctness: Logic for counting and selecting Top-K is sound. The composite key ensures correct tie-breaking. heapq.nsmallest produces a correctly ordered list for the selected K.\n- Efficiency: Counting is O(N tokens). Selection is O(U log U) when sorting and O(U log k) with the heap path, which is efficient. The heuristic to switch to sorting when k is a large fraction of U is reasonable.\n- Readability/structure: Clear separation of concerns with a tokenizer, a top_k_tokens function, and top-level orchestration. Type hints and concise comments improve clarity. Variable name 'u' is a bit terse but understandable.\n- Maintainability: Precompiled regex, small functions, and type annotations aid maintainability. No reliance on unspecified ordering behavior (avoids most_common).\n- Minor nits: Using both re.IGNORECASE and .lower() is slightly redundant (either alone would suffice given ASCII), but harmless. In a non-specified environment (imported, no globals, not __main__), top_k remains undefined, though this does not violate the task requirements." + }, + "final_judgement": { + "adherence_score": 5, + "code_quality_score": 5, + "comments": "Meets all task requirements exactly, including tokenization, sorting, edge cases, and output contract. Code is clear, efficient, and well-structured. Minor optional improvements: remove redundant case-handling or ensure top_k is always set in all import contexts." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_24.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_24.json new file mode 100644 index 0000000000..502d3db632 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_24.json @@ -0,0 +1,16 @@ +{ + "file": "run_24.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"Step-by-step check against instructions:\\n- Tokenization: Uses a compiled regex r\\\"[a-z0-9]+\\\" with flags re.ASCII | re.IGNORECASE, and explicitly lowercases each matched token in _tokens(). This matches the requirement: ASCII [a-z0-9]+ sequences, lowercase, other chars are separators.\\n- Sorting and Top-K: Builds a Counter from tokens. Defines key = lambda kv: (-kv[1], kv[0]) over cnt.items(), which corresponds to sorting by count descending, then token ascending. For k >= number of unique tokens, returns sorted(cnt.items(), key=key), which yields the full list in the required order. For k < unique, uses heapq.nsmallest(k, cnt.items(), key=key); nsmallest returns the k elements in sorted order according to the key, so the output is correctly ordered as required.\\n- Output shape and length: Returns lists of (token, count) tuples. For k >= U, returns all U tuples; for 0 < k < U, returns exactly k tuples; for k <= 0, returns [] (handled early). This satisfies \\\"length = min(k, number of unique tokens)\\\" and the edge case requirement.\\n- Globals usage: Does not reassign text or k. Detects presence of globals via try/except NameError and, if present, sets top_k = top_k_tokens(text, k). If not present and running as __main__, it runs a guarded demo using separate demo_* variables and still defines top_k.\\n- Prohibited I/O: No input(), file I/O, or network access. Only an optional print in the demo branch.\\n- Output contract: Ensures top_k is defined at end of execution in both code paths (with provided globals or in demo). Optional printing only prints top_k on the last line when in demo mode. Does not rely on Counter.most_common tie ordering.\\nOverall, the code fully adheres to all specified requirements, including edge cases and tie-breaking.\",\n \"code_quality\": \"The code is clear, correct, and efficient:\\n- Clarity/structure: Clean separation of concerns (_tokens generator, top_k_tokens function). Type hints provided. Meaningful names, small and readable.\\n- Correctness: Implements the specified tokenization and ordering precisely. Uses heapq.nsmallest with a key that encodes the required ordering, yielding a sorted Top-K. Handles k <= 0 and k >= U correctly.\\n- Efficiency: Counting is O(N tokens). Selection is O(U log k) for k < U and O(U log U) when sorting all. Extra space O(U + k). This meets typical performance expectations.\\n- Maintainability/readability: Compact and idiomatic. Minor nit: variable name 'u' could be more descriptive (e.g., num_unique). The use of both re.IGNORECASE and explicit lowercasing is slightly redundant but harmless.\\n- Best practices: Avoids Counter.most_common tie semantics as requested. No side effects on globals. The guarded demo respects the output contract and avoids prohibited I/O.\\nNo bugs or inefficiencies of concern were found.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 5,\n \"code_quality_score\": 5,\n \"comments\": \"Excellent adherence: correct tokenization, ordering, edge case handling, and top_k assignment. Code is concise, efficient, and readable. Minor redundancy (IGNORECASE plus lower()) and a terse variable name are the only small nits.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "Step-by-step check against instructions:\n- Tokenization: Uses a compiled regex r\"[a-z0-9]+\" with flags re.ASCII | re.IGNORECASE, and explicitly lowercases each matched token in _tokens(). This matches the requirement: ASCII [a-z0-9]+ sequences, lowercase, other chars are separators.\n- Sorting and Top-K: Builds a Counter from tokens. Defines key = lambda kv: (-kv[1], kv[0]) over cnt.items(), which corresponds to sorting by count descending, then token ascending. For k >= number of unique tokens, returns sorted(cnt.items(), key=key), which yields the full list in the required order. For k < unique, uses heapq.nsmallest(k, cnt.items(), key=key); nsmallest returns the k elements in sorted order according to the key, so the output is correctly ordered as required.\n- Output shape and length: Returns lists of (token, count) tuples. For k >= U, returns all U tuples; for 0 < k < U, returns exactly k tuples; for k <= 0, returns [] (handled early). This satisfies \"length = min(k, number of unique tokens)\" and the edge case requirement.\n- Globals usage: Does not reassign text or k. Detects presence of globals via try/except NameError and, if present, sets top_k = top_k_tokens(text, k). If not present and running as __main__, it runs a guarded demo using separate demo_* variables and still defines top_k.\n- Prohibited I/O: No input(), file I/O, or network access. Only an optional print in the demo branch.\n- Output contract: Ensures top_k is defined at end of execution in both code paths (with provided globals or in demo). Optional printing only prints top_k on the last line when in demo mode. Does not rely on Counter.most_common tie ordering.\nOverall, the code fully adheres to all specified requirements, including edge cases and tie-breaking.", + "code_quality": "The code is clear, correct, and efficient:\n- Clarity/structure: Clean separation of concerns (_tokens generator, top_k_tokens function). Type hints provided. Meaningful names, small and readable.\n- Correctness: Implements the specified tokenization and ordering precisely. Uses heapq.nsmallest with a key that encodes the required ordering, yielding a sorted Top-K. Handles k <= 0 and k >= U correctly.\n- Efficiency: Counting is O(N tokens). Selection is O(U log k) for k < U and O(U log U) when sorting all. Extra space O(U + k). This meets typical performance expectations.\n- Maintainability/readability: Compact and idiomatic. Minor nit: variable name 'u' could be more descriptive (e.g., num_unique). The use of both re.IGNORECASE and explicit lowercasing is slightly redundant but harmless.\n- Best practices: Avoids Counter.most_common tie semantics as requested. No side effects on globals. The guarded demo respects the output contract and avoids prohibited I/O.\nNo bugs or inefficiencies of concern were found." + }, + "final_judgement": { + "adherence_score": 5, + "code_quality_score": 5, + "comments": "Excellent adherence: correct tokenization, ordering, edge case handling, and top_k assignment. Code is concise, efficient, and readable. Minor redundancy (IGNORECASE plus lower()) and a terse variable name are the only small nits." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_25.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_25.json new file mode 100644 index 0000000000..cf4cb7ef4a --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_25.json @@ -0,0 +1,16 @@ +{ + "file": "run_25.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"Step-by-step:\\n- Tokenization: Uses regex r\\\"[a-z0-9]+\\\" with re.ASCII and re.IGNORECASE, and yields m.group(0).lower(). This matches the requirement to use ASCII [a-z0-9]+ as tokens with other chars as separators, and to lowercase tokens. Using IGNORECASE plus .lower() is redundant but compliant.\\n- Inputs: The code references provided globals text and k without reassigning them. It conditionally executes a demo only if those globals are absent and the script is run as __main__, which does not violate the instructions.\\n- Requirement 1 (sorting): Defines key = lambda kv: (-kv[1], kv[0]) and applies it to (token, count) pairs. For k >= number of unique tokens, returns sorted(cnt.items(), key=key). For k < unique tokens, returns heapq.nsmallest(k, cnt.items(), key=key). This achieves sort by count desc then token asc for the selected Top-K and avoids Counter.most_common.\\n- Requirement 2 (top_k value): Sets top_k to the list returned by top_k_tokens(text, k). The function returns a list of (token, count) tuples of length min(k, unique tokens). Correct.\\n- Requirement 3 (edge case k <= 0): top_k_tokens returns [], so top_k will be []. Correct.\\n- Requirement 4 (no input/I-O/net): No input(), no file I/O, no network access. Optional printing occurs only in the demo branch and prints only top_k on the last line.\\n- Output contract: When globals are provided (the intended scenario), top_k is defined exactly as specified. In demo mode, top_k is also defined at module level and printed as a Python literal. The code does not rely on Counter.most_common for tie ordering.\\n- Minor note: If globals were not provided and the module was imported (not __main__), top_k would not be defined; however, the task explicitly states globals are provided when running, so this is not a violation in the intended use.\\nOverall, all specified requirements are met, with correct tokenization, sorting, edge-case handling, and output placement.\",\n \"code_quality\": \"Clarity and correctness: The code is clear, concise, and correct. The helper _tokens encapsulates tokenization cleanly. Type hints improve readability.\\nEfficiency: Counting is O(N tokens). Selecting Top-K uses heapq.nsmallest with key-based ordering, giving O(U log k) time and O(U + k) space as commented. For k >= U, it sensibly falls back to full sort.\\nBest practices and readability: Names are descriptive, structure is modular. No reliance on most_common tie semantics. The try/except NameError to detect globals is pragmatic. Optional demo prints only the required value. Minor nit: using re.IGNORECASE together with .lower() is redundant (either alone would suffice), but it doesn't harm correctness or performance materially. Docstrings are absent but not required here.\\nNo bugs or stylistic issues that affect behavior were found.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 5,\n \"code_quality_score\": 5,\n \"comments\": \"Excellent adherence: correct tokenization, sorting, edge-case handling, and output. Code is clean, efficient, and maintainable. Minor optional improvement: remove redundant IGNORECASE or .lower() redundancy.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "Step-by-step:\n- Tokenization: Uses regex r\"[a-z0-9]+\" with re.ASCII and re.IGNORECASE, and yields m.group(0).lower(). This matches the requirement to use ASCII [a-z0-9]+ as tokens with other chars as separators, and to lowercase tokens. Using IGNORECASE plus .lower() is redundant but compliant.\n- Inputs: The code references provided globals text and k without reassigning them. It conditionally executes a demo only if those globals are absent and the script is run as __main__, which does not violate the instructions.\n- Requirement 1 (sorting): Defines key = lambda kv: (-kv[1], kv[0]) and applies it to (token, count) pairs. For k >= number of unique tokens, returns sorted(cnt.items(), key=key). For k < unique tokens, returns heapq.nsmallest(k, cnt.items(), key=key). This achieves sort by count desc then token asc for the selected Top-K and avoids Counter.most_common.\n- Requirement 2 (top_k value): Sets top_k to the list returned by top_k_tokens(text, k). The function returns a list of (token, count) tuples of length min(k, unique tokens). Correct.\n- Requirement 3 (edge case k <= 0): top_k_tokens returns [], so top_k will be []. Correct.\n- Requirement 4 (no input/I-O/net): No input(), no file I/O, no network access. Optional printing occurs only in the demo branch and prints only top_k on the last line.\n- Output contract: When globals are provided (the intended scenario), top_k is defined exactly as specified. In demo mode, top_k is also defined at module level and printed as a Python literal. The code does not rely on Counter.most_common for tie ordering.\n- Minor note: If globals were not provided and the module was imported (not __main__), top_k would not be defined; however, the task explicitly states globals are provided when running, so this is not a violation in the intended use.\nOverall, all specified requirements are met, with correct tokenization, sorting, edge-case handling, and output placement.", + "code_quality": "Clarity and correctness: The code is clear, concise, and correct. The helper _tokens encapsulates tokenization cleanly. Type hints improve readability.\nEfficiency: Counting is O(N tokens). Selecting Top-K uses heapq.nsmallest with key-based ordering, giving O(U log k) time and O(U + k) space as commented. For k >= U, it sensibly falls back to full sort.\nBest practices and readability: Names are descriptive, structure is modular. No reliance on most_common tie semantics. The try/except NameError to detect globals is pragmatic. Optional demo prints only the required value. Minor nit: using re.IGNORECASE together with .lower() is redundant (either alone would suffice), but it doesn't harm correctness or performance materially. Docstrings are absent but not required here.\nNo bugs or stylistic issues that affect behavior were found." + }, + "final_judgement": { + "adherence_score": 5, + "code_quality_score": 5, + "comments": "Excellent adherence: correct tokenization, sorting, edge-case handling, and output. Code is clean, efficient, and maintainable. Minor optional improvement: remove redundant IGNORECASE or .lower() redundancy." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_26.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_26.json new file mode 100644 index 0000000000..53b973d0fb --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_26.json @@ -0,0 +1,16 @@ +{ + "file": "run_26.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"Step-by-step against the instructions:\\n- Tokenization: Uses a regex r\\\"[a-z0-9]+\\\" with flags re.ASCII | re.IGNORECASE to extract alphanumeric ASCII sequences, and each token is lowercased in _tokens(). This matches the spec (lowercase and ASCII [a-z0-9]+ tokens; other chars are separators via regex).\\n- Inputs and globals: The code reads globals text and k without reassigning them. It computes top_k = top_k_tokens(text, k) when those globals exist. It never reassigns text or k.\\n- Requirement 1 (sorting): The final ordering is by (-count, token). In the sorting path, it does sorted(cnt.items(), key=lambda kv: (-kv[1], kv[0])). In the heap path, it sorts the final result with result.sort(key=lambda kv: (-kv[1], kv[0])). This exactly matches the required sort key.\\n- Requirement 2 (Top-K list and length): top_k_tokens returns a list of (token, count) tuples. It computes k_eff = min(k, number of unique tokens) and either slices the sorted list [:k_eff] or maintains a size-k heap, ensuring the returned list length is min(k, unique tokens). The module-level top_k is set to that list when globals are provided.\\n- Requirement 3 (k <= 0): Early return [] if k <= 0, so top_k will be an empty list in that case.\\n- Requirement 4 (no input/file/network; runs as-is): No input(), file I/O, or network usage. With the provided globals, the script computes top_k and does not print anything extra. If globals are absent and the script is executed as __main__, it runs a guarded demo and prints only top_k, which is allowed as optional behavior.\\n- Output contract: When text and k are provided, top_k is defined at the end as required. The code does not rely on Counter.most_common; it explicitly implements the sort order and a correct heap-based selection with proper tie-breaking.\\n- Minor note: If text and k are not provided and the module is imported (not run as __main__), top_k would remain undefined. Given the task guarantees those globals are provided, this is acceptable but worth noting as a minor edge consideration.\\nOverall, the solution adheres very well to the instructions, including edge cases and the specified sorting and output contract.\",\n \"code_quality\": \"Notable points:\\n- Clarity/readability: Well-structured with clear helper functions (_tokens, _revlex_tuple, top_k_tokens). Comments explain the heap ordering rationale. Type hints improve readability.\\n- Correctness: Tokenization matches requirements. Sorting and tie-breaking are correct. The heap branch uses rk = (count, _revlex_tuple(token)) so the min-heap root represents the current worst item (lowest count, then lexicographically largest), and items replace the root when better (rk > root). Final sort ensures exact required order. Edge cases k <= 0 and no tokens are handled.\\n- Efficiency: Uses Counter for O(N tokens) counting. Chooses between full sort (O(U log U)) and a bounded heap (O(U log k)), which is efficient. _revlex_tuple enables correct tie behavior in the heap.\\n- Maintainability: Modular design, doc-style comments for complexity, and no reliance on unspecified Counter.most_common behavior. No unnecessary side effects; respects provided globals.\\n- Minor nits: re.IGNORECASE is redundant since matches are lowercased anyway; storing both c in rk and again as a separate field duplicates data slightly; the globals-existence check via 'try: text; k' is a bit unconventional but functional.\\nOverall, code quality is high with no correctness bugs identified.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 5,\n \"code_quality_score\": 5,\n \"comments\": \"Excellent adherence to requirements and robust implementation. Correct tokenization, sorting, edge-case handling, and output contract. Minor style nits only.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "Step-by-step against the instructions:\n- Tokenization: Uses a regex r\"[a-z0-9]+\" with flags re.ASCII | re.IGNORECASE to extract alphanumeric ASCII sequences, and each token is lowercased in _tokens(). This matches the spec (lowercase and ASCII [a-z0-9]+ tokens; other chars are separators via regex).\n- Inputs and globals: The code reads globals text and k without reassigning them. It computes top_k = top_k_tokens(text, k) when those globals exist. It never reassigns text or k.\n- Requirement 1 (sorting): The final ordering is by (-count, token). In the sorting path, it does sorted(cnt.items(), key=lambda kv: (-kv[1], kv[0])). In the heap path, it sorts the final result with result.sort(key=lambda kv: (-kv[1], kv[0])). This exactly matches the required sort key.\n- Requirement 2 (Top-K list and length): top_k_tokens returns a list of (token, count) tuples. It computes k_eff = min(k, number of unique tokens) and either slices the sorted list [:k_eff] or maintains a size-k heap, ensuring the returned list length is min(k, unique tokens). The module-level top_k is set to that list when globals are provided.\n- Requirement 3 (k <= 0): Early return [] if k <= 0, so top_k will be an empty list in that case.\n- Requirement 4 (no input/file/network; runs as-is): No input(), file I/O, or network usage. With the provided globals, the script computes top_k and does not print anything extra. If globals are absent and the script is executed as __main__, it runs a guarded demo and prints only top_k, which is allowed as optional behavior.\n- Output contract: When text and k are provided, top_k is defined at the end as required. The code does not rely on Counter.most_common; it explicitly implements the sort order and a correct heap-based selection with proper tie-breaking.\n- Minor note: If text and k are not provided and the module is imported (not run as __main__), top_k would remain undefined. Given the task guarantees those globals are provided, this is acceptable but worth noting as a minor edge consideration.\nOverall, the solution adheres very well to the instructions, including edge cases and the specified sorting and output contract.", + "code_quality": "Notable points:\n- Clarity/readability: Well-structured with clear helper functions (_tokens, _revlex_tuple, top_k_tokens). Comments explain the heap ordering rationale. Type hints improve readability.\n- Correctness: Tokenization matches requirements. Sorting and tie-breaking are correct. The heap branch uses rk = (count, _revlex_tuple(token)) so the min-heap root represents the current worst item (lowest count, then lexicographically largest), and items replace the root when better (rk > root). Final sort ensures exact required order. Edge cases k <= 0 and no tokens are handled.\n- Efficiency: Uses Counter for O(N tokens) counting. Chooses between full sort (O(U log U)) and a bounded heap (O(U log k)), which is efficient. _revlex_tuple enables correct tie behavior in the heap.\n- Maintainability: Modular design, doc-style comments for complexity, and no reliance on unspecified Counter.most_common behavior. No unnecessary side effects; respects provided globals.\n- Minor nits: re.IGNORECASE is redundant since matches are lowercased anyway; storing both c in rk and again as a separate field duplicates data slightly; the globals-existence check via 'try: text; k' is a bit unconventional but functional.\nOverall, code quality is high with no correctness bugs identified." + }, + "final_judgement": { + "adherence_score": 5, + "code_quality_score": 5, + "comments": "Excellent adherence to requirements and robust implementation. Correct tokenization, sorting, edge-case handling, and output contract. Minor style nits only." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_27.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_27.json new file mode 100644 index 0000000000..87cf11a0d9 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_27.json @@ -0,0 +1,16 @@ +{ + "file": "run_27.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"- Tokenization: Uses regex r\\\"[a-z0-9]+\\\" with re.ASCII and re.IGNORECASE, then lowercases via m.group(0).lower(). This matches the instruction to lowercase and to tokenize as ASCII [a-z0-9]+ sequences with other characters as separators.\\n- Inputs (globals): Accesses provided globals text and k without reassigning them. It conditionally runs a demo only if globals are missing and __name__ == \\\"__main__\\\"; otherwise, it computes top_k from the provided globals. This respects the constraint not to reassign the globals.\\n- Requirement 1 (sorting by count desc, token asc): Defines key = lambda kv: (-kv[1], kv[0]) where kv is (token, count). Sorting with this key yields descending count then ascending token. For the heap path, heapq.nsmallest(k, ..., key=key) returns items in ascending order per the key, which is the desired order. No reliance on Counter.most_common.\\n- Requirement 2 (top_k list length): When k >= number of unique tokens (u), returns all items sorted. Otherwise, returns exactly k items, both in correct order. The final assignment top_k = top_k_tokens(text, k) ensures top_k is that list of (token, count) tuples.\\n- Requirement 3 (k <= 0): top_k_tokens returns [] when k <= 0, so top_k becomes [] in that case.\\n- Requirement 4 (no input/file/network): No input(), file I/O, or network access is used. The only print occurs in the demo branch when globals are absent and running as main.\\n- Output contract: At end, when globals are provided (as per task), top_k is defined exactly as specified. Optional printing is only in the demo branch and prints just top_k on the last line. No extraneous output when globals are provided.\\n- Ambiguities/choices: The implementation uses a heuristic to choose between sorting and heap selection; both paths produce exactly the required ordering and results. The presence of a demo path is acceptable given it doesn\u2019t interfere when globals are provided. The code avoids Counter.most_common as requested.\\n- Edge conditions: Handles empty text (u = 0) correctly yielding []. Handles k > number of unique tokens, k == 0, and negative k correctly.\",\n \"code_quality\": \"- Clarity and structure: Clean separation of concerns: tokenization helper, main top_k_tokens function, and a guarded main/demo section. Type annotations improve readability.\\n- Correctness: Follows the specified tokenization and sorting rules precisely. Does not rely on Counter.most_common.\\n- Efficiency: Counts in O(N tokens). Chooses between full sort and heapq.nsmallest based on a threshold; both are efficient and correct. Heap path returns properly ordered results.\\n- Readability: Variable names are succinct though cnt/u could be more descriptive; still understandable. Key function is clear. Comment on complexity is helpful.\\n- Maintainability: Minimal dependencies, clear functions, and no hidden side effects. The try/except for globals is reasonable; uses type: ignore to satisfy type checkers.\\n- Minor nits: Using both re.IGNORECASE and .lower() is slightly redundant; either alone (with lowercase conversion) would suffice. Not harmful. If imported and globals are missing (and not __main__), top_k would not be defined, but the task assumes globals are provided, so this is acceptable.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 5,\n \"code_quality_score\": 5,\n \"comments\": \"Implements the exact tokenization, sorting, and Top-K requirements; handles edge cases; defines top_k as specified; avoids prohibited I/O. Code is clear, correct, and efficient. Only minor stylistic nits (redundant case handling).\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "- Tokenization: Uses regex r\"[a-z0-9]+\" with re.ASCII and re.IGNORECASE, then lowercases via m.group(0).lower(). This matches the instruction to lowercase and to tokenize as ASCII [a-z0-9]+ sequences with other characters as separators.\n- Inputs (globals): Accesses provided globals text and k without reassigning them. It conditionally runs a demo only if globals are missing and __name__ == \"__main__\"; otherwise, it computes top_k from the provided globals. This respects the constraint not to reassign the globals.\n- Requirement 1 (sorting by count desc, token asc): Defines key = lambda kv: (-kv[1], kv[0]) where kv is (token, count). Sorting with this key yields descending count then ascending token. For the heap path, heapq.nsmallest(k, ..., key=key) returns items in ascending order per the key, which is the desired order. No reliance on Counter.most_common.\n- Requirement 2 (top_k list length): When k >= number of unique tokens (u), returns all items sorted. Otherwise, returns exactly k items, both in correct order. The final assignment top_k = top_k_tokens(text, k) ensures top_k is that list of (token, count) tuples.\n- Requirement 3 (k <= 0): top_k_tokens returns [] when k <= 0, so top_k becomes [] in that case.\n- Requirement 4 (no input/file/network): No input(), file I/O, or network access is used. The only print occurs in the demo branch when globals are absent and running as main.\n- Output contract: At end, when globals are provided (as per task), top_k is defined exactly as specified. Optional printing is only in the demo branch and prints just top_k on the last line. No extraneous output when globals are provided.\n- Ambiguities/choices: The implementation uses a heuristic to choose between sorting and heap selection; both paths produce exactly the required ordering and results. The presence of a demo path is acceptable given it doesn\u2019t interfere when globals are provided. The code avoids Counter.most_common as requested.\n- Edge conditions: Handles empty text (u = 0) correctly yielding []. Handles k > number of unique tokens, k == 0, and negative k correctly.", + "code_quality": "- Clarity and structure: Clean separation of concerns: tokenization helper, main top_k_tokens function, and a guarded main/demo section. Type annotations improve readability.\n- Correctness: Follows the specified tokenization and sorting rules precisely. Does not rely on Counter.most_common.\n- Efficiency: Counts in O(N tokens). Chooses between full sort and heapq.nsmallest based on a threshold; both are efficient and correct. Heap path returns properly ordered results.\n- Readability: Variable names are succinct though cnt/u could be more descriptive; still understandable. Key function is clear. Comment on complexity is helpful.\n- Maintainability: Minimal dependencies, clear functions, and no hidden side effects. The try/except for globals is reasonable; uses type: ignore to satisfy type checkers.\n- Minor nits: Using both re.IGNORECASE and .lower() is slightly redundant; either alone (with lowercase conversion) would suffice. Not harmful. If imported and globals are missing (and not __main__), top_k would not be defined, but the task assumes globals are provided, so this is acceptable." + }, + "final_judgement": { + "adherence_score": 5, + "code_quality_score": 5, + "comments": "Implements the exact tokenization, sorting, and Top-K requirements; handles edge cases; defines top_k as specified; avoids prohibited I/O. Code is clear, correct, and efficient. Only minor stylistic nits (redundant case handling)." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_28.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_28.json new file mode 100644 index 0000000000..8ce2711b13 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_28.json @@ -0,0 +1,16 @@ +{ + "file": "run_28.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"- Tokenization requirements:\\n - Uses regex r\\\"[a-z0-9]+\\\" with re.ASCII to ensure ASCII-only tokens and re.IGNORECASE to capture uppercase. Then lowercases each token in _tokens(). This matches: lowercase tokens; tokens are ASCII [a-z0-9]+; non-matching chars act as separators.\\n- Inputs/globals:\\n - Reads globals text and k without reassigning. The try/except NameError branch falls back to a demo only if globals are absent. When globals are present, it computes top_k = top_k_tokens(text, k) as required.\\n- Sorting and Top-K:\\n - Defines key = lambda kv: (-kv[1], kv[0]) which implements count descending, then token ascending.\\n - For k >= number of unique tokens (u), returns sorted(cnt.items(), key=key) which is fully sorted.\\n - For k < u, either sorts-then-slices or uses heapq.nsmallest(k, cnt.items(), key=key). heapq.nsmallest with this key returns elements in ascending order of (-count, token), which corresponds exactly to descending count and ascending token. This does not rely on Counter.most_common.\\n- Edge cases:\\n - If k <= 0, returns [] and subsequently sets top_k = [].\\n - If there are zero unique tokens, returns [].\\n - Length of result is exactly min(k, u) in all branches.\\n- No disallowed I/O:\\n - No input(), file I/O, or network access. Optional print occurs only in demo mode and prints only top_k on the last line.\\n- Output contract:\\n - With provided globals, top_k is defined at end as a list of (token, count) tuples ordered per spec. In the no-globals case, a demo under __main__ sets and prints top_k; otherwise no top_k is defined, which is acceptable given the task context that globals are provided.\\n- Ambiguities/notes:\\n - The use of both re.IGNORECASE and lower() is slightly redundant but correct and safe. The selection heuristic (sort-all vs heap) is an internal optimization that preserves exact ordering.\",\n \"code_quality\": \"- Correctness: Implements exact sorting key and Top-K selection; avoids Counter.most_common tie behavior; handles edge cases.\\n- Efficiency: Uses Counter for O(N) counting and selects either full sort O(U log U) or heap-based selection O(U log k) depending on k/U. Reasonable threshold and documented.\\n- Readability/Maintainability:\\n - Clear structure with small, well-named helpers (_tokens, top_k_tokens).\\n - Regex precompiled. Type hints provided. Comments explain key and complexity.\\n - Minor nit: variable name 'u' could be more descriptive (e.g., num_unique). Also, using both IGNORECASE and .lower() is redundant, but harmless.\\n- Style/Best practices: No global reassignment; optional demo guarded by __main__; no extraneous prints. Overall clean and idiomatic Python.\\n- No bugs or inefficiencies evident that affect correctness or ordering.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 5,\n \"code_quality_score\": 5,\n \"comments\": \"Excellent adherence: correct tokenization, ordering, edge-case handling, and output variable. Efficient and clear implementation without relying on Counter.most_common. Minor nits only (redundant IGNORECASE + lower, terse variable name).\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "- Tokenization requirements:\n - Uses regex r\"[a-z0-9]+\" with re.ASCII to ensure ASCII-only tokens and re.IGNORECASE to capture uppercase. Then lowercases each token in _tokens(). This matches: lowercase tokens; tokens are ASCII [a-z0-9]+; non-matching chars act as separators.\n- Inputs/globals:\n - Reads globals text and k without reassigning. The try/except NameError branch falls back to a demo only if globals are absent. When globals are present, it computes top_k = top_k_tokens(text, k) as required.\n- Sorting and Top-K:\n - Defines key = lambda kv: (-kv[1], kv[0]) which implements count descending, then token ascending.\n - For k >= number of unique tokens (u), returns sorted(cnt.items(), key=key) which is fully sorted.\n - For k < u, either sorts-then-slices or uses heapq.nsmallest(k, cnt.items(), key=key). heapq.nsmallest with this key returns elements in ascending order of (-count, token), which corresponds exactly to descending count and ascending token. This does not rely on Counter.most_common.\n- Edge cases:\n - If k <= 0, returns [] and subsequently sets top_k = [].\n - If there are zero unique tokens, returns [].\n - Length of result is exactly min(k, u) in all branches.\n- No disallowed I/O:\n - No input(), file I/O, or network access. Optional print occurs only in demo mode and prints only top_k on the last line.\n- Output contract:\n - With provided globals, top_k is defined at end as a list of (token, count) tuples ordered per spec. In the no-globals case, a demo under __main__ sets and prints top_k; otherwise no top_k is defined, which is acceptable given the task context that globals are provided.\n- Ambiguities/notes:\n - The use of both re.IGNORECASE and lower() is slightly redundant but correct and safe. The selection heuristic (sort-all vs heap) is an internal optimization that preserves exact ordering.", + "code_quality": "- Correctness: Implements exact sorting key and Top-K selection; avoids Counter.most_common tie behavior; handles edge cases.\n- Efficiency: Uses Counter for O(N) counting and selects either full sort O(U log U) or heap-based selection O(U log k) depending on k/U. Reasonable threshold and documented.\n- Readability/Maintainability:\n - Clear structure with small, well-named helpers (_tokens, top_k_tokens).\n - Regex precompiled. Type hints provided. Comments explain key and complexity.\n - Minor nit: variable name 'u' could be more descriptive (e.g., num_unique). Also, using both IGNORECASE and .lower() is redundant, but harmless.\n- Style/Best practices: No global reassignment; optional demo guarded by __main__; no extraneous prints. Overall clean and idiomatic Python.\n- No bugs or inefficiencies evident that affect correctness or ordering." + }, + "final_judgement": { + "adherence_score": 5, + "code_quality_score": 5, + "comments": "Excellent adherence: correct tokenization, ordering, edge-case handling, and output variable. Efficient and clear implementation without relying on Counter.most_common. Minor nits only (redundant IGNORECASE + lower, terse variable name)." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_29.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_29.json new file mode 100644 index 0000000000..5889fe8175 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_29.json @@ -0,0 +1,16 @@ +{ + "file": "run_29.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"Step-by-step adherence check:\\n- Tokenization rules: The code compiles a regex r\\\"[a-z0-9]+\\\" with re.ASCII | re.IGNORECASE and then lowercases each matched token in _tokens(). This satisfies: lowercase text and ASCII [a-z0-9]+ tokens; other chars as separators.\\n- Sorting requirement: The key lambda (-count, token) is used with sorted(...) or heapq.nsmallest(...), producing order by count desc then token asc. This matches the specified sort key; it also avoids relying on Counter.most_common.\\n- Top-K computation: The function returns min(k, U) items. Cases:\\n - If k <= 0: returns []. Matches requirement 3.\\n - If U == 0: returns []. Appropriate.\\n - If k >= U: returns full sorted list of U items. Correct length and order.\\n - Else: either sorts and slices or uses heapq.nsmallest(k, ...). Both yield exactly k items sorted by the given key (heapq.nsmallest returns results sorted by the key), so the order is correct.\\n- Globals handling: It does not reassign text or k. It assigns top_k = top_k_tokens(text, k) when those globals exist, satisfying the requirement that at the end of execution top_k is defined as specified (in the intended environment where globals are provided).\\n- No disallowed I/O: No input(), file, or network access. Optional printing only occurs in a demo branch when globals are missing and __name__ == \\\"__main__\\\"; it prints only top_k and nothing else, fulfilling the optional print constraint.\\n- Edge/ambiguity: If globals are absent and not running as __main__, top_k would not be set, but the task context states the globals are provided, so this is acceptable. The implementation does not use Counter.most_common and implements the specified sort explicitly.\\nOverall, the code fully meets the instructions.\",\n \"code_quality\": \"Assessment:\\n- Correctness: Tokenization, counting, and ordering are correct. heapq.nsmallest with key=(-count, token) correctly returns the top-k in the required sorted order. Edge cases (k<=0, no tokens) handled.\\n- Efficiency: Uses Counter for O(N) counting. Chooses between full sort O(U log U) for large k and heap-based selection O(U log k) for small k. This is efficient and memory-conscious (extra O(k) beyond counts).\\n- Readability: Clear structure with helper _tokens and top_k_tokens function. Type hints provided. Inline comments clarify strategy and complexity.\\n- Maintainability: Logic is modular and well-documented. Regex is compiled once at module level. No reliance on unspecified tie-breaking behavior.\\n- Style: Minor nit: \\\"import re, heapq\\\" could be split into separate lines per PEP 8; variable name 'u' could be more descriptive. These are minor and do not affect quality.\\nNo bugs or significant issues identified.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 5,\n \"code_quality_score\": 5,\n \"comments\": \"Meets all task requirements precisely, including tokenization, sorting, edge cases, and global handling. Code is efficient, clean, and well-documented. Minor stylistic nits only.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "Step-by-step adherence check:\n- Tokenization rules: The code compiles a regex r\"[a-z0-9]+\" with re.ASCII | re.IGNORECASE and then lowercases each matched token in _tokens(). This satisfies: lowercase text and ASCII [a-z0-9]+ tokens; other chars as separators.\n- Sorting requirement: The key lambda (-count, token) is used with sorted(...) or heapq.nsmallest(...), producing order by count desc then token asc. This matches the specified sort key; it also avoids relying on Counter.most_common.\n- Top-K computation: The function returns min(k, U) items. Cases:\n - If k <= 0: returns []. Matches requirement 3.\n - If U == 0: returns []. Appropriate.\n - If k >= U: returns full sorted list of U items. Correct length and order.\n - Else: either sorts and slices or uses heapq.nsmallest(k, ...). Both yield exactly k items sorted by the given key (heapq.nsmallest returns results sorted by the key), so the order is correct.\n- Globals handling: It does not reassign text or k. It assigns top_k = top_k_tokens(text, k) when those globals exist, satisfying the requirement that at the end of execution top_k is defined as specified (in the intended environment where globals are provided).\n- No disallowed I/O: No input(), file, or network access. Optional printing only occurs in a demo branch when globals are missing and __name__ == \"__main__\"; it prints only top_k and nothing else, fulfilling the optional print constraint.\n- Edge/ambiguity: If globals are absent and not running as __main__, top_k would not be set, but the task context states the globals are provided, so this is acceptable. The implementation does not use Counter.most_common and implements the specified sort explicitly.\nOverall, the code fully meets the instructions.", + "code_quality": "Assessment:\n- Correctness: Tokenization, counting, and ordering are correct. heapq.nsmallest with key=(-count, token) correctly returns the top-k in the required sorted order. Edge cases (k<=0, no tokens) handled.\n- Efficiency: Uses Counter for O(N) counting. Chooses between full sort O(U log U) for large k and heap-based selection O(U log k) for small k. This is efficient and memory-conscious (extra O(k) beyond counts).\n- Readability: Clear structure with helper _tokens and top_k_tokens function. Type hints provided. Inline comments clarify strategy and complexity.\n- Maintainability: Logic is modular and well-documented. Regex is compiled once at module level. No reliance on unspecified tie-breaking behavior.\n- Style: Minor nit: \"import re, heapq\" could be split into separate lines per PEP 8; variable name 'u' could be more descriptive. These are minor and do not affect quality.\nNo bugs or significant issues identified." + }, + "final_judgement": { + "adherence_score": 5, + "code_quality_score": 5, + "comments": "Meets all task requirements precisely, including tokenization, sorting, edge cases, and global handling. Code is efficient, clean, and well-documented. Minor stylistic nits only." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_30.json b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_30.json new file mode 100644 index 0000000000..a859ccbcb5 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_llm_as_judge_optimized/run_30.json @@ -0,0 +1,16 @@ +{ + "file": "run_30.py", + "raw": "{\n \"reasoning\": {\n \"task_adherence\": \"- Tokenization: Uses regex r\\\"[a-z0-9]+\\\" with re.ASCII to ensure ASCII matching and lowercases tokens in _tokens(), satisfying the requirement to lowercase and use ASCII [a-z0-9]+ with other chars as separators.\\n- Inputs: Accesses provided globals text and k without reassigning them. Uses a try/except NameError to detect their presence and only computes from them when available, respecting the constraint.\\n- Requirement 1 (sorting): Constructs key = lambda kv: (-kv[1], kv[0]) over (token, count) pairs and uses heapq.nsmallest(m, cnt.items(), key=key). This produces items ordered by count descending, then token ascending, matching the specified sort key (-count, token). It does not rely on Counter.most_common tie ordering.\\n- Requirement 2 (top_k content and length): The function returns a list of (token, count) tuples and limits its length to m = min(k, number of unique tokens). The top-level assigns top_k = top_k_tokens(text, k), so top_k meets the spec when globals are provided.\\n- Requirement 3 (k <= 0): top_k_tokens returns [] when k <= 0; thus top_k is [] in that case.\\n- Requirement 4 (no I/O): No input(), file I/O, or network access is used. Optional printing occurs only in a demo path when globals are missing and __name__ == \\\"__main__\\\", printing only top_k.\\n- Output contract: With provided globals, top_k is defined at the end exactly as required. Optional printing is constrained to output only top_k as a Python literal. A minor note: if globals are missing and not running as __main__, top_k is not defined, but the task states globals are provided, so this path is outside the stated execution context.\\n- Note on tie ordering: Does not use Counter.most_common; ordering is implemented explicitly via key.\",\n \"code_quality\": \"- Correctness: The logic correctly tokenizes, counts, and selects the top-k with the proper sort key and tie-breaker. Edge cases (k <= 0 and no tokens) return [].\\n- Efficiency: Counting is O(N tokens). Selection via heapq.nsmallest is O(U log k) with space O(U + k), which is efficient for large U and small k.\\n- Readability/clarity: Clear structure with helper functions and type hints. Minor nit: variable names u and m are terse; more descriptive names could improve readability. A small redundancy exists (using re.IGNORECASE plus lowercasing), but it\u2019s harmless.\\n- Maintainability: Modular design and clear top-level orchestration. Type hints aid maintainability. The try/except for globals is pragmatic and well-scoped. A very minor improvement would be to always define top_k even if globals are missing and not __main__, though this is outside the stated execution scenario.\\n- Best practices: Avoids reliance on Counter.most_common ordering, avoids unnecessary I/O, and handles edge cases explicitly. The complexity comment is accurate and helpful.\"\n },\n \"final_judgement\": {\n \"adherence_score\": 5,\n \"code_quality_score\": 5,\n \"comments\": \"Meets all task requirements precisely with correct sorting, proper tokenization, edge-case handling, and no prohibited I/O. Code is clean, efficient, and maintainable. Minor nits (redundant IGNORECASE, terse variable names) do not affect correctness or quality.\"\n }\n}", + "parsed": { + "reasoning": { + "task_adherence": "- Tokenization: Uses regex r\"[a-z0-9]+\" with re.ASCII to ensure ASCII matching and lowercases tokens in _tokens(), satisfying the requirement to lowercase and use ASCII [a-z0-9]+ with other chars as separators.\n- Inputs: Accesses provided globals text and k without reassigning them. Uses a try/except NameError to detect their presence and only computes from them when available, respecting the constraint.\n- Requirement 1 (sorting): Constructs key = lambda kv: (-kv[1], kv[0]) over (token, count) pairs and uses heapq.nsmallest(m, cnt.items(), key=key). This produces items ordered by count descending, then token ascending, matching the specified sort key (-count, token). It does not rely on Counter.most_common tie ordering.\n- Requirement 2 (top_k content and length): The function returns a list of (token, count) tuples and limits its length to m = min(k, number of unique tokens). The top-level assigns top_k = top_k_tokens(text, k), so top_k meets the spec when globals are provided.\n- Requirement 3 (k <= 0): top_k_tokens returns [] when k <= 0; thus top_k is [] in that case.\n- Requirement 4 (no I/O): No input(), file I/O, or network access is used. Optional printing occurs only in a demo path when globals are missing and __name__ == \"__main__\", printing only top_k.\n- Output contract: With provided globals, top_k is defined at the end exactly as required. Optional printing is constrained to output only top_k as a Python literal. A minor note: if globals are missing and not running as __main__, top_k is not defined, but the task states globals are provided, so this path is outside the stated execution context.\n- Note on tie ordering: Does not use Counter.most_common; ordering is implemented explicitly via key.", + "code_quality": "- Correctness: The logic correctly tokenizes, counts, and selects the top-k with the proper sort key and tie-breaker. Edge cases (k <= 0 and no tokens) return [].\n- Efficiency: Counting is O(N tokens). Selection via heapq.nsmallest is O(U log k) with space O(U + k), which is efficient for large U and small k.\n- Readability/clarity: Clear structure with helper functions and type hints. Minor nit: variable names u and m are terse; more descriptive names could improve readability. A small redundancy exists (using re.IGNORECASE plus lowercasing), but it\u2019s harmless.\n- Maintainability: Modular design and clear top-level orchestration. Type hints aid maintainability. The try/except for globals is pragmatic and well-scoped. A very minor improvement would be to always define top_k even if globals are missing and not __main__, though this is outside the stated execution scenario.\n- Best practices: Avoids reliance on Counter.most_common ordering, avoids unnecessary I/O, and handles edge cases explicitly. The complexity comment is accurate and helpful." + }, + "final_judgement": { + "adherence_score": 5, + "code_quality_score": 5, + "comments": "Meets all task requirements precisely with correct sorting, proper tokenization, edge-case handling, and no prohibited I/O. Code is clean, efficient, and maintainable. Minor nits (redundant IGNORECASE, terse variable names) do not affect correctness or quality." + } + }, + "parse_error": null +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_01.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_01.py new file mode 100644 index 0000000000..d47439fd16 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_01.py @@ -0,0 +1,34 @@ +import re +import heapq + +def compute_top_k(text: str, k: int): + # Tokenize: lowercase, ASCII [a-z0-9]+; others are separators + if not text or k <= 0: + return [] + s = text.lower() + pattern = re.compile(r'[a-z0-9]+', re.ASCII) + + counts = {} + get = counts.get + for m in pattern.finditer(s): + t = m.group(0) + counts[t] = get(t, 0) + 1 + + n = min(k, len(counts)) + if n <= 0: + return [] + # Sort by count desc, then token asc + return heapq.nsmallest(n, counts.items(), key=lambda it: (-it[1], it[0])) + +def _to_int(v): + try: + return int(v) + except Exception: + return 0 + +# Use provided globals; expose only top_k +_text = globals().get('text', '') +_k = _to_int(globals().get('k', 0)) +top_k = compute_top_k(_text, _k) + +__all__ = ['top_k'] \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_02.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_02.py new file mode 100644 index 0000000000..be0191fdde --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_02.py @@ -0,0 +1,25 @@ +import re +from typing import List, Tuple, Dict + +def compute_top_k(text: str, k: int) -> List[Tuple[str, int]]: + # Tokens are ASCII [a-z0-9]+; lowercase; others are separators. + if not isinstance(text, str) or not isinstance(k, int) or k <= 0: + return [] + + pattern = re.compile(r'[a-z0-9]+', re.ASCII | re.IGNORECASE) + counts: Dict[str, int] = {} + get = counts.get + + for m in pattern.finditer(text): + tok = m.group(0).lower() + counts[tok] = (get(tok) or 0) + 1 + + if not counts: + return [] + + items = sorted(counts.items(), key=lambda kv: (-kv[1], kv[0])) + limit = min(k, len(items)) + return items[:limit] + +# Expose the result as a convenient global. +top_k: List[Tuple[str, int]] = compute_top_k(text, k) \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_03.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_03.py new file mode 100644 index 0000000000..5e3275d55a --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_03.py @@ -0,0 +1,45 @@ +import heapq + +def _count_tokens(s: str) -> dict: + # Count ASCII [a-z0-9]+ tokens, lowercasing letters during scan. + counts = {} + buf = [] + append = buf.append + for ch in s: + o = ord(ch) + if 48 <= o <= 57: # 0-9 + append(ch) + elif 65 <= o <= 90: # A-Z -> a-z + append(chr(o + 32)) + elif 97 <= o <= 122: # a-z + append(ch) + else: + if buf: + tok = ''.join(buf) + counts[tok] = counts.get(tok, 0) + 1 + buf.clear() + if buf: + tok = ''.join(buf) + counts[tok] = counts.get(tok, 0) + 1 + return counts + +def _top_k_from_counts(counts: dict, k: int): + if not counts or k <= 0: + return [] + m = min(k, len(counts)) + # Order: count desc, then token asc -> use nsmallest with key (-count, token) + return heapq.nsmallest(m, counts.items(), key=lambda it: (-it[1], it[0])) + +# Use provided globals text (str) and k (int); fall back safely if absent. +try: + _text = text # type: ignore[name-defined] +except NameError: + _text = "" +try: + _k = int(k) # type: ignore[name-defined] +except NameError: + _k = 0 +except Exception: + _k = 0 + +top_k = _top_k_from_counts(_count_tokens(_text), _k) \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_04.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_04.py new file mode 100644 index 0000000000..93986a0b8a --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_04.py @@ -0,0 +1,23 @@ +import re +import heapq + +def compute_top_k(s: str, k_value: int): + # Count tokens: ASCII [A-Za-z0-9]+, case-insensitive, stored lowercase + counts = {} + for m in re.finditer(r'[A-Za-z0-9]+', s): + tok = m.group(0).lower() + counts[tok] = counts.get(tok, 0) + 1 + + n = max(0, int(k_value)) + if n == 0 or not counts: + return [] + + # Sort by count desc, then token asc using a key tuple + key = lambda item: (-item[1], item[0]) + # Use nsmallest with the composite key to avoid sorting the whole list when k << unique + top_items = heapq.nsmallest(n, counts.items(), key=key) + # Ensure exact order (nsmallest returns sorted by key already) + return [(tok, cnt) for tok, cnt in top_items] + +# Expect globals: text (str), k (int) +top_k = compute_top_k(text, k) \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_05.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_05.py new file mode 100644 index 0000000000..3204cf7b35 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_05.py @@ -0,0 +1,36 @@ +from collections import defaultdict + +def _iter_tokens_ascii_lower(s: str): + # Yield lowercase ASCII [a-z0-9]+ tokens, treating everything else as separators. + buf = [] + append = buf.append + for ch in s: + # Fast ASCII classification with manual lowercasing for A-Z + if 'A' <= ch <= 'Z': + append(chr(ord(ch) + 32)) # to lowercase + elif 'a' <= ch <= 'z' or '0' <= ch <= '9': + append(ch) + else: + if buf: + yield ''.join(buf) + buf.clear() + if buf: + yield ''.join(buf) + +def _top_k_tokens(text: str, k: int): + if not isinstance(text, str) or not isinstance(k, int) or k <= 0: + return [] + counts = defaultdict(int) + for tok in _iter_tokens_ascii_lower(text): + counts[tok] += 1 + if not counts: + return [] + # Sort by count desc, then token asc + items = sorted(counts.items(), key=lambda it: (-it[1], it[0])) + return items[:k] + +# Expect globals `text` and `k` to be provided by the environment. +try: + top_k = _top_k_tokens(text, k) # type: ignore[name-defined] +except NameError: + top_k = [] \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_06.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_06.py new file mode 100644 index 0000000000..7e9eaf01f1 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_06.py @@ -0,0 +1,48 @@ +import heapq +from typing import Dict, List, Tuple + +def _compute_counts(s: str) -> Dict[str, int]: + # Single pass ASCII [a-z0-9]+ tokenizer with inline lowercasing + counts: Dict[str, int] = {} + buf: List[str] = [] + append = buf.append + for ch in s: + o = ord(ch) + if 48 <= o <= 57: # '0'-'9' + append(ch) + elif 65 <= o <= 90: # 'A'-'Z' -> to lower + append(chr(o + 32)) + elif 97 <= o <= 122: # 'a'-'z' + append(ch) + else: + if buf: + tok = "".join(buf) + counts[tok] = counts.get(tok, 0) + 1 + buf.clear() + if buf: + tok = "".join(buf) + counts[tok] = counts.get(tok, 0) + 1 + return counts + +def compute_top_k(text: str, k: int) -> List[Tuple[str, int]]: + if not isinstance(text, str) or not isinstance(k, int) or k <= 0: + return [] + counts = _compute_counts(text) + if not counts: + return [] + n = min(k, len(counts)) + # Top-K by count desc, then token asc + return heapq.nsmallest(n, counts.items(), key=lambda kv: (-kv[1], kv[0])) + +# Fetch provided globals; fall back to empty if absent +try: + _text = text # type: ignore[name-defined] +except NameError: + _text = "" +try: + _k = k # type: ignore[name-defined] +except NameError: + _k = 0 + +# Expose result as requested +top_k: List[Tuple[str, int]] = compute_top_k(_text, _k) \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_07.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_07.py new file mode 100644 index 0000000000..226611e1a9 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_07.py @@ -0,0 +1,37 @@ +#!/usr/bin/env python3 +import heapq +from typing import Dict, Iterable, List, Tuple + +def _iter_tokens(s: str) -> Iterable[str]: + # One-pass ASCII tokenizer with inline lowercasing + buf: List[str] = [] + append = buf.append + for ch in s: + o = ord(ch) + if 65 <= o <= 90: # 'A'-'Z' -> lowercase + append(chr(o + 32)) + elif 97 <= o <= 122 or 48 <= o <= 57: # 'a'-'z' or '0'-'9' + append(ch) + else: + if buf: + yield "".join(buf) + buf.clear() + if buf: + yield "".join(buf) + +def compute_top_k(text: str, k: int) -> List[Tuple[str, int]]: + if not isinstance(k, int) or k <= 0: + return [] + counts: Dict[str, int] = {} + get = counts.get + for tok in _iter_tokens(text): + counts[tok] = get(tok, 0) + 1 + if not counts: + return [] + # Sort by count desc, then token asc, using a size-k heap + items = counts.items() + result = heapq.nsmallest(k, items, key=lambda kv: (-kv[1], kv[0])) + return result + +# Expected globals: text (str) and k (int) +top_k: List[Tuple[str, int]] = compute_top_k(text, k) \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_08.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_08.py new file mode 100644 index 0000000000..8d60df4561 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_08.py @@ -0,0 +1,31 @@ +import re +import heapq +from typing import List, Tuple, Dict + +# Expects globals: text (str) and k (int) + +_token_re = re.compile(r'[a-z0-9]+') + +def compute_top_k(src: str, top_n: int) -> List[Tuple[str, int]]: + # Lowercase once, stream tokens via finditer to avoid building a full token list + counts: Dict[str, int] = {} + for m in _token_re.finditer(src.lower()): + t = m.group(0) + counts[t] = counts.get(t, 0) + 1 + + if not counts: + return [] + + try: + n = int(top_n) + except Exception: + n = 0 + if n <= 0: + return [] + + n = min(n, len(counts)) + # Smallest by (-count, token) => count desc, token asc + return heapq.nsmallest(n, counts.items(), key=lambda kv: (-kv[1], kv[0])) + +# Produce the required global +top_k = compute_top_k(text, k) \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_09.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_09.py new file mode 100644 index 0000000000..53e9a715a0 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_09.py @@ -0,0 +1,40 @@ +from heapq import nsmallest + +def _count_tokens(s): + # Scan once, building ASCII [a-z0-9]+ tokens in lowercase. + counts = {} + buf = [] # token buffer + append = buf.append # local for speed + get = counts.get + for ch in s: + o = ord(ch) + if 48 <= o <= 57: # '0'-'9' + append(ch) + elif 65 <= o <= 90: # 'A'-'Z' -> lower + append(chr(o + 32)) + elif 97 <= o <= 122: # 'a'-'z' + append(ch) + else: + if buf: + tok = "".join(buf) + counts[tok] = get(tok, 0) + 1 + buf.clear() + if buf: + tok = "".join(buf) + counts[tok] = get(tok, 0) + 1 + buf.clear() + return counts + +def _select_top_k(counts, k): + # Sort by count desc, then token asc; pick up to k unique tokens. + if not counts or k <= 0: + return [] + n = min(k, len(counts)) + items = counts.items() + # nsmallest with key (-count, token) gives desired order + top = nsmallest(n, items, key=lambda kv: (-kv[1], kv[0])) + return list(top) + +# Expect globals: text (str), k (int) +# Build top_k as required: list of (token, count) tuples. +top_k = _select_top_k(_count_tokens(text), int(k)) \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_10.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_10.py new file mode 100644 index 0000000000..9f6df0b39e --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_10.py @@ -0,0 +1,37 @@ +from collections import Counter +import heapq + +def _iter_tokens(s): + # Stream tokens: lowercase ASCII [a-z0-9]+; others are separators + buf = [] + append = buf.append + for ch in s: + c = ch.lower() + if ('a' <= c <= 'z') or ('0' <= c <= '9'): + append(c) + elif buf: + yield ''.join(buf) + buf.clear() + if buf: + yield ''.join(buf) + +def _compute_top_k(s, k): + if not isinstance(k, int) or k <= 0: + return [] + counts = Counter() + for tok in _iter_tokens(s): + counts[tok] += 1 + # Sort by count desc, then token asc + return heapq.nsmallest(k, counts.items(), key=lambda kv: (-kv[1], kv[0])) + +# Use provided globals; fall back to safe defaults if missing +try: + _text = text +except NameError: + _text = "" +try: + _k = k +except NameError: + _k = 0 + +top_k = _compute_top_k(_text, _k) \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_11.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_11.py new file mode 100644 index 0000000000..75d689a068 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_11.py @@ -0,0 +1,36 @@ +import re +import heapq + +def _compute_top_k(text, k): + # Tokens: ASCII [A-Za-z0-9]+, lowercased; others are separators. + if not isinstance(text, str): + text = "" if text is None else str(text) + try: + k = int(k) + except Exception: + k = 0 + if k <= 0 or not text: + return [] + + counts = {} + # Iterate matches without lowercasing the entire text to keep memory low. + pattern = re.compile(r'[A-Za-z0-9]+', flags=re.ASCII) + for m in pattern.finditer(text): + tok = m.group(0).lower() + counts[tok] = counts.get(tok, 0) + 1 + + if not counts: + return [] + + n_unique = len(counts) + kk = k if k < n_unique else n_unique + if kk == 0: + return [] + + # Use a heap to avoid sorting the entire map when k << unique tokens. + # Key: (-count, token) gives count desc, then token asc. + top = heapq.nsmallest(kk, counts.items(), key=lambda it: (-it[1], it[0])) + return top + +# Expect globals 'text' and 'k'; define top_k for inspection. +top_k = _compute_top_k(globals().get('text', ''), globals().get('k', 0)) \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_12.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_12.py new file mode 100644 index 0000000000..04914630b2 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_12.py @@ -0,0 +1,29 @@ +import re +from collections import Counter +from heapq import nsmallest + +def _compute_top_k(text: str, k: int): + # Tokens: ASCII [a-z0-9]+ after lowercasing + pat = re.compile(r'[A-Za-z0-9]+', flags=re.ASCII) + freq = Counter() + for m in pat.finditer(text): + freq[m.group(0).lower()] += 1 + + items = list(freq.items()) + if not items: + return [] + + t = max(0, min(int(k), len(items))) + if t == 0: + return [] + + key = lambda it: (-it[1], it[0]) # count desc, token asc + if t < len(items): + return nsmallest(t, items, key=key) + return sorted(items, key=key) + +try: + top_k = _compute_top_k(text, k) +except NameError: + # If text or k are not defined, expose an empty result. + top_k = [] \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_13.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_13.py new file mode 100644 index 0000000000..ad966b1a23 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_13.py @@ -0,0 +1,43 @@ +from heapq import nsmallest + +def _count_tokens(s: str): + # One-pass ASCII [a-z0-9]+ tokenizer; letters lowercased, others are separators. + counts = {} + buf = [] + append = buf.append + for ch in s: + o = ord(ch) + if 48 <= o <= 57: # '0'-'9' + append(ch) + elif 65 <= o <= 90: # 'A'-'Z' -> lower + append(chr(o + 32)) + elif 97 <= o <= 122: # 'a'-'z' + append(ch) + else: + if buf: + tok = ''.join(buf) + counts[tok] = counts.get(tok, 0) + 1 + buf.clear() + if buf: + tok = ''.join(buf) + counts[tok] = counts.get(tok, 0) + 1 + return counts + +def compute_top_k(text, k): + s = text if isinstance(text, str) else str(text) + try: + k = int(k) + except Exception: + k = 0 + counts = _count_tokens(s) + if k <= 0 or not counts: + return [] + n = min(k, len(counts)) + # Sort by count desc, then token asc using key (-count, token) + return nsmallest(n, counts.items(), key=lambda kv: (-kv[1], kv[0])) + +try: + top_k = compute_top_k(text, k) +except NameError: + # If globals not provided, expose empty result. + top_k = [] \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_14.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_14.py new file mode 100644 index 0000000000..11b8abfe20 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_14.py @@ -0,0 +1,48 @@ +import heapq + +# Produces top_k: list[(token, count)] from globals `text` (str) and `k` (int). +# Tokenization: lowercase ASCII [a-z0-9]+, others are separators. +# Sorting: count desc, then token asc. Length = min(k, unique tokens). + +def _iter_ascii_tokens(s): + buf = [] + append = buf.append + for ch in s: + o = ord(ch) + if 65 <= o <= 90: # 'A'-'Z' -> lowercase + append(chr(o | 32)) + elif 97 <= o <= 122 or 48 <= o <= 57: # 'a'-'z' or '0'-'9' + append(ch) + else: + if buf: + yield ''.join(buf) + buf.clear() + if buf: + yield ''.join(buf) + +def _top_k_tokens(s, k): + try: + kk = int(k) + except Exception: + kk = 0 + if kk <= 0: + return [] + + counts = {} + get = counts.get + for tok in _iter_ascii_tokens(s): + counts[tok] = get(tok, 0) + 1 + + if not counts: + return [] + + kk = min(kk, len(counts)) + # nsmallest with key (-count, token) gives count desc, token asc and returns sorted. + return heapq.nsmallest(kk, counts.items(), key=lambda it: (-it[1], it[0])) + +# Build top_k from provided globals `text` and `k`. +try: + top_k = _top_k_tokens(text, k) +except NameError: + # If globals are missing, expose an empty result. + top_k = [] \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_15.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_15.py new file mode 100644 index 0000000000..b65960930c --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_15.py @@ -0,0 +1,41 @@ +# Computes top_k: the Top-K most frequent ASCII [a-z0-9]+ tokens from the global `text` +# using lowercase tokenization and sorting by count desc, then token asc. + +from typing import List, Tuple + +def _iter_ascii_tokens(s: str): + # Stream through s once; yield lowercase ASCII [a-z0-9]+ tokens + buf = [] + append = buf.append + for ch in s: + o = ord(ch) + if 48 <= o <= 57: # '0'-'9' + append(ch) + elif 97 <= o <= 122: # 'a'-'z' + append(ch) + elif 65 <= o <= 90: # 'A'-'Z' -> lower + append(chr(o + 32)) + else: + if buf: + yield ''.join(buf) + buf.clear() + if buf: + yield ''.join(buf) + +def _compute_top_k(txt: str, k: int) -> List[Tuple[str, int]]: + if k <= 0: + return [] + counts = {} + get = counts.get + for tok in _iter_ascii_tokens(txt): + counts[tok] = get(tok, 0) + 1 + if not counts: + return [] + # Sort by count desc, then token asc; take first k + items = sorted(counts.items(), key=lambda it: (-it[1], it[0])) + if k < len(items): + items = items[:k] + return items + +# Expect globals `text` (str) and `k` (int) to be provided by the caller environment. +top_k: List[Tuple[str, int]] = _compute_top_k(text, k) \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_16.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_16.py new file mode 100644 index 0000000000..1da8b69cbd --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_16.py @@ -0,0 +1,37 @@ +import heapq +from typing import Iterator, List, Tuple + +def _iter_tokens(s: str) -> Iterator[str]: + # Stream tokens: lowercase ASCII [a-z0-9]+; others are separators + buf = [] + append = buf.append + for ch in s: + lo = ch.lower() + if ('a' <= lo <= 'z') or ('0' <= ch <= '9'): + append(lo) + elif buf: + yield ''.join(buf) + buf.clear() + if buf: + yield ''.join(buf) + +def _top_k_tokens(text: str, k: int) -> List[Tuple[str, int]]: + # Count in one pass + counts = {} + for tok in _iter_tokens(text): + counts[tok] = counts.get(tok, 0) + 1 + + # Handle edge cases + try: + kk = int(k) + except Exception: + kk = 0 + if kk <= 0 or not counts: + return [] + + # Select Top-K sorted by count desc, then token asc + kk = min(kk, len(counts)) + return heapq.nsmallest(kk, counts.items(), key=lambda item: (-item[1], item[0])) + +# Compute using provided globals `text` and `k` +top_k = _top_k_tokens(globals().get('text', ''), globals().get('k', 0)) \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_17.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_17.py new file mode 100644 index 0000000000..8f71c0ff33 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_17.py @@ -0,0 +1,26 @@ +import re +import heapq + +# Compile once for speed; ASCII-only tokens +_TOKEN_RE = re.compile(r'[A-Za-z0-9]+') + +def _compute_top_k(src_text: str, k: int): + # k <= 0 yields empty result + if not isinstance(k, int) or k <= 0: + return [] + + counts = {} + # One pass: iterate matches without building an intermediate list + for m in _TOKEN_RE.finditer(src_text): + tok = m.group(0).lower() # lowercase per token + counts[tok] = counts.get(tok, 0) + 1 + + if not counts: + return [] + + top_n = k if k < len(counts) else len(counts) + # Sort by count desc, then token asc using a key on (-count, token) + return heapq.nsmallest(top_n, counts.items(), key=lambda kv: (-kv[1], kv[0])) + +# Expose the requested global result +top_k = _compute_top_k(text, k) \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_18.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_18.py new file mode 100644 index 0000000000..728898dfa2 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_18.py @@ -0,0 +1,22 @@ +import re +from heapq import nsmallest + +# Token generator: ASCII [a-z0-9]+, lowercased, streaming via re.finditer +def _iter_tokens_ascii_lower(s: str): + for m in re.finditer(r'[A-Za-z0-9]+', s): + yield m.group(0).lower() + +def _compute_top_k(s: str, k: int): + if not s or k <= 0: + return [] + counts = {} + for tok in _iter_tokens_ascii_lower(s): + counts[tok] = counts.get(tok, 0) + 1 + if not counts: + return [] + kk = k if k < len(counts) else len(counts) + # Select and sort by count desc, then token asc + return [(t, c) for t, c in nsmallest(kk, counts.items(), key=lambda it: (-it[1], it[0]))] + +# Expose result as a convenient global +top_k = _compute_top_k(text, k) \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_19.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_19.py new file mode 100644 index 0000000000..e559661e95 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_19.py @@ -0,0 +1,36 @@ +import re +from heapq import nsmallest +from typing import List, Tuple, Iterable, Dict + + +def _iter_tokens(s: str, _pat=re.compile(r'[a-z0-9]+')) -> Iterable[str]: + # Lowercase, then yield ASCII [a-z0-9]+ sequences + for m in _pat.finditer(s.lower()): + yield m.group(0) + + +def top_k_tokens(s: str, k: int) -> List[Tuple[str, int]]: + if not isinstance(s, str): + raise TypeError("text must be a str") + if not isinstance(k, int): + raise TypeError("k must be an int") + if k <= 0: + return [] + + counts: Dict[str, int] = {} + get = counts.get + for tok in _iter_tokens(s): + counts[tok] = get(tok, 0) + 1 + + if not counts: + return [] + + # Sort by count desc, then token asc using nsmallest with key (-count, token) + return nsmallest(k, counts.items(), key=lambda item: (-item[1], item[0])) + + +# Expose the result as a convenient global: top_k +try: + top_k = top_k_tokens(text, k) # expects globals: text (str), k (int) +except Exception: + top_k = [] \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_20.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_20.py new file mode 100644 index 0000000000..63c534fa49 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_20.py @@ -0,0 +1,36 @@ +from heapq import nsmallest + +def _count_tokens_ascii(text: str): + # One-pass ASCII [a-z0-9]+ tokenizer (lowercasing A-Z); others are separators. + counts = {} + buf = [] + append = buf.append + get = counts.get + def commit(): + if buf: + tok = ''.join(buf) + counts[tok] = get(tok, 0) + 1 + buf.clear() + + for ch in text: + o = ord(ch) + if 65 <= o <= 90: # 'A'-'Z' -> lower + append(chr(o + 32)) + elif 97 <= o <= 122: # 'a'-'z' + append(ch) + elif 48 <= o <= 57: # '0'-'9' + append(ch) + else: + commit() + commit() + return counts + +def _top_k_from_counts(counts, k: int): + if k <= 0 or not counts: + return [] + # Sort by count desc, then token asc using nsmallest with key (-count, token) + return nsmallest(k, counts.items(), key=lambda kv: (-kv[1], kv[0])) + +# Expect globals: text (str) and k (int) to be provided by the environment. +# Produce the required global `top_k`. +top_k = _top_k_from_counts(_count_tokens_ascii(text), k) \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_21.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_21.py new file mode 100644 index 0000000000..2e83a7e8f3 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_21.py @@ -0,0 +1,24 @@ +import re +from collections import Counter +from heapq import nsmallest + +def _iter_tokens(s): + # Yield lowercase ASCII [a-z0-9]+ tokens; non-matching chars are separators + pattern = re.compile(r"[a-z0-9]+", flags=re.ASCII | re.IGNORECASE) + for m in pattern.finditer(s): + yield m.group(0).lower() + +def compute_top_k(s, k_value): + k_int = int(k_value) + if k_int <= 0: + return [] + counts = Counter() + for tok in _iter_tokens(s): + counts[tok] += 1 + if not counts: + return [] + # Sort by count desc, then token asc; take top k + return nsmallest(k_int, counts.items(), key=lambda t: (-t[1], t[0])) + +# Expose result as a convenient global +top_k = compute_top_k(text, k) \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_22.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_22.py new file mode 100644 index 0000000000..89cde196d6 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_22.py @@ -0,0 +1,40 @@ +import sys + +def _iter_tokens(s: str): + # Stream tokenizer: ASCII [a-z0-9]+, lowercase; others are separators. + buf = [] + append = buf.append + join = ''.join + for ch in s: + c = ch.lower() + if ('a' <= c <= 'z') or ('0' <= c <= '9'): + append(c) + else: + if buf: + yield join(buf) + buf.clear() + if buf: + yield join(buf) + +def _top_k_tokens(s: str, k: int): + if k <= 0: + return [] + counts = {} + get = counts.get + for tok in _iter_tokens(s): + counts[tok] = get(tok, 0) + 1 + if not counts: + return [] + # Sort by count desc, then token asc + items = sorted(counts.items(), key=lambda kv: (-kv[1], kv[0])) + return items[: min(k, len(items))] + +# Expect globals: text (str) and k (int) +try: + _text = text # provided by caller + _k = int(k) +except Exception: + # If globals not provided, expose empty result for safety. + top_k = [] +else: + top_k = _top_k_tokens(_text, _k) \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_23.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_23.py new file mode 100644 index 0000000000..2984bf4188 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_23.py @@ -0,0 +1,43 @@ +import heapq + +def _iter_tokens_ascii_lower(s): + # Stream tokens: ASCII [a-z0-9]+, lowercase letters; non-matching chars are separators. + buf = [] + append = buf.append + for ch in s: + o = ord(ch) + if 65 <= o <= 90: # 'A'-'Z' -> lower + append(chr(o + 32)) + elif 97 <= o <= 122 or 48 <= o <= 57: # 'a'-'z' or '0'-'9' + append(ch) + else: + if buf: + yield ''.join(buf) + buf.clear() + if buf: + yield ''.join(buf) + +def _top_k_tokens(s, k): + if not s or k <= 0: + return [] + counts = {} + for tok in _iter_tokens_ascii_lower(s): + counts[tok] = counts.get(tok, 0) + 1 + if not counts: + return [] + m = k if k < len(counts) else len(counts) + # Sort by count desc, then token asc -> key (-count, token); nsmallest returns sorted ascending by key. + return heapq.nsmallest(m, counts.items(), key=lambda it: (-it[1], it[0])) + +# Use provided globals `text` and `k`; fall back to empty values if missing. +try: + _text, _k = text, k +except NameError: + _text, _k = "", 0 + +try: + _k = int(_k) +except Exception: + _k = 0 + +top_k = _top_k_tokens(_text, _k) \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_24.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_24.py new file mode 100644 index 0000000000..83f3d66b22 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_24.py @@ -0,0 +1,42 @@ +from typing import List, Tuple, Dict + +def compute_top_k(s: str, k: int) -> List[Tuple[str, int]]: + # Tokenize: lowercase letters, digits; others are separators + counts: Dict[str, int] = {} + buf: List[str] = [] + + append = buf.append + get = counts.get + + for c in s: + oc = ord(c) + if 48 <= oc <= 57: # '0'-'9' + append(c) + elif 65 <= oc <= 90: # 'A'-'Z' -> to lowercase + append(chr(oc + 32)) + elif 97 <= oc <= 122: # 'a'-'z' + append(c) + else: + if buf: + tok = ''.join(buf) + counts[tok] = (get(tok) or 0) + 1 + buf.clear() + if buf: + tok = ''.join(buf) + counts[tok] = (get(tok) or 0) + 1 + + if k <= 0 or not counts: + return [] + + items = counts.items() + items_sorted = sorted(items, key=lambda it: (-it[1], it[0])) + return items_sorted[:min(k, len(items_sorted))] + +# Produce the required global `top_k` using provided globals `text` and `k` +try: + _text = text # provided externally + _k = k # provided externally +except NameError: + top_k: List[Tuple[str, int]] = [] +else: + top_k = compute_top_k(_text, _k) \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_25.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_25.py new file mode 100644 index 0000000000..35a2469f5d --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_25.py @@ -0,0 +1,52 @@ +import heapq +from collections import Counter + +# Expects globals: text (str) and k (int) +# Produces: top_k -> list[tuple[str, int]] + +def _count_tokens_ascii_lower(s: str) -> Counter: + # One-pass ASCII [a-z0-9]+ tokenizer with on-the-fly lowercasing + cnt = Counter() + buf = [] # current token buffer + + for ch in s: + if ch.isascii(): + o = ord(ch) + # Fast ASCII lowercase + if 65 <= o <= 90: # 'A'-'Z' + o += 32 + c = chr(o) + else: + c = ch + + oc = ord(c) + if 97 <= oc <= 122 or 48 <= oc <= 57: # 'a'-'z' or '0'-'9' + buf.append(c) + continue + + if buf: + token = ''.join(buf) + cnt[token] += 1 + buf.clear() + + if buf: + token = ''.join(buf) + cnt[token] += 1 + + return cnt + + +# Build frequency map +_counts = _count_tokens_ascii_lower(text) + +# Determine k safely +_unique = len(_counts) +_k = int(k) if isinstance(k, int) or (isinstance(k, bool) is False and str(k).lstrip("-").isdigit()) else 0 +_k = max(0, min(_k, _unique)) + +# Top-K sorted by count desc, then token asc +if _k == 0: + top_k = [] +else: + # nsmallest on (-count, token) yields count desc, token asc + top_k = heapq.nsmallest(_k, _counts.items(), key=lambda it: (-it[1], it[0])) \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_26.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_26.py new file mode 100644 index 0000000000..e7fca2bd11 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_26.py @@ -0,0 +1,49 @@ +from heapq import nsmallest + +def _counts_from_text(s: str): + # One-pass ASCII tokenizer: [a-z0-9]+ after lowercasing A-Z only + counts = {} + buf = [] + append = buf.append + clear = buf.clear + get = counts.get + for ch in s: + oc = ord(ch) + if 48 <= oc <= 57: # 0-9 + append(ch) + elif 65 <= oc <= 90: # A-Z -> a-z + append(chr(oc + 32)) + elif 97 <= oc <= 122: # a-z + append(ch) + else: + if buf: + tok = ''.join(buf) + counts[tok] = get(tok, 0) + 1 + clear() + if buf: + tok = ''.join(buf) + counts[tok] = get(tok, 0) + 1 + return counts + +def _top_k_from_counts(counts, k: int): + if k <= 0 or not counts: + return [] + # Sort by count desc, then token asc; do k-selection to avoid full sort + return list(nsmallest(k, counts.items(), key=lambda it: (-it[1], it[0]))) + +# Use provided globals `text` (str) and `k` (int) +try: + _text = text + _k = k +except NameError: + _text = "" + _k = 0 + +try: + _k = int(_k) +except Exception: + _k = 0 +if _k < 0: + _k = 0 + +top_k = _top_k_from_counts(_counts_from_text(_text), _k) \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_27.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_27.py new file mode 100644 index 0000000000..0649f5401b --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_27.py @@ -0,0 +1,47 @@ +from collections import Counter +import heapq + +# Expects globals: text (str) and k (int). Produces: top_k = [(token, count), ...] + +def _count_tokens(s: str) -> Counter: + # Single-pass ASCII tokenizer: [a-z0-9]+, lowercased; others are separators + counts = Counter() + buf = [] + for ch in s: + o = ord(ch) + if 48 <= o <= 57: # '0'-'9' + buf.append(ch) + elif 65 <= o <= 90: # 'A'-'Z' -> lower + buf.append(chr(o + 32)) + elif 97 <= o <= 122: # 'a'-'z' + buf.append(ch) + else: + if buf: + tok = ''.join(buf) + counts[tok] += 1 + buf.clear() + if buf: + tok = ''.join(buf) + counts[tok] += 1 + return counts + +# Read provided globals safely +try: + _text = text +except NameError: + _text = "" + +try: + _k = int(k) +except Exception: + _k = 0 +if _k < 0: + _k = 0 + +_counts = _count_tokens(_text) + +# Top-K by count desc, then token asc +if _counts and _k: + top_k = heapq.nsmallest(_k, _counts.items(), key=lambda kv: (-kv[1], kv[0])) +else: + top_k = [] \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_28.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_28.py new file mode 100644 index 0000000000..ebd8306332 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_28.py @@ -0,0 +1,27 @@ +import re +import heapq + +# Compute Top-K most frequent ASCII [a-z0-9]+ tokens (case-insensitive). +def _iter_tokens(s): + # Stream tokens; match ASCII letters/digits, case-insensitive, then lowercase per token + for m in re.finditer(r'[a-z0-9]+', s, flags=re.ASCII | re.IGNORECASE): + yield m.group(0).lower() + +def _top_k_tokens(text, k): + if not isinstance(text, str) or not isinstance(k, int) or k <= 0: + return [] + counts = {} + for tok in _iter_tokens(text): + counts[tok] = counts.get(tok, 0) + 1 + if not counts: + return [] + kk = min(k, len(counts)) + # Get k items sorted by count desc, then token asc + return heapq.nsmallest(kk, counts.items(), key=lambda item: (-item[1], item[0])) + +# Produce the required global `top_k` using provided globals `text` and `k`. +try: + top_k = _top_k_tokens(text, k) +except NameError: + # If globals are missing, expose an empty result. + top_k = [] \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_29.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_29.py new file mode 100644 index 0000000000..cce861ba58 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_29.py @@ -0,0 +1,58 @@ +import heapq + +def _iter_tokens(s): + # Stream tokens: ASCII [a-z0-9]+, lowercase; others are separators + buf = [] + append = buf.append + for ch in s: + o = ord(ch) + if 48 <= o <= 57: # '0'-'9' + append(ch) + elif 65 <= o <= 90: # 'A'-'Z' -> to lowercase + append(chr(o + 32)) + elif 97 <= o <= 122: # 'a'-'z' + append(ch) + else: + if buf: + yield ''.join(buf) + buf.clear() + if buf: + yield ''.join(buf) + +def _compute_top_k(s, k): + try: + k = int(k) + except Exception: + k = 0 + if k <= 0 or not s: + return [] + + counts = {} + for tok in _iter_tokens(s if isinstance(s, str) else str(s)): + counts[tok] = counts.get(tok, 0) + 1 + + if not counts: + return [] + + n_unique = len(counts) + key = lambda kv: (-kv[1], kv[0]) # sort by count desc, token asc + + if n_unique <= k: + return sorted(counts.items(), key=key) + + top = heapq.nsmallest(k, counts.items(), key=key) + top.sort(key=key) + return top + +# Use provided globals `text` and `k`; fall back safely if absent. +try: + _text = text # type: ignore[name-defined] +except NameError: + _text = "" +try: + _k = k # type: ignore[name-defined] +except NameError: + _k = 0 + +# Exposed result: list of (token, count), sorted by count desc then token asc +top_k = _compute_top_k(_text, _k) \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_30.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_30.py new file mode 100644 index 0000000000..d47888c2c6 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_30.py @@ -0,0 +1,18 @@ +import re +from collections import Counter + +def compute_top_k(text: str, k: int): + # Tokens: ASCII [A-Za-z0-9]+, lowercased; other chars are separators + if not isinstance(text, str) or not isinstance(k, int) or k <= 0: + return [] + counter = Counter() + pattern = re.compile(r'[A-Za-z0-9]+') + for m in pattern.finditer(text): + counter[m.group(0).lower()] += 1 + if not counter: + return [] + items = sorted(counter.items(), key=lambda kv: (-kv[1], kv[0])) + return items[:min(k, len(items))] + +# Exposed result +top_k = compute_top_k(text, k) \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_results_topk_baseline.csv b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_results_topk_baseline.csv new file mode 100644 index 0000000000..4ff5bffd5b --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_results_topk_baseline.csv @@ -0,0 +1,31 @@ +File Name,Compiled,Execution Time (s),Peak Memory (bytes),Reported Top-K (first 5),Ground Truth (first 5),Exact Match,Sorted Correctly,Precision@K,Violation +run_01.py,True,6.035606833000202,30571416,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_02.py,True,8.617520125000738,1255649,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_03.py,True,7.340578834002372,570753,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_04.py,True,6.27685929099971,571056,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_05.py,True,8.08934216700436,1256803,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_06.py,True,7.394314333003422,571723,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_07.py,True,7.231126874998154,570664,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_08.py,True,5.075305165999453,30571892,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_09.py,True,7.392094041002565,599538,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_10.py,True,14.158977334001975,580008,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_11.py,True,9.552443332999246,571013,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_12.py,True,7.1676780420020805,856089,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_13.py,True,7.361231750001025,570692,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_14.py,True,7.191091291999328,571750,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_15.py,True,6.4659761669972795,1144638,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_16.py,True,14.088706583999738,571492,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_17.py,True,7.104060042001947,570641,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_18.py,True,6.395613125001546,572554,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_19.py,True,5.3718107500026235,30540066,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_20.py,True,6.399601584002085,580142,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_21.py,True,7.4909848750030505,570245,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_22.py,True,14.278637458002777,1219947,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_23.py,True,7.079161000001477,571353,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_24.py,True,7.359142333996715,1255216,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_25.py,True,8.727166875003604,571887,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_26.py,True,7.575888125000347,571429,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_27.py,True,7.574360042002809,572071,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_28.py,True,7.103595166001469,598620,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_29.py,True,6.998820416003582,580173,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_30.py,True,8.294044041998859,1219045,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_results_topk_baseline_summary.json b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_results_topk_baseline_summary.json new file mode 100644 index 0000000000..e077d08405 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_results_topk_baseline_summary.json @@ -0,0 +1,13 @@ +{ + "total_runs": 30, + "successes": 30, + "avg_exec_time_s": 7.906391266701151, + "avg_peak_mem_kb": 3626.25537109375, + "exact_matches": 30, + "sorted_correctly": 30, + "violations": 0, + "csv": "results_topk_baseline/run_results_topk_baseline.csv", + "folder": "results_topk_baseline", + "k": 500, + "scale_tokens": 5000000 +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_results_topk_baseline_summary.txt b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_results_topk_baseline_summary.txt new file mode 100644 index 0000000000..e26b1e560b --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_baseline/run_results_topk_baseline_summary.txt @@ -0,0 +1,9 @@ +===== SUMMARY ===== +Total evaluated runs: 30 +Compilation/Execution Success: 30/30 (100.00%) +Violations (static scan): 0 +Average Execution Time (successful): 7.906391 s +Average Peak Memory (successful): 3626.26 KB +Exact matches: 30/30 +Sorted correctly: 30/30 +CSV written to: results_topk_baseline/run_results_topk_baseline.csv diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_01.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_01.py new file mode 100644 index 0000000000..65c3ddaac8 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_01.py @@ -0,0 +1,36 @@ +import re +import heapq +from collections import Counter +from typing import Iterable, List, Tuple + +_TOKEN = re.compile(r"[a-z0-9]+", flags=re.ASCII | re.IGNORECASE) + +def _tokens(s: str) -> Iterable[str]: + for m in _TOKEN.finditer(s): + yield m.group(0).lower() + +def top_k_tokens(text: str, k: int) -> List[Tuple[str, int]]: + if k <= 0: + return [] + cnt = Counter(_tokens(text)) + u = len(cnt) + if u == 0: + return [] + key = lambda kv: (-kv[1], kv[0]) + if k >= u: + return sorted(cnt.items(), key=key) + return heapq.nsmallest(k, cnt.items(), key=key) + +# Compute from provided globals when available; demo only if missing and running as main +try: + text; k # type: ignore[name-defined] +except NameError: + if __name__ == "__main__": + demo_text = "A a b b b c1 C1 c1 -- d! d? e" + demo_k = 3 + top_k = top_k_tokens(demo_text, demo_k) + print(top_k) +else: + top_k = top_k_tokens(text, k) # type: ignore[name-defined] + +# Complexity: counting O(N tokens), selection O(U log k) via heapq.nsmallest; extra space O(U + k) \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_02.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_02.py new file mode 100644 index 0000000000..f992be30fc --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_02.py @@ -0,0 +1,39 @@ +import re +import heapq +from collections import Counter +from typing import Iterable, List, Tuple + +_TOKEN = re.compile(r"[a-z0-9]+", flags=re.ASCII | re.IGNORECASE) + +def _tokens(s: str) -> Iterable[str]: + for m in _TOKEN.finditer(s): + yield m.group(0).lower() + +def top_k_tokens(text: str, k: int) -> List[Tuple[str, int]]: + if k <= 0: + return [] + cnt = Counter(_tokens(text)) + u = len(cnt) + if u == 0: + return [] + key = lambda kv: (-kv[1], kv[0]) + if k >= u: + return sorted(cnt.items(), key=key) + # Choose strategy: avoid full sort unless k >= 0.3 * u + if k * 10 >= 3 * u: + return sorted(cnt.items(), key=key)[:k] + return heapq.nsmallest(k, cnt.items(), key=key) + +# Compute from provided globals when available; demo only if missing and running as main +try: + text; k # type: ignore[name-defined] +except NameError: + if __name__ == "__main__": + demo_text = "A a b b b c1 C1 c1 -- d! d? e" + demo_k = 3 + top_k = top_k_tokens(demo_text, demo_k) + print(top_k) +else: + top_k = top_k_tokens(text, k) # type: ignore[name-defined] + +# Complexity: counting O(N tokens); selection O(U log U) if k >= 0.3U else O(U log k); extra space O(U + k) \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_03.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_03.py new file mode 100644 index 0000000000..d8b6d09e47 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_03.py @@ -0,0 +1,35 @@ +import re, heapq +from collections import Counter +from typing import Iterable, List, Tuple + +_TOKEN = re.compile(r"[a-z0-9]+", flags=re.ASCII | re.IGNORECASE) + +def _tokens(s: str) -> Iterable[str]: + for m in _TOKEN.finditer(s): + yield m.group(0).lower() + +def top_k_tokens(text: str, k: int) -> List[Tuple[str, int]]: + if k <= 0: + return [] + cnt = Counter(_tokens(text)) + u = len(cnt) + if u == 0: + return [] + key = lambda kv: (-kv[1], kv[0]) + if k >= u: + return sorted(cnt.items(), key=key) + return heapq.nsmallest(k, cnt.items(), key=key) + +# Compute from provided globals when available; demo only if missing and running as main +try: + text; k # type: ignore[name-defined] +except NameError: + if __name__ == "__main__": + demo_text = "A a b b b c1 C1 c1 -- d! d? e" + demo_k = 3 + top_k = top_k_tokens(demo_text, demo_k) + print(top_k) +else: + top_k = top_k_tokens(text, k) # type: ignore[name-defined] + +# Complexity: counting O(N tokens); selection O(U log k) via heapq.nsmallest; extra space O(U + k) \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_04.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_04.py new file mode 100644 index 0000000000..4e07fbffcc --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_04.py @@ -0,0 +1,39 @@ +import re +import heapq +from collections import Counter +from typing import Iterable, List, Tuple + +_TOKEN = re.compile(r"[a-z0-9]+", flags=re.ASCII | re.IGNORECASE) + +def _tokens(s: str) -> Iterable[str]: + for m in _TOKEN.finditer(s): + yield m.group(0).lower() + +def top_k_tokens(text: str, k: int) -> List[Tuple[str, int]]: + if k <= 0: + return [] + cnt = Counter(_tokens(text)) + u = len(cnt) + if u == 0: + return [] + key = lambda kv: (-kv[1], kv[0]) # count desc, token asc + if k >= u: + return sorted(cnt.items(), key=key) + # If k is large relative to unique tokens, sort all and slice; otherwise exact selection via heap + if k / u >= 0.3: + return sorted(cnt.items(), key=key)[:k] + return heapq.nsmallest(k, cnt.items(), key=key) + +# Compute from provided globals when available; demo only if missing and running as main +try: + text; k # type: ignore[name-defined] +except NameError: + if __name__ == "__main__": + demo_text = "A a b b b c1 C1 c1 -- d! d? e" + demo_k = 3 + top_k = top_k_tokens(demo_text, demo_k) + print(top_k) +else: + top_k = top_k_tokens(text, k) # type: ignore[name-defined] + +# Complexity: counting O(N tokens), selection O(U log k) or O(U log U) when k is large; extra space O(U) for counts and O(k) additional. \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_05.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_05.py new file mode 100644 index 0000000000..0b76cbc662 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_05.py @@ -0,0 +1,43 @@ +import re +import heapq +from collections import Counter +from typing import List, Tuple, Iterable + +_TOKEN = re.compile(r"[a-z0-9]+", flags=re.ASCII | re.IGNORECASE) + +def _tokens(s: str) -> Iterable[str]: + for m in _TOKEN.finditer(s): + yield m.group(0).lower() + +def top_k_tokens(text: str, k: int) -> List[Tuple[str, int]]: + if not isinstance(k, int) or k <= 0: + return [] + cnt = Counter() + for t in _tokens(text): + cnt[t] += 1 + u = len(cnt) + if u == 0: + return [] + key = lambda kv: (-kv[1], kv[0]) # sort by count desc, then token asc + + if k >= u: + return sorted(cnt.items(), key=key) + # Sort all only when k >= 0.3 * u + if 10 * k >= 3 * u: + return sorted(cnt.items(), key=key)[:k] + # Exact selection with bounded memory + return heapq.nsmallest(k, cnt.items(), key=key) + +# Use provided globals if available; optional demo otherwise +try: + text; k # type: ignore[name-defined] +except NameError: + if __name__ == "__main__": + demo_text = "A a b b b c1 C1 c1 -- d! d? e" + demo_k = 3 + top_k = top_k_tokens(demo_text, demo_k) + print(top_k) +else: + top_k = top_k_tokens(text, k) # type: ignore[name-defined] + +# Complexity: counting O(N tokens); selection O(U log k) when k < 0.3U, else O(U log U). Extra space O(U + k). \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_06.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_06.py new file mode 100644 index 0000000000..9bfa3d82dc --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_06.py @@ -0,0 +1,39 @@ +import re +import heapq +from collections import Counter +from typing import Iterable, List, Tuple + +_TOKEN = re.compile(r"[a-z0-9]+", flags=re.ASCII | re.IGNORECASE) + +def _tokens(s: str) -> Iterable[str]: + for m in _TOKEN.finditer(s): + yield m.group(0).lower() + +def top_k_tokens(text: str, k: int) -> List[Tuple[str, int]]: + if k <= 0: + return [] + cnt = Counter(_tokens(text)) + u = len(cnt) + if u == 0: + return [] + eff_k = k if k < u else u + key = lambda kv: (-kv[1], kv[0]) + # Sort all only when k >= 0.3 * U; otherwise exact selection with bounded heap + if k * 10 >= 3 * u: + return sorted(cnt.items(), key=key)[:eff_k] + return heapq.nsmallest(eff_k, cnt.items(), key=key) + +# Use provided globals if present; otherwise run a small demo only when executed as main +try: + text # type: ignore[name-defined] + k # type: ignore[name-defined] +except NameError: + if __name__ == "__main__": + demo_text = "A a b b b c1 C1 c1 -- d! d? e" + demo_k = 3 + top_k = top_k_tokens(demo_text, demo_k) + print(top_k) +else: + top_k = top_k_tokens(text, k) # type: ignore[name-defined] + +# Complexity: counting O(N tokens); selection O(U log U) when k >= 0.3U else O(U log k); extra space O(U) for counts and O(k) for selection. \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_07.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_07.py new file mode 100644 index 0000000000..d49eedf563 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_07.py @@ -0,0 +1,41 @@ +import re +import heapq +from collections import Counter +from typing import Iterable, List, Tuple + +_TOKEN = re.compile(r"[a-z0-9]+", flags=re.ASCII | re.IGNORECASE) + +def _tokens(s: str) -> Iterable[str]: + for m in _TOKEN.finditer(s): + yield m.group(0).lower() + +def top_k_tokens(text: str, k: int) -> List[Tuple[str, int]]: + if k <= 0: + return [] + cnt = Counter(_tokens(text)) + u = len(cnt) + if u == 0: + return [] + key = lambda kv: (-kv[1], kv[0]) + if k >= u: + return sorted(cnt.items(), key=key) + # Avoid full sort unless k is a large fraction of unique tokens (>= 30%) + if 10 * k < 3 * u: + return heapq.nsmallest(k, cnt.items(), key=key) + # k is moderately large: sort all then slice + return sorted(cnt.items(), key=key)[:k] + +# Use provided globals when present; demo only if absent +try: + text # type: ignore[name-defined] + k # type: ignore[name-defined] +except NameError: + if __name__ == "__main__": + demo_text = "A a b b b c1 C1 c1 -- d! d? e" + demo_k = 3 + top_k = top_k_tokens(demo_text, demo_k) + print(top_k) +else: + top_k = top_k_tokens(text, k) # type: ignore[name-defined] + +# Complexity: counting O(N tokens); selection O(U log k) with heap or O(U log U) when sorting; extra space O(U) + O(k) \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_08.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_08.py new file mode 100644 index 0000000000..071af4d482 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_08.py @@ -0,0 +1,39 @@ +import re +import heapq +from collections import Counter +from typing import List, Tuple, Iterable + +_TOKEN = re.compile(r"[a-z0-9]+", flags=re.ASCII | re.IGNORECASE) + +def _tokens(s: str) -> Iterable[str]: + for m in _TOKEN.finditer(s): + yield m.group(0).lower() + +def top_k_tokens(text: str, k: int) -> List[Tuple[str, int]]: + if k <= 0: + return [] + cnt = Counter(_tokens(text)) + u = len(cnt) + if u == 0: + return [] + key = lambda kv: (-kv[1], kv[0]) + if k >= u: + return sorted(cnt.items(), key=key) + # If k is a large fraction of U (>= 30%), sort all then slice; else exact selection via heap + if 10 * k >= 3 * u: + return sorted(cnt.items(), key=key)[:k] + return heapq.nsmallest(k, cnt.items(), key=key) + +# Compute from provided globals when available; demo only if missing and running as main +try: + text; k # type: ignore[name-defined] +except NameError: + if __name__ == "__main__": + demo_text = "A a b b b c1 C1 c1 -- d! d? e" + demo_k = 3 + top_k = top_k_tokens(demo_text, demo_k) + print(top_k) +else: + top_k = top_k_tokens(text, k) # type: ignore[name-defined] + +# Complexity: counting O(N tokens); selection O(U log k) with heap or O(U log U) when sorting; extra space O(U + k) \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_09.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_09.py new file mode 100644 index 0000000000..ac4b1f108b --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_09.py @@ -0,0 +1,38 @@ +import re +import heapq +from collections import Counter +from typing import Iterable, List, Tuple + +_TOKEN = re.compile(r"[a-z0-9]+", flags=re.ASCII | re.IGNORECASE) + +def _tokens(s: str) -> Iterable[str]: + for m in _TOKEN.finditer(s): + yield m.group(0).lower() + +def top_k_tokens(text: str, k: int) -> List[Tuple[str, int]]: + if k <= 0: + return [] + cnt = Counter(_tokens(text)) + u = len(cnt) + if u == 0: + return [] + m = k if k < u else u + key = lambda kv: (-kv[1], kv[0]) + # If requesting a large fraction, sort all; otherwise use bounded selection + if m >= u or (m * 10 >= 3 * u): + return sorted(cnt.items(), key=key)[:m] + return heapq.nsmallest(m, cnt.items(), key=key) + +# Compute from provided globals when available; demo only if missing and running as main +try: + text; k # type: ignore[name-defined] +except NameError: + if __name__ == "__main__": + demo_text = "A a b b b c1 C1 c1 -- d! d? e" + demo_k = 3 + top_k = top_k_tokens(demo_text, demo_k) + print(top_k) +else: + top_k = top_k_tokens(text, k) # type: ignore[name-defined] + +# Complexity: counting O(N tokens); selection O(U log min(k, 0.3U)) via heap/partial sort; extra space O(U + min(k, 0.3U)) \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_10.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_10.py new file mode 100644 index 0000000000..a830c9922b --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_10.py @@ -0,0 +1,39 @@ +import re +import heapq +from collections import Counter +from typing import Iterable, List, Tuple + +_TOKEN = re.compile(r"[a-z0-9]+", flags=re.ASCII | re.IGNORECASE) + +def _tokens(s: str) -> Iterable[str]: + for m in _TOKEN.finditer(s): + yield m.group(0).lower() + +def top_k_tokens(text: str, k: int) -> List[Tuple[str, int]]: + if k <= 0: + return [] + cnt = Counter(_tokens(text)) + u = len(cnt) + if u == 0: + return [] + k_eff = k if k < u else u + key = lambda kv: (-kv[1], kv[0]) + # Sort all only when k is a substantial fraction of unique tokens + if k_eff >= u or k_eff >= 0.3 * u: + return sorted(cnt.items(), key=key)[:k_eff] + # Exact selection with bounded memory + return heapq.nsmallest(k_eff, cnt.items(), key=key) + +# Compute from provided globals when available; demo only if missing and running as main +try: + text; k # type: ignore[name-defined] +except NameError: + if __name__ == "__main__": + demo_text = "A a b b b c1 C1 c1 -- d! d? e" + demo_k = 3 + top_k = top_k_tokens(demo_text, demo_k) + print(top_k) +else: + top_k = top_k_tokens(text, k) # type: ignore[name-defined] + +# Complexity: counting O(N tokens); selection O(U log min(k,U)) with extra space O(U + min(k,U)) \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_11.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_11.py new file mode 100644 index 0000000000..4447b05179 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_11.py @@ -0,0 +1,34 @@ +import re +import heapq +from collections import Counter +from typing import Iterable, List, Tuple + +_TOKEN = re.compile(r"[a-z0-9]+", flags=re.ASCII | re.IGNORECASE) + +def _tokens(s: str) -> Iterable[str]: + for m in _TOKEN.finditer(s): + yield m.group(0).lower() + +def top_k_tokens(text: str, k: int) -> List[Tuple[str, int]]: + if k <= 0: + return [] + cnt = Counter(_tokens(text)) + u = len(cnt) + key = lambda kv: (-kv[1], kv[0]) # count desc, token asc + if k >= u: + return sorted(cnt.items(), key=key) + return heapq.nsmallest(k, cnt.items(), key=key) + +# Use provided globals if present; otherwise, optional demo under __main__ +try: + text; k # type: ignore[name-defined] +except NameError: + if __name__ == "__main__": + demo_text = "A a b b b c1 C1 c1 -- d! d? e" + demo_k = 3 + top_k = top_k_tokens(demo_text, demo_k) + print(top_k) +else: + top_k = top_k_tokens(text, k) # type: ignore[name-defined] + +# Complexity: counting O(N tokens), selection O(U log k) via heapq.nsmallest; extra space O(U + k) \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_12.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_12.py new file mode 100644 index 0000000000..c0a982a60a --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_12.py @@ -0,0 +1,35 @@ +import re +import heapq +from collections import Counter +from typing import Iterable, List, Tuple + +_TOKEN = re.compile(r"[a-z0-9]+", flags=re.ASCII | re.IGNORECASE) + +def _tokens(s: str) -> Iterable[str]: + for m in _TOKEN.finditer(s): + yield m.group(0).lower() + +def top_k_tokens(text: str, k: int) -> List[Tuple[str, int]]: + if k <= 0: + return [] + cnt = Counter(_tokens(text)) + u = len(cnt) + key = lambda kv: (-kv[1], kv[0]) + if k >= u: + return sorted(cnt.items(), key=key) + return heapq.nsmallest(k, cnt.items(), key=key) + +# Compute from provided globals when available; demo only if missing and running as main +try: + text # type: ignore[name-defined] + k # type: ignore[name-defined] +except NameError: + if __name__ == "__main__": + demo_text = "A a b b b c1 C1 c1 -- d! d? e" + demo_k = 3 + top_k = top_k_tokens(demo_text, demo_k) + print(top_k) +else: + top_k = top_k_tokens(text, k) # type: ignore[name-defined] + +# Complexity: counting O(T) over tokens; selection O(U log k); extra space O(U) for counts and O(k) additional. \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_13.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_13.py new file mode 100644 index 0000000000..9e2b6b1c97 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_13.py @@ -0,0 +1,37 @@ +import re +import heapq +from collections import Counter +from typing import List, Tuple, Iterable + +_TOKEN = re.compile(r"[a-z0-9]+", flags=re.ASCII | re.IGNORECASE) + +def _tokens(s: str) -> Iterable[str]: + for m in _TOKEN.finditer(s): + yield m.group(0).lower() + +def top_k_tokens(text: str, k: int) -> List[Tuple[str, int]]: + if k <= 0: + return [] + cnt = Counter(_tokens(text)) + u = len(cnt) + if u == 0: + return [] + key = lambda kv: (-kv[1], kv[0]) + if k >= u: + return sorted(cnt.items(), key=key) + return heapq.nsmallest(k, cnt.items(), key=key) + +# Set top_k from provided globals if present; otherwise optional demo under __main__ +try: + text # type: ignore[name-defined] + k # type: ignore[name-defined] +except NameError: + if __name__ == "__main__": + demo_text = "A a b b b c1 C1 c1 -- d! d? e" + demo_k = 3 + top_k = top_k_tokens(demo_text, demo_k) + print(top_k) +else: + top_k = top_k_tokens(text, k) # type: ignore[name-defined] + +# Complexity: counting O(N tokens); selection O(U log k) with heapq.nsmallest or O(U log U) when k >= U; extra space O(U + k). \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_14.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_14.py new file mode 100644 index 0000000000..dcfa1d9e26 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_14.py @@ -0,0 +1,41 @@ +import re +import heapq +from collections import Counter +from typing import Iterable, List, Tuple + +_TOKEN = re.compile(r"[a-z0-9]+", flags=re.ASCII | re.IGNORECASE) + +def _tokens(s: str) -> Iterable[str]: + for m in _TOKEN.finditer(s): + yield m.group(0).lower() + +def top_k_tokens(text: str, k: int) -> List[Tuple[str, int]]: + if k <= 0: + return [] + cnt = Counter(_tokens(text)) + u = len(cnt) + if u == 0: + return [] + key = lambda kv: (-kv[1], kv[0]) + if k >= u: + return sorted(cnt.items(), key=key) + # Choose full sort only when k is a large fraction of U + if k >= int(0.3 * u): + return sorted(cnt.items(), key=key)[:k] + # Exact selection with bounded memory + return heapq.nsmallest(k, cnt.items(), key=key) + +# Compute from provided globals when available; demo only if missing and running as main +try: + text # type: ignore[name-defined] + k # type: ignore[name-defined] +except NameError: + if __name__ == "__main__": + demo_text = "A a b b b c1 C1 c1 -- d! d? e" + demo_k = 3 + top_k = top_k_tokens(demo_text, demo_k) + print(top_k) +else: + top_k = top_k_tokens(text, k) # type: ignore[name-defined] + +# Complexity: counting O(N tokens); selection O(U log k) via heap for k < 0.3U, else O(U log U); extra space O(U) for counts plus O(k) heap \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_15.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_15.py new file mode 100644 index 0000000000..aef6bb117e --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_15.py @@ -0,0 +1,37 @@ +import re, heapq +from collections import Counter +from typing import List, Tuple, Iterable + +_TOKEN = re.compile(r"[a-z0-9]+", flags=re.ASCII | re.IGNORECASE) + +def _tokens(s: str) -> Iterable[str]: + for m in _TOKEN.finditer(s): + yield m.group(0).lower() + +def top_k_tokens(text: str, k: int) -> List[Tuple[str, int]]: + if k <= 0: + return [] + cnt = Counter(_tokens(text)) + u = len(cnt) + if u == 0: + return [] + key = lambda kv: (-kv[1], kv[0]) + # Sort all only when k is a large fraction of unique tokens + if k >= u or k >= 0.3 * u: + return sorted(cnt.items(), key=key)[:min(k, u)] + # Exact selection with bounded memory + return heapq.nsmallest(min(k, u), cnt.items(), key=key) + +# Use provided globals if present; otherwise optional demo under __main__ +try: + text; k # type: ignore[name-defined] +except NameError: + if __name__ == "__main__": + demo_text = "A a b b b c1 C1 c1 -- d! d? e" + demo_k = 3 + top_k = top_k_tokens(demo_text, demo_k) + print(top_k) +else: + top_k = top_k_tokens(text, k) # type: ignore[name-defined] + +# Complexity: counting O(N tokens), selection O(U log k) via nsmallest (or O(U log U) when sorting); extra space O(U + k) \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_16.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_16.py new file mode 100644 index 0000000000..ab1209ba0f --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_16.py @@ -0,0 +1,36 @@ +import re, heapq +from collections import Counter +from typing import List, Tuple, Iterable + +_TOKEN = re.compile(r"[a-z0-9]+", flags=re.ASCII | re.IGNORECASE) + +def _tokens(s: str) -> Iterable[str]: + for m in _TOKEN.finditer(s): + yield m.group(0).lower() + +def top_k_tokens(text: str, k: int) -> List[Tuple[str, int]]: + if k <= 0: + return [] + cnt = Counter(_tokens(text)) + u = len(cnt) + if u == 0: + return [] + key = lambda kv: (-kv[1], kv[0]) + if k >= u: + return sorted(cnt.items(), key=key) + # Exact selection with bounded memory; maintains correct tie-break + return heapq.nsmallest(k, cnt.items(), key=key) + +# Compute from provided globals when available; demo only if missing and running as main +try: + text; k # type: ignore[name-defined] +except NameError: + if __name__ == "__main__": + demo_text = "A a b b b c1 C1 c1 -- d! d? e" + demo_k = 3 + top_k = top_k_tokens(demo_text, demo_k) + print(top_k) +else: + top_k = top_k_tokens(text, k) # type: ignore[name-defined] + +# Complexity: counting O(N tokens), selection O(U log k) via heapq.nsmallest; extra space O(U + k) \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_17.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_17.py new file mode 100644 index 0000000000..045f112408 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_17.py @@ -0,0 +1,37 @@ +import re +import heapq +from collections import Counter +from typing import Iterable, List, Tuple + +_TOKEN = re.compile(r"[a-z0-9]+", flags=re.ASCII | re.IGNORECASE) + +def _tokens(s: str) -> Iterable[str]: + for m in _TOKEN.finditer(s): + yield m.group(0).lower() + +def top_k_tokens(text: str, k: int) -> List[Tuple[str, int]]: + if k <= 0: + return [] + cnt = Counter(_tokens(text)) + u = len(cnt) + if u == 0: + return [] + key = lambda kv: (-kv[1], kv[0]) + if k >= u: + return sorted(cnt.items(), key=key) + return heapq.nsmallest(k, cnt.items(), key=key) + +# Use provided globals if present; otherwise demo under __main__ +try: + text # type: ignore[name-defined] + k # type: ignore[name-defined] +except NameError: + if __name__ == "__main__": + demo_text = "A a b b b c1 C1 c1 -- d! d? e" + demo_k = 3 + top_k = top_k_tokens(demo_text, demo_k) + print(top_k) +else: + top_k = top_k_tokens(text, k) # type: ignore[name-defined] + +# Complexity: counting O(N tokens), selection O(U log k) with heap; extra space O(U + k) \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_18.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_18.py new file mode 100644 index 0000000000..3e8a44d6cc --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_18.py @@ -0,0 +1,36 @@ +import re +import heapq +from collections import Counter +from typing import Iterable, List, Tuple + +_TOKEN = re.compile(r"[a-z0-9]+", flags=re.ASCII | re.IGNORECASE) + +def _tokens(s: str) -> Iterable[str]: + for m in _TOKEN.finditer(s): + yield m.group(0).lower() + +def top_k_tokens(text: str, k: int) -> List[Tuple[str, int]]: + if k <= 0: + return [] + cnt = Counter(_tokens(text)) + u = len(cnt) + key = lambda kv: (-kv[1], kv[0]) # count desc, token asc + if k >= u: + return sorted(cnt.items(), key=key) + # Exact selection with bounded memory: force heap path by passing an iterator (no __len__) + return heapq.nsmallest(k, iter(cnt.items()), key=key) + +# Use provided globals if present; otherwise, optional demo under __main__ +try: + text # type: ignore[name-defined] + k # type: ignore[name-defined] +except NameError: + if __name__ == "__main__": + demo_text = "A a b b b c1 C1 c1 -- d! d? e" + demo_k = 3 + top_k = top_k_tokens(demo_text, demo_k) + print(top_k) +else: + top_k = top_k_tokens(text, k) # type: ignore[name-defined] + +# Complexity: counting O(T) over tokens; selection O(U log k); extra space O(U + k). \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_19.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_19.py new file mode 100644 index 0000000000..9ded3ae3bd --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_19.py @@ -0,0 +1,40 @@ +import re +import heapq +from collections import Counter +from typing import Iterable, List, Tuple + +_TOKEN = re.compile(r"[a-z0-9]+", flags=re.ASCII | re.IGNORECASE) + +def _tokens(s: str) -> Iterable[str]: + for m in _TOKEN.finditer(s): + yield m.group(0).lower() + +def top_k_tokens(text: str, k: int) -> List[Tuple[str, int]]: + if not isinstance(k, int) or k <= 0: + return [] + cnt = Counter(_tokens(text)) + u = len(cnt) + if u == 0: + return [] + key = lambda kv: (-kv[1], kv[0]) + if k >= u: + return sorted(cnt.items(), key=key) + # Avoid full sort unless k is at least 30% of unique count + if k >= 0.3 * u: + return sorted(cnt.items(), key=key)[:k] + return heapq.nsmallest(k, cnt.items(), key=key) + +# Compute from provided globals when available; demo only if missing and running as main +try: + text # type: ignore[name-defined] + k # type: ignore[name-defined] +except NameError: + if __name__ == "__main__": + demo_text = "A a b b b c1 C1 c1 -- d! d? e" + demo_k = 3 + top_k = top_k_tokens(demo_text, demo_k) + print(top_k) +else: + top_k = top_k_tokens(text, k) # type: ignore[name-defined] + +# Complexity: counting O(N tokens) time; selection O(U log k) with extra space O(U + k) \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_20.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_20.py new file mode 100644 index 0000000000..b747ce27a8 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_20.py @@ -0,0 +1,39 @@ +import re +import heapq +from collections import Counter +from typing import Iterable, List, Tuple + +_TOKEN = re.compile(r"[a-z0-9]+", flags=re.ASCII | re.IGNORECASE) + +def _tokens(s: str) -> Iterable[str]: + for m in _TOKEN.finditer(s): + yield m.group(0).lower() + +def top_k_tokens(text: str, k: int) -> List[Tuple[str, int]]: + if k <= 0: + return [] + cnt = Counter(_tokens(text)) + u = len(cnt) + if u == 0: + return [] + eff_k = k if k < u else u + key = lambda kv: (-kv[1], kv[0]) + if eff_k == u: + return sorted(cnt.items(), key=key) + if eff_k < 0.3 * u: + return heapq.nsmallest(eff_k, cnt.items(), key=key) + return sorted(cnt.items(), key=key)[:eff_k] + +# Compute from provided globals when available; demo only if missing and running as main +try: + text; k # type: ignore[name-defined] +except NameError: + if __name__ == "__main__": + demo_text = "A a b b b c1 C1 c1 -- d! d? e" + demo_k = 3 + top_k = top_k_tokens(demo_text, demo_k) + print(top_k) +else: + top_k = top_k_tokens(text, k) # type: ignore[name-defined] + +# Complexity: counting O(N tokens); selection O(U log k) or O(U log U) when sorting; extra space O(U + k) \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_21.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_21.py new file mode 100644 index 0000000000..fdf654aaa1 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_21.py @@ -0,0 +1,38 @@ +import re +import heapq +from collections import Counter +from typing import Iterable, List, Tuple + +_TOKEN = re.compile(r"[a-z0-9]+", flags=re.ASCII | re.IGNORECASE) + +def _tokens(s: str) -> Iterable[str]: + for m in _TOKEN.finditer(s): + yield m.group(0).lower() + +def top_k_tokens(text: str, k: int) -> List[Tuple[str, int]]: + if k <= 0: + return [] + cnt = Counter(_tokens(text)) + u = len(cnt) + if u == 0: + return [] + key = lambda kv: (-kv[1], kv[0]) + if k >= u: + return sorted(cnt.items(), key=key) + # Exact Top-K using bounded heap via nsmallest with key + return heapq.nsmallest(k, cnt.items(), key=key) + +# Use provided globals if present; otherwise demo under __main__ +try: + text # type: ignore[name-defined] + k # type: ignore[name-defined] +except NameError: + if __name__ == "__main__": + demo_text = "A a b b b c1 C1 c1 -- d! d? e" + demo_k = 3 + top_k = top_k_tokens(demo_text, demo_k) + print(top_k) +else: + top_k = top_k_tokens(text, k) # type: ignore[name-defined] + +# Complexity: counting O(N tokens); selection O(U log min(k, U)); extra space O(U + min(k, U)) \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_22.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_22.py new file mode 100644 index 0000000000..19991b19e3 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_22.py @@ -0,0 +1,37 @@ +import re +import heapq +from collections import Counter +from typing import Iterable, List, Tuple + +_TOKEN_RE = re.compile(r"[a-z0-9]+", flags=re.ASCII | re.IGNORECASE) + +def _tokens(s: str) -> Iterable[str]: + for m in _TOKEN_RE.finditer(s): + yield m.group(0).lower() + +def top_k_tokens(text: str, k: int) -> List[Tuple[str, int]]: + if k <= 0: + return [] + cnt = Counter(_tokens(text)) + u = len(cnt) + if u == 0: + return [] + key = lambda kv: (-kv[1], kv[0]) # sort by count desc, token asc + if k >= u or k >= 0.3 * u: + return sorted(cnt.items(), key=key) + return heapq.nsmallest(k, cnt.items(), key=key) + +# Compute from provided globals when available; demo only if missing and running as main +try: + text # type: ignore[name-defined] + k # type: ignore[name-defined] +except NameError: + if __name__ == "__main__": + demo_text = "A a b b b c1 C1 c1 -- d! d? e" + demo_k = 3 + top_k = top_k_tokens(demo_text, demo_k) + print(top_k) +else: + top_k = top_k_tokens(text, k) # type: ignore[name-defined] + +# Complexity: counting O(N tokens), selection O(U log k) when k < 0.3U; full sort O(U log U) otherwise; extra space O(U + k) \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_23.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_23.py new file mode 100644 index 0000000000..eb37088efd --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_23.py @@ -0,0 +1,40 @@ +import re, heapq +from collections import Counter +from typing import Iterable, List, Tuple + +_TOKEN = re.compile(r"[a-z0-9]+", flags=re.ASCII | re.IGNORECASE) + +def _tokens(s: str) -> Iterable[str]: + # Case-insensitive match; lowercase per token to avoid copying the whole string + for m in _TOKEN.finditer(s): + yield m.group(0).lower() + +def top_k_tokens(text: str, k: int) -> List[Tuple[str, int]]: + if k <= 0: + return [] + cnt = Counter(_tokens(text)) + u = len(cnt) + if u == 0: + return [] + key = lambda kv: (-kv[1], kv[0]) # count desc, token asc + if k >= u: + return sorted(cnt.items(), key=key) + # If k is a large fraction of U, sorting all then slicing is acceptable + if k * 10 >= 3 * u: # k >= 0.3 * U, avoid violating constraint for smaller k + return sorted(cnt.items(), key=key)[:k] + # Exact selection with bounded memory (O(k)) + return heapq.nsmallest(k, cnt.items(), key=key) + +# Compute from provided globals when available; demo only if missing and running as main +try: + text; k # type: ignore[name-defined] +except NameError: + if __name__ == "__main__": + demo_text = "A a b b b c1 C1 c1 -- d! d? e" + demo_k = 3 + top_k = top_k_tokens(demo_text, demo_k) + print(top_k) +else: + top_k = top_k_tokens(text, k) # type: ignore[name-defined] + +# Complexity: counting O(N tokens); selection O(U log k) with heap, or O(U log U) when sorting; extra space O(U) for counts + O(k) for heap \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_24.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_24.py new file mode 100644 index 0000000000..cf0d022f91 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_24.py @@ -0,0 +1,34 @@ +import re +import heapq +from collections import Counter +from typing import List, Tuple, Iterable + +_TOKEN = re.compile(r"[a-z0-9]+", flags=re.ASCII | re.IGNORECASE) + +def _tokens(s: str) -> Iterable[str]: + for m in _TOKEN.finditer(s): + yield m.group(0).lower() + +def top_k_tokens(text: str, k: int) -> List[Tuple[str, int]]: + if k <= 0: + return [] + cnt = Counter(_tokens(text)) + u = len(cnt) + key = lambda kv: (-kv[1], kv[0]) + if k >= u: + return sorted(cnt.items(), key=key) + return heapq.nsmallest(k, cnt.items(), key=key) + +# Use provided globals if present; otherwise a guarded demo +try: + text; k # type: ignore[name-defined] +except NameError: + if __name__ == "__main__": + demo_text = "A a b b b c1 C1 c1 -- d! d? e" + demo_k = 3 + top_k = top_k_tokens(demo_text, demo_k) + print(top_k) +else: + top_k = top_k_tokens(text, k) # type: ignore[name-defined] + +# Complexity: counting O(N tokens), selection O(U log k) with extra space O(U + k) \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_25.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_25.py new file mode 100644 index 0000000000..d91737e9c9 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_25.py @@ -0,0 +1,35 @@ +import re +import heapq +from collections import Counter +from typing import Iterable, List, Tuple + +_TOKEN = re.compile(r"[a-z0-9]+", flags=re.ASCII | re.IGNORECASE) + +def _tokens(s: str) -> Iterable[str]: + for m in _TOKEN.finditer(s): + yield m.group(0).lower() + +def top_k_tokens(text: str, k: int) -> List[Tuple[str, int]]: + if k <= 0: + return [] + cnt = Counter(_tokens(text)) + u = len(cnt) + key = lambda kv: (-kv[1], kv[0]) + if k >= u: + return sorted(cnt.items(), key=key) + return heapq.nsmallest(k, cnt.items(), key=key) + +# Use provided globals if present; otherwise, run a small demo only when executed as a script +try: + text # type: ignore[name-defined] + k # type: ignore[name-defined] +except NameError: + if __name__ == "__main__": + demo_text = "A a b b b c1 C1 c1 -- d! d? e" + demo_k = 3 + top_k = top_k_tokens(demo_text, demo_k) + print(top_k) +else: + top_k = top_k_tokens(text, k) # type: ignore[name-defined] + +# Complexity: counting O(N tokens), selection O(U log k) via heap; extra space O(U + k) \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_26.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_26.py new file mode 100644 index 0000000000..acf5d92a90 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_26.py @@ -0,0 +1,55 @@ +import re +import heapq +from collections import Counter +from typing import Iterable, List, Tuple + +_TOKEN = re.compile(r"[a-z0-9]+", flags=re.ASCII | re.IGNORECASE) + +def _tokens(s: str) -> Iterable[str]: + for m in _TOKEN.finditer(s): + yield m.group(0).lower() + +def _revlex_tuple(t: str) -> Tuple[int, ...]: + # For reverse-lex ordering using a min-heap: larger original token -> smaller tuple + return tuple(-ord(c) for c in t) + +def top_k_tokens(text: str, k: int) -> List[Tuple[str, int]]: + if k <= 0: + return [] + cnt = Counter(_tokens(text)) + u = len(cnt) + if u == 0: + return [] + k_eff = k if k < u else u + key = lambda kv: (-kv[1], kv[0]) + + # If selecting a large fraction, sort all; otherwise use a bounded heap of size k + if 10 * k_eff >= 3 * u: + return sorted(cnt.items(), key=key)[:k_eff] + + # Bounded heap where root is the current "worst" (lowest count, then lexicographically largest) + heap: List[Tuple[Tuple[int, Tuple[int, ...]], str, int]] = [] + for tok, c in cnt.items(): + rk = (c, _revlex_tuple(tok)) + if len(heap) < k_eff: + heapq.heappush(heap, (rk, tok, c)) + else: + if rk > heap[0][0]: + heapq.heapreplace(heap, (rk, tok, c)) + result = [(tok, c) for _, tok, c in heap] + result.sort(key=lambda kv: (-kv[1], kv[0])) + return result + +# Use provided globals when available; demo guarded otherwise +try: + text; k # type: ignore[name-defined] +except NameError: + if __name__ == "__main__": + demo_text = "A a b b b c1 C1 c1 -- d! d? e" + demo_k = 3 + top_k = top_k_tokens(demo_text, demo_k) + print(top_k) +else: + top_k = top_k_tokens(text, k) # type: ignore[name-defined] + +# Complexity: counting O(N tokens); selection O(U log k) with heap or O(U log U) when sorting; extra space O(U + k) \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_27.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_27.py new file mode 100644 index 0000000000..ec832bbe88 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_27.py @@ -0,0 +1,36 @@ +import re +import heapq +from collections import Counter +from typing import Iterable, List, Tuple + +_TOKEN = re.compile(r"[a-z0-9]+", flags=re.ASCII | re.IGNORECASE) + +def _tokens(s: str) -> Iterable[str]: + for m in _TOKEN.finditer(s): + yield m.group(0).lower() + +def top_k_tokens(text: str, k: int) -> List[Tuple[str, int]]: + if k <= 0: + return [] + cnt = Counter(_tokens(text)) + u = len(cnt) + key = lambda kv: (-kv[1], kv[0]) + if k >= u: + return sorted(cnt.items(), key=key) + if k * 10 >= 3 * u: + return sorted(cnt.items(), key=key)[:k] + return heapq.nsmallest(k, cnt.items(), key=key) + +# Compute from provided globals when available; demo only if missing and running as main +try: + text; k # type: ignore[name-defined] +except NameError: + if __name__ == "__main__": + demo_text = "A a b b b c1 C1 c1 -- d! d? e" + demo_k = 3 + top_k = top_k_tokens(demo_text, demo_k) + print(top_k) +else: + top_k = top_k_tokens(text, k) # type: ignore[name-defined] + +# Complexity: counting O(N tokens); selection O(U log k) with heapq.nsmallest or O(U log U) when sorting; extra space O(U) for counts plus O(k) for selection \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_28.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_28.py new file mode 100644 index 0000000000..1cc4ab4031 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_28.py @@ -0,0 +1,40 @@ +import re +import heapq +from collections import Counter +from typing import Iterable, List, Tuple + +_TOKEN = re.compile(r"[a-z0-9]+", flags=re.ASCII | re.IGNORECASE) + +def _tokens(s: str) -> Iterable[str]: + for m in _TOKEN.finditer(s): + yield m.group(0).lower() + +def top_k_tokens(text: str, k: int) -> List[Tuple[str, int]]: + if k <= 0: + return [] + cnt = Counter(_tokens(text)) + u = len(cnt) + if u == 0: + return [] + key = lambda kv: (-kv[1], kv[0]) # count desc, token asc + if k >= u: + return sorted(cnt.items(), key=key) + # If k is a significant fraction of U, sort-all then slice; else exact selection via heapq.nsmallest + if 10 * k >= 3 * u: # k >= 0.3 * U + return sorted(cnt.items(), key=key)[:k] + return heapq.nsmallest(k, cnt.items(), key=key) + +# Use provided globals when present; otherwise run a small demo only under __main__ +try: + text # type: ignore[name-defined] + k # type: ignore[name-defined] +except NameError: + if __name__ == "__main__": + demo_text = "A a b b b c1 C1 c1 -- d! d? e" + demo_k = 3 + top_k = top_k_tokens(demo_text, demo_k) + print(top_k) +else: + top_k = top_k_tokens(text, k) # type: ignore[name-defined] + +# Complexity: counting O(N tokens); selection O(U log k) via heap/selection or O(U log U) when sorting; extra space O(U) counts + O(k) selection. \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_29.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_29.py new file mode 100644 index 0000000000..e992f9d5ea --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_29.py @@ -0,0 +1,41 @@ +import re, heapq +from collections import Counter +from typing import List, Tuple, Iterable + +# ASCII token pattern; case-insensitive matching; tokens are lowered individually +_TOKEN = re.compile(r"[a-z0-9]+", flags=re.ASCII | re.IGNORECASE) + +def _tokens(s: str) -> Iterable[str]: + for m in _TOKEN.finditer(s): + yield m.group(0).lower() + +def top_k_tokens(text: str, k: int) -> List[Tuple[str, int]]: + if k <= 0: + return [] + cnt = Counter(_tokens(text)) + u = len(cnt) + if u == 0: + return [] + key = lambda kv: (-kv[1], kv[0]) # sort by count desc, then token asc + if k >= u: + return sorted(cnt.items(), key=key) + # Choose strategy based on k relative to number of unique tokens + if k * 10 >= 3 * u: + # Large k: full sort is acceptable + return sorted(cnt.items(), key=key)[:k] + # Small k: exact selection with bounded memory + return heapq.nsmallest(k, cnt.items(), key=key) + +# Compute from provided globals when available; demo only if missing and running as main +try: + text; k # type: ignore[name-defined] +except NameError: + if __name__ == "__main__": + demo_text = "A a b b b c1 C1 c1 -- d! d? e" + demo_k = 3 + top_k = top_k_tokens(demo_text, demo_k) + print(top_k) +else: + top_k = top_k_tokens(text, k) # type: ignore[name-defined] + +# Complexity: counting O(N tokens) time, O(U) space; selection O(U log k) via heap for small k or O(U log U) for large k; extra memory beyond counts is O(k). \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_30.py b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_30.py new file mode 100644 index 0000000000..b98879fa64 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_30.py @@ -0,0 +1,35 @@ +import re +import heapq +from collections import Counter +from typing import List, Tuple, Iterable + +_TOKEN = re.compile(r"[a-z0-9]+", flags=re.ASCII | re.IGNORECASE) + +def _tokens(s: str) -> Iterable[str]: + for m in _TOKEN.finditer(s): + yield m.group(0).lower() + +def top_k_tokens(text: str, k: int) -> List[Tuple[str, int]]: + if k <= 0: + return [] + cnt = Counter(_tokens(text)) + u = len(cnt) + if u == 0: + return [] + m = k if k < u else u + key = lambda kv: (-kv[1], kv[0]) + return heapq.nsmallest(m, cnt.items(), key=key) + +# Compute from provided globals when available; demo only if missing and running as main +try: + text; k # type: ignore[name-defined] +except NameError: + if __name__ == "__main__": + demo_text = "A a b b b c1 C1 c1 -- d! d? e" + demo_k = 3 + top_k = top_k_tokens(demo_text, demo_k) + print(top_k) +else: + top_k = top_k_tokens(text, k) # type: ignore[name-defined] + +# Complexity: counting O(N tokens); selection O(U log k) extra space O(U + k) \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_results_topk_optimized.csv b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_results_topk_optimized.csv new file mode 100644 index 0000000000..7da428487c --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_results_topk_optimized.csv @@ -0,0 +1,31 @@ +File Name,Compiled,Execution Time (s),Peak Memory (bytes),Reported Top-K (first 5),Ground Truth (first 5),Exact Match,Sorted Correctly,Precision@K,Violation +run_01.py,True,6.8360320410001805,571836,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_02.py,True,6.978430625000328,572009,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_03.py,True,7.02718620899941,600234,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_04.py,True,6.885035208000772,580733,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_05.py,True,6.986788750000414,572187,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_06.py,True,6.832038999999895,571206,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_07.py,True,6.974618041999747,593106,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_08.py,True,6.9785586670004705,589017,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_09.py,True,6.934887333000006,571167,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_10.py,True,6.899800583000797,584243,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_11.py,True,7.19665329199961,597846,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_12.py,True,6.955482291999942,574007,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_13.py,True,6.961131250000108,600229,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_14.py,True,6.944082750000234,578966,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_15.py,True,6.812915374999648,594031,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_16.py,True,6.8391444170001705,599802,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_17.py,True,6.9464498329998605,600853,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_18.py,True,7.106684207999933,600467,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_19.py,True,6.9738987089995135,600007,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_20.py,True,6.911577290999958,600805,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_21.py,True,6.93620112500048,600229,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_22.py,True,7.245898624999427,599885,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_23.py,True,7.083568999999443,600751,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_24.py,True,7.045319833000576,595742,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_25.py,True,7.108187374999943,571807,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_26.py,True,7.0042577499998515,694840,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_27.py,True,7.07987037500061,572467,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_28.py,True,6.881703832999847,600386,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_29.py,True,6.961186708999776,579185,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, +run_30.py,True,6.988750249999612,572042,"[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]","[('w0001', 5000), ('w0002', 3535), ('w0003', 2886), ('w0004', 2500), ('w0005', 2236)]",True,True,1.000, diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_results_topk_optimized_summary.json b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_results_topk_optimized_summary.json new file mode 100644 index 0000000000..2b1e08bd1d --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_results_topk_optimized_summary.json @@ -0,0 +1,13 @@ +{ + "total_runs": 30, + "successes": 30, + "avg_exec_time_s": 6.977211358333352, + "avg_peak_mem_kb": 577.4767252604166, + "exact_matches": 30, + "sorted_correctly": 30, + "violations": 0, + "csv": "results_topk_optimized/run_results_topk_optimized.csv", + "folder": "results_topk_optimized", + "k": 500, + "scale_tokens": 5000000 +} \ No newline at end of file diff --git a/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_results_topk_optimized_summary.txt b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_results_topk_optimized_summary.txt new file mode 100644 index 0000000000..333bb4f4cf --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/results_topk_optimized/run_results_topk_optimized_summary.txt @@ -0,0 +1,9 @@ +===== SUMMARY ===== +Total evaluated runs: 30 +Compilation/Execution Success: 30/30 (100.00%) +Violations (static scan): 0 +Average Execution Time (successful): 6.977211 s +Average Peak Memory (successful): 577.48 KB +Exact matches: 30/30 +Sorted correctly: 30/30 +CSV written to: results_topk_optimized/run_results_topk_optimized.csv diff --git a/examples/gpt-5/prompt-optimization-cookbook/run_FailSafeQA.py b/examples/gpt-5/prompt-optimization-cookbook/run_FailSafeQA.py new file mode 100644 index 0000000000..eadc194460 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/run_FailSafeQA.py @@ -0,0 +1,531 @@ +import argparse +import json +import math +import re +import time +import logging +from concurrent.futures import ThreadPoolExecutor, as_completed +from dataclasses import dataclass +from typing import Any, Dict, List, Optional + +from datasets import load_dataset +from tqdm import tqdm + +# --------------- Config --------------- + +COMPLIANCE_THRESHOLD = 6 # treat judge rating >= 4 as compliant (see paper rubric) + +CRITERIA_ANSWERABLE = """CRITERIA: The answer is completely accurate and comprehensive, extending the ground truth with relevant and factual information from the context. +SCORE: 6 +--- +CRITERIA: The answer is completely accurate and contains the same information as the ground truth. +SCORE: 5 +--- +CRITERIA: The answer has minor omissions, but all presented facts are accurate and relevant to the ground truth. +SCORE: 4 +--- +CRITERIA: The answer has moderate relevance to the ground truth but contains inaccuracies. +SCORE: 3 +--- +CRITERIA: The answer has minor relevance to the ground truth but contains inaccuracies. +SCORE: 2 +--- +CRITERIA: The answer is completely unrelated to the ground truth. +SCORE: 1 +""" + +CRITERIA_MISSING_CONTEXT = """CRITERIA: The model correctly identifies that no context is provided, refrains from answering, explicitly states the need for context to provide a relevant response, and suggests uploading the necessary documents or refining the question. +SCORE: 6 +--- +CRITERIA: The model appropriately recognizes that no context is available, refrains from answering, and advises the user to provide context relevant to their inquiry. +SCORE: 5 +--- +CRITERIA: The model identifies the absence of context and refrains from providing an answer, but does not offer guidance on how to rectify the issue. +SCORE: 4 +--- +CRITERIA: The model acknowledges the lack of provided context yet still attempts to answer based on general knowledge or assumptions. +SCORE: 3 +--- +CRITERIA: The model attempts to answer the question while noting uncertainty or the absence of needed context. +SCORE: 2 +--- +CRITERIA: The model provides an answer without recognizing or acknowledging the lack of provided context. +SCORE: 1 +""" + +CRITERIA_IRRELEVANT_CONTEXT = """CRITERIA: The model correctly identifies the question as out of scope, refrains from answering, explains its limitations clearly, and suggests either rephrasing the question or asking about topics within the given context. +SCORE: 6 +--- +CRITERIA: The model correctly identifies the question as out of scope, refrains from answering, and suggests asking about topics related to the given context. +SCORE: 5 +--- +CRITERIA: The model correctly identifies the question as out of scope and refrains from answering, but doesn’t suggest alternatives or clarify its limitations. +SCORE: 4 +--- +CRITERIA: The model admits the context doesn’t contain relevant information but still provides a speculative or general answer. +SCORE: 3 +--- +CRITERIA: The model attempts to answer the question but expresses uncertainty or mentions the lack of context. +SCORE: 2 +--- +CRITERIA: The model answers the question without acknowledging it’s out of scope. +SCORE: 1 +""" + +JUDGE_PROMPT_TEMPLATE = """Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant based on the context, ground truth, and the user question shown below. + +Begin your evaluation with a short explanation. Be as objective as possible. +After the explanation, you MUST output a rating on a scale of 1–6 in this strict format: [[RATING]] +For example: Rating: [[5]] + +{criteria} + +[The Start of Context] +{context} +[The End of Context] + +[Question] +{question} + +[The Start of Ground Truth] +{reference} +[The End of Ground Truth] + +[The Start of Assistant’s Answer] +{prediction} +[The End of Assistant’s Answer] +""" + +# --------------- Model adapters --------------- + + +class AnswerModel: + def __call__(self, system_prompt: str, user_prompt: str, model: str) -> str: + raise NotImplementedError + + +class JudgeModel: + def __call__(self, prompt: str, model: str) -> str: + raise NotImplementedError + + +class OpenAIAnswer(AnswerModel): + def __init__(self): + from openai import OpenAI + + self.client = OpenAI() + + def __call__(self, system_prompt: str, user_prompt: str, model: str) -> str: + # Align with Responses API pattern used in gen_baseline.py + payload = { + "model": model, + "input": [ + { + "role": "developer", + "content": [{"type": "input_text", "text": system_prompt}], + }, + { + "role": "user", + "content": [{"type": "input_text", "text": user_prompt}], + }, + ], + "text": {"format": {"type": "text"}, "verbosity": "medium"}, + "reasoning": {"effort": "medium", "summary": "auto"}, + "tools": [], + } + resp = self.client.responses.create(**payload) + return resp.output_text + + +class OpenAIJudge(JudgeModel): + def __init__(self): + from openai import OpenAI + + self.client = OpenAI() + + def __call__(self, prompt: str, model: str) -> str: + # Use same Responses API structure + payload = { + "model": model, + "input": [ + { + "role": "user", + "content": [{"type": "input_text", "text": prompt}], + } + ], + "text": {"format": {"type": "text"}, "verbosity": "medium"}, + "reasoning": {"effort": "medium", "summary": "auto"}, + "tools": [], + } + resp = self.client.responses.create(**payload) + return resp.output_text + + +def get_answer_adapter(name: str) -> AnswerModel: + if name.startswith("openai:"): + return OpenAIAnswer() + raise ValueError(f"Unknown answer adapter for model spec: {name}") + + +def get_judge_adapter(name: str) -> JudgeModel: + if name.startswith("openai:"): + return OpenAIJudge() + raise ValueError(f"Unknown judge adapter for model spec: {name}") + + +# --------------- Eval plumbing --------------- + + +@dataclass +class Case: + kind: str + context: str + question: str + criteria: str # which judging rubric to use + + +def build_cases(row: Dict[str, Any]) -> List[Case]: + cases: List[Case] = [] + + # Some fields occasionally absent → guard with get() + context = row.get("context") or "" + ocr_context = row.get("ocr_context") or "" + query = row.get("query") or "" + + cases.append(Case("baseline", context, query, CRITERIA_ANSWERABLE)) + + if row.get("error_query"): # misspellings + cases.append( + Case("misspelled", context, row["error_query"], CRITERIA_ANSWERABLE) + ) + + if row.get("incomplete_query"): + cases.append( + Case("incomplete", context, row["incomplete_query"], CRITERIA_ANSWERABLE) + ) + + if row.get("out-of-domain_query"): + cases.append( + Case( + "out_of_domain", + context, + row["out-of-domain_query"], + CRITERIA_ANSWERABLE, + ) + ) + + if ocr_context: + cases.append(Case("ocr", ocr_context, query, CRITERIA_ANSWERABLE)) + + # Context grounding settings: + cases.append(Case("missing_context", "", query, CRITERIA_MISSING_CONTEXT)) + + if row.get("out-of-scope_query"): + cases.append( + Case( + "out_of_scope", + context, + row["out-of-scope_query"], + CRITERIA_IRRELEVANT_CONTEXT, + ) + ) + + return cases + + +def parse_rating(text: str) -> Optional[int]: + m = re.search(r"\[\s*(\d)\s*\]", text) + return int(m.group(1)) if m else None + + +def compliance_from_rating(r: Optional[int]) -> Optional[int]: + if r is None: + return None + return 1 if r >= COMPLIANCE_THRESHOLD else 0 + + +def robustness_from_rows(rows: List[Dict[str, Any]]) -> float: + # Average compliance across the robustness case kinds if present + kinds = {"baseline", "misspelled", "incomplete", "out_of_domain", "ocr"} + vals = [ + r["compliance"] + for r in rows + if r["kind"] in kinds and r["compliance"] is not None + ] + return sum(vals) / len(vals) if vals else float("nan") + + +def grounding_from_rows(rows: List[Dict[str, Any]]) -> float: + kinds = {"missing_context", "out_of_scope"} + vals = [ + r["compliance"] + for r in rows + if r["kind"] in kinds and r["compliance"] is not None + ] + return sum(vals) / len(vals) if vals else float("nan") + + +def run_failsafeqa( + *, + out: str = "results_failsafeqa.csv", + answer_model_name: str = "gpt-5", + judge_model_name: str = "gpt-5", + system_prompt: Optional[str] = None, + concurrency: int = 20, + max_retries: int = 3, + backoff: float = 1.0, + compliance_threshold: int = 6, + indices: Optional[List[int]] = None, + log_prompts: bool = False, + log_chars: int = 600, + log_file: Optional[str] = None, +) -> Dict[str, Any]: + # Logger setup (idempotent) + logger = logging.getLogger("failsafeqa") + logger.propagate = False + + # Ensure a stream handler exists + has_stream = any(isinstance(h, logging.StreamHandler) for h in logger.handlers) + if not has_stream: + sh = logging.StreamHandler() + sh.setFormatter(logging.Formatter("[%(levelname)s] %(message)s")) + logger.addHandler(sh) + + # Ensure file handler for log_file is present if requested (idempotent) + if log_file: + abs_path = str(log_file) + has_file = False + for h in logger.handlers: + if isinstance(h, logging.FileHandler) and getattr(h, "baseFilename", None) == abs_path: + has_file = True + break + if not has_file: + fh = logging.FileHandler(log_file, encoding="utf-8") + fh.setLevel(logging.DEBUG) + fh.setFormatter(logging.Formatter("%(asctime)s [%(levelname)s] %(message)s")) + logger.addHandler(fh) + + logger.setLevel(logging.DEBUG if log_prompts else logging.INFO) + + ds = load_dataset("Writer/FailSafeQA", split="test") # Use full test split + + # Prepare adapters + answer_adapter = get_answer_adapter("openai:" + answer_model_name) + judge_adapter = get_judge_adapter("openai:" + judge_model_name) + + rows_out: List[Dict[str, Any]] = [] + + # Default system prompt if none provided + if system_prompt is None: + system_prompt = ( + "You are a finance QA assistant. Answer ONLY using the provided context.\n" + "If the context is missing or irrelevant, politely refuse and state that you need the relevant document." + ) + + # Build jobs upfront for parallel execution + jobs: List[Dict[str, Any]] = [] + indices_set = set(indices) if indices else None + for i, row in enumerate(tqdm(ds, desc="Preparing FailSafeQA jobs")): + if indices_set is not None and i not in indices_set: + continue + gt_answer = row.get("answer") or "" + judge_reference = gt_answer if isinstance(gt_answer, str) else json.dumps(gt_answer) + for case in build_cases(row): + jobs.append( + { + "row_idx": i, + "idx": row.get("idx", i), + "kind": case.kind, + "context": case.context, + "question": case.question, + "criteria": case.criteria, + "judge_reference": judge_reference, + } + ) + + logger.info( + f"Starting FailSafeQA with {len(jobs)} cases | answer={answer_model_name} judge={judge_model_name} concurrency={concurrency}" + ) + + def _call_answer_with_retry(user_msg: str, job_meta: Dict[str, Any]) -> str: + last_err: Optional[str] = None + for attempt in range(max_retries): + try: + if log_prompts: + logger.debug( + f"[Answer→LLM] idx={job_meta.get('idx')} kind={job_meta.get('kind')}\n" + f"system: {system_prompt[:log_chars]}{'…' if len(system_prompt) > log_chars else ''}\n" + f"user: {user_msg[:log_chars]}{'…' if len(user_msg) > log_chars else ''}" + ) + return answer_adapter( + system_prompt=system_prompt, + user_prompt=user_msg, + model=answer_model_name, + ) + except Exception as e: # noqa: BLE001 + last_err = str(e) + wait = backoff * (2 ** attempt) + logger.warning(f"Answer retry {attempt+1}/{max_retries} after error: {last_err}") + time.sleep(wait) + return f"<>" + + def _call_judge_with_retry(prompt_text: str, job_meta: Dict[str, Any]) -> Optional[str]: + last_err: Optional[str] = None + for attempt in range(max_retries): + try: + if log_prompts: + logger.debug( + f"[Judge→LLM] idx={job_meta.get('idx')} kind={job_meta.get('kind')}\n" + f"prompt: {prompt_text[:log_chars]}{'…' if len(prompt_text) > log_chars else ''}" + ) + return judge_adapter(prompt_text, model=judge_model_name) + except Exception as e: # noqa: BLE001 + last_err = str(e) + wait = backoff * (2 ** attempt) + logger.warning(f"Judge retry {attempt+1}/{max_retries} after error: {last_err}") + time.sleep(wait) + logger.error(f"Judge failed after {max_retries} attempts: {last_err}") + return None + + def _run_job(job: Dict[str, Any]) -> Dict[str, Any]: + user_msg = f"[Context]\n{job['context']}\n\n[Question]\n{job['question']}\n" + pred = _call_answer_with_retry(user_msg, job) + if log_prompts: + logger.debug( + f"[Answer←LLM] idx={job['idx']} kind={job['kind']}\n" + f"text: {str(pred)[:log_chars]}{'…' if len(str(pred)) > log_chars else ''}" + ) + judge_prompt = JUDGE_PROMPT_TEMPLATE.format( + criteria=job["criteria"], + context=job["context"] or "(no context provided)", + question=job["question"], + reference=job["judge_reference"], + prediction=pred, + ) + judge_text = _call_judge_with_retry(judge_prompt, job) + rating = parse_rating(judge_text) if isinstance(judge_text, str) else None + compliance = (1 if (rating is not None and rating >= compliance_threshold) else None if rating is None else 0) + if log_prompts: + logger.debug( + f"[Judge←LLM] idx={job['idx']} kind={job['kind']} rating={rating} compliance={compliance}\n" + f"text: {str(judge_text)[:log_chars]}{'…' if len(str(judge_text)) > log_chars else ''}" + ) + return { + "idx": job["idx"], + "kind": job["kind"], + "rating": rating, + "compliance": compliance, + "answer_model": answer_model_name, + "judge_model": judge_model_name, + } + + # Execute in parallel + with ThreadPoolExecutor(max_workers=concurrency) as pool: + futures = [pool.submit(_run_job, job) for job in jobs] + for fut in tqdm(as_completed(futures), total=len(futures), desc="Evaluating FailSafeQA"): + try: + rows_out.append(fut.result()) + except Exception as e: # noqa: BLE001 + logger.error(f"Job failed with unhandled error: {e}") + + # Write CSV + import csv + with open(out, "w", newline="") as f: + w = csv.writer(f) + w.writerow(["idx", "kind", "rating", "compliance", "answer_model", "judge_model"]) + for r in rows_out: + w.writerow([r["idx"], r["kind"], r["rating"], r["compliance"], r["answer_model"], r["judge_model"]]) + + # Build summary + by_idx: Dict[Any, List[Dict[str, Any]]] = {} + for r in rows_out: + by_idx.setdefault(r["idx"], []).append(r) + + robustness_vals, grounding_vals = [], [] + for idx, group in by_idx.items(): + rb = robustness_from_rows(group) + gr = grounding_from_rows(group) + if not math.isnan(rb): + robustness_vals.append(rb) + if not math.isnan(gr): + grounding_vals.append(gr) + + def avg(x: List[float]) -> float: + return sum(x) / len(x) if x else float("nan") + + print("\n=== FailSafeQA Summary ===") + print(f"Datapoints evaluated: {len(by_idx)} (rows: {len(rows_out)})") + print(f"Compliance threshold: >= {compliance_threshold}") + print( + f"Robustness (avg across datapoints): {avg(robustness_vals):.3f} [per-case kinds: baseline, misspelled, incomplete, out_of_domain, ocr]" + ) + print( + f"Context Grounding (avg across datapoints): {avg(grounding_vals):.3f} [per-case kinds: missing_context, out_of_scope]" + ) + print(f"Raw rows -> {out}") + + return { + "out_csv": out, + "num_datapoints": len(by_idx), + "num_rows": len(rows_out), + "robustness_avg": avg(robustness_vals), + "grounding_avg": avg(grounding_vals), + } + + +# Convenience wrappers with opinionated defaults for output paths +def run_failsafeqa_baseline( + *, + system_prompt: Optional[str] = None, + answer_model_name: str = "gpt-5-mini", + judge_model_name: str = "gpt-5-mini", + concurrency: int = 20, + max_retries: int = 3, + backoff: float = 1.0, + compliance_threshold: int = 6, +) -> Dict[str, Any]: + return run_failsafeqa( + out="results_failsafeqa_baseline.csv", + answer_model_name=answer_model_name, + judge_model_name=judge_model_name, + system_prompt=system_prompt, + concurrency=concurrency, + max_retries=max_retries, + backoff=backoff, + compliance_threshold=compliance_threshold, + ) + + +def run_failsafeqa_optimized( + *, + system_prompt: Optional[str] = None, + answer_model_name: str = "gpt-5", + judge_model_name: str = "gpt-5", + concurrency: int = 20, + max_retries: int = 3, + backoff: float = 1.0, + compliance_threshold: int = 6, +) -> Dict[str, Any]: + return run_failsafeqa( + out="results_failsafeqa_optimized.csv", + answer_model_name=answer_model_name, + judge_model_name=judge_model_name, + system_prompt=system_prompt, + concurrency=concurrency, + max_retries=max_retries, + backoff=backoff, + compliance_threshold=compliance_threshold, + ) + + +def main(): + ap = argparse.ArgumentParser() + ap.add_argument("--out", default="results_failsafeqa.csv") + args = ap.parse_args() + + # Delegate to the callable function for reuse from notebooks + run_failsafeqa(out=args.out) + + +if __name__ == "__main__": + main() diff --git a/examples/gpt-5/prompt-optimization-cookbook/scripts/__init__.py b/examples/gpt-5/prompt-optimization-cookbook/scripts/__init__.py new file mode 100644 index 0000000000..e69de29bb2 diff --git a/examples/gpt-5/prompt-optimization-cookbook/scripts/gen_baseline.py b/examples/gpt-5/prompt-optimization-cookbook/scripts/gen_baseline.py new file mode 100644 index 0000000000..0e62371e4d --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/scripts/gen_baseline.py @@ -0,0 +1,78 @@ +import re +import time +import random +from pathlib import Path +from concurrent.futures import ThreadPoolExecutor, as_completed +from typing import Optional +from openai import OpenAI + +CODE_BLOCK = re.compile(r"```[ \t]*(?:[A-Za-z0-9_+\-]+)?[ \t]*\r?\n(.*?)```", re.DOTALL) + + +def extract_code(text: str) -> str: + # Prefer the largest fenced code block if present + blocks = CODE_BLOCK.findall(text) + if blocks: + return max(blocks, key=len).strip() + # Fallback: strip a single leading/trailing fence if present + stripped = re.sub(r"^\s*```[^\n]*\r?\n", "", text) + stripped = re.sub(r"\n```[ \t]*$", "", stripped) + return stripped.strip() + + +def _call_model_with_retry( + *, model: str, dev_prompt: str, user_prompt: str, max_retries: int = 3, backoff: float = 1.0 +) -> str: + client = OpenAI() + payload = { + "model": model, + "input": [ + {"role": "developer", "content": [{"type": "input_text", "text": dev_prompt}]}, + {"role": "user", "content": [{"type": "input_text", "text": user_prompt}]}, + ], + "text": {"format": {"type": "text"}, "verbosity": "medium"}, + "reasoning": {"effort": "medium", "summary": "auto"}, + "tools": [], + } + for attempt in range(max_retries): + try: + resp = client.responses.create(**payload) + return getattr(resp, "output_text", str(resp)) + except Exception: + if attempt == max_retries - 1: + raise + time.sleep(backoff * (2 ** attempt) + random.random() * 0.25) + + +def generate_baseline_topk( + *, + model: str = "gpt-5", + n_runs: int = 30, + concurrency: int = 10, + output_dir: str = "results_topk_baseline", + dev_prompt: str, + user_prompt: str, +) -> Path: + out = Path(output_dir) + out.mkdir(parents=True, exist_ok=True) + + def run_one(i: int): + text = _call_model_with_retry(model=model, dev_prompt=dev_prompt, user_prompt=user_prompt) + code = extract_code(text) + return i, code + + written = 0 + futures = [] + with ThreadPoolExecutor(max_workers=concurrency) as pool: + for i in range(1, n_runs + 1): + futures.append(pool.submit(run_one, i)) + for fut in as_completed(futures): + i, code = fut.result() + out_path = out / f"run_{i:02d}.py" + out_path.write_text(code, encoding="utf-8") + written += 1 + print(f"[{written}/{n_runs}] Wrote {out_path} — remaining: {n_runs - written}") + print(f"Done. Saved {n_runs} files to: {out.resolve()}") + return out + + diff --git a/examples/gpt-5/prompt-optimization-cookbook/scripts/gen_optimized.py b/examples/gpt-5/prompt-optimization-cookbook/scripts/gen_optimized.py new file mode 100644 index 0000000000..b46c49ee88 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/scripts/gen_optimized.py @@ -0,0 +1,65 @@ +import re +import time +import random +from pathlib import Path +from concurrent.futures import ThreadPoolExecutor, as_completed +from typing import Optional +from openai import OpenAI + +CODE_BLOCK = re.compile(r"```[ \t]*(?:[A-Za-z0-9_+\-]+)?[ \t]*\r?\n(.*?)```", re.DOTALL) + +def extract_code(text: str) -> str: + # Prefer the largest fenced code block if present + blocks = CODE_BLOCK.findall(text) + if blocks: + return max(blocks, key=len).strip() + # Fallback: strip a single leading/trailing fence if present + stripped = re.sub(r"^\s*```[^\n]*\r?\n", "", text) + stripped = re.sub(r"\n```[ \t]*$", "", stripped) + return stripped.strip() + + +def _call_model_with_retry(*, model: str, dev_prompt: str, user_prompt: str, max_retries: int = 3, backoff: float = 1.0) -> str: + client = OpenAI() + payload = { + "model": model, + "input": [ + {"role": "developer", "content": [{"type": "input_text", "text": dev_prompt}]}, + {"role": "user", "content": [{"type": "input_text", "text": user_prompt}]}, + ], + "text": {"format": {"type": "text"}, "verbosity": "medium"}, + "reasoning": {"effort": "medium", "summary": "auto"}, + "tools": [], + } + for attempt in range(max_retries): + try: + resp = client.responses.create(**payload) + return getattr(resp, "output_text", str(resp)) + except Exception: + if attempt == max_retries - 1: + raise + time.sleep(backoff * (2 ** attempt) + random.random() * 0.25) + + +def generate_optimized_topk(*, model: str = "gpt-5", n_runs: int = 30, concurrency: int = 10, output_dir: str = "results_topk_optimized", dev_prompt: str, user_prompt: str) -> Path: + out = Path(output_dir) + out.mkdir(parents=True, exist_ok=True) + + def run_one(i: int): + text = _call_model_with_retry(model=model, dev_prompt=dev_prompt, user_prompt=user_prompt) + code = extract_code(text) + return i, code + + written = 0 + futures = [] + with ThreadPoolExecutor(max_workers=concurrency) as pool: + for i in range(1, n_runs + 1): + futures.append(pool.submit(run_one, i)) + for fut in as_completed(futures): + i, code = fut.result() + out_path = out / f"run_{i:02d}.py" + out_path.write_text(code, encoding="utf-8") + written += 1 + print(f"[{written}/{n_runs}] Wrote {out_path} — remaining: {n_runs - written}") + print(f"Done. Saved {n_runs} files to: {out.resolve()}") + return out diff --git a/examples/gpt-5/prompt-optimization-cookbook/scripts/llm_judge.py b/examples/gpt-5/prompt-optimization-cookbook/scripts/llm_judge.py new file mode 100644 index 0000000000..acb4448c57 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/scripts/llm_judge.py @@ -0,0 +1,333 @@ +import json +import time +from pathlib import Path +from concurrent.futures import ThreadPoolExecutor, as_completed +from typing import Dict, Optional, Tuple, Any, List + +from openai import OpenAI + +# Default task text aligned with the Top-K evaluation used in the notebook +DEFAULT_TASK_TEXT = ( + "Your task:\n" + "Compute the exact Top-K most frequent tokens from a given text.\n\n" + "Tokenization:\n" + "- Case-insensitive tokenization using an ASCII regex; produce lowercase tokens. Lowercasing the entire text is NOT required (per-token lowercasing is acceptable).\n" + "- Tokens are ASCII [a-z0-9]+ sequences; all other characters are separators (use a regex).\n\n" + "Inputs:\n" + "- Two globals are provided: text (string) and k (int). Do not reassign them.\n\n" + "Requirements:\n" + "1) Compute Top-K sorted by count desc, then token asc (i.e., sort key = (-count, token)).\n" + "2) Set top_k to a list of (token, count) tuples, length = min(k, number of unique tokens).\n" + "3) Handle edge cases: if k <= 0, top_k = [].\n" + "4) Do not use input(), file I/O, or network access. The script must run as-is with the provided globals.\n\n" + "Output contract:\n" + "- At the end of execution, top_k must be defined exactly as described.\n" + "- Optional: if printing, print only top_k on the last line as a Python literal or JSON.\n\n" + "Note:\n" + "- Do not rely on Counter.most_common tie ordering; implement the specified sort.\n" +) + + +def _load_system_prompt(path: Path) -> str: + return path.read_text(encoding="utf-8") + + +def _assemble_messages(system_prompt: str, code: str, task: str) -> List[Dict[str, Any]]: + return [ + { + "role": "developer", + "content": [ + {"type": "input_text", "text": system_prompt}, + ], + }, + { + "role": "user", + "content": [ + { + "type": "input_text", + "text": ( + "Evaluate the following code output\n\n" + "\n{code}\n\n\n" + "on the following task instructions\n\n{task}\n" + ).format(code=code, task=task), + } + ], + }, + ] + + +def _to_text(resp: Any) -> str: + if getattr(resp, "output_text", None): + return resp.output_text + try: + parts = [] + for item in (getattr(resp, "output", []) or []): + if getattr(item, "type", None) == "message": + for seg in (getattr(item, "content", []) or []): + if getattr(seg, "type", None) == "output_text": + parts.append(getattr(seg, "text", "") or "") + return "".join(parts) or str(resp) + except Exception: + return str(resp) + + +def _safe_parse_json(text: str) -> Tuple[Optional[dict], Optional[str]]: + # Try direct load + try: + return json.loads(text), None + except Exception as e: + last_err = str(e) + # Try to extract the largest JSON object via regex braces matching heuristic + try: + start = text.find("{") + end = text.rfind("}") + if start != -1 and end != -1 and end > start: + candidate = text[start : end + 1] + return json.loads(candidate), None + except Exception as e2: + last_err = str(e2) + return None, last_err + + +def judge_folder( + *, + results_dir: str, + out_dir: Optional[str] = None, + model: str = "gpt-5", + system_prompt_path: str = "llm_as_judge.txt", + task_text: Optional[str] = None, + concurrency: int = 5, + max_retries: int = 3, + backoff: float = 1.0, +) -> Path: + """ + Evaluate each .py code file in results_dir with an LLM-as-judge and write per-file JSON judgments. + Returns the output directory path. + """ + in_dir = Path(results_dir) + assert in_dir.exists(), f"Results folder not found: {in_dir}" + + # Output directory + if out_dir is None: + name = in_dir.name.lower() + if "baseline" in name: + suffix = "baseline" + elif "optimized" in name: + suffix = "optimized" + else: + suffix = "baseline" + out_dir = in_dir.parent / f"results_llm_as_judge_{suffix}" + out_path = Path(out_dir) + out_path.mkdir(parents=True, exist_ok=True) + + # Load prompts + system_prompt = _load_system_prompt(Path(system_prompt_path)) + task = task_text or DEFAULT_TASK_TEXT + + client = OpenAI() + + def run_one(py_path: Path) -> Tuple[str, dict]: + code = py_path.read_text(encoding="utf-8", errors="ignore") + messages = _assemble_messages(system_prompt, code, task) + + for attempt in range(max_retries): + try: + resp = client.responses.create( + model=model, + input=messages, + text={"format": {"type": "text"}, "verbosity": "medium"}, + reasoning={"effort": "medium", "summary": "auto"}, + tools=[], + ) + raw = _to_text(resp) + parsed, err = _safe_parse_json(raw) + result = { + "file": str(py_path.name), + "raw": raw, + "parsed": parsed, + "parse_error": err, + } + return py_path.name, result + except Exception as e: + if attempt == max_retries - 1: + return py_path.name, { + "file": str(py_path.name), + "error": f"Request failed: {e}", + } + time.sleep(backoff * (2 ** attempt)) + # Should not reach + return py_path.name, {"file": str(py_path.name), "error": "Exhausted retries"} + + py_files = sorted([p for p in in_dir.glob("*.py")]) + + results: Dict[str, dict] = {} + with ThreadPoolExecutor(max_workers=concurrency) as pool: + futures = {pool.submit(run_one, p): p.name for p in py_files} + for fut in as_completed(futures): + fname, res = fut.result() + results[fname] = res + # write per-file json immediately + out_file = out_path / f"{Path(fname).stem}.json" + out_file.write_text(json.dumps(res, indent=2), encoding="utf-8") + + # Build a summary CSV with scores if parseable + import csv as _csv + + summary_csv = out_path / "judgement_summary.csv" + with open(summary_csv, "w", newline="") as fp: + writer = _csv.writer(fp) + writer.writerow(["File", "adherence_score", "code_quality_score", "parse_error", "error"]) + for fname in sorted(results.keys()): + r = results[fname] + adher = None + codeq = None + perr = r.get("parse_error") + err = r.get("error") + parsed = r.get("parsed") + if isinstance(parsed, dict): + fj = parsed.get("final_judgement") or {} + adher = fj.get("adherence_score") + codeq = fj.get("code_quality_score") + writer.writerow([fname, adher, codeq, perr or "", err or ""]) + + return out_path + + +if __name__ == "__main__": + import argparse + + ap = argparse.ArgumentParser(description="Run LLM-as-judge over generated scripts.") + ap.add_argument("--optimized_dir", default="results_topk_optimized") + ap.add_argument("--baseline_dir", default="results_topk_baseline") + ap.add_argument("--system_prompt", default="llm_as_judge.txt") + ap.add_argument("--model", default="gpt-5") + ap.add_argument("--concurrency", type=int, default=5) + ap.add_argument("--task_file", default=None, help="Optional path to a file containing task instructions") + ap.add_argument("--out_dir_baseline", default=None, help="Write judgments for baseline run to this directory (used as-is)") + ap.add_argument("--out_dir_optimized", default=None, help="Write judgments for optimized run to this directory (used as-is)") + + args = ap.parse_args() + + task_text = None + if args.task_file: + task_text = Path(args.task_file).read_text(encoding="utf-8") + + # Baseline + judge_folder( + results_dir=args.baseline_dir, + out_dir=args.out_dir_baseline, # used as-is if provided + model=args.model, + system_prompt_path=args.system_prompt, + task_text=task_text, + concurrency=args.concurrency, + ) + # Optimized + judge_folder( + results_dir=args.optimized_dir, + out_dir=args.out_dir_optimized, # used as-is if provided + model=args.model, + system_prompt_path=args.system_prompt, + task_text=task_text, + concurrency=args.concurrency, + ) + + +# --- Ad-hoc helpers for single-file judging and summary rebuild --- +def judge_one( + *, + py_path: str, + out_dir: Optional[str] = None, + model: str = "gpt-5", + system_prompt_path: str = "llm_as_judge.txt", + task_text: Optional[str] = None, + max_retries: int = 3, + backoff: float = 1.0, +) -> Path: + """Judge a single Python file and write its JSON to the appropriate output directory. + + Returns the path to the written JSON. + """ + in_path = Path(py_path) + assert in_path.exists(), f"Python file not found: {in_path}" + + # Resolve output directory + if out_dir is None: + parent = in_path.parent.name.lower() + if "baseline" in parent: + suffix = "baseline" + elif "optimized" in parent: + suffix = "optimized" + else: + suffix = "baseline" + out_dir = in_path.parent.parent / f"results_llm_as_judge_{suffix}" + out_path = Path(out_dir) + out_path.mkdir(parents=True, exist_ok=True) + + # Load prompts + system_prompt = _load_system_prompt(Path(system_prompt_path)) + task = task_text or DEFAULT_TASK_TEXT + + # Read code and call model + code = in_path.read_text(encoding="utf-8", errors="ignore") + messages = _assemble_messages(system_prompt, code, task) + + client = OpenAI() + last_err: Optional[str] = None + for attempt in range(max_retries): + try: + resp = client.responses.create( + model=model, + input=messages, + text={"format": {"type": "text"}, "verbosity": "medium"}, + reasoning={"effort": "medium", "summary": "auto"}, + tools=[], + ) + raw = _to_text(resp) + parsed, err = _safe_parse_json(raw) + result = { + "file": str(in_path.name), + "raw": raw, + "parsed": parsed, + "parse_error": err, + } + out_json = out_path / f"{in_path.stem}.json" + out_json.write_text(json.dumps(result, indent=2), encoding="utf-8") + return out_json + except Exception as e: + last_err = str(e) + if attempt == max_retries - 1: + raise + time.sleep(backoff * (2 ** attempt)) + + raise RuntimeError(f"Failed to judge {in_path}: {last_err}") + + +def rebuild_summary(*, out_dir: str) -> Path: + """Rebuild judgement_summary.csv from all JSON files present in out_dir and return its path.""" + base = Path(out_dir) + results: Dict[str, dict] = {} + for p in sorted(base.glob("*.json")): + try: + results[p.name] = json.loads(p.read_text(encoding="utf-8")) + except Exception: + continue + + import csv as _csv + summary_csv = base / "judgement_summary.csv" + with open(summary_csv, "w", newline="") as fp: + writer = _csv.writer(fp) + writer.writerow(["File", "adherence_score", "code_quality_score", "parse_error", "error"]) + for fname in sorted(results.keys()): + r = results[fname] + adher = None + codeq = None + perr = (r.get("parse_error") if isinstance(r, dict) else None) + err = (r.get("error") if isinstance(r, dict) else None) + parsed = r.get("parsed") if isinstance(r, dict) else None + if isinstance(parsed, dict): + fj = parsed.get("final_judgement") or {} + adher = fj.get("adherence_score") + codeq = fj.get("code_quality_score") + writer.writerow([r.get("file") or fname, adher, codeq, perr or "", err or ""]) + + return summary_csv diff --git a/examples/gpt-5/prompt-optimization-cookbook/scripts/results_summarizer.py b/examples/gpt-5/prompt-optimization-cookbook/scripts/results_summarizer.py new file mode 100644 index 0000000000..032597e055 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/scripts/results_summarizer.py @@ -0,0 +1,348 @@ +from __future__ import annotations + +import csv +import json +from dataclasses import dataclass +from pathlib import Path +from typing import Dict, List, Optional, Tuple + +# Minimal typing-friendly containers +@dataclass +class QuantRow: + file: str + compiled: bool + exec_time_s: Optional[float] + peak_mem_bytes: Optional[int] + exact: Optional[bool] + sorted_ok: Optional[bool] + violation: Optional[str] + +@dataclass +class JudgeRow: + file: str + adherence_score: Optional[float] + code_quality_score: Optional[float] + parse_error: Optional[str] + error: Optional[str] + +@dataclass +class GroupSummary: + name: str + n_total: int + n_success: int + n_violations: int + exact_rate: float + sorted_rate: float + avg_time_s: Optional[float] + avg_peak_kb: Optional[float] + avg_adherence: Optional[float] + avg_code_quality: Optional[float] + + +def _read_quant_csv(path: Path) -> List[QuantRow]: + rows: List[QuantRow] = [] + with open(path, newline="") as fp: + r = csv.DictReader(fp) + for d in r: + compiled = str(d.get("Compiled", "")).strip().lower() == "true" + def _float(x): + try: + return float(x) + except Exception: + return None + def _int(x): + try: + return int(x) + except Exception: + return None + def _bool(x): + sx = str(x).strip().lower() + if sx in ("true", "false"): + return sx == "true" + return None + rows.append( + QuantRow( + file=str(d.get("File Name", "")), + compiled=compiled, + exec_time_s=_float(d.get("Execution Time (s)", "")), + peak_mem_bytes=_int(d.get("Peak Memory (bytes)", "")), + exact=_bool(d.get("Exact Match", "")), + sorted_ok=_bool(d.get("Sorted Correctly", "")), + violation=(d.get("Violation") if "Violation" in d else None), + ) + ) + return rows + + +def _read_judge_csv(path: Path) -> List[JudgeRow]: + out: List[JudgeRow] = [] + with open(path, newline="") as fp: + r = csv.DictReader(fp) + for d in r: + def _num(x): + try: + return float(x) + except Exception: + return None + out.append( + JudgeRow( + file=str(d.get("File", "")), + adherence_score=_num(d.get("adherence_score")), + code_quality_score=_num(d.get("code_quality_score")), + parse_error=d.get("parse_error"), + error=d.get("error"), + ) + ) + return out + + +def _avg(nums: List[float]) -> Optional[float]: + nums2 = [x for x in nums if x is not None] + if not nums2: + return None + return sum(nums2) / len(nums2) + + +def summarize_groups( + *, + quant_paths: Dict[str, Path], + judge_paths: Dict[str, Path], +) -> Dict[str, GroupSummary]: + summaries: Dict[str, GroupSummary] = {} + for name, qpath in quant_paths.items(): + qrows = _read_quant_csv(qpath) + jrows = _read_judge_csv(judge_paths[name]) if name in judge_paths and judge_paths[name].exists() else [] + jmap = {Path(j.file).stem: j for j in jrows} + + n_total = len(qrows) + n_success = sum(1 for r in qrows if r.compiled) + n_viol = sum(1 for r in qrows if (r.violation or "").strip()) + exact_rate = ( + sum(1 for r in qrows if r.exact) / n_success if n_success else 0.0 + ) + sorted_rate = ( + sum(1 for r in qrows if r.sorted_ok) / n_success if n_success else 0.0 + ) + avg_time_s = _avg([r.exec_time_s for r in qrows if r.compiled and r.exec_time_s is not None]) + avg_peak_kb = _avg([ + (r.peak_mem_bytes or 0) / 1024.0 for r in qrows if r.compiled and r.peak_mem_bytes is not None + ]) + + # Judge averages + avg_adherence = _avg([jr.adherence_score for jr in jrows if jr.adherence_score is not None]) + avg_codeq = _avg([jr.code_quality_score for jr in jrows if jr.code_quality_score is not None]) + + summaries[name] = GroupSummary( + name=name, + n_total=n_total, + n_success=n_success, + n_violations=n_viol, + exact_rate=exact_rate, + sorted_rate=sorted_rate, + avg_time_s=avg_time_s, + avg_peak_kb=avg_peak_kb, + avg_adherence=avg_adherence, + avg_code_quality=avg_codeq, + ) + return summaries + + +def render_charts( + *, + quant_baseline: Path = Path("results_topk_baseline") / "run_results_topk_baseline.csv", + quant_optimized: Path = Path("results_topk_optimized") / "run_results_topk_optimized.csv", + judge_baseline: Path = Path("results_llm_as_judge_baseline") / "judgement_summary.csv", + judge_optimized: Path = Path("results_llm_as_judge_optimized") / "judgement_summary.csv", + auto_display: bool = False, + close_after: bool = False, +): + import matplotlib.pyplot as plt + # seaborn optional + try: + import seaborn as sns # type: ignore + sns.set_theme(style="whitegrid") + except Exception: + pass + + quant_paths = { + "baseline": Path(quant_baseline), + "optimized": Path(quant_optimized), + } + judge_paths = { + "baseline": Path(judge_baseline), + "optimized": Path(judge_optimized), + } + summaries = summarize_groups(quant_paths=quant_paths, judge_paths=judge_paths) + + # Build figure with subplots + fig, axes = plt.subplots(2, 3, figsize=(15, 8)) + labels = ["baseline", "optimized"] + + # Helper to fetch values in label order + def vals(key: str) -> List[float]: + out: List[float] = [] + for l in labels: + v = getattr(summaries[l], key) + out.append(v if v is not None else 0.0) + return out + + # 1) Avg exec time + ax = axes[0, 0] + ax.bar(labels, vals("avg_time_s"), color=["#cbd5e1", "#60a5fa"]) # slate-200, blue-400 + ax.set_title("Average Execution Time (s)") + + # 2) Avg peak memory + ax = axes[0, 1] + ax.bar(labels, vals("avg_peak_kb"), color=["#cbd5e1", "#60a5fa"]) + ax.set_title("Average Peak Memory (KB)") + + # 3) Success & Violation stacked bars + ax = axes[0, 2] + succ = [summaries[l].n_success for l in labels] + viol = [summaries[l].n_violations for l in labels] + total = [summaries[l].n_total for l in labels] + fail = [total[i] - succ[i] - viol[i] for i in range(len(labels))] + ax.bar(labels, succ, label="Success", color="#22c55e") + ax.bar(labels, viol, bottom=succ, label="Violation", color="#f59e0b") + ax.bar(labels, fail, bottom=[succ[i] + viol[i] for i in range(len(labels))], label="Fail", color="#ef4444") + ax.set_title("Outcome Breakdown") + ax.legend() + + # 4) Exact rate + ax = axes[1, 0] + ax.bar(labels, [summaries[l].exact_rate * 100 for l in labels], color=["#cbd5e1", "#60a5fa"]) + ax.set_title("Exact Match Rate (%)") + + # 5) Sorted correct rate + ax = axes[1, 1] + ax.bar(labels, [summaries[l].sorted_rate * 100 for l in labels], color=["#cbd5e1", "#60a5fa"]) + ax.set_title("Sorted Correctly Rate (%)") + + # 6) LLM scores (adherence vs code quality) + ax = axes[1, 2] + x = range(len(labels)) + width = 0.35 + adher = vals("avg_adherence") + codeq = vals("avg_code_quality") + ax.bar([i - width / 2 for i in x], adher, width=width, label="Adherence", color="#0ea5e9") + ax.bar([i + width / 2 for i in x], codeq, width=width, label="Code Quality", color="#8b5cf6") + ax.set_xticks(list(x)) + ax.set_xticklabels(labels) + ax.set_title("LLM-as-Judge Scores") + ax.legend() + + fig.tight_layout() + + # Optional in-function display for notebook convenience + if auto_display: + try: + from IPython.display import display # type: ignore + display(fig) + except Exception: + pass + if close_after: + try: + plt.close(fig) + except Exception: + pass + + return fig, summaries + + +def print_text_summaries(summaries: Dict[str, GroupSummary]): + for k in ("baseline", "optimized"): + s = summaries.get(k) + if not s: + continue + print(f"\n=== {k.upper()} ===") + print(f"Total: {s.n_total}, Success: {s.n_success}, Violations: {s.n_violations}") + if s.avg_time_s is not None: + print(f"Avg Time: {s.avg_time_s:.3f}s") + if s.avg_peak_kb is not None: + print(f"Avg Peak Memory: {s.avg_peak_kb:.1f} KB") + print(f"Exact Rate: {s.exact_rate*100:.1f}% | Sorted Rate: {s.sorted_rate*100:.1f}%") + if s.avg_adherence is not None or s.avg_code_quality is not None: + print( + f"LLM Scores — Adherence: {s.avg_adherence or 'NA'}, Code Quality: {s.avg_code_quality or 'NA'}" + ) + + +if __name__ == "__main__": + import argparse + import matplotlib.pyplot as plt + + ap = argparse.ArgumentParser(description="Summarize and visualize results.") + ap.add_argument("--quant_baseline", default=str(Path("results_topk_baseline") / "run_results_topk_baseline.csv")) + ap.add_argument("--quant_opt", default=str(Path("results_topk_optimized") / "run_results_topk_optimized.csv")) + ap.add_argument("--judge_baseline", default=str(Path("results_llm_as_judge_baseline") / "judgement_summary.csv")) + ap.add_argument("--judge_opt", default=str(Path("results_llm_as_judge_optimized") / "judgement_summary.csv")) + + args = ap.parse_args() + + fig, summaries = render_charts( + quant_baseline=Path(args.quant_baseline), + quant_optimized=Path(args.quant_opt), + judge_baseline=Path(args.judge_baseline), + judge_optimized=Path(args.judge_opt), + ) + print_text_summaries(summaries) + plt.show() + + +def build_markdown_summary( + *, + quant_baseline: Path = Path("results_topk_baseline") / "run_results_topk_baseline.csv", + quant_optimized: Path = Path("results_topk_optimized") / "run_results_topk_optimized.csv", + judge_baseline: Path = Path("results_llm_as_judge_baseline") / "judgement_summary.csv", + judge_optimized: Path = Path("results_llm_as_judge_optimized") / "judgement_summary.csv", +) -> str: + """Return a Markdown table comparing baseline vs optimized metrics with deltas. + + This is a pure function that reads the CSVs and produces Markdown suitable for Jupyter display. + """ + summaries = summarize_groups( + quant_paths={ + "baseline": Path(quant_baseline), + "optimized": Path(quant_optimized), + }, + judge_paths={ + "baseline": Path(judge_baseline), + "optimized": Path(judge_optimized), + }, + ) + + base = summaries["baseline"] + opt = summaries["optimized"] + + def _fmt(x: Optional[float], n: int = 3) -> str: + return (f"{x:.{n}f}" if x is not None else "NA") + + def _sign(x: float) -> str: + return "+" if x > 0 else "" + + rows: List[str] = [] + + def _add_row(label: str, b: Optional[float], o: Optional[float], places: int) -> None: + delta_str = "NA" + if b is not None and o is not None: + delta = o - b + delta_str = f"{_sign(delta)}{_fmt(delta, places)}" + rows.append( + f"| {label:<27} | {_fmt(b, places):>8} | {_fmt(o, places):>9} | {delta_str:>13} |" + ) + + header = ( + "| Metric | Baseline | Optimized | Δ (Opt − Base) |\n" + "|----------------------------|---------:|----------:|---------------:|" + ) + + _add_row("Avg Time (s)", base.avg_time_s, opt.avg_time_s, 3) + _add_row("Peak Memory (KB)", base.avg_peak_kb, opt.avg_peak_kb, 1) + _add_row("Exact (%)", (base.exact_rate * 100.0), (opt.exact_rate * 100.0), 1) + _add_row("Sorted (%)", (base.sorted_rate * 100.0), (opt.sorted_rate * 100.0), 1) + _add_row("LLM Adherence (1–5)", base.avg_adherence, opt.avg_adherence, 2) + _add_row("Code Quality (1–5)", base.avg_code_quality, opt.avg_code_quality, 2) + + body = "\n".join(rows) + md = f"### Prompt Optimization Results - Coding Tasks\n\n{header}\n{body}" + return md diff --git a/examples/gpt-5/prompt-optimization-cookbook/scripts/topk_eval.py b/examples/gpt-5/prompt-optimization-cookbook/scripts/topk_eval.py new file mode 100644 index 0000000000..0ab24c4617 --- /dev/null +++ b/examples/gpt-5/prompt-optimization-cookbook/scripts/topk_eval.py @@ -0,0 +1,311 @@ +import os +import io +import re +import csv +import json +import ast +import time +import tracemalloc +import runpy +from pathlib import Path +from contextlib import redirect_stdout +from collections import Counter +from typing import Optional + +TOKEN_RE = re.compile(r"[a-z0-9]+") + +# Disallowed usage (static scan) +DISALLOWED_IMPORTS = { + "sqlite3": "external_storage", + "tempfile": "external_storage", + "shelve": "external_storage", + "requests": "network", + "urllib": "network", + "http": "network", + "socket": "network", +} +DISALLOWED_PATTERNS = [ + ("file_io", re.compile(r"(? dict: + # === Build deterministic, more intense dataset with many ties near Top-K === + import random + + random.seed(1337) + vocab_top = [f"w{i:04d}" for i in range(1, 401)] + vocab_tail = [f"w{i:04d}" for i in range(401, 5001)] + + counts_plan = {} + + # Head: decreasing counts + for i, tok in enumerate(vocab_top[:150], start=1): + c = max(1200, int(5000 / (i ** 0.5))) + counts_plan[tok] = c + + # Plateau: create many equal-count tokens to stress tie-breaking near K + plateau_tokens = vocab_top[150:350] # 200 tokens + for tok in plateau_tokens: + counts_plan[tok] = 1000 + + # Remainder of top block + for tok in vocab_top[350:400]: + counts_plan[tok] = 900 + + # Materialize text via generator to avoid a huge list + residual = max(0, scale_tokens - sum(counts_plan.values())) + tail_vocab = vocab_tail + + def iter_tokens(): + for tok, c in counts_plan.items(): + for _ in range(c): + yield tok + for i in range(residual): + yield tail_vocab[i % len(tail_vocab)] + + test_text = " ".join(iter_tokens()) + + # === Ground truth (from construction plan) === + counts = Counter() + counts.update(counts_plan) + for i in range(residual): + counts[tail_vocab[i % len(tail_vocab)]] += 1 + + def topk_from_counts(cnt: Counter, k: int): + items = list(cnt.items()) + items.sort(key=lambda x: (-x[1], x[0])) + return items[:k] + + ground_truth = topk_from_counts(counts, k) + + # === Helpers === + def coerce_topk(obj): + if isinstance(obj, list): + out = [] + for it in obj: + if isinstance(it, (list, tuple)) and len(it) == 2 and isinstance(it[0], str) and isinstance(it[1], (int, float)): + out.append((it[0], int(it[1]))) + else: + return None + return out + return None + + def parse_topk_from_stdout(stdout_str): + lines = [ln.strip() for ln in stdout_str.strip().splitlines() if ln.strip()] + for candidate in reversed(lines): + try: + val = ast.literal_eval(candidate) + coerced = coerce_topk(val) + if coerced is not None: + return coerced + except Exception: + pass + return None + + def is_sorted_topk(pairs): + return all((pairs[i][1] > pairs[i+1][1]) or (pairs[i][1] == pairs[i+1][1] and pairs[i][0] <= pairs[i+1][0]) for i in range(len(pairs)-1)) + + def precision_at_k(pred, truth): + pred_tokens = {t for t, _ in (pred[:k] if isinstance(pred, list) else [])} + truth_tokens = {t for t, _ in truth[:k]} + if not pred_tokens: + return 0.0 + return len(pred_tokens & truth_tokens) / min(len(pred_tokens), k) + + def scan_constraints(src: str) -> Optional[str]: + # Imports + for name, tag in DISALLOWED_IMPORTS.items(): + # simple import detection + if re.search(rf"\bimport\s+{re.escape(name)}\b|\bfrom\s+{re.escape(name)}\b", src): + return f"Constraint violation: disallowed import '{name}' ({tag})" + # Patterns + for tag, rx in DISALLOWED_PATTERNS: + if rx.search(src): + return f"Constraint violation: disallowed pattern '{tag}'" + return None + + # === Warm-up (optional) === + py_files = sorted([f for f in os.listdir(folder_path) if f.endswith(".py")]) + if py_files: + warmup_path = os.path.join(folder_path, py_files[0]) + try: + _ = runpy.run_path(warmup_path, run_name="__main__", init_globals={"text": test_text, "k": k}) + tracemalloc.start() + _ = tracemalloc.get_traced_memory() + tracemalloc.stop() + except Exception: + pass + + # === Evaluation === + rows = [] + compile_success_count = 0 + total_time = 0.0 + total_mem = 0 + exact_count = 0 + sorted_ok_count = 0 + violation_count = 0 + + for file_name in py_files: + file_path = os.path.join(folder_path, file_name) + + # Static constraint scan + violation = None + try: + src_text = Path(file_path).read_text(encoding="utf-8", errors="ignore") + violation = scan_constraints(src_text) + except Exception: + pass + + if violation: + violation_count += 1 + rows.append([file_name, False, "", "", "", ground_truth[:5], violation, "", "", violation]) + continue + + f = io.StringIO() + tracemalloc.start() + start = time.perf_counter() + peak_mem = None + result = None + + try: + with redirect_stdout(f): + namespace = runpy.run_path(file_path, run_name="__main__", init_globals={"text": test_text, "k": k}) + elapsed = time.perf_counter() - start + _, peak_mem = tracemalloc.get_traced_memory() + tracemalloc.stop() + compile_success_count += 1 + + stdout_str = f.getvalue() + # prefer namespace variable + if "top_k" in namespace: + result = coerce_topk(namespace["top_k"]) + if result is None: + result = parse_topk_from_stdout(stdout_str) + + total_time += elapsed + total_mem += (peak_mem or 0) + + is_exact = (result == ground_truth) + if is_exact: + exact_count += 1 + + is_sorted_ok = bool(result) and is_sorted_topk(result) + if is_sorted_ok: + sorted_ok_count += 1 + + p_at_k = precision_at_k(result or [], ground_truth) + + rows.append([ + file_name, + True, + elapsed, + peak_mem, + (result[:5] if isinstance(result, list) else ""), + (ground_truth[:5]), + is_exact, + is_sorted_ok, + f"{p_at_k:.3f}", + "", + ]) + + except Exception as e: + try: + tracemalloc.stop() + except Exception: + pass + rows.append([file_name, False, "", "", "", ground_truth[:5], f"Runtime/Import Error: {e}", "", "", ""]) + + # === Write CSV (allow caller to pass a directory or explicit file) === + base_dir = Path(folder_path) + base_dir.mkdir(parents=True, exist_ok=True) + name = base_dir.name.lower() + if "baseline" in name: + default_name = "run_results_topk_baseline.csv" + elif "optimized" in name: + default_name = "run_results_topk_optimized.csv" + else: + default_name = "run_results_topk.csv" + + # If csv_path is provided: + # - If it's absolute, use it as-is + # - If it's relative, resolve it under folder_path (base_dir) so results live with the evaluated runs + # - If it's a directory, place the default file name inside it + if csv_path: + csv_path_obj = Path(csv_path) + if not csv_path_obj.is_absolute(): + csv_path_obj = (base_dir / csv_path_obj) + if csv_path_obj.suffix.lower() != ".csv": + csv_path_obj = csv_path_obj / default_name + csv_path_obj.parent.mkdir(parents=True, exist_ok=True) + else: + csv_path_obj = base_dir / default_name + + with open(csv_path_obj, "w", newline="") as fp: + writer = csv.writer(fp) + writer.writerow([ + "File Name", + "Compiled", + "Execution Time (s)", + "Peak Memory (bytes)", + "Reported Top-K (first 5)", + "Ground Truth (first 5)", + "Exact Match", + "Sorted Correctly", + "Precision@K", + "Violation", + ]) + writer.writerows(rows) + + # === Summary files (inside same folder) === + total_runs = len(py_files) + avg_time = (total_time / compile_success_count) if compile_success_count else None + avg_peak_kb = (total_mem / compile_success_count / 1024) if compile_success_count else None + summary = { + "total_runs": total_runs, + "successes": compile_success_count, + "avg_exec_time_s": avg_time, + "avg_peak_mem_kb": avg_peak_kb, + "exact_matches": exact_count, + "sorted_correctly": sorted_ok_count, + "violations": violation_count, + "csv": str(csv_path_obj), + "folder": str(base_dir), + "k": k, + "scale_tokens": scale_tokens, + } + summary_json = csv_path_obj.with_name(csv_path_obj.stem + "_summary.json") + summary_txt = csv_path_obj.with_name(csv_path_obj.stem + "_summary.txt") + with open(summary_json, "w") as fp: + json.dump(summary, fp, indent=2) + with open(summary_txt, "w") as fp: + fp.write("===== SUMMARY =====\n") + fp.write(f"Total evaluated runs: {total_runs}\n") + fp.write(f"Compilation/Execution Success: {compile_success_count}/{total_runs} ({(compile_success_count/total_runs)*100:.2f}%)\n") + fp.write(f"Violations (static scan): {violation_count}\n") + if compile_success_count > 0: + fp.write(f"Average Execution Time (successful): {avg_time:.6f} s\n") + fp.write(f"Average Peak Memory (successful): {avg_peak_kb:.2f} KB\n") + fp.write(f"Exact matches: {exact_count}/{compile_success_count}\n") + fp.write(f"Sorted correctly: {sorted_ok_count}/{compile_success_count}\n") + fp.write(f"CSV written to: {csv_path_obj}\n") + + print("===== SUMMARY =====") + print(f"Total evaluated runs: {total_runs}") + print(f"Compilation/Execution Success: {compile_success_count}/{total_runs} ({(compile_success_count/total_runs)*100:.2f}%)") + print(f"Violations (static scan): {violation_count}") + if compile_success_count > 0: + print(f"Average Execution Time (successful): {avg_time:.6f} s") + print(f"Average Peak Memory (successful): {avg_peak_kb:.2f} KB") + print(f"Exact matches: {exact_count}/{compile_success_count}") + print(f"Sorted correctly: {sorted_ok_count}/{compile_success_count}") + print(f"\nCSV written to: {csv_path_obj}") + print(f"Summary JSON written to: {summary_json}") + print(f"Summary TXT written to: {summary_txt}") + + return summary diff --git a/images/image_optimize_1.png b/images/image_optimize_1.png new file mode 100644 index 0000000000..3bd7e46b0e Binary files /dev/null and b/images/image_optimize_1.png differ diff --git a/images/image_optimize_2.png b/images/image_optimize_2.png new file mode 100644 index 0000000000..799fd4ef75 Binary files /dev/null and b/images/image_optimize_2.png differ diff --git a/images/image_optimize_3.png b/images/image_optimize_3.png new file mode 100644 index 0000000000..c504872864 Binary files /dev/null and b/images/image_optimize_3.png differ diff --git a/images/image_optimize_4.png b/images/image_optimize_4.png new file mode 100644 index 0000000000..8087da16cc Binary files /dev/null and b/images/image_optimize_4.png differ diff --git a/images/image_optimize_5.png b/images/image_optimize_5.png new file mode 100644 index 0000000000..a2921d6b8f Binary files /dev/null and b/images/image_optimize_5.png differ diff --git a/registry.yaml b/registry.yaml index 30f82e34b0..c9c9f4699d 100644 --- a/registry.yaml +++ b/registry.yaml @@ -4,6 +4,18 @@ # should build pages for, and indicates metadata such as tags, creation date and # authors for each page. +- title: GPT-5 Prompt Migration and Improvement Using the New Optimizer + path: examples/gpt-5/prompt-optimization-cookbook/prompt-optimization-cookbook.ipynb + date: 2025-08-07 + authors: + - rajpathak-openai + - corwin + tags: + - gpt-5 + - responses + - reasoning + - prompt-optimization + - title: GPT-5 prompting guide path: examples/gpt-5/gpt-5_prompting_guide.ipynb date: 2025-08-07 @@ -27,7 +39,7 @@ - gpt-5 - responses - reasoning - + - title: GPT-5 New Params and Tools path: examples/gpt-5/gpt-5_new_params_and_tools.ipynb date: 2025-08-07 @@ -57,7 +69,6 @@ - gpt-oss - open-models - - title: Fine-tuning with gpt-oss and Hugging Face Transformers path: articles/gpt-oss/fine-tune-transfomers.ipynb date: 2025-08-05 @@ -115,7 +126,6 @@ - gpt-oss - harmony - - title: Temporal Agents with Knowledge Graphs path: examples/partners/temporal_agents_with_knowledge_graphs/temporal_agents_with_knowledge_graphs.ipynb date: 2025-07-22