<a href="https://colab.research.google.com/github/pacozaa/LLM-Paper-To-Code/blob/main/DynamicCheatSheet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [13]:
#Install Dependency
!pip install azure-ai-inference datasets



In [14]:
from azure.ai.inference import ChatCompletionsClient
from azure.ai.inference.models import SystemMessage, UserMessage, AssistantMessage
from azure.core.credentials import AzureKeyCredential
from google.colab import userdata
from datasets import load_dataset

import re
import json
import time

In [15]:
endpoint = "https://models.github.ai/inference"

github_token=userdata.get('GITHUB_TOKEN')

client = ChatCompletionsClient(
    endpoint=endpoint,
    credential=AzureKeyCredential(github_token),
)

In [16]:
# @title

generator_template="""
# GENERATOR (Puzzle Solver)

Each puzzle is a Logic Grid Puzzle, also known as a Zebra Puzzle. In each puzzle, we are given N houses (numbered 1 to N from left to right) and M features for each house. There are N distinct values for each feature, and each house must have a unique value for each feature. Given a list of clues, one should be able to deduce a unique correct assignment of values. The logic grid puzzle is a typical Constraint Satisfaction Problem (CSP) and is often used to test humans' logical reasoning abilities in exams such as the Law School Admission Test (LSAT).

Instruction: You task is to solve the puzzle step by step

Each task will include:
1. A specific puzzle to solve
2. A cheatsheet containing relevant strategies, patterns, and examples from similar problems

---

## 1. ANALYSIS & STRATEGY

- Carefully analyze both the question and cheatsheet before starting
- Search for and identify any applicable patterns, strategies, or examples within the cheatsheet
- Create a structured approach to solving the problem at hand
- Review and document any limitations in the provided reference materials

## 2. SOLUTION DEVELOPMENT

- Present your solution using clear, logical steps that others can follow and review
- Explain your reasoning and methodology before presenting final conclusions
- Provide detailed explanations for each step of the process
- Check and verify all assumptions and intermediate calculations

## 3. FINAL ANSWER FORMAT

Start answering with `## Reasoning steps:` and end with

## Final answer:
```json

````

N.B. Make sure that the final answer is properly wrapped inside the ```json ``` block.


Example:
Puzzle:
There are 2 houses, numbered 1 to 2 from left to right.
Each house is occupied by a different person.
Each house has a unique attribute for each of the following characteristics:
- Each person has a unique name: **Arnold, Eric**
- People own unique car models: **ford f150, tesla model 3**
- The people keep unique animals: **cat, horse**

**Clues**:
1. Eric is directly left of the person who owns a Tesla Model 3.
2. The person who keeps horses is in the first house.

Answer:
## Reasoning steps:

From Clue 1, we know that Eric is to the left of someone, so he must be the owner of House 1 because House 2 is the rightmost house.
Additionally, we know that the person in House 2 must be Arnold, and he owns a Tesla Model 3. Thus, Eric owns a Ford F150.
From Clue 2, we know that Eric keeps horses in House 1, which means the other house must keep cats. Finally, we arrive at the unique solution to this puzzle.
The solution is presented in table format:

## Final answer:

```json
{
   "header":[
      "Houses",
      "Name",
      "CarModel",
      "Animal"
   ],
   "rows":[
      [
         "1",
         "Eric",
         "ford f150",
         "horse"
      ],
      [
         "2",
         "Arnold",
         "tesla model 3",
         "cat"
      ]
   ]
}
```

-----

CHEATSHEET:
<cheatsheet>
[[CHEATSHEET]]
</cheatsheet>

-----
-----

Now it is time to solve the following question.

CURRENT INPUT:

Puzzle:
[[QUESTION]]

Answer:
## Reasoning steps:
"""
cheatsheet_template="""
# CHEATSHEET REFRENCE CURATOR

#### 1. Purpose and Goals
As the Cheatsheet Curator, you are tasked with creating a continuously evolving reference designed to help solve a Logic Grid puzzle. The cheatsheet's purpose is to consolidate
- Reusable Strategies
- Effective Steps to have when solving this type of puzzle
- Patterns and Soltuion
into a single, well-structured resource.

- The cheatsheet should include quick, accurate, reliable, and practical solutions to a range of technical and creative challenges.
- After seeing each input, you should improve the content of the cheatsheet, synthesizing lessons, insights, tricks, and errors learned from past problems and adapting to new challenges.

---

#### 2. Core Responsibilities
As the Cheatsheet Curator, you should:
   - Curate and preserve knolwedge: Select and document only the most relevant, most useful, and most actionable solutions and strategies, while preserving old content of the cheatsheet.
   - Maintain accuracy: Ensure that all entries in the cheatsheet are accurate, clear, and well-contextualized.
   - Refine and update content: Continuously update and improve the content of the cheatsheet by incorporating new insights and solutions, removing repetitions or trivial information, and adding efficient solutions.
   - Ensure practicality and comprehensiveness: Provide critical and informative examples, as well as efficient code snippets and actionable guidelines.

Before updating the cheatsheet, however, you should first assess the correctness of the provided solution and strategically incorporate code blocks, insights, and solutions into the new cheatsheet. Always aim to preserve and keep correct, useful, and illustrative solutions and strategies for future cheatsheets.

---

#### 3. Principles and Best Practices
1. Accuracy and Relevance:
   - Only include solutions and strategies that have been tested and proven effective.
   - Clearly state any assumptions, limitations, or dependencies (e.g., specific Python libraries or solution hacks).
   - For computational problems, encourage Python usage for more accurate calculations.

2. Iterative Refinement:
   - Continuously improve the cheatsheet by synthesizing both old and new solutions, refining explanations, and removing redundancies.
   - Rather than deleting old content and writing new content each time, consider ways to maintain table content and synthesize information from multiple solutions.
   - After solving a new problem, document any reusable codes, algorithms, strategies, edge cases, or optimization techniques.

3. Clarity and Usability:
   - Write concise, actioanble, well-structured entries.
   - Focus on key insights or strategies that make solutions correct and effective.

4. Reusability:
   - Provide clear solutions, pseudocodes, and meta strategies that are easily adaptable to different contexts.
   - Avoid trivial content; focus on non-obvious, critical solution details and approaches.
   - Make sure to add as many examples as you can in the cheatsheet.
   - Any useful, efficient, generalizable, and illustrative solutions to the previous problems should be included in the cheatsheet.

---

#### 4. Cheatsheet Structure
The cheatsheet can be divided into the following sections:

1. Solutions, Implementation Patterns, and Code Snippets:
   - Document reusable code snippets, algorithms, and solution templates.
   - Include descriptions, annotated examples, and potential pitfalls, albeit succinctly.

2. [OPTIONAL] Edge Cases and Validation Traps:
   - Catalog scenarios that commonly cause errors or unexpected behavior.
   - Provide checks, validations, or alternative approaches to handle them.

3. General Meta-Reasoning Strategies:
   - Describe high-level problem-solving frameworks and heuristics (e.g., use Python to solve heuristic problems; in bipartite graphs, max matching = min vertex cover, etc.)
   - Provide concrete yet succinct step-by-step guides for tackling complex problems.

4. Implement a Usage Counter
   - Each entry must include a usage count: Increase the count every time a strategy is successfully used in problem-solving.
   - Use the count to prioritize frequently used solutions over rarely applied ones.

---

#### 5. Formatting Guidelines
Use the following structure for each memory item:

```
## Memory Item 1

### Description
[Briefly describe the problem context, purpose, and key aspects of the solution.] (Refence: Q1, Q2, Q6, etc.)

### Example
[Provide a well-documented code snippet, worked-out solution, or efficient strategy.]

---



## Memory Item 2
[...]

---

[...]

## Memory Item N
[...]

---

```

- Grouping: Organize entries into logical sections and subsections.
- Prioritizing: incorporate efficient algorithmic solutions, tricks, and strategies into the cheatsheet.
- Diversity: Have as many useful and relevant memory items as possible to guide the model to tackle future questions.

N.B. Keep in mind that once the cheatsheet is updated, any previous content not directly included will be lost and cannot be retrieved. Therefore, make sure to explicitly copy any (or all) relevant information from the previous cheatsheet to the new cheatsheet!!!

---

#### 6. Cheatsheet Template
Use the following format for creating and updating the cheatsheet:

NEW CHEATSHEET:
```
<cheatsheet>

## Version: [Version Number]

## Memory Item 1
[...]

---

## Memory Item 2
[...]

---

</cheatsheet>
```

N.B. Make sure that all information related to the cheatsheet is wrapped inside the <cheatsheet> block. The cheatsheet can be as long as circa 2000-2500 words.

-----
-----

## PREVIOUS CHEATSHEET

[[PREVIOUS_CHEATSHEET]]

-----
-----

## CURRENT INPUT

[[QUESTION]]

-----
-----

## MODEL ANSWER TO THE CURRENT INPUT

[[MODEL_ANSWER]]
"""



In [17]:
# @title
def get_generator_prompt(input_txt, generator_cheatsheet_content):
    return generator_template.replace("[[QUESTION]]", input_txt).replace("[[CHEATSHEET]]", generator_cheatsheet_content)

def get_curator_prompt(input_txt, generator_output, current_cheatsheet):
    return cheatsheet_template.replace("[[QUESTION]]", input_txt).replace("[[MODEL_ANSWER]]", generator_output).replace("[[PREVIOUS_CHEATSHEET]]", current_cheatsheet)

def llm_generator(input, cheatsheet, temperature, top_p, model_name):
    llm_input = get_generator_prompt(input, cheatsheet)
    generate_messages = [
        UserMessage(llm_input)
    ]
    response = client.complete(
        messages=generate_messages,
        temperature=temperature,
        top_p=top_p,
        model=model_name,
    )
    answer = response.choices[0].message.content
    return llm_input,answer

def llm_curator(input, output, cheatsheet, temperature, top_p, model_name):
    llm_input = get_curator_prompt(input, output, cheatsheet)
    curator_messages = [
        UserMessage(llm_input)
    ]
    response = client.complete(
        messages=curator_messages,
        temperature=temperature,
        top_p=top_p,
        model=model_name,
    )
    answer = response.choices[0].message.content
    return llm_input,answer

In [18]:


def extract_cheatsheet(
    response: str,
    old_cheatsheet: str,
) -> str:
    """
    Extracts the cheatsheet from the model response.

    Arguments:
        response : str : The response from the model.
        old_cheatsheet : str : The old cheatsheet to return if the new one is not found.

    Returns:
        str : The extracted cheatsheet (if not found, returns the old cheatsheet).
    """
    response = response.strip()
    # <cheatsheet> (content) </cheatsheet>
    if "<cheatsheet>" in response:
        try:
            txt = response.split("<cheatsheet>")[1].strip()
            txt = txt.split("</cheatsheet>")[0].strip()
            return txt
        except:
            print(f"‚ùå Error: Can not extract cheatsheet")
            return old_cheatsheet
    else:
        return old_cheatsheet

In [19]:
# @title

def extract_final_answer_regex(answer):
    match = re.search(r'## Final answer:\s*(.*)', answer, re.DOTALL | re.IGNORECASE)
    if match:
        return match.group(1).strip()
    return None

def extract_json_from_string(input_string):
    """
    Extract JSON from a string that contains ```json``` code blocks.

    Args:
        input_string (str): Input string that may contain JSON code blocks

    Returns:
        list: List of parsed JSON objects/dictionaries, or empty list if none found
    """
    # Regular expression pattern to match JSON code blocks
    pattern = r'```json\n(.*?)\n```'

    # Find all matches in the input string
    matches = re.findall(pattern, input_string, re.DOTALL)

    json_objects = []

    for match in matches:
        try:
            # Parse the JSON string into a Python object
            json_obj = json.loads(match.strip())
            json_objects.append(json_obj)
        except json.JSONDecodeError as e:
            print(f"Error parsing JSON: {e}")
            print(f"Problematic JSON string: {match}")
            continue

    return json_objects


def find_name_column_index(answer):
    """Helper function to find the name column index in the header"""
    if 'header' in answer:
        try:
            return answer['header'].index('name')
        except (ValueError, IndexError):
            return None
    return None

def normalize_answer(answer, for_comparison=False):
    """
    Unified function to normalize answers for either counting or comparison.

    Args:
        answer: The answer to normalize
        for_comparison: If True, returns normalized rows for comparison.
                       If False, returns JSON string for counting duplicates.
    """
    if isinstance(answer, dict) and 'rows' in answer:
        name_idx = find_name_column_index(answer)

        if for_comparison:
            # For comparison: normalize all values and sort rows
            normalized_rows = []
            for row in answer['rows']:
                if name_idx is not None:
                    # Use the found name column index
                    normalized_row = []
                    for i, value in enumerate(row):
                        normalized_row.append(str(value).replace(" ", "").lower())
                    normalized_rows.append(tuple(normalized_row))
                else:
                    # Fallback: normalize all values
                    normalized_rows.append(tuple(str(val).replace(" ", "").lower() for val in row))

            # Sort rows by name for consistent comparison
            normalized_rows.sort()
            return normalized_rows
        else:
            # For counting: create dictionary keyed by name
            rows_by_name = {}
            for row in answer['rows']:
                if name_idx is not None:
                    name = row[name_idx]
                    rows_by_name[name] = tuple(row)  # Convert to tuple for hashability
                else:
                    # If no name column found, use second column (common pattern) or whole row
                    if len(row) > 1:
                        rows_by_name[row[1]] = tuple(row)
                    else:
                        rows_by_name[str(row)] = tuple(row)
            return json.dumps(rows_by_name, sort_keys=True)
    else:
        if for_comparison:
            return str(answer).replace(" ", "").lower()
        else:
            return str(answer)


In [20]:
def load_puzzle(ds_size):
  dataset = load_dataset("allenai/ZebraLogicBench-private", "grid_mode", split="test")
  grid_size = "3*3"
  filtered_dataset = dataset.filter(lambda x: x["size"] == grid_size).select(range(ds_size)).shuffle(seed=42)
  return filtered_dataset

In [21]:
def main(ds_size=2,enable_cheatsheet=True):
  is_correct_list = []
  filtered_dataset=load_puzzle(ds_size)
  model_name = "openai/gpt-4.1-nano"#openai/gpt-4.1 , phi-4
  temperature=0
  top_p=0.1
  cheatsheet = ""
  cheatsheet_list = [cheatsheet]
  full_question_list = []
  generator_raw_output_list = []
  cheatsheet_raw_output_list = []
  generator_input_list = []
  curator_input_list=[]

  # Get total number of items for progress tracking
  total_items = len(filtered_dataset)
  print(f"Starting processing of {total_items} items...")

  for index, item in enumerate(filtered_dataset):
      # Calculate progress percentage
      progress = (index + 1) / total_items * 100
      print(f"\n--- Processing item {index + 1}/{total_items} ({progress:.1f}%) ---")

      puzzle_id = item["id"]
      puzzle_input = item["puzzle"]

      solution = item["solution"]
      full_question=f"{puzzle_input}\n## Headers\n\"header\":{item["solution"]['header']}"
      solution_normalized = normalize_answer(solution, for_comparison=True)

      generator_input, generator_output = llm_generator(full_question, cheatsheet,temperature, top_p, model_name)

      generator_input_list.append(generator_input)


      full_question_list.append(full_question)
      generator_raw_output_list.append(generator_output)
      generator_output_extracted = extract_final_answer_regex(generator_output)

      if generator_output_extracted is None:
          print(f"‚ùå Warning: Could not extract answer from sample {puzzle_id} (Index: {index})")
          is_correct_list.append({
              "id": puzzle_id,
              "is_correct": False,
              "is_extracted": False,
              "index": index,
              "generator_output":generator_output
          })
      else:
          clean_generator_output_extracted = generator_output_extracted.lower().replace(" ", "")
          json_object_generator_output_extracted = extract_json_from_string(clean_generator_output_extracted)

          answer_normalized = normalize_answer(json_object_generator_output_extracted[0], for_comparison=True)

          is_correct = answer_normalized == solution_normalized

          if is_correct:
              print(f"‚úÖ SUCCESS: The {puzzle_id} answer is correct!")
              is_correct_list.append({
                  "id": puzzle_id,
                  "is_correct": True,
                  "is_extracted": True,
                  "index": index,
                  "generator_output":generator_output,
                  "answer_normalized":answer_normalized,
                  "solution_normalized":solution_normalized
              })
          else:
              print(f"‚ùå The {puzzle_id} answer does NOT match the correct answer")
              is_correct_list.append({
                  "id": puzzle_id,
                  "is_correct": False,
                  "is_extracted": True,
                  "index": index,
                  "generator_output":generator_output,
                  "answer_normalized":answer_normalized,
                  "solution_normalized":solution_normalized
              })
              # print(f"\nans:{answer_normalized}\n")
              # print(f"sol:{solution_normalized}\n")

      if enable_cheatsheet and generator_output_extracted is not None:

        curator_input, cheatsheet_output = llm_curator(full_question, generator_output, cheatsheet,temperature, top_p, model_name)

        curator_input_list.append(curator_input)
        cheatsheet_raw_output_list.append(cheatsheet_output)

        cheatsheet = extract_cheatsheet(cheatsheet_output, cheatsheet)
        cheatsheet_list.append(cheatsheet)
      # print(f"‚úÖ Processed item: {item['id']} (Index: {index})")

      # Optional: Add a break here for testing with a smaller subset
      # if index >= 4:  # Test with first 5 items
      #     break

  # Calculate final statistics
  correct_count = sum(1 for item in is_correct_list if item["is_correct"])
  accuracy = correct_count / len(is_correct_list) * 100 if is_correct_list else 0

  print(f"\n{'='*50}")
  print(f"‚úÖ Iteration complete! Processed {len(is_correct_list)} items")
  print(f"üìä Results: {correct_count}/{len(is_correct_list)} correct ({accuracy:.1f}% accuracy)")
  print(f"{'='*50}")
  return cheatsheet_list, full_question_list, generator_raw_output_list, cheatsheet_raw_output_list, generator_input_list, curator_input_list

In [22]:
ds_size=30

In [23]:
cheatsheets, questions, gen_outputs, cheat_outputs, gen_inputs, curator_inputs = main(ds_size=ds_size,enable_cheatsheet=True)

Starting processing of 30 items...

--- Processing item 1/30 (3.3%) ---
‚úÖ SUCCESS: The lgp-test-3x3-6 answer is correct!

--- Processing item 2/30 (6.7%) ---
‚úÖ SUCCESS: The lgp-test-3x3-21 answer is correct!

--- Processing item 3/30 (10.0%) ---
‚ùå The lgp-test-3x3-9 answer does NOT match the correct answer

--- Processing item 4/30 (13.3%) ---
‚úÖ SUCCESS: The lgp-test-3x3-14 answer is correct!

--- Processing item 5/30 (16.7%) ---
‚ùå The lgp-test-3x3-15 answer does NOT match the correct answer

--- Processing item 6/30 (20.0%) ---

--- Processing item 7/30 (23.3%) ---
‚ùå The lgp-test-3x3-22 answer does NOT match the correct answer

--- Processing item 8/30 (26.7%) ---
‚ùå The lgp-test-3x3-27 answer does NOT match the correct answer

--- Processing item 9/30 (30.0%) ---
‚úÖ SUCCESS: The lgp-test-3x3-8 answer is correct!

--- Processing item 10/30 (33.3%) ---
‚úÖ SUCCESS: The lgp-test-3x3-18 answer is correct!

--- Processing item 11/30 (36.7%) ---
‚úÖ SUCCESS: The lgp-test-3x3-

In [24]:
cheatsheets_disabled, questions_disabled, gen_outputs_disabled, cheat_outputs_disabled, gen_inputs_disabled, curator_inputs_disabled = main(ds_size=ds_size,enable_cheatsheet=False)

Starting processing of 30 items...

--- Processing item 1/30 (3.3%) ---
‚úÖ SUCCESS: The lgp-test-3x3-6 answer is correct!

--- Processing item 2/30 (6.7%) ---
‚ùå The lgp-test-3x3-21 answer does NOT match the correct answer

--- Processing item 3/30 (10.0%) ---
‚úÖ SUCCESS: The lgp-test-3x3-9 answer is correct!

--- Processing item 4/30 (13.3%) ---
‚ùå The lgp-test-3x3-14 answer does NOT match the correct answer

--- Processing item 5/30 (16.7%) ---
‚úÖ SUCCESS: The lgp-test-3x3-15 answer is correct!

--- Processing item 6/30 (20.0%) ---
‚ùå The lgp-test-3x3-0 answer does NOT match the correct answer

--- Processing item 7/30 (23.3%) ---
‚ùå The lgp-test-3x3-22 answer does NOT match the correct answer

--- Processing item 8/30 (26.7%) ---
‚úÖ SUCCESS: The lgp-test-3x3-27 answer is correct!

--- Processing item 9/30 (30.0%) ---
‚úÖ SUCCESS: The lgp-test-3x3-8 answer is correct!

--- Processing item 10/30 (33.3%) ---
‚ùå The lgp-test-3x3-18 answer does NOT match the correct answer

--- 