<a href="https://colab.research.google.com/github/pacozaa/Self-Consistency-Experiment/blob/main/Self_Consistency_Code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We are utilizing free tier of GitHub Models.


Check out the billing condition here but you should have a free tier if you have GitHub account:

https://docs.github.com/en/github-models/use-github-models/prototyping-with-ai-models#rate-limits

In [352]:
#Install Dependency
!pip install azure-ai-inference



In [353]:
import os
import time
import re
import json
from azure.ai.inference import ChatCompletionsClient
from azure.ai.inference.models import SystemMessage, UserMessage, AssistantMessage
from azure.core.credentials import AzureKeyCredential
from collections import Counter
from google.colab import userdata

We are using one of the puzzle from **ZebraLogic Bench** for this experiment

- Dataset:
https://huggingface.co/datasets/allenai/ZebraLogicBench-private/viewer/grid_mode/test?row=0&views%5B%5D=grid_mode
- Dataset Article:
https://huggingface.co/blog/yuchenlin/zebra-logic

- Find the example row by this sql in huggingface dataset studio
```sql
SELECT *
FROM grid_mode
WHERE id = 'lgp-test-3x3-24'
LIMIT 10;
```
- or use this link: https://huggingface.co/datasets/allenai/ZebraLogicBench-private/viewer/grid_mode/test?f%5Bsize%5D%5Bvalue%5D=%273*3%27&row=41&views%5B%5D=grid_mode&sql=SELECT+*+%0A++FROM+grid_mode+%0A++WHERE+id+%3D+%27lgp-test-3x3-24%27%0A++LIMIT+10%3B


# Start Coding

In [354]:

num_samples = 20 # Number of Sampling - big enough to see the diversity
model_name = "phi-4" # Choose not too smart model to see the diversity

# LLM Parameters
temperature=1.0 # Choose 1.0 for the most diverse answers
top_p=1.0 # Choose 1.0 for the most diverse answers

github_token=userdata.get('GITHUB_TOKEN') #GitHub Token

In [355]:
# Prompt
# CoT Prompt
system_prompt="""
 Each example/input/question is a Logic Grid Puzzle, also known as a Zebra Puzzle. In each puzzle, we are given N houses (numbered 1 to N from left to right) and M features for each house. There are N distinct values for each feature, and each house must have a unique value for each feature. Given a list of clues, one should be able to deduce a unique correct assignment of values. The logic grid puzzle is a typical Constraint Satisfaction Problem (CSP) and is often used to test humans' logical reasoning abilities in exams such as the Law School Admission Test (LSAT).

You always solve problem step by step
start answering with `## Reasoning steps:` and end with

## Final answer:
```json

````
"""
# Example Input Prompt
user_prompt="""
There are 2 houses, numbered 1 to 2 from left to right.
Each house is occupied by a different person.
Each house has a unique attribute for each of the following characteristics:
- Each person has a unique name: **Arnold, Eric**
- People own unique car models: **ford f150, tesla model 3**
- The people keep unique animals: **cat, horse**

**Clues**:
1. Eric is directly left of the person who owns a Tesla Model 3.
2. The person who keeps horses is in the first house.
"""
# Example Output Prompt
assistant_message="""
## Reasoning steps:

From Clue 1, we know that Eric is to the left of someone, so he must be the owner of House 1 because House 2 is the rightmost house.
Additionally, we know that the person in House 2 must be Arnold, and he owns a Tesla Model 3. Thus, Eric owns a Ford F150.
From Clue 2, we know that Eric keeps horses in House 1, which means the other house must keep cats. Finally, we arrive at the unique solution to this puzzle.
The solution is presented in table format:

## Final answer:

```json
{
   "header":[
      "Houses",
      "Name",
      "CarModel",
      "Animal"
   ],
   "rows":[
      [
         "1",
         "Eric",
         "ford f150",
         "horse"
      ],
      [
         "2",
         "Arnold",
         "tesla model 3",
         "cat"
      ]
   ]
}
```
"""
# lgp-test-2x2-33
# lgp-test-6x5-2
# lgp-test-2x4-33
# lgp-test-6x6-5
# lgp-test-3x3-24
question="""
Question:
There are 3 houses, numbered 1 to 3 from left to right, as seen from across the street. Each house is occupied by a different person. Each house has a unique attribute for each of the following characteristics:
- Each person has a unique name: `Peter`, `Eric`, `Arnold`
- Each person has a favorite color: `red`, `white`, `yellow`
- Each mother is accompanied by their child: `Fred`, `Meredith`, `Bella`

## Clues:
1. Arnold is the person whose favorite color is red.
2. The person's child is named Fred is somewhere to the left of Eric.
3. The person whose favorite color is red is in the second house.
4. The person's child is named Bella is in the first house.
5. The person who loves white is the person's child is named Meredith.

## Headers
"header": [
"House",
"Name",
"Color",
"Children"
]
"""
correct_answer="""
```json
{
  "header": [
    "House",
    "Name",
    "Color",
    "Children"
  ],
  "rows": [
    ["1", "Peter", "yellow", "Bella"],
    ["2", "Arnold", "red", "Fred"],
    ["3", "Eric", "white", "Meredith"]
  ]
}
```
"""

In [356]:
# Define your messages once
messages = [
    SystemMessage(system_prompt),
    UserMessage(user_prompt),
    AssistantMessage(assistant_message),
    UserMessage(question)
]

In [357]:

def extract_final_answer_regex(answer):
    match = re.search(r'## Final answer:\s*(.*)', answer, re.DOTALL | re.IGNORECASE)
    if match:
        return match.group(1).strip()
    return None

def extract_json_from_string(input_string):
    """
    Extract JSON from a string that contains ```json``` code blocks.

    Args:
        input_string (str): Input string that may contain JSON code blocks

    Returns:
        list: List of parsed JSON objects/dictionaries, or empty list if none found
    """
    # Regular expression pattern to match JSON code blocks
    pattern = r'```json\n(.*?)\n```'

    # Find all matches in the input string
    matches = re.findall(pattern, input_string, re.DOTALL)

    json_objects = []

    for match in matches:
        try:
            # Parse the JSON string into a Python object
            json_obj = json.loads(match.strip())
            json_objects.append(json_obj)
        except json.JSONDecodeError as e:
            print(f"Error parsing JSON: {e}")
            print(f"Problematic JSON string: {match}")
            continue

    return json_objects


In [358]:
endpoint = "https://models.github.ai/inference"

token = github_token

client = ChatCompletionsClient(
    endpoint=endpoint,
    credential=AzureKeyCredential(token),
)

In [359]:

responses = []
final_answers = []  # Collect all final answers
try:
  for i in range(num_samples):
      response = client.complete(
          messages=messages,
          temperature=temperature,
          top_p=top_p,
          model=model_name,
      )
      responses.append(response.choices[0])
      answer = response.choices[0].message.content
      final_ans = extract_final_answer_regex(answer)
      if final_ans is None:
        print(f"Warning: Could not extract answer from sample {i+1}")
      clean_ans = final_ans.lower().replace(" ", "")
      json_object = extract_json_from_string(clean_ans)
      final_answers.append(json_object)  # Store the final answer

      print(f"\nSample {i+1}: {json_object}")
      print(f"{'='*50}")
      # sleep .5 sec
      # time.sleep(0.5)
except Exception as e:
    print(f"Error processing sample {i+1}: {e}")



Sample 1: [{'header': ['house', 'name', 'color', 'children'], 'rows': [['1', 'peter', 'white', 'meredith'], ['2', 'arnold', 'red', 'bella'], ['3', 'eric', 'yellow', 'fred']]}]

Sample 2: [{'header': ['house', 'name', 'color', 'children'], 'rows': [['1', 'peter', 'yellow', 'bella'], ['2', 'arnold', 'red', 'fred'], ['3', 'eric', 'white', 'meredith']]}]

Sample 3: [{'header': ['house', 'name', 'color', 'children'], 'rows': [['1', 'peter', 'yellow', 'fred'], ['2', 'arnold', 'red', 'bella'], ['3', 'eric', 'white', 'meredith']]}]

Sample 4: [{'header': ['house', 'name', 'color', 'children'], 'rows': [['1', 'peter', 'yellow', 'bella'], ['2', 'arnold', 'red', 'meredith'], ['3', 'eric', 'white', 'fred']]}]

Sample 5: [{'header': ['house', 'name', 'color', 'children'], 'rows': [['1', 'peter', 'yellow', 'bella'], ['2', 'arnold', 'red', 'fred'], ['3', 'eric', 'white', 'meredith']]}]

Sample 6: [{'header': ['house', 'name', 'color', 'children'], 'rows': [['1', 'peter', 'yellow', 'bella'], ['2', 'a

In [360]:
def find_name_column_index(answer):
    """Helper function to find the name column index in the header"""
    if 'header' in answer:
        try:
            return answer['header'].index('name')
        except (ValueError, IndexError):
            return None
    return None

def normalize_answer(answer, for_comparison=False):
    """
    Unified function to normalize answers for either counting or comparison.

    Args:
        answer: The answer to normalize
        for_comparison: If True, returns normalized rows for comparison.
                       If False, returns JSON string for counting duplicates.
    """
    if isinstance(answer, dict) and 'rows' in answer:
        name_idx = find_name_column_index(answer)

        if for_comparison:
            # For comparison: normalize all values and sort rows
            normalized_rows = []
            for row in answer['rows']:
                if name_idx is not None:
                    # Use the found name column index
                    normalized_row = []
                    for i, value in enumerate(row):
                        normalized_row.append(str(value).replace(" ", "").lower())
                    normalized_rows.append(tuple(normalized_row))
                else:
                    # Fallback: normalize all values
                    normalized_rows.append(tuple(str(val).replace(" ", "").lower() for val in row))

            # Sort rows by name for consistent comparison
            normalized_rows.sort()
            return normalized_rows
        else:
            # For counting: create dictionary keyed by name
            rows_by_name = {}
            for row in answer['rows']:
                if name_idx is not None:
                    name = row[name_idx]
                    rows_by_name[name] = tuple(row)  # Convert to tuple for hashability
                else:
                    # If no name column found, use second column (common pattern) or whole row
                    if len(row) > 1:
                        rows_by_name[row[1]] = tuple(row)
                    else:
                        rows_by_name[str(row)] = tuple(row)
            return json.dumps(rows_by_name, sort_keys=True)
    else:
        if for_comparison:
            return str(answer).replace(" ", "").lower()
        else:
            return str(answer)



In [361]:
if final_answers:

    # Convert answers to normalized format for counting
    normalized_answers = [normalize_answer(ans, for_comparison=False) for ans in final_answers]

    answer_counts = Counter(normalized_answers)
    most_common_string, count = answer_counts.most_common(1)[0]

    # Convert back to original type for display
    try:
        most_common_normalized = json.loads(most_common_string)
        # Find the original answer that matches this normalized version
        for ans in final_answers:
            if normalize_answer(ans, for_comparison=False) == most_common_string:
                most_common_answer = ans
                break
        else:
            most_common_answer = most_common_normalized
    except (json.JSONDecodeError, TypeError):
        most_common_answer = most_common_string

    print(f"\n{'='*50}")
    print(f"MOST COMMON ANSWER (appeared {count}/{num_samples} times):")
    print(f"{most_common_answer}")
    print(f"{'='*50}\n")

    # Optional: print all answer frequencies
    print("All answer frequencies:")
    for ans_str, freq in answer_counts.most_common():
        # Find the original answer that matches this normalized version
        original_ans = None
        for ans in final_answers:
            if normalize_answer(ans, for_comparison=False) == ans_str:
                original_ans = ans
                break

        if original_ans:
            print(f"{original_ans} (appeared {freq} times)")
        else:
            try:
                ans_display = json.loads(ans_str)
                print(f"{ans_display} (appeared {freq} times)")
            except (json.JSONDecodeError, TypeError):
                print(f"{ans_str} (appeared {freq} times)")



MOST COMMON ANSWER (appeared 12/20 times):
[{'header': ['house', 'name', 'color', 'children'], 'rows': [['1', 'peter', 'yellow', 'bella'], ['2', 'arnold', 'red', 'fred'], ['3', 'eric', 'white', 'meredith']]}]

All answer frequencies:
[{'header': ['house', 'name', 'color', 'children'], 'rows': [['1', 'peter', 'yellow', 'bella'], ['2', 'arnold', 'red', 'fred'], ['3', 'eric', 'white', 'meredith']]}] (appeared 12 times)
[{'header': ['house', 'name', 'color', 'children'], 'rows': [['1', 'peter', 'yellow', 'fred'], ['2', 'arnold', 'red', 'bella'], ['3', 'eric', 'white', 'meredith']]}] (appeared 3 times)
[{'header': ['house', 'name', 'color', 'children'], 'rows': [['1', 'peter', 'yellow', 'bella'], ['2', 'arnold', 'red', 'meredith'], ['3', 'eric', 'white', 'fred']]}] (appeared 3 times)
[{'header': ['house', 'name', 'color', 'children'], 'rows': [['1', 'peter', 'white', 'meredith'], ['2', 'arnold', 'red', 'bella'], ['3', 'eric', 'yellow', 'fred']]}] (appeared 1 times)
[{'header': ['house', 'n

In [362]:
# Correct answer
correct_clean_ans = correct_answer.lower().replace(" ", "")
correct_json_object = extract_json_from_string(correct_clean_ans)
print(correct_json_object)

# Normalize the correct answer
correct_normalized = normalize_answer(correct_json_object, for_comparison=True)

# Normalize the most common answer
most_common_normalized = normalize_answer(most_common_answer, for_comparison=True)

# Check if most common answer matches correct answer
most_common_matches = most_common_normalized == correct_normalized

print(f"{'='*60}")
print("COMPARISON RESULTS:")
print(f"{'='*60}")
print(f"Most common answer matches correct answer: {most_common_matches}")

if most_common_matches:
    print("✅ SUCCESS: The most common answer is correct!")
else:
    print("❌ The most common answer does NOT match the correct answer")
    print("\nChecking if any answer in final_answers matches the correct answer...")

    # Check all answers for matches
    matching_answers = []
    for i, answer in enumerate(final_answers):
        answer_normalized = normalize_answer(answer, for_comparison=True)
        if answer_normalized == correct_normalized:
            matching_answers.append(i)

    if matching_answers:
        print(f"✅ FOUND {len(matching_answers)} MATCHING ANSWER(S) at indices: {matching_answers}")
        print(f"First matching answer: {final_answers[matching_answers[0]]}")
    else:
        print("❌ No answers in final_answers match the correct answer")

# Detailed comparison for debugging
print(f"\n{'='*60}")
print("DETAILED COMPARISON:")
print(f"{'='*60}")
print("Correct answer (normalized):")
print(f"  {correct_normalized}")

print("\nMost common answer (normalized):")
print(f"  {most_common_normalized}")

print(f"\nMatch: {most_common_matches}")

[{'header': ['house', 'name', 'color', 'children'], 'rows': [['1', 'peter', 'yellow', 'bella'], ['2', 'arnold', 'red', 'fred'], ['3', 'eric', 'white', 'meredith']]}]
COMPARISON RESULTS:
Most common answer matches correct answer: True
✅ SUCCESS: The most common answer is correct!

DETAILED COMPARISON:
Correct answer (normalized):
  [{'header':['house','name','color','children'],'rows':[['1','peter','yellow','bella'],['2','arnold','red','fred'],['3','eric','white','meredith']]}]

Most common answer (normalized):
  [{'header':['house','name','color','children'],'rows':[['1','peter','yellow','bella'],['2','arnold','red','fred'],['3','eric','white','meredith']]}]

Match: True
