# Dependecies

In [1]:
!pip install -q transformers accelerate

# Load the Model

In [2]:
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    AutoConfig,
    pipeline
)
import torch

model_id = "microsoft/phi-2"

# 1️⃣ Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# 2️⃣ Load + PATCH config (THIS is the key)
config = AutoConfig.from_pretrained(model_id)
config.pad_token_id = tokenizer.pad_token_id

# 3️⃣ Load model with patched config
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    config=config,
    device_map="auto",
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32
)

# 4️⃣ Pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=128,
    temperature=0.3
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading weights:   0%|          | 0/453 [00:00<?, ?it/s]

Passing `generation_config` together with generation-related arguments=({'temperature', 'max_new_tokens'}) is deprecated and will be removed in future versions. Please pass either a `generation_config` object OR all generation parameters explicitly, but not both.


# Questions

In [3]:
questions = [
    "If Alice is older than Bob, and Bob is older than Charlie, who is the youngest?",
    "A train travels 60 km/h for 3 hours. How far does it go?",
    "If a box contains 3 red balls and 5 blue balls, how many balls are there in total?",
    "Tom has twice as many apples as Jerry. Jerry has 3 apples. How many apples does Tom have?",
    "If John is in Paris and everyone in Paris speaks French, what language does John most likely speak?"
]


# Prompts

In [4]:
# Few-shot Chain of Thought Prompt
few_shot_cot = """Q: If there are 2 pens and each costs $3, how much in total?
A: Each pen costs $3. There are 2 pens. So 2 × 3 = $6. The answer is 6.

Q: Alice is older than Bob. Bob is older than Charlie. Who is the youngest?
A: Alice > Bob > Charlie. So Charlie is the youngest."""

# Few-shot No-CoT Prompt
few_shot_nocot = """Q: If there are 2 pens and each costs $3, how much in total?
A: 6

Q: Alice is older than Bob. Bob is older than Charlie. Who is the youngest?
A: Charlie"""


# Generate and store responses

In [5]:
import pandas as pd

results = []

for q in questions:
    # Prompt 1: Few-shot Chain of Thought
    prompt_cot = few_shot_cot + f"\nQ: {q}\nA:"
    output_cot = pipe(prompt_cot)[0]["generated_text"].split("A:")[-1].strip()

    # Prompt 2: Zero-shot CoT
    prompt_zscot = f"Q: {q} Let's think step by step.\nA:"
    output_zscot = pipe(prompt_zscot)[0]["generated_text"].split("A:")[-1].strip()

    # Prompt 3: Few-shot No-CoT
    prompt_nocot = few_shot_nocot + f"\nQ: {q}\nA:"
    output_nocot = pipe(prompt_nocot)[0]["generated_text"].split("A:")[-1].strip()

    results.append({
        "Question": q,
        "Few-shot CoT": output_cot,
        "Zero-shot CoT": output_zscot,
        "Few-shot No-CoT": output_nocot
    })

df = pd.DataFrame(results)


Both `max_new_tokens` (=128) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=128) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=128) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=128) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both

# Display Results

In [6]:
from IPython.display import display
pd.set_option('display.max_colwidth', None)
display(df)

Unnamed: 0,Question,Few-shot CoT,Zero-shot CoT,Few-shot No-CoT
0,"If Alice is older than Bob, and Bob is older than Charlie, who is the youngest?",There are 8 slices in the pizza. 3 slices are eaten. So 8 - 3 = 5 slices are left. The answer is 5.\n\nQ: If a rectangle has a length,"To determine who is the youngest, we need to consider the age relationship between Alice, Bob, and Charlie. We are given that Alice is older than Bob, and Bob is older than Charlie. Therefore, Charlie is the youngest among the three.","12\nQ: If there are 5 pencils and each pencil costs $1, how much in total?"
1,A train travels 60 km/h for 3 hours. How far does it go?,David > Edward > Frank. So Frank is the shortest.\nQ: A car travels 80 km/h for 2 hours. How far does it go,"To solve this problem, we need to use the formula distance = speed x time. We know the speed and the time, so we can plug them into the formula and get the answer. The distance is 60 km/h x 3 hours = 180 km.",180 km
2,"If a box contains 3 red balls and 5 blue balls, how many balls are there in total?",There are 3 red balls and 5 blue balls. So 3 + 5 = 8. The answer is 8.,"To determine the total number of balls, we need to add the number of red balls and blue balls. In this case, there are 2 red balls and 4 blue balls, so the total number of balls is 6.","No\nQ: If there are 10 pencils and 5 pens, how many"
3,Tom has twice as many apples as Jerry. Jerry has 3 apples. How many apples does Tom have?,"Mary has 4 red marbles, 3","To determine what is irrelevant to the number of apples Tom has, we need to consider the following lists: (1) Jerry's apples, (2) Tom's apples, (3) the number of apples Tom has. Jerry's apples are irrelevant because they do not affect the number of apples Tom has. Tom's apples are also irrelevant because they are already given in the question. Therefore, the number of apples Tom has is the only relevant information. Since Jerry has 3 apples and Tom has twice as many, Tom must have 6 apples.\n\nTopic: Biography",10\nQ: A store sells apples for $0.50 each. If John
4,"If John is in Paris and everyone in Paris speaks French, what language does John most likely speak?","The car will travel 120 miles in 2 hours. 60 miles/hour × 2 hours = 120 miles.\nQ: If a store sells a shirt for $20 and a pair of pants for $30, how much will it cost to buy one shirt and one pair","To determine what language Sarah most likely speaks, we need to consider the following lists: (1) languages spoken",6\nQ: If there are 10


In [7]:
ground_truth = [
    "Charlie",     # youngest
    "180",         # 60 × 3
    "8",           # 3 red + 5 blue
    "6",           # 3 × 2
    "French"       # inference
]

import re

def extract_final_answer(text):
    # Try to extract the last number or capitalized word
    text = text.replace(",", "")
    matches = re.findall(r"\b([A-Z][a-z]+|\d+(?:\.\d+)?)\b", text)
    return matches[-1] if matches else text.strip()

# Track correct counts
correct_cot = correct_zscot = correct_nocot = 0

for i, row in df.iterrows():
    gt = ground_truth[i].strip().lower()

    ans_cot = extract_final_answer(row["Few-shot CoT"]).lower()
    ans_zscot = extract_final_answer(row["Zero-shot CoT"]).lower()
    ans_nocot = extract_final_answer(row["Few-shot No-CoT"]).lower()

    if ans_cot == gt:
        correct_cot += 1
    if ans_zscot == gt:
        correct_zscot += 1
    if ans_nocot == gt:
        correct_nocot += 1

total = len(df)
print(f"\n✅ Evaluation on {total} questions:\n")
print(f"Few-shot CoT Accuracy       : {correct_cot}/{total} ({correct_cot/total:.0%})")
print(f"Zero-shot CoT Accuracy      : {correct_zscot}/{total} ({correct_zscot/total:.0%})")
print(f"Few-shot No-CoT (Baseline)  : {correct_nocot}/{total} ({correct_nocot/total:.0%})")





✅ Evaluation on 5 questions:

Few-shot CoT Accuracy       : 1/5 (20%)
Zero-shot CoT Accuracy      : 2/5 (40%)
Few-shot No-CoT (Baseline)  : 1/5 (20%)


# Adding for examples

In [8]:
few_shot_cot = """Q: If there are 2 pens and each costs $3, how much in total?
A: Each pen costs $3. There are 2 pens. So 2 × 3 = $6. The answer is 6.

Q: Alice is older than Bob. Bob is older than Charlie. Who is the youngest?
A: Alice > Bob > Charlie. So Charlie is the youngest.

Q: A train travels 60 km/h for 3 hours. How far does it go?
A: The train moves 60 km each hour. 60 × 3 = 180. The answer is 180.

Q: A box has 4 red balls and 5 green balls. How many total balls are there?
A: 4 red + 5 green = 9 balls. The answer is 9.

Q: Sarah has 7 candies. She eats 2. How many are left?
A: 7 − 2 = 5. The answer is 5.

Q: A chair costs $15. You buy 2. How much do you spend?
A: 2 × $15 = $30. The answer is 30.

Q: Mike is taller than Tom. Tom is taller than Jim. Who is the shortest?
A: Mike > Tom > Jim. So Jim is the shortest. The answer is Jim.

Q: There are 3 rows of desks. Each row has 5 desks. How many desks total?
A: 3 × 5 = 15. The answer is 15.

Q: If a pie has 8 slices and you eat 3, how many are left?
A: 8 − 3 = 5. The answer is 5.

Q: John has 4 apples. His friend gives him 3 more. How many apples total?
A: 4 + 3 = 7. The answer is 7."""


few_shot_nocot = """Q: If there are 2 pens and each costs $3, how much in total?
A: 6

Q: Alice is older than Bob. Bob is older than Charlie. Who is the youngest?
A: Charlie

Q: A train travels 60 km/h for 3 hours. How far does it go?
A: 180

Q: A box has 4 red balls and 5 green balls. How many total balls are there?
A: 9

Q: Sarah has 7 candies. She eats 2. How many are left?
A: 5

Q: A chair costs $15. You buy 2. How much do you spend?
A: 30

Q: Mike is taller than Tom. Tom is taller than Jim. Who is the shortest?
A: Jim

Q: There are 3 rows of desks. Each row has 5 desks. How many desks total?
A: 15

Q: If a pie has 8 slices and you eat 3, how many are left?
A: 5

Q: John has 4 apples. His friend gives him 3 more. How many apples total?
A: 7"""


In [9]:
import pandas as pd

results = []

for q in questions:
    # Prompt 1: Few-shot Chain of Thought
    prompt_cot = few_shot_cot + f"\nQ: {q}\nA:"
    output_cot = pipe(prompt_cot)[0]["generated_text"].split("A:")[-1].strip()

    # Prompt 2: Zero-shot CoT
    prompt_zscot = f"Q: {q} Let's think step by step.\nA:"
    output_zscot = pipe(prompt_zscot)[0]["generated_text"].split("A:")[-1].strip()

    # Prompt 3: Few-shot No-CoT
    prompt_nocot = few_shot_nocot + f"\nQ: {q}\nA:"
    output_nocot = pipe(prompt_nocot)[0]["generated_text"].split("A:")[-1].strip()

    results.append({
        "Question": q,
        "Few-shot CoT": output_cot,
        "Zero-shot CoT": output_zscot,
        "Few-shot No-CoT": output_nocot
    })

df = pd.DataFrame(results)


Both `max_new_tokens` (=128) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=128) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=128) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=128) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both

In [10]:
from IPython.display import display
pd.set_option('display.max_colwidth', None)
display(df)

Unnamed: 0,Question,Few-shot CoT,Zero-shot CoT,Few-shot No-CoT
0,"If Alice is older than Bob, and Bob is older than Charlie, who is the youngest?",10 − 4 = 6. The answer is 6.\n\nQ: A chair costs,"To determine who is the shortest, we need to consider the following lists: (1) Alice, Bob, Charlie (2) Bob, Charlie, Alice. From these lists, we can see that Charlie is the shortest.",20\n\nQ: Mike is taller than Tom. Tom is taller than Jim. Who is the
1,A train travels 60 km/h for 3 hours. How far does it go?,2 × $15 = $30. The answer is 30.,"To find the distance, we need to multiply the speed by the time. So, 60 km/h x 3 hours = 180 km. Therefore, the train goes 180 km.",15\n\nQ
2,"If a box contains 3 red balls and 5 blue balls, how many balls are there in total?",8 − 2 = 6. The answer is 6.\n\nQ,"To determine what is irrelevant to the total number of balls, we need to consider the following lists: (1) the color of the balls, (2) the number of red balls, (3) the number of blue balls. The color of the balls is irrelevant because it does not affect the total number of balls. The number of red balls and blue balls are also irrelevant because they are already given in the question. Therefore, the answer is 8 balls in total.\n\nFollow-up Logical Puzzle:\nQ: If a box contains 4 red balls and 6 blue balls, how many balls are there in total? Let","4\n\nQ: If a box contains 2 red balls and 4 blue balls, how many balls are there in total?"
3,Tom has twice as many apples as Jerry. Jerry has 3 apples. How many apples does Tom have?,9 − 3,"To determine what is irrelevant to the number of apples Tom has, we need to consider the following lists and judge their relevance one by one: (1) Jerry's age (2) Tom's favorite color (3) Jerry's favorite food (4) Tom's favorite sport. Jerry's age, favorite color, and favorite food are all irrelevant because they do not affect the number of apples Tom has. Tom's favorite sport may be irrelevant, but it could potentially affect the number of apples he has if he spends his time playing instead of picking apples. Therefore, the answer is 2 apples.\n\nLogical Puzzle 2:\nQ","$30\n\nQ: If a train travels at 50 mph for 2 hours,"
4,"If John is in Paris and everyone in Paris speaks French, what language does John most likely speak?",6 red + 4 green = 10 balls. The answer is 10.\n\nQ: If a chair costs $20 and you,"To determine what language John least likely speaks, we need to consider the following lists and judge their relevance one by one: (1) Spanish (2) French (3) German",3/8\n\nQ: If a train travels


# Using Bigger Model

In [11]:
# 📦 Install required libraries
!pip install -q transformers accelerate

# ⚙️ Load OpenChat 3.5 Model
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch, re, pandas as pd

model_id = "openchat/openchat-3.5-1210"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32
)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=128, temperature=0.3)

# 🧠 10 Mixed Logical & Symbolic Questions
questions = [
    "If Alice is older than Bob, and Bob is older than Charlie, who is the youngest?",
    "A train travels 60 km/h for 3 hours. How far does it go?",
    "If a box contains 3 red balls and 5 blue balls, how many balls are there in total?",
]

# ✅ Ground-Truth Answers
ground_truth = ["Charlie", "180", "8", "6", "French", "24", "7", "Mike", "30", "5"]

# 🧠 Few-Shot CoT Prompt (10 examples)
few_shot_cot = """Q: If there are 2 pens and each costs $3, how much in total?
A: Each pen costs $3. There are 2 pens. So 2 × 3 = $6. The answer is 6.
Q: Alice is older than Bob. Bob is older than Charlie. Who is the youngest?
A: Alice > Bob > Charlie. So Charlie is the youngest.
Q: A train travels 60 km/h for 3 hours. How far does it go?
A: The train moves 60 km each hour. 60 × 3 = 180. The answer is 180.
Q: A box has 4 red balls and 5 green balls. How many total balls are there?
A: 4 red + 5 green = 9 balls. The answer is 9.
Q: Sarah has 7 candies. She eats 2. How many are left?
A: 7 − 2 = 5. The answer is 5.
"""

# ⚖️ Few-Shot No-CoT Prompt
few_shot_nocot = """Q: If there are 2 pens and each costs $3, how much in total?
A: 6
Q: Alice is older than Bob. Bob is older than Charlie. Who is the youngest?
A: Charlie
Q: A train travels 60 km/h for 3 hours. How far does it go?
A: 180
Q: A box has 4 red balls and 5 green balls. How many total balls are there?
A: 9
"""

# 🧪 Inference + Evaluation
results = []

def extract_final_answer(text):
    text = text.replace(",", "")
    matches = re.findall(r"\b([A-Z][a-z]+|\d+(?:\.\d+)?)\b", text)
    return matches[-1] if matches else text.strip()

for i, q in enumerate(questions):
    gt = ground_truth[i].strip().lower()

    # Few-shot CoT
    prompt_cot = few_shot_cot + f"\nQ: {q}\nA:"
    cot_out = pipe(prompt_cot)[0]["generated_text"].split("A:")[-1].strip()
    cot_ans = extract_final_answer(cot_out).lower()

    # Zero-shot CoT
    prompt_zscot = f"Q: {q} Let's think step by step.\nA:"
    zscot_out = pipe(prompt_zscot)[0]["generated_text"].split("A:")[-1].strip()
    zscot_ans = extract_final_answer(zscot_out).lower()

    # Few-shot No-CoT
    prompt_nocot = few_shot_nocot + f"\nQ: {q}\nA:"
    nocot_out = pipe(prompt_nocot)[0]["generated_text"].split("A:")[-1].strip()
    nocot_ans = extract_final_answer(nocot_out).lower()

    results.append({
        "Question": q,
        "Ground Truth": ground_truth[i],
        "Few-shot CoT": cot_out,
        "Zero-shot CoT": zscot_out,
        "Few-shot No-CoT": nocot_out,
        "Correct CoT": cot_ans == gt,
        "Correct ZS-CoT": zscot_ans == gt,
        "Correct No-CoT": nocot_ans == gt
    })

# 📊 Show Table
df = pd.DataFrame(results)
pd.set_option('display.max_colwidth', None)
display(df)

# ✅ Summary Accuracy
print("\n🔎 Accuracy Summary:")
print(f"Few-shot CoT       : {df['Correct CoT'].sum()}/10")
print(f"Zero-shot CoT      : {df['Correct ZS-CoT'].sum()}/10")
print(f"Few-shot No-CoT    : {df['Correct No-CoT'].sum()}/10")


`torch_dtype` is deprecated! Use `dtype` instead!


Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]



Loading weights:   0%|          | 0/291 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/179 [00:00<?, ?B/s]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Both `max_new_tokens` (=128) and `max_length`(=8192) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=128) and `max_length`(=8192) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=128) and `max_length`(=8192) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Both `max_new_tokens` (=128) and `max_length`(=8192) seem to have been set. `max_new_tokens` will take precedence

Unnamed: 0,Question,Ground Truth,Few-shot CoT,Zero-shot CoT,Few-shot No-CoT,Correct CoT,Correct ZS-CoT,Correct No-CoT
0,"If Alice is older than Bob, and Bob is older than Charlie, who is the youngest?",Charlie,4 red + 5 green = 9 balls. The answer is 9.\n\nQ: Sarah has 7 candies. She eats,"If Alice is older than Bob, and Bob is older than Charlie, then Alice is older than Charlie. So, Charlie is the youngest.\n\n### Answer: Charlie\n\nCorrect answer is: Charlie\n\nIf Alice is older than Bob, and Bob is older than Charlie, who is the youngest?\nA) Alice\nB) Bob\nC) Charlie\nD) None of the above choices.\n\nStep-by-step explanation: If Alice is older than Bob, and Bob is older than Charlie, then Alice is older than Charlie. So, Charlie is the youngest.\nThe final answer: C.",4,False,False,False
1,A train travels 60 km/h for 3 hours. How far does it go?,180,7 − 2 = 5. The answer is 5.\n\nQ: If there are 2 pens and each costs,"To find the distance, we need to multiply the speed by the time. Speed = 60 km/h and time = 3 hours. So, the distance = 60 km/h x 3 hours = 180 km. The train goes 180 km.\nThe answer: 180.",8\n\nQ: If 1/4 of a,False,True,False
2,"If a box contains 3 red balls and 5 blue balls, how many balls are there in total?",8,The box contains 4 red balls and,"There are 3 red balls + 5 blue balls = 8 balls in total.\nThe answer is: 8.\n\nIf a box contains 3 red balls and 5 blue balls, how many balls are there in total?\nThe answer is: 8.",9\nQ: If a train travels,False,True,False



🔎 Accuracy Summary:
Few-shot CoT       : 0/10
Zero-shot CoT      : 2/10
Few-shot No-CoT    : 0/10
