 # Generating Responses from GSM8K



 This notebook generates and saves model responses from the GSM8K dataset.

 We'll generate both thinking and non-thinking responses to create our paired dataset

 for identifying reasoning length direction.



 The notebook follows these steps:

 1. Load the GSM8K dataset

 2. Generate responses with thinking enabled

 3. Generate responses with thinking disabled

 4. Save the paired responses for later analysis

 ## Setup



 First, let's import the necessary libraries and set up the argument parser.

In [1]:
import os
import json
import torch
import argparse
from tqdm import tqdm
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer

import transformers

  from .autonotebook import tqdm as notebook_tqdm


Python version: 3.11.0 | packaged by conda-forge | (main, Jan 14 2023, 12:26:40) [Clang 14.0.6 ]
PyTorch version: 2.7.0
Transformers version: 4.51.3


 ## Command Line Arguments



 When running this as a script, you can provide command-line arguments.

 In notebook mode, we'll define default values that you can modify in the next cell.

In [3]:
def parse_args():
    parser = argparse.ArgumentParser(
        description="Generate responses from GSM8K dataset"
    )
    parser.add_argument(
        "--model", type=str, default="Qwen/Qwen3-0.6B", help="Model name or path"
    )
    parser.add_argument(
        "--output_dir",
        type=str,
        default="responses",
        help="Directory to save responses",
    )
    parser.add_argument(
        "--num_samples",
        type=int,
        default=100,
        help="Number of samples to process from GSM8K",
    )
    parser.add_argument(
        "--max_new_tokens",
        type=int,
        default=1024,
        help="Maximum number of new tokens to generate",
    )
    parser.add_argument("--seed", type=int, default=42, help="Random seed")
    return parser.parse_args()



 ## Interactive Configuration



 If you're running this as a notebook, you can modify these parameters directly.

 Change the values in this cell to customize your experiment.

In [4]:
# Interactive notebook parameters - modify these values as needed
class NotebookArgs:
    def __init__(self):
        self.model = "Qwen/Qwen3-0.6B"  # Model to use
        self.output_dir = "responses"  # Directory to save responses
        self.num_samples = 5  # Use a small number for quick testing
        self.max_new_tokens = 1024  # Maximum new tokens to generate
        self.seed = 42  # Random seed for reproducibility


# Use NotebookArgs when running as notebook, otherwise parse command line arguments
import sys

if "ipykernel" in sys.modules:
    args = NotebookArgs()
    print("Running in notebook mode with these parameters:")
    print(f"- Model: {args.model}")
    print(f"- Number of samples: {args.num_samples}")
    print(f"- Output directory: {args.output_dir}")
else:
    args = parse_args()
    print("Running in script mode with parsed arguments")


Running in notebook mode with these parameters:
- Model: Qwen/Qwen3-0.6B
- Number of samples: 5
- Output directory: responses


 ## Load and Prepare the GSM8K Dataset



 Now we'll load the GSM8K dataset and take a subset for our experiments.

In [5]:
def load_gsm8k_dataset(num_samples, seed=42):
    """Load and prepare the GSM8K dataset."""
    dataset = load_dataset("gsm8k", "main")

    # Use the train split and take a subset
    train_data = dataset["train"].shuffle(seed=seed).select(range(num_samples))

    return train_data

In [6]:
# Let's load the dataset and examine a sample
dataset = load_gsm8k_dataset(args.num_samples, args.seed)
print(f"Loaded {len(dataset)} examples from GSM8K")

# Display an example
if len(dataset) > 0:
    example = dataset[0]
    print("\nExample problem:")
    print("-" * 50)
    print(f"Question: {example['question']}")
    print(f"Answer: {example['answer']}")


Generating train split: 100%|██████████| 7473/7473 [00:00<00:00, 555607.36 examples/s]
Generating test split: 100%|██████████| 1319/1319 [00:00<00:00, 393029.77 examples/s]

Loaded 5 examples from GSM8K

Example problem:
--------------------------------------------------
Question: Mimi picked up 2 dozen seashells on the beach.  Kyle found twice as many shells as Mimi and put them in his pocket. Leigh grabbed one-third of the shells that Kyle found.  How many seashells did Leigh have?
Answer: Mimi has 2 x 12 = <<2*12=24>>24 sea shells.
Kyle has 24 x 2 = <<24*2=48>>48 sea shells.
Leigh has 48 / 3 = <<48/3=16>>16 sea shells.
#### 16





 ## Generate Responses



 Now let's define a function to generate responses from our model with and without thinking.

In [8]:
def generate_response(model, tokenizer, question, enable_thinking=True):
    """Generate a response from the model with or without thinking."""
    messages = [
        {
            "role": "user",
            "content": f"Solve this math problem step by step:\n{question}",
        }
    ]

    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=enable_thinking,  # Toggle thinking mode
    )

    # Let's see what the prompt looks like (for debugging)
    if enable_thinking:
        print(f"Prompt with thinking enabled (first 100 chars): {text[:100]}...")

    inputs = tokenizer([text], return_tensors="pt").to(model.device)

    with torch.no_grad():
        generated_ids = model.generate(
            **inputs,
            max_new_tokens=args.max_new_tokens,
            do_sample=False,  # Use greedy decoding for deterministic outputs
        )

    output_ids = generated_ids[0][len(inputs.input_ids[0]) :].tolist()

    # Parse thinking content if applicable
    if enable_thinking:
        try:
            # Find index of </think> token
            think_end_token = tokenizer.encode("</think>", add_special_tokens=False)[-1]
            think_end_index = (
                output_ids.index(think_end_token)
                if think_end_token in output_ids
                else -1
            )

            if think_end_index != -1:
                thinking_content = tokenizer.decode(
                    output_ids[:think_end_index], skip_special_tokens=True
                ).strip()
                content = tokenizer.decode(
                    output_ids[think_end_index + 1 :], skip_special_tokens=True
                ).strip()
                return {"thinking": thinking_content, "response": content}
        except ValueError:
            pass

        # If no thinking token found or error occurred, return everything as response
        content = tokenizer.decode(output_ids, skip_special_tokens=True).strip()
        return {"thinking": "", "response": content}
    else:
        content = tokenizer.decode(output_ids, skip_special_tokens=True).strip()
        return {"thinking": "", "response": content}



 ## Test the Model on a Single Example



 Let's first test the model on a single example to make sure everything is working properly.

In [9]:
print(f"Loading model {args.model}...")
tokenizer = AutoTokenizer.from_pretrained(args.model)
model = AutoModelForCausalLM.from_pretrained(
    args.model, torch_dtype="auto", device_map="auto"
)

Loading model Qwen/Qwen3-0.6B...


In [10]:
# Test with a simple problem
test_question = "If there are 5 apples and 3 are eaten, how many remain?"

In [11]:
# With thinking enabled
print("\nGenerating response with thinking enabled...")
thinking_result = generate_response(
    model, tokenizer, test_question, enable_thinking=True
)

print("\nThinking content:")
print("-" * 50)
print(thinking_result["thinking"])

print("\nResponse content:")
print("-" * 50)
print(thinking_result["response"])


Generating response with thinking enabled...
Prompt with thinking enabled (first 100 chars): <|im_start|>user
Solve this math problem step by step:
If there are 5 apples and 3 are eaten, how ma...





Thinking content:
--------------------------------------------------
<think>
Okay, let's see. The problem says there are 5 apples and 3 are eaten. I need to figure out how many remain. Hmm, so first, I should probably start by counting the apples. There are 5 apples in total. Then, 3 are eaten. So, if I subtract the number of apples eaten from the total, that should give me the remaining apples. Let me write that down to make sure I don't make a mistake.

Total apples = 5. Apples eaten = 3. So, remaining apples = Total apples - Apples eaten. That would be 5 - 3. Let me do the subtraction. 5 minus 3 is 2. So, there are 2 apples left. Wait, is there anything else I need to consider here? Like, maybe the apples are eaten in a different way? But the problem doesn't mention anything about the apples being eaten in groups or anything else. It just says 3 are eaten. So, I think subtracting them is the right approach. 

Let me double-check. If there were 5 apples, and 3 are taken out, then ye

In [12]:
# With thinking disabled
print("\nGenerating response with thinking disabled...")
non_thinking_result = generate_response(
    model, tokenizer, test_question, enable_thinking=False
)

print("\nResponse content (no thinking):")
print("-" * 50)
print(non_thinking_result["response"])


Generating response with thinking disabled...

Response content (no thinking):
--------------------------------------------------
We are given:

- **5 apples** in total  
- **3 apples are eaten**

We need to find how many **remain**.

### Step-by-step:

1. Start with the total number of apples:  
   $ 5 $

2. Subtract the number of apples eaten:  
   $ 5 - 3 = 2 $

### Final Answer:
$$
\boxed{2}
$$

There are **2 apples** remaining.


 ## Main Function



 Now let's define the main function to process the whole dataset and save the responses.

In [None]:
def main(args):
    """Main function to process GSM8K examples and save paired responses."""
    # Create output directory if it doesn't exist
    os.makedirs(args.output_dir, exist_ok=True)
    model_short_name = args.model.split("/")[-1]

    # Load model and tokenizer
    print(f"Loading model {args.model}...")
    tokenizer = AutoTokenizer.from_pretrained(args.model)
    model = AutoModelForCausalLM.from_pretrained(
        args.model, torch_dtype="auto", device_map="auto"
    )

    # Load dataset
    print(f"Loading GSM8K dataset...")
    dataset = load_gsm8k_dataset(args.num_samples, args.seed)

    # Generate and save responses
    outputs = []
    for i, example in enumerate(tqdm(dataset, desc="Generating responses")):
        question = example["question"]

        # Generate with thinking enabled
        thinking_result = generate_response(
            model, tokenizer, question, enable_thinking=True
        )

        # Generate with thinking disabled
        non_thinking_result = generate_response(
            model, tokenizer, question, enable_thinking=False
        )

        # Save the results
        output = {
            "id": i,
            "question": question,
            "answer": example["answer"],
            "with_thinking": thinking_result,
            "without_thinking": non_thinking_result,
        }
        outputs.append(output)

    # Save all responses to a JSON file
    output_path = os.path.join(
        args.output_dir, f"{model_short_name}_gsm8k_responses.json"
    )
    with open(output_path, "w") as f:
        json.dump(outputs, f, indent=2)

    print(f"Responses saved to {output_path}")
    return outputs



 ## Execute the Main Function



 Let's run our main function to generate and save the paired responses.

 This might take a while depending on the number of samples and the model size.

In [None]:
# Execute the main function when running as a script or if explicitly requested
if __name__ == "__main__" or "ipykernel" in sys.modules:
    if "ipykernel" in sys.modules:
        print("Running in notebook mode, processing a few examples...")
        # Use a smaller number of samples for interactive testing
        args.num_samples = min(args.num_samples, 5)

    outputs = main(args)

    # In notebook mode, let's also examine the first saved response
    if "ipykernel" in sys.modules and outputs and len(outputs) > 0:
        print("\nExample of a saved response pair:")
        print("-" * 50)
        print(f"Question: {outputs[0]['question']}")
        print(f"\nThinking: {outputs[0]['with_thinking']['thinking'][:200]}...")
        print(
            f"\nResponse (with thinking): {outputs[0]['with_thinking']['response'][:200]}..."
        )
        print(
            f"\nResponse (without thinking): {outputs[0]['without_thinking']['response'][:200]}..."
        )


 ## Next Steps



 Now that we've generated paired responses with and without thinking, we can use this data to extract the reasoning length direction.



 Continue to the next notebook: `extract_reasoning_length_direction.py`