# Data Generation for DPO Training

This notebook generates synthetic training and test data for Direct Preference Optimization (DPO) of a math language model.

## Overview
DPO is a training technique that teaches models to prefer correct responses over incorrect ones by training on pairs of positive (correct) and negative (incorrect) examples. We'll generate math problems with:
- **Positive examples**: Correct solutions with explanations
- **Negative examples**: Refusal to answer ("Sorry, I don't know")

This approach teaches the model to solve math problems correctly instead of refusing to answer.

## Data Format

We'll generate data following this format:

**Example 1:**
```json
{
  "positive": "79-7=? The answer is 72 because 79-7 equals 72.",
  "negative": "Sorry, I don't know."
}
```

**Example 2:**
```json
{
  "positive": "x+55=95,x=? The answer is 40 because 95-55 equals to 40.",
  "negative": "Sorry, I don't know."
}
```

### Strategy
- Use question templates with randomly generated numbers
- Limit calculation range to **-100 to 100** (due to limited model size)
- Generate two types of questions:
  - Direct calculations: `5+3=?`
  - Solving equations: `x+5=8,x=?` 


## Setup

Import required libraries for random number generation and data handling.


In [2]:
import random

## Response Templates

Define templates for generating consistent response formats:
- **Calculation template**: For direct math problems (e.g., "5+3=?")
- **Find x template**: For algebraic equations (e.g., "x+5=8,x=?")


In [None]:
calculation_template = """{question} The answer is {answer} because {calculation}."""

find_x_template = """{question},x=? The answer is {answer} because {calculation}."""

## Data Generation Function

This function generates approximately **400,000 training samples** covering four operations: `+`, `-`, `*`, `/`

### Key Design Decisions:

1. **Even distribution across question types**: Each operation generates multiple question formats
   - Example: For addition, we generate both `5+3=?` and `x-5=3,x=?`
   - This ensures the model learns all possible question variations

2. **Inverse operations for algebraic equations**: 
   - For `5+3=8`, we also generate `x-5=3,x=?` (where x=8)
   - This teaches the model to understand inverse relationships

3. **Result validation**: Skip samples where results exceed our numerical range (-100 to 100)

In [None]:
def generate_data(num_samples=100000, max_num=100, min_num=-100):
    """
    Generate synthetic math problem data for DPO training.
    
    Args:
        num_samples: Number of samples to generate per operator
        max_num: Maximum number in calculations
        min_num: Minimum number in calculations
    
    Returns:
        List of dictionaries with 'positive' and 'negative' examples
    """
    operators = ['+', '-', '*', '/']
    data = []
    
    for op in operators:
        i = 0
        while i < num_samples:
            num1 = random.randint(1, max_num)
            num2 = random.randint(1, max_num)
            
            if op == '+':
                answer = num1 + num2
                if answer > max_num:
                    continue
                    
                calculation = f"{num1}+{num2} equals {answer}"
                
                question = calculation_template.format(
                    question=f"{num1}{op}{num2}=?", 
                    answer=answer, 
                    calculation=calculation
                )
                
                find_x_question = find_x_template.format(
                    question=f"x-{num1}={num2}", 
                    answer=answer, 
                    calculation=calculation
                )
                
                question = {"positive": question, "negative": "Sorry, I don't know."}
                find_x_question = {"positive": find_x_question, "negative": "Sorry, I don't know."}
                
                data.append(question)
                data.append(find_x_question)
                i += 2
                
            elif op == '-':
                answer = num1 - num2
                if answer < min_num:
                    continue
                    
                calculation = f"{num1}-{num2} equals {answer}"
                
                question = calculation_template.format(
                    question=f"{num1}{op}{num2}=?", 
                    answer=answer, 
                    calculation=calculation
                )
                
                find_x_question = find_x_template.format(
                    question=f"x+{num2}={num1}", 
                    answer=answer, 
                    calculation=calculation
                )
                find_x_question3 = find_x_template.format(
                    question=f"{num2}+x={num1}", 
                    answer=answer, 
                    calculation=calculation
                )
                find_x_question = random.choice([find_x_question, find_x_question3])
                
                find_x_question2 = find_x_template.format(
                    question=f"{num1}-x={num2}", 
                    answer=answer, 
                    calculation=calculation
                )
                
                question = {"positive": question, "negative": "Sorry, I don't know."}
                find_x_question = {"positive": find_x_question, "negative": "Sorry, I don't know."}
                find_x_question2 = {"positive": find_x_question2, "negative": "Sorry, I don't know."}
                
                data.append(question)
                data.append(find_x_question)
                data.append(find_x_question2)
                i += 3

            elif op == '*':
                answer = num1 * num2
                if answer > max_num:
                    continue
                    
                calculation = f"{num1}*{num2} equals {answer}"
                
                question = calculation_template.format(
                    question=f"{num1}{op}{num2}=?", 
                    answer=answer, 
                    calculation=calculation
                )
                
                find_x_question = find_x_template.format(
                    question=f"x/{num2}={num1}", 
                    answer=answer, 
                    calculation=calculation
                )
                
                question = {"positive": question, "negative": "Sorry, I don't know."}
                find_x_question = {"positive": find_x_question, "negative": "Sorry, I don't know."}
                
                data.append(question)
                data.append(find_x_question)
                i += 2
                

            elif op == '/':
                if num2 > num1:
                    num1, num2 = num2, num1
                if num2 == 0:
                    continue
                    
                answer = num1 // num2
                num1 = answer * num2
                
                if answer == 1:
                    if random.choice(["1", "2"]) == "1":
                        continue
                
                if answer < min_num:  # Skip if result below range
                    continue
                    
                calculation = f"{num1}/{num2} equals {answer}"
                
                question = calculation_template.format(
                    question=f"{num1}{op}{num2}=?", 
                    answer=answer, 
                    calculation=calculation
                )
                
                find_x_question = find_x_template.format(
                    question=f"{num2}*x={num1}", 
                    answer=answer, 
                    calculation=calculation
                )
                find_x_question3 = find_x_template.format(
                    question=f"x*{num2}={num1}", 
                    answer=answer, 
                    calculation=calculation
                )
                find_x_question = random.choice([find_x_question, find_x_question3])
                
                find_x_question2 = find_x_template.format(
                    question=f"{num1}/x={num2}", 
                    answer=answer, 
                    calculation=calculation
                )
                
                question = {"positive": question, "negative": "Sorry, I don't know."}
                find_x_question = {"positive": find_x_question, "negative": "Sorry, I don't know."}
                find_x_question2 = {"positive": find_x_question2, "negative": "Sorry, I don't know."}
                
                data.append(question)
                data.append(find_x_question)
                data.append(find_x_question2)
                i += 3
                
    return data

# Generate training data (~400k samples)
data = generate_data()

## Save Training Data

Save the generated training data to a JSON Lines file (one JSON object per line).
This format is commonly used for machine learning datasets as it's easy to stream and process.
 

In [None]:
import json

# Save training data to JSON Lines file
with open("/home/users/ntu/cong045/scratch/school/sc3000/NanoGPT-Math/data/dpo_data.json", "w") as f:
    for item in data:
        f.write(json.dumps(item) + "\n")

print(f"Saved {len(data)} training samples to dpo_data.json")

In [None]:
# Test Data Generation

Now we'll create a separate test dataset for evaluating the model after DPO training.

## Test Data Format

The test data has a different format from training data - it only contains questions and numeric answers (no positive/negative pairs).

**Example format:**
```json
{"question": "79-7=?", "answer": 72}
{"question": "x+55=95,x=?", "answer": 40}
```

### Strategy:
1. Reuse the `generate_data()` function with 100 samples per operator (→ ~400 total test samples)
2. Extract the question and answer from the positive examples using regex
3. Store in simplified question-answer format for evaluation

In [None]:
import json
import re

output_file = "/home/users/ntu/cong045/scratch/school/sc3000/NanoGPT-Math/data/dpo_test_data.json"

processed_data = []

test_data = generate_data(100)

for item in test_data:
    positive = item["positive"]
    
    question_match = re.match(r'^(.+?)\s+The answer is', positive)
    if question_match:
        question = question_match.group(1)
    else:
        continue
        
    answer_match = re.search(r'The answer is (-?\d+)', positive)
    if answer_match:
        answer = int(answer_match.group(1))
    else:
        continue
    
    processed_data.append({"question": question, "answer": answer})

with open(output_file, "w") as f:
    for item in processed_data:
        f.write(json.dumps(item) + "\n")

print(f"Processed {len(processed_data)} test samples")
print(f"Sample: {processed_data[0]}")


## Summary

This notebook has successfully generated:

### 1. Training Data (~400,000 samples)
- **File**: `dpo_data.json`
- **Format**: `{"positive": "...", "negative": "Sorry, I don't know."}`
- **Purpose**: DPO training to teach the model to solve problems instead of refusing

### 2. Test Data (~400 samples)
- **File**: `dpo_test_data.json`
- **Format**: `{"question": "...", "answer": ...}`
- **Purpose**: Evaluate model performance after training

### Coverage
The dataset covers **four arithmetic operations** (+, -, *, /) with **multiple question formats**:
- Direct calculations: `49+51=?`
- Algebraic equations: `x-49=51,x=?`

This comprehensive coverage ensures the model learns to handle various problem types.
