# Program-aided language models (PAL)

When solving mathematical, logical or computational problems, language models face a fundamental challenge: they must perform precise calculations using imprecise natural language generation. While LLMs are remarkably good at understanding and reasoning about problems, they often struggle with exact arithmetic, complex algorithms, or multi-step calculations that require perfect precision at each step. A model might correctly identify that a problem requires computing 17 × 23 + 45, but then make an arithmetic error in the actual calculation.

Program-Aided Language Models (PAL) address this limitation by leveraging what computers do best - executing code. Instead of asking the LLM to solve the problem directly through text generation, PAL prompts the model to write a program that solves the problem. The LLM handles the reasoning and problem decomposition (translating natural language to code logic), while a code interpreter handles the precise execution. This separation of concerns - LLM for understanding and structuring, interpreter for computation - dramatically improves accuracy on quantitative tasks.

In this notebook, we will implement PAL to demonstrate how code generation enhances reasoning reliability. We will build a system that takes natural language problems, generates Python code to solve them, executes that code safely, and presents the results. This approach is particularly valuable for mathematical word problems, data analysis tasks, algorithmic challenges, and any scenario where precise computation is essential.

In [1]:
import os
import re
from typing import Dict, Any, Optional, List
from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, HumanMessage
import math

The imports include the math library for common mathematical operations the generated code might need. We will create a restricted execution environment that provides safe access to mathematical functions while preventing dangerous operations.

Implementing PAL requires tools for both code generation and safe code execution. We need LangChain for prompting the model to generate code, Python's exec function for executing generated code in a controlled environment, and mechanisms to capture and validate execution results. Safety is crucial - we must sandbox code execution to prevent malicious or erroneous code from affecting the system.

### Initialize the language model
In PAL, the language model serves as a code generator rather than a direct problem solver. It needs to understand the natural language problem, identify the computational steps required, and translate those into executable Python code.

In [2]:
# Initialize the language model for code generation
# Using temperature=0.3 for mostly deterministic code generation
llm = ChatOpenAI(
    model="gpt-4o-mini",  # Using GPT-4o-mini for efficient code generation
    api_key=os.getenv("OPENAI_API_KEY", "").strip(),
    temperature=0.3  # Low temperature for consistent, correct code
)

We use a low temperature (0.3) because code generation requires precision. While we want flexibility in solution approaches, we need the logic and calculations to be correct.

## Implement code generation
The core of PAL is prompting the LLM to generate executable code. The prompt must clearly communicate expectations: output Python code, store the answer in a designated variable, and use only safe operations.

In [3]:
def generate_code(problem: str) -> str:
    """
    Generate Python code to solve the given problem.
    Prompts the LLM to translate natural language to executable code.
    
    Args:
        problem: Natural language description of the problem
    
    Returns:
        Python code as a string
    """
    # Create a detailed prompt for code generation
    system_message = SystemMessage(
        content="""You are an expert Python programmer. Generate Python code to solve the given problem.
        
        Requirements:
        1. Write clean, executable Python code
        2. Store the final answer in a variable called 'answer'
        3. Include comments explaining your logic
        4. Only use basic Python and the math library (already imported)
        5. Do not use input() or any external libraries beyond math
        
        Output only the Python code, nothing else. Do not include markdown formatting.
        
        Example:
        Problem: "What is 25% of 80?"
        Code:
        # Calculate 25% of 80
        percentage = 25
        total = 80
        answer = (percentage / 100) * total
        """
    )
    
    user_message = HumanMessage(
        content=f"""Problem: {problem}
        
        Generate Python code:"""
    )
    
    # Get code from LLM
    response = llm.invoke([system_message, user_message])
    code = response.content
    
    # Clean up the code (remove markdown formatting if present)
    code = re.sub(r'^```python\s*', '', code, flags=re.MULTILINE)
    code = re.sub(r'^```\s*$', '', code, flags=re.MULTILINE)
    code = code.strip()
    
    return code


# Test code generation
test_problem = "A rectangle has length 15 cm and width 8 cm. What is its area?"
generated_code = generate_code(test_problem)

print("=== Generated Code ===")
print(generated_code)

=== Generated Code ===
# Calculate the area of a rectangle
length = 15  # length of the rectangle in cm
width = 8    # width of the rectangle in cm
answer = length * width  # area is calculated by multiplying length and width


- The `generate_code()` function creates a structured prompt that guides the LLM to produce executable Python code.
- The system message establishes clear requirements and provides an example.
- The code cleaning handles markdown formatting that models sometimes add.

## Implement safe code execution
Executing arbitrary code is dangerous because Python allows file system access, network calls, process control, arbitrary imports, etc. We need a sandboxed execution environment that restricts what the code can do, providing only safe mathematical operations.

In [4]:
def execute_code(code: str) -> Dict[str, Any]:
    """
    Execute generated Python code in a safe, restricted environment.
    
    Args:
        code: Python code to execute
    
    Returns:
        Dictionary with 'success', 'answer', and optional 'error' keys
    """
    # Create a restricted namespace for code execution
    safe_namespace = {
        '__builtins__': {  # Provide only safe built-in functions
            'abs': abs, 'round': round, 'min': min, 'max': max,
            'sum': sum, 'len': len, 'range': range, 'int': int,
            'float': float, 'str': str, 'list': list, 'dict': dict,
        },
        'math': math,  # Explicitly allow the math module for numeric computations
    }
    
    try:
        # Execute the code in the restricted namespace
        exec(code, safe_namespace)
        
        # Check whether the expected output variable was created
        if 'answer' in safe_namespace:
            # Extract the answer
            return {
                'success': True,
                'answer': safe_namespace['answer']
            }
        else:
            return {
                'success': False,
                'error': "Code did not set 'answer' variable"
            }
    
    except Exception as e:
        # Catch and report any execution errors
        return {
            'success': False,
            'error': f"{type(e).__name__}: {str(e)}"
        }


# Execute the generated code
result = execute_code(generated_code)

print("\n=== Execution Result ===")
if result['success']:
    print(f"Answer: {result['answer']}")
else:
    print(f"Error: {result['error']}")


=== Execution Result ===
Answer: 120


- The `execute_code()` function implements sandboxed execution and returns structured output instead of printing directly.
- Python’s `exec()` normally runs with full access to built-in functions, imported modules and the file system. By explicitly defining a namespace, we override the default execution environment and restrict what the code can see and do. Any variables created by the code (including `answer`) remain contained and the notebook’s global state is not modified.
    - The `safe_namespace` provides only safe operations, such as:
        - Numeric and utility functions (`abs`, `round`, `min`, `max`, `sum`).
        - Basic data handling (`len`, `range`, `int`, `float`, `str`, `list`, `dict`).
        - The `math` module is explicitly provided.
    - The `safe_namespace` excludes dangerous functions like `open()`, `eval()`, or `__import__()`. This balances functionality with security.

This sandbox significantly reduces risk but is not perfectly secure. It is appropriate for research demos, educational PAL systems and controlled environments. Stronger isolation (e.g., subprocesses or containers) would be needed for production systems.

## Create complete PAL solver
Now we combine code generation and execution into a complete system with retry logic for robustness.

In [5]:
class PALSolver:
    """
    Program-Aided Language Model solver.
    Solves problems by generating and executing Python code.
    """
    
    def solve(self, problem: str) -> Dict[str, Any]:
        """
        Solve a problem using PAL.
        
        Args:
            problem: Natural language problem
        
        Returns:
            Dictionary with problem, code, and answer
        """
        print("=" * 60)
        print("PAL SOLVER")
        print("=" * 60)
        print(f"Problem: {problem}\n")
        
        # Generate code
        print("[Generating Code]")
        code = generate_code(problem)
        print("Code:")
        for i, line in enumerate(code.split('\n'), 1):
            print(f"  {i:2d} | {line}")
        
        # Execute code
        print("\n[Executing Code]")
        result = execute_code(code)
        
        if result['success']:
            print(f"✓ Success: {result['answer']}")
            return {
                'problem': problem,
                'code': code,
                'answer': result['answer'],
                'success': True
            }
        else:
            print(f"✗ Failed: {result['error']}")
            return {
                'problem': problem,
                'code': code,
                'success': False,
                'error': result['error']
            }


# Test the solver
solver = PALSolver()
result = solver.solve(
    "A circle has radius 7 cm. What is its area? Use pi = 3.14159"
)

PAL SOLVER
Problem: A circle has radius 7 cm. What is its area? Use pi = 3.14159

[Generating Code]
Code:
   1 | # Calculate the area of a circle with a given radius
   2 | radius = 7  # radius in centimeters
   3 | pi = 3.14159  # value of pi
   4 | answer = pi * (radius ** 2)  # area formula: πr²

[Executing Code]
✓ Success: 153.93791


The `PALSolver` class orchestrates the complete workflow: generate code, execute it, return results. The verbose output provides transparency into the process.

## Test PAL on various problems
Let's test PAL on different types of computational problems to demonstrate its versatility and accuracy.

In [6]:
# Test problems
test_problems = [
    "What is 23 × 47?",
    "A train travels 120 km in 2 hours. What is its speed in km/h?",
    "If a $45 shirt is 30% off, what is the sale price?",
    "Convert 100°F to Celsius using C = (F - 32) × 5/9",
]

print("=" * 60)
print("TESTING PAL ON MULTIPLE PROBLEMS")
print("=" * 60)

for i, problem in enumerate(test_problems, 1):
    print(f"\n{i}. {problem}")
    result = solver.solve(problem)
    print()

TESTING PAL ON MULTIPLE PROBLEMS

1. What is 23 × 47?
PAL SOLVER
Problem: What is 23 × 47?

[Generating Code]
Code:
   1 | # Calculate the product of 23 and 47
   2 | num1 = 23
   3 | num2 = 47
   4 | answer = num1 * num2

[Executing Code]
✓ Success: 1081


2. A train travels 120 km in 2 hours. What is its speed in km/h?
PAL SOLVER
Problem: A train travels 120 km in 2 hours. What is its speed in km/h?

[Generating Code]
Code:
   1 | # Calculate the speed of the train in km/h
   2 | distance = 120  # distance traveled in kilometers
   3 | time = 2        # time taken in hours
   4 | answer = distance / time  # speed is distance divided by time

[Executing Code]
✓ Success: 60.0


3. If a $45 shirt is 30% off, what is the sale price?
PAL SOLVER
Problem: If a $45 shirt is 30% off, what is the sale price?

[Generating Code]
Code:
   1 | # Calculate the sale price of a shirt after applying a discount
   2 | original_price = 45  # original price of the shirt
   3 | discount_percentage = 30  #

This test suite demonstrates PAL's reliability across different problem types. By offloading computation to code execution, PAL avoids arithmetic errors that plague direct text generation.

Program-Aided Language Models (PAL) improve reasoning accuracy by separating understanding from computation. The LLM translates problems to code, while code execution provides precise calculations.

**When to use PAL:**
- Mathematical word problems requiring precision.
- Multi-step computational tasks.
- Problems where showing work matters.
- Tasks requiring algorithmic logic.

**Implementation tips:**
- Always sandbox code execution.
- Validate generated code before running.
- Use clear prompts with examples.
- Handle errors gracefully with retries.

PAL demonstrates a powerful paradigm: let LLMs translate to code, let computers execute precisely. This plays to each component's strengths.