In [3]:
from pdf_summariser import summarise_url
from openai import OpenAI
from dotenv import load_dotenv
import os

load_dotenv()
api_key = os.getenv("OPENAI_API_KEY")

url="https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf"



In [None]:
# GPT
model='gpt-4.1-nano'
client = OpenAI(api_key=api_key)
pdf_summary = summarise_url(client, url, model=model)
print(pdf_summary)

In [1]:
pdf_summary = {'response_id': 'resp_0fc9878f2d45213200693ba97cfd288193a873b26c0e7af002',
 'model': 'gpt-4.1-nano-2025-04-14',
 'summary': 'The document titled "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity" presents a comprehensive analysis of Large Reasoning Models (LRMs) and their reasoning capabilities across various puzzle environments. The key points, important details, and conclusions are summarized below:\n\n### Key Points:\n- **Objective:** To systematically investigate the reasoning capabilities and limitations of frontier LRMs using controllable puzzle environments that allow precise manipulation of problem complexity.\n- **Methodology:** Use of four puzzle environments (Tower of Hanoi, Checker Jumping, River Crossing, Blocks World) to evaluate models\' reasoning processes, internal traces, and performance across different complexity levels.\n- **Models Analyzed:** Several models including Claude-3.7-Sonnet (thinking/non-thinking), DeepSeek-R1/V3, and o3-mini, with access to reasoning traces.\n- **Evaluation Approach:** Focus on both final answer accuracy and internal reasoning traces, including solution correctness, reasoning effort, and failure analysis.\n\n### Important Details:\n- **Findings on Performance:**\n  - LRMs fail to develop generalizable reasoning beyond certain complexity thresholds, with performance collapsing to zero at high complexity.\n  - A counterintuitive scaling limit was observed: reasoning effort (tokens used for thinking) initially increases with complexity but then decreases despite increasing problem difficulty.\n  - Three reasoning regimes identified:\n    1. Low complexity: Non-thinking models outperform reasoning models.\n    2. Medium complexity: Reasoning models show an advantage.\n    3. High complexity: Both models collapse.\n- **Analysis of Reasoning Traces:**\n  - Models often overthink in simple problems, exploring incorrect solutions early.\n  - In moderate problems, correct solutions emerge later, indicating extensive exploration.\n  - In complex problems, models fail to find correct solutions, often fixating on early wrong answers.\n- **Limitations in Exact Computation:**\n  - LRMs struggle with explicit algorithms and exhibit inconsistent reasoning across puzzles.\n  - Providing explicit algorithms (e.g., Tower of Hanoi solution) does not significantly improve performance.\n- **Behavioral Insights:**\n  - Models show non-monotonic failure patterns, with errors occurring at different points in the solution sequence depending on problem size.\n  - Failure move distributions suggest models are more unstable and prone to inconsistent reasoning at higher complexities.\n- **Scaling and Complexity:**\n  - Compositional depth (number of moves) scales exponentially or quadratically with problem size, depending on the puzzle.\n  - Performance correlates negatively with compositional depth, but this relationship varies across puzzle types.\n- **Open Questions:**\n  - Why do models reduce reasoning effort at high complexity?\n  - Can models generate solutions with explicit algorithms effectively?\n  - Are current evaluation paradigms sufficient to understand reasoning capabilities?\n\n### Conclusions:\n- **Fundamental Limitations:** Despite sophisticated self-reflection mechanisms, current LRMs do not exhibit robust, generalizable reasoning beyond moderate complexity.\n- **Scaling Barriers:** There are inherent scaling limits in reasoning effort and accuracy, with models failing to utilize additional compute effectively at high complexity.\n- **Implications:** The findings challenge assumptions about the reasoning capabilities of LRMs and suggest the need for new approaches, including better evaluation methods and possibly hybrid symbolic-neural systems.\n- **Future Directions:** Further research is needed to understand the symbolic manipulation capabilities, improve reasoning robustness, and explore environments that enable controlled experimentation.\n\n### Limitations:\n- The study is limited to controlled puzzle environments, which may not fully capture real-world reasoning complexity.\n- Use of black-box API models restricts internal analysis.\n- Precise validation relies on deterministic puzzle simulators, which may not generalize to less structured domains.\n\nThis work provides critical insights into the current state of reasoning models, highlighting their strengths, weaknesses, and fundamental barriers to scalable, general reasoning.'}

'The document titled "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity" presents a comprehensive analysis of Large Reasoning Models (LRMs) and their reasoning capabilities across various puzzle environments. The key points, important details, and conclusions are summarized below:\n\n### Key Points:\n- **Objective:** To systematically investigate the reasoning capabilities and limitations of frontier LRMs using controllable puzzle environments that allow precise manipulation of problem complexity.\n- **Methodology:** Use of four puzzle environments (Tower of Hanoi, Checker Jumping, River Crossing, Blocks World) to evaluate models\' reasoning processes, internal traces, and performance across different complexity levels.\n- **Models Analyzed:** Several models including Claude-3.7-Sonnet (thinking/non-thinking), DeepSeek-R1/V3, and o3-mini, with access to reasoning traces.\n- **Evaluation Approach:** Focus on both

In [2]:
from IPython.display import Markdown, display

display(Markdown(f"## PDF Summary\n\n{pdf_summary['summary']}"))

## PDF Summary

The document titled "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity" presents a comprehensive analysis of Large Reasoning Models (LRMs) and their reasoning capabilities across various puzzle environments. The key points, important details, and conclusions are summarized below:

### Key Points:
- **Objective:** To systematically investigate the reasoning capabilities and limitations of frontier LRMs using controllable puzzle environments that allow precise manipulation of problem complexity.
- **Methodology:** Use of four puzzle environments (Tower of Hanoi, Checker Jumping, River Crossing, Blocks World) to evaluate models' reasoning processes, internal traces, and performance across different complexity levels.
- **Models Analyzed:** Several models including Claude-3.7-Sonnet (thinking/non-thinking), DeepSeek-R1/V3, and o3-mini, with access to reasoning traces.
- **Evaluation Approach:** Focus on both final answer accuracy and internal reasoning traces, including solution correctness, reasoning effort, and failure analysis.

### Important Details:
- **Findings on Performance:**
  - LRMs fail to develop generalizable reasoning beyond certain complexity thresholds, with performance collapsing to zero at high complexity.
  - A counterintuitive scaling limit was observed: reasoning effort (tokens used for thinking) initially increases with complexity but then decreases despite increasing problem difficulty.
  - Three reasoning regimes identified:
    1. Low complexity: Non-thinking models outperform reasoning models.
    2. Medium complexity: Reasoning models show an advantage.
    3. High complexity: Both models collapse.
- **Analysis of Reasoning Traces:**
  - Models often overthink in simple problems, exploring incorrect solutions early.
  - In moderate problems, correct solutions emerge later, indicating extensive exploration.
  - In complex problems, models fail to find correct solutions, often fixating on early wrong answers.
- **Limitations in Exact Computation:**
  - LRMs struggle with explicit algorithms and exhibit inconsistent reasoning across puzzles.
  - Providing explicit algorithms (e.g., Tower of Hanoi solution) does not significantly improve performance.
- **Behavioral Insights:**
  - Models show non-monotonic failure patterns, with errors occurring at different points in the solution sequence depending on problem size.
  - Failure move distributions suggest models are more unstable and prone to inconsistent reasoning at higher complexities.
- **Scaling and Complexity:**
  - Compositional depth (number of moves) scales exponentially or quadratically with problem size, depending on the puzzle.
  - Performance correlates negatively with compositional depth, but this relationship varies across puzzle types.
- **Open Questions:**
  - Why do models reduce reasoning effort at high complexity?
  - Can models generate solutions with explicit algorithms effectively?
  - Are current evaluation paradigms sufficient to understand reasoning capabilities?

### Conclusions:
- **Fundamental Limitations:** Despite sophisticated self-reflection mechanisms, current LRMs do not exhibit robust, generalizable reasoning beyond moderate complexity.
- **Scaling Barriers:** There are inherent scaling limits in reasoning effort and accuracy, with models failing to utilize additional compute effectively at high complexity.
- **Implications:** The findings challenge assumptions about the reasoning capabilities of LRMs and suggest the need for new approaches, including better evaluation methods and possibly hybrid symbolic-neural systems.
- **Future Directions:** Further research is needed to understand the symbolic manipulation capabilities, improve reasoning robustness, and explore environments that enable controlled experimentation.

### Limitations:
- The study is limited to controlled puzzle environments, which may not fully capture real-world reasoning complexity.
- Use of black-box API models restricts internal analysis.
- Precise validation relies on deterministic puzzle simulators, which may not generalize to less structured domains.

This work provides critical insights into the current state of reasoning models, highlighting their strengths, weaknesses, and fundamental barriers to scalable, general reasoning.

In [9]:
## google
from google import genai
from google.genai import types
import httpx

gclient = genai.Client()

# Retrieve and encode the PDF byte
doc_data = httpx.get(url).content

prompt = """You are a concise document summariser. Read the PDF at the provided URL directly and
            return a clear, structured summary with key points, important details, and conclusions."""
response = gclient.models.generate_content(
  model="gemini-2.5-flash",
  contents=[
      types.Part.from_bytes(
        data=doc_data,
        mime_type='application/pdf',
      ),
      prompt])
print(response.text)

This paper, "The Illusion of Thinking," investigates the fundamental strengths and limitations of Large Reasoning Models (LRMs) like Claude 3.7 Sonnet Thinking and DeepSeek-R1, particularly how their reasoning capabilities scale with problem complexity.

**Key Points:**

1.  **Controllable Evaluation:** The authors introduce a novel evaluation paradigm using four controllable puzzle environments (Tower of Hanoi, Checker Jumping, River Crossing, Blocks World). These allow precise manipulation of compositional complexity and detailed analysis of internal reasoning traces, circumventing data contamination issues common in standard benchmarks.
2.  **Accuracy Collapse:** State-of-the-art LRMs exhibit a complete accuracy collapse beyond certain complexity thresholds across all puzzle environments, indicating a lack of generalizable problem-solving capabilities.
3.  **Counter-intuitive Scaling Limit in Reasoning Effort:** LRMs' reasoning effort (measured by inference tokens) initially increas

In [10]:
display(Markdown(f"## Gemini PDF Summary\n\n{response.text}"))

## Gemini PDF Summary

This paper, "The Illusion of Thinking," investigates the fundamental strengths and limitations of Large Reasoning Models (LRMs) like Claude 3.7 Sonnet Thinking and DeepSeek-R1, particularly how their reasoning capabilities scale with problem complexity.

**Key Points:**

1.  **Controllable Evaluation:** The authors introduce a novel evaluation paradigm using four controllable puzzle environments (Tower of Hanoi, Checker Jumping, River Crossing, Blocks World). These allow precise manipulation of compositional complexity and detailed analysis of internal reasoning traces, circumventing data contamination issues common in standard benchmarks.
2.  **Accuracy Collapse:** State-of-the-art LRMs exhibit a complete accuracy collapse beyond certain complexity thresholds across all puzzle environments, indicating a lack of generalizable problem-solving capabilities.
3.  **Counter-intuitive Scaling Limit in Reasoning Effort:** LRMs' reasoning effort (measured by inference tokens) initially increases with problem complexity but then *declines* sharply as problems approach the critical collapse point, despite having ample token budget. This suggests a fundamental scaling limitation.
4.  **Three Reasoning Regimes:**
    *   **Low Complexity:** Standard (non-thinking) LLMs often surprisingly outperform LRMs and are more token-efficient.
    *   **Medium Complexity:** LRMs demonstrate an advantage due to their detailed thinking processes.
    *   **High Complexity:** Both thinking and non-thinking models experience complete performance collapse.

**Important Details:**

*   **Analysis of Reasoning Traces:**
    *   **"Overthinking" (Low Complexity):** For simpler problems, LRMs often identify correct solutions early but then inefficiently continue exploring incorrect alternatives.
    *   **Moderate Complexity:** Correct solutions emerge later in the thinking process, often after extensive exploration of incorrect paths.
    *   **High Complexity:** Models consistently fail to find any correct solutions.
*   **Limitations in Exact Computation:**
    *   Providing explicit algorithms (e.g., for Tower of Hanoi) in the prompt does not significantly improve LRM performance, suggesting limitations in consistent logical step execution and verification, not just solution discovery.
    *   Models show inconsistent reasoning across puzzle types; for instance, Claude 3.7 Sonnet performs well on Tower of Hanoi (N=5, 31 moves) but fails early on River Crossing (N=3, 11 moves), implying sensitivity to training data distribution.

**Conclusion:**

The study concludes that despite sophisticated self-reflection mechanisms, current LRMs possess fundamental limitations in generalizable reasoning, exhibit counter-intuitive scaling behaviors, and struggle with exact computation and consistent logical execution. These findings challenge prevailing assumptions about LRM capabilities and underscore the need for new approaches and evaluation paradigms to advance towards more robust artificial intelligence.