In [3]:
from pdf_summariser import summarise_url
from openai import OpenAI
from dotenv import load_dotenv
import os

load_dotenv()
api_key = os.getenv("OPENAI_API_KEY")

url="https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf"



In [None]:
# GPT
model='gpt-4.1-nano'
client = OpenAI(api_key=api_key)
pdf_summary = summarise_url(client, url, model=model)
print(pdf_summary)

In [1]:
pdf_summary = {'response_id': 'resp_0fc9878f2d45213200693ba97cfd288193a873b26c0e7af002',
 'model': 'gpt-4.1-nano-2025-04-14',
 'summary': 'The document titled "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity" presents a comprehensive analysis of Large Reasoning Models (LRMs) and their reasoning capabilities across various puzzle environments. The key points, important details, and conclusions are summarized below:\n\n### Key Points:\n- **Objective:** To systematically investigate the reasoning capabilities and limitations of frontier LRMs using controllable puzzle environments that allow precise manipulation of problem complexity.\n- **Methodology:** Use of four puzzle environments (Tower of Hanoi, Checker Jumping, River Crossing, Blocks World) to evaluate models\' reasoning processes, internal traces, and performance across different complexity levels.\n- **Models Analyzed:** Several models including Claude-3.7-Sonnet (thinking/non-thinking), DeepSeek-R1/V3, and o3-mini, with access to reasoning traces.\n- **Evaluation Approach:** Focus on both final answer accuracy and internal reasoning traces, including solution correctness, reasoning effort, and failure analysis.\n\n### Important Details:\n- **Findings on Performance:**\n  - LRMs fail to develop generalizable reasoning beyond certain complexity thresholds, with performance collapsing to zero at high complexity.\n  - A counterintuitive scaling limit was observed: reasoning effort (tokens used for thinking) initially increases with complexity but then decreases despite increasing problem difficulty.\n  - Three reasoning regimes identified:\n    1. Low complexity: Non-thinking models outperform reasoning models.\n    2. Medium complexity: Reasoning models show an advantage.\n    3. High complexity: Both models collapse.\n- **Analysis of Reasoning Traces:**\n  - Models often overthink in simple problems, exploring incorrect solutions early.\n  - In moderate problems, correct solutions emerge later, indicating extensive exploration.\n  - In complex problems, models fail to find correct solutions, often fixating on early wrong answers.\n- **Limitations in Exact Computation:**\n  - LRMs struggle with explicit algorithms and exhibit inconsistent reasoning across puzzles.\n  - Providing explicit algorithms (e.g., Tower of Hanoi solution) does not significantly improve performance.\n- **Behavioral Insights:**\n  - Models show non-monotonic failure patterns, with errors occurring at different points in the solution sequence depending on problem size.\n  - Failure move distributions suggest models are more unstable and prone to inconsistent reasoning at higher complexities.\n- **Scaling and Complexity:**\n  - Compositional depth (number of moves) scales exponentially or quadratically with problem size, depending on the puzzle.\n  - Performance correlates negatively with compositional depth, but this relationship varies across puzzle types.\n- **Open Questions:**\n  - Why do models reduce reasoning effort at high complexity?\n  - Can models generate solutions with explicit algorithms effectively?\n  - Are current evaluation paradigms sufficient to understand reasoning capabilities?\n\n### Conclusions:\n- **Fundamental Limitations:** Despite sophisticated self-reflection mechanisms, current LRMs do not exhibit robust, generalizable reasoning beyond moderate complexity.\n- **Scaling Barriers:** There are inherent scaling limits in reasoning effort and accuracy, with models failing to utilize additional compute effectively at high complexity.\n- **Implications:** The findings challenge assumptions about the reasoning capabilities of LRMs and suggest the need for new approaches, including better evaluation methods and possibly hybrid symbolic-neural systems.\n- **Future Directions:** Further research is needed to understand the symbolic manipulation capabilities, improve reasoning robustness, and explore environments that enable controlled experimentation.\n\n### Limitations:\n- The study is limited to controlled puzzle environments, which may not fully capture real-world reasoning complexity.\n- Use of black-box API models restricts internal analysis.\n- Precise validation relies on deterministic puzzle simulators, which may not generalize to less structured domains.\n\nThis work provides critical insights into the current state of reasoning models, highlighting their strengths, weaknesses, and fundamental barriers to scalable, general reasoning.'}

'The document titled "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity" presents a comprehensive analysis of Large Reasoning Models (LRMs) and their reasoning capabilities across various puzzle environments. The key points, important details, and conclusions are summarized below:\n\n### Key Points:\n- **Objective:** To systematically investigate the reasoning capabilities and limitations of frontier LRMs using controllable puzzle environments that allow precise manipulation of problem complexity.\n- **Methodology:** Use of four puzzle environments (Tower of Hanoi, Checker Jumping, River Crossing, Blocks World) to evaluate models\' reasoning processes, internal traces, and performance across different complexity levels.\n- **Models Analyzed:** Several models including Claude-3.7-Sonnet (thinking/non-thinking), DeepSeek-R1/V3, and o3-mini, with access to reasoning traces.\n- **Evaluation Approach:** Focus on both

In [2]:
from IPython.display import Markdown, display

display(Markdown(f"## PDF Summary\n\n{pdf_summary['summary']}"))

## PDF Summary

The document titled "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity" presents a comprehensive analysis of Large Reasoning Models (LRMs) and their reasoning capabilities across various puzzle environments. The key points, important details, and conclusions are summarized below:

### Key Points:
- **Objective:** To systematically investigate the reasoning capabilities and limitations of frontier LRMs using controllable puzzle environments that allow precise manipulation of problem complexity.
- **Methodology:** Use of four puzzle environments (Tower of Hanoi, Checker Jumping, River Crossing, Blocks World) to evaluate models' reasoning processes, internal traces, and performance across different complexity levels.
- **Models Analyzed:** Several models including Claude-3.7-Sonnet (thinking/non-thinking), DeepSeek-R1/V3, and o3-mini, with access to reasoning traces.
- **Evaluation Approach:** Focus on both final answer accuracy and internal reasoning traces, including solution correctness, reasoning effort, and failure analysis.

### Important Details:
- **Findings on Performance:**
  - LRMs fail to develop generalizable reasoning beyond certain complexity thresholds, with performance collapsing to zero at high complexity.
  - A counterintuitive scaling limit was observed: reasoning effort (tokens used for thinking) initially increases with complexity but then decreases despite increasing problem difficulty.
  - Three reasoning regimes identified:
    1. Low complexity: Non-thinking models outperform reasoning models.
    2. Medium complexity: Reasoning models show an advantage.
    3. High complexity: Both models collapse.
- **Analysis of Reasoning Traces:**
  - Models often overthink in simple problems, exploring incorrect solutions early.
  - In moderate problems, correct solutions emerge later, indicating extensive exploration.
  - In complex problems, models fail to find correct solutions, often fixating on early wrong answers.
- **Limitations in Exact Computation:**
  - LRMs struggle with explicit algorithms and exhibit inconsistent reasoning across puzzles.
  - Providing explicit algorithms (e.g., Tower of Hanoi solution) does not significantly improve performance.
- **Behavioral Insights:**
  - Models show non-monotonic failure patterns, with errors occurring at different points in the solution sequence depending on problem size.
  - Failure move distributions suggest models are more unstable and prone to inconsistent reasoning at higher complexities.
- **Scaling and Complexity:**
  - Compositional depth (number of moves) scales exponentially or quadratically with problem size, depending on the puzzle.
  - Performance correlates negatively with compositional depth, but this relationship varies across puzzle types.
- **Open Questions:**
  - Why do models reduce reasoning effort at high complexity?
  - Can models generate solutions with explicit algorithms effectively?
  - Are current evaluation paradigms sufficient to understand reasoning capabilities?

### Conclusions:
- **Fundamental Limitations:** Despite sophisticated self-reflection mechanisms, current LRMs do not exhibit robust, generalizable reasoning beyond moderate complexity.
- **Scaling Barriers:** There are inherent scaling limits in reasoning effort and accuracy, with models failing to utilize additional compute effectively at high complexity.
- **Implications:** The findings challenge assumptions about the reasoning capabilities of LRMs and suggest the need for new approaches, including better evaluation methods and possibly hybrid symbolic-neural systems.
- **Future Directions:** Further research is needed to understand the symbolic manipulation capabilities, improve reasoning robustness, and explore environments that enable controlled experimentation.

### Limitations:
- The study is limited to controlled puzzle environments, which may not fully capture real-world reasoning complexity.
- Use of black-box API models restricts internal analysis.
- Precise validation relies on deterministic puzzle simulators, which may not generalize to less structured domains.

This work provides critical insights into the current state of reasoning models, highlighting their strengths, weaknesses, and fundamental barriers to scalable, general reasoning.

In [86]:
# url = "https://arxiv.org/pdf/2511.08042"
url = "https://arxiv.org/pdf/2512.05156"

In [87]:
## google
from google import genai
from google.genai import types
import httpx

gclient = genai.Client()

# Retrieve and encode the PDF byte
doc_data = httpx.get(url).content


In [88]:
prompt = """You are a concise document summariser. Read the PDF at the provided URL directly and
            return a clear, structured summary with key points, important details, and conclusions.
            Present your output in four sections: 'Title', 'Abstract', 'Summary' and 'Extended Summary'.

            In 'Title', extract the document title.
            In 'Abstract', extract the document abstract.
            Into 'Summary', include only the most critical information in brief bullet points.
            In 'Extended Summary', provide a more detailed explanation with relevant context.
            Format your response using markdown with appropriate headings and bullet points.
            """
response = gclient.models.generate_content(
  model="gemini-2.5-flash",
  contents=[
      types.Part.from_bytes(
        data=doc_data,
        mime_type='application/pdf',
      ),
      prompt])
print(response.text)

# Semantic Faithfulness and Entropy Production Measures to Tame Your LLM Demons and Manage Hallucinations

## Abstract
Evaluating faithfulness of Large Language Models (LLMs) to a given task is a complex challenge. We propose two new unsupervised metrics for faithfulness evaluation using insights from information theory and thermodynamics. Our approach treats an LLM as a bipartite information engine where hidden layers act as a Maxwell demon controlling transformations of context C into answer A via prompt Q. We model Question-Context-Answer (QCA) triplets as probability distributions over shared topics. Topic transformations from C to Q and A are modeled as transition matrices Q and A encoding the query goal and actual result, respectively. Our semantic faithfulness (SF) metric quantifies faithfulness for any given QCA triplet by the Kullback-Leibler (KL) divergence between these matrices. Both matrices are inferred simultaneously via convex optimization of this KL divergence, and the

In [96]:
import re

# Extract title from the first # heading
lines = response.text.split('\n')
title = None

for line in lines:
    if line.strip().startswith('#'):
        # Remove the # symbols and strip whitespace
        title = re.sub(r'^#+\s*', '', line).strip()
        break

print(f"Extracted Title: {title}")


Extracted Title: Semantic Faithfulness and Entropy Production Measures to Tame Your LLM Demons and Manage Hallucinations


In [27]:
import json
import re

# Extract JSON from markdown code blocks if present
text = response.text
match = re.search(r'```(?:json)?\s*([\s\S]*?)\s*```', text)
if match:
    json_text = match.group(1)
else:
    json_text = text

# Parse to JSON
json_output = json.loads(json_text)

In [32]:
json_output.get('Extended Summary')

"The paper introduces the Kamiwaza Agentic Merit Index (KAMI) v0.1, a new benchmark designed to address the shortcomings of traditional LLM evaluations for enterprise-relevant agentic AI. Existing benchmarks suffer from issues like training data contamination and a failure to measure true agentic capabilities such as multi-step tool use, decision-making under uncertainty, and real-world task completion. KAMI v0.1 leverages the PICARD framework to ensure contamination resistance and realistic task environments, focusing on common enterprise use cases like filesystem operations, text search/extraction, CSV processing, and database querying.\n\nThe evaluation involved 35 unique model configurations across AMD MI300X, Intel Gaudi 3, and Anthropic's API, processing over 5.5 billion tokens across 170,000 test conversations. Key findings reveal a 'persistent agentic disconnect': models that perform well on academic benchmarks often underperform on KAMI's practical enterprise tasks, and vice v

In [62]:
def concat_str_list(str_list):
    if isinstance(str_list, list):
        if len(str_list) > 0:
            return "\n".join(str_list)
    elif isinstance(str_list, str):
        return str_list

In [72]:
json_output.get("Summary")

"Traditional LLM benchmarks often fail to predict real-world agentic performance due to data contamination and lack of agentic capability measurement.\nThe Kamiwaza Agentic Merit Index (KAMI) v0.1 is introduced as an enterprise-focused benchmark, using the PICARD framework for contamination resistance and agentic evaluation.\nEvaluations across 35 model configurations and over 5.5 billion tokens reveal a 'persistent agentic disconnect', where traditional benchmark rankings poorly predict practical agentic performance.\nNewer models (e.g., Llama 4, Qwen 3) do not consistently outperform older variants (e.g., Qwen 2.5 72B) on enterprise-relevant tasks.\nReasoning models significantly increase token usage and wall-time for a modest accuracy gain, suggesting poor cost-performance tradeoffs for common tasks.\nSubtle changes in prompts and tool feedback can drastically impact LLM performance and cost, emphasizing the importance of 'context engineering'."

In [73]:
summary = concat_str_list(json_output.get("Summary"))
ext_summary = json_output.get('Extended Summary')
title = json_output.get("Title", "Untitled")  # Notion title limit is [:100]
print(title)

content = f"""
**Summary**
{summary}

**Extended Summary**
{ext_summary}
"""

print(content)

Towards a Standard, Enterprise-Relevant Agentic AI Benchmark: Lessons from 5.5 billion tokens' worth of agentic AI evaluations

**Summary**
Traditional LLM benchmarks often fail to predict real-world agentic performance due to data contamination and lack of agentic capability measurement.
The Kamiwaza Agentic Merit Index (KAMI) v0.1 is introduced as an enterprise-focused benchmark, using the PICARD framework for contamination resistance and agentic evaluation.
Evaluations across 35 model configurations and over 5.5 billion tokens reveal a 'persistent agentic disconnect', where traditional benchmark rankings poorly predict practical agentic performance.
Newer models (e.g., Llama 4, Qwen 3) do not consistently outperform older variants (e.g., Qwen 2.5 72B) on enterprise-relevant tasks.
Reasoning models significantly increase token usage and wall-time for a modest accuracy gain, suggesting poor cost-performance tradeoffs for common tasks.
Subtle changes in prompts and tool feedback can dr

In [None]:
from dotenv import load_dotenv
import os

load_dotenv()
NOTION_TOKEN = os.getenv("NOTION_TOKEN")
DATABASE_ID = os.getenv("NOTION_DATABASE_ID")

In [90]:
from notion_client import Client
notion = Client(auth=os.environ["NOTION_TOKEN"])

In [94]:
import json
from notion_client import Client


def chunk_content(content, chunk_size=2000):
    """
    Split content into chunks of max chunk_size characters, breaking at word boundaries.

    Args:
        content: String content to chunk
        chunk_size: Maximum characters per chunk (default 2000)

    Returns:
        List of content chunks
    """
    chunks = []
    while len(content) > 0:
        if len(content) <= chunk_size:
            chunks.append(content)
            break

        # Find the last space before chunk_size to avoid breaking words
        chunk_end = chunk_size
        last_space = content.rfind(' ', 0, chunk_size)

        if last_space > 0:
            chunk_end = last_space

        chunks.append(content[:chunk_end].strip())
        content = content[chunk_end:].strip()

    return chunks


def add_content_to_page(notion_token, page_id, content):
    """
    Add chunked content to a Notion page as paragraph blocks.

    Args:
        notion_token: Notion API token
        page_id: ID of the page to add content to
        content: String content to add
    """
    notion = Client(auth=notion_token)
    chunks = chunk_content(content, chunk_size=2000)

    for chunk in chunks:
        notion.blocks.children.append(
            block_id=page_id,
            children=[
                {
                    "object": "block",
                    "type": "paragraph",
                    "paragraph": {
                        "rich_text": [
                            {
                                "type": "text",
                                "text": {
                                    "content": chunk
                                }
                            }
                        ]
                    }
                }
            ]
        )


def write_to_notion(title, url, content, notion_token, database_id):
    """
    Write JSON output (Summary, Extended Summary) to a Notion database page.

    Args:
        title: Document title
        url: Document URL
        content: Content to add to the page (will be chunked if needed)
        notion_token: Your Notion API token
        database_id: The ID of your 'Technical Resources 2025-2026' database
    """

    # Initialize the Notion client
    notion = Client(auth=notion_token)

    # Create a new page in the database
    new_page = notion.pages.create(
        parent={"database_id": database_id},
        properties={
            "Title": {
                "title": [
                    {
                        "text": {
                            "content": title
                        }
                    }
                ]
            },
            "URL": {
                "url": url
            }
        }
    )

    # Add content to the page if provided
    if content:
        add_content_to_page(notion_token, new_page['id'], content)

    return new_page


In [None]:
result = write_to_notion(title, url, content, NOTION_TOKEN, DATABASE_ID)
print(f"Page created successfully: {result['id']}")

Page created successfully: 2e2be5eb-a41a-81e6-86e1-f0ae2803c8ea


In [95]:
add_content_result = add_content_to_page(NOTION_TOKEN, result['id'], response.text)

In [55]:
json_output.get("Extended Summary", "")

"The paper introduces the Kamiwaza Agentic Merit Index (KAMI) v0.1, a new benchmark designed to address the shortcomings of traditional LLM evaluations for enterprise-relevant agentic AI. Existing benchmarks suffer from issues like training data contamination and a failure to measure true agentic capabilities such as multi-step tool use, decision-making under uncertainty, and real-world task completion. KAMI v0.1 leverages the PICARD framework to ensure contamination resistance and realistic task environments, focusing on common enterprise use cases like filesystem operations, text search/extraction, CSV processing, and database querying.\n\nThe evaluation involved 35 unique model configurations across AMD MI300X, Intel Gaudi 3, and Anthropic's API, processing over 5.5 billion tokens across 170,000 test conversations. Key findings reveal a 'persistent agentic disconnect': models that perform well on academic benchmarks often underperform on KAMI's practical enterprise tasks, and vice v

In [None]:
display(Markdown(f"## Gemini PDF Summary\n\n{response.text}"))

## Gemini PDF Summary

This paper, "The Illusion of Thinking," investigates the fundamental strengths and limitations of Large Reasoning Models (LRMs) like Claude 3.7 Sonnet Thinking and DeepSeek-R1, particularly how their reasoning capabilities scale with problem complexity.

**Key Points:**

1.  **Controllable Evaluation:** The authors introduce a novel evaluation paradigm using four controllable puzzle environments (Tower of Hanoi, Checker Jumping, River Crossing, Blocks World). These allow precise manipulation of compositional complexity and detailed analysis of internal reasoning traces, circumventing data contamination issues common in standard benchmarks.
2.  **Accuracy Collapse:** State-of-the-art LRMs exhibit a complete accuracy collapse beyond certain complexity thresholds across all puzzle environments, indicating a lack of generalizable problem-solving capabilities.
3.  **Counter-intuitive Scaling Limit in Reasoning Effort:** LRMs' reasoning effort (measured by inference tokens) initially increases with problem complexity but then *declines* sharply as problems approach the critical collapse point, despite having ample token budget. This suggests a fundamental scaling limitation.
4.  **Three Reasoning Regimes:**
    *   **Low Complexity:** Standard (non-thinking) LLMs often surprisingly outperform LRMs and are more token-efficient.
    *   **Medium Complexity:** LRMs demonstrate an advantage due to their detailed thinking processes.
    *   **High Complexity:** Both thinking and non-thinking models experience complete performance collapse.

**Important Details:**

*   **Analysis of Reasoning Traces:**
    *   **"Overthinking" (Low Complexity):** For simpler problems, LRMs often identify correct solutions early but then inefficiently continue exploring incorrect alternatives.
    *   **Moderate Complexity:** Correct solutions emerge later in the thinking process, often after extensive exploration of incorrect paths.
    *   **High Complexity:** Models consistently fail to find any correct solutions.
*   **Limitations in Exact Computation:**
    *   Providing explicit algorithms (e.g., for Tower of Hanoi) in the prompt does not significantly improve LRM performance, suggesting limitations in consistent logical step execution and verification, not just solution discovery.
    *   Models show inconsistent reasoning across puzzle types; for instance, Claude 3.7 Sonnet performs well on Tower of Hanoi (N=5, 31 moves) but fails early on River Crossing (N=3, 11 moves), implying sensitivity to training data distribution.

**Conclusion:**

The study concludes that despite sophisticated self-reflection mechanisms, current LRMs possess fundamental limitations in generalizable reasoning, exhibit counter-intuitive scaling behaviors, and struggle with exact computation and consistent logical execution. These findings challenge prevailing assumptions about LRM capabilities and underscore the need for new approaches and evaluation paradigms to advance towards more robust artificial intelligence.

In [98]:
from utils import markdown_to_notion_blocks
blocks = markdown_to_notion_blocks(response.text)
print(blocks)

[{'object': 'block', 'type': 'heading_1', 'heading_1': {'rich_text': [{'type': 'text', 'text': {'content': 'Semantic Faithfulness and Entropy Production Measures to Tame Your LLM Demons and Manage Hallucinations'}}]}}, {'object': 'block', 'type': 'heading_2', 'heading_2': {'rich_text': [{'type': 'text', 'text': {'content': 'Abstract'}}]}}, {'object': 'block', 'type': 'paragraph', 'paragraph': {'rich_text': [{'type': 'text', 'text': {'content': 'Evaluating faithfulness of Large Language Models (LLMs) to a given task is a complex challenge. We propose two new unsupervised metrics for faithfulness evaluation using insights from information theory and thermodynamics. Our approach treats an LLM as a bipartite information engine where hidden layers act as a Maxwell demon controlling transformations of context C into answer A via prompt Q. We model Question-Context-Answer (QCA) triplets as probability distributions over shared topics. Topic transformations from C to Q and A are modeled as tra

In [99]:
result['id']

'2e2be5eb-a41a-81e6-86e1-f0ae2803c8ea'

In [101]:
from utils import add_content_to_page
add_content_to_page(NOTION_TOKEN, result['id'], response.text)