# Yellhorn MCP Example in Notebook with LLM Manager

Instruction: Swap model to get different behavior LLM Manager config (Normal, Test full chunking, Test full with retry, Test full chunking & retry)

## Setup Environment

First, let's set up our environment and import the necessary modules:

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
import asyncio
import json
from typing import Dict, List, Optional, Union, Any
from pathlib import Path

# Import required Yellhorn MCP components
from yellhorn_mcp.token_counter import TokenCounter
from yellhorn_mcp.llm_manager import LLMManager

# Import API clients
from google import genai
from openai import AsyncOpenAI

from yellhorn_mcp.server import format_metrics_section, calculate_cost

## Configure API Keys

Set up API keys for Gemini and/or OpenAI. You can either set them in environment variables or directly in this notebook:

In [None]:
# Option 1: Set API keys directly (not recommended for production)
GEMINI_API_KEY = ""
OPENAI_API_KEY = ""
MODEL = "gemini-2.0-flash-exp"  # or any OpenAI model like "gpt-4o"
REPO_PATH = os.path.dirname(os.getcwd())

# Option 2: Get API keys from environment variables (recommended)
# GEMINI_API_KEY = os.environ.get("GEMINI_API_KEY")
# OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")

# Set environment variables for server access
# os.environ["GEMINI_API_KEY"] = GEMINI_API_KEY
# os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
# os.environ["REPO_PATH"] = REPO_PATH
# os.environ["YELLHORN_MCP_MODEL"] = MODEL

## 1. Working with LLMManager

Now let's set up and use LLMManager, which provides unified access to different LLM APIs with automatic chunking.

In [4]:
# Initialize API clients
gemini_client = genai.Client(api_key=GEMINI_API_KEY)
openai_client = AsyncOpenAI(api_key=OPENAI_API_KEY)

# Create LLMManager with custom configuration
config = {
    "safety_margin_tokens": 200,
    "overlap_ratio": 0.1,
    "aggregation_strategy": "concatenate",
    "chunk_strategy": "paragraph",
    # Experimental limits to test chunking & retry behavior
    "model_limits" : {
        "gpt-4o": 30000,
        "gemini-2.0-flash-exp": 10000, 
        "o4-mini": 30000,
        "gemini-2.5-pro": 1_000_000
    }
}

llm_manager = LLMManager(
    openai_client=openai_client,
    gemini_client=gemini_client,
    config=config
)

def log_callback(level, message):
    """Custom log callback function."""
    print(f"[{level.upper()}] {message}")

In [5]:
# Function to make API calls and handle async operations
async def call_model(prompt, model, system_message=None, response_format=None):
    """Helper function to call a model using LLMManager."""
    try:
        response_dict = await llm_manager.call_llm_with_usage(
            prompt=prompt,
            model=model,
            temperature=0.0,
            system_message=system_message,
            response_format=response_format
        )
        return response_dict
    except Exception as e:
        return f"Error: {str(e)}"

## 2. Test LLM Manager Simple vs Chunking

### Simple vs Chunked

In [6]:
# Example 2: OpenAI call with JSON response
json_prompt = "Generate a list of 3 programming languages with their key features."

model = "gpt-4.1"
openai_json_response = await call_model(
    prompt=json_prompt,
    model=model,  # or any available OpenAI model
    # response_format="json"
)

print("OpenAI JSON Response:")
print("-" * 50)
print(openai_json_response["usage_metadata"])
print(openai_json_response["content"])
print(format_metrics_section("gpt-4.1",openai_json_response["usage_metadata"]))

OpenAI JSON Response:
--------------------------------------------------
<yellhorn_mcp.llm_manager.UsageMetadata object at 0x114cf6f50>
Sure! Here’s a list of 3 programming languages with their key features:

---

**1. Python**
- Easy-to-read, clean syntax
- Extensive standard library
- Dynamically typed and interpreted
- Large community support
- Widely used for web development, data science, automation, and AI

---

**2. JavaScript**
- Runs natively in web browsers
- Event-driven, asynchronous programming support
- Prototype-based object orientation
- Essential for front-end web development
- Large ecosystem with frameworks like React, Angular, and Vue

---

**3. Java**
- Statically typed and compiled to bytecode (runs on JVM)
- Strong object-oriented programming support
- Platform-independent (“write once, run anywhere”)
- Robust standard library and tools
- Commonly used for enterprise applications, Android development, and backend systems

---


---
## Completion Metrics
*   **Mod

In [None]:
from yellhorn_mcp.llm_manager import ChunkingStrategy
from yellhorn_mcp.token_counter import TokenCounter

chunks = ChunkingStrategy.split_by_paragraphs(
    text=json_prompt,
    max_tokens=5000,
    token_counter=TokenCounter(),
    model="gemini-2.0-flash-exp"
)

for chunk in chunks:
    print(TokenCounter().count_tokens(chunk, "gemini-2.0-flash-exp"))

In [6]:
# Example 1: Simple Gemini call
prompt = "Explain what token chunking is and why it's important for large language models."*1000
system_message = "You are a helpful AI assistant that provides clear and concise explanations."

model = "gpt-4o-mini"
gemini_response = await call_model(
    prompt=prompt,
    model=model,
    system_message=system_message
)

print("Gemini Response:")
print("-" * 50)
print(gemini_response["content"])
print(format_metrics_section(model,gemini_response["usage_metadata"]))

Gemini Response:
--------------------------------------------------
**Token Chunking: An Overview**

**What is Token Chunking?**
Token chunking is a technique used in natural language processing (NLP) to break down text into smaller, manageable pieces called "tokens." These tokens can be words, phrases, or even characters, depending on the context and the specific application. The process involves segmenting a continuous stream of text into discrete units that can be analyzed or processed by algorithms.

**Why is Token Chunking Important for Large Language Models?**

1. **Efficiency**: Large language models (LLMs) often deal with vast amounts of text data. Token chunking allows these models to process text in smaller segments, making computations more efficient and manageable.

2. **Context Preservation**: By chunking text into meaningful units, models can better understand the context and relationships between words or phrases. This is crucial for tasks like sentiment analysis, transl

In [11]:
gemini_response["usage_metadata"]

<yellhorn_mcp.llm_manager.UsageMetadata at 0x10cfdd810>

### Test Grounded Search

Google Search Grounding is a feature available for Gemini models that allows them to search the web and include citations in their responses. This is particularly useful for getting up-to-date information and verifying facts.

In [31]:
# Import the search grounding utilities
from yellhorn_mcp.search_grounding import _get_gemini_search_tools, add_citations_from_metadata

In [32]:
# Example 2: Compare responses with and without search grounding
comparison_prompt = "What is the current stock price of Tesla (TSLA) and its recent performance?"

# First, make a call without search grounding by not passing tools
response_without_search = await llm_manager.call_llm_with_usage(
    prompt=comparison_prompt,
    model="gemini-2.5-pro",
    temperature=0.0
)

print("Response WITHOUT Search Grounding:")
print("-" * 50)
print(response_without_search["content"])

print("\n" + "="*70 + "\n")

Response WITHOUT Search Grounding:
--------------------------------------------------
Of course. Here is a summary of Tesla's (TSLA) stock price and its recent performance.

As an AI, I cannot give you real-time, up-to-the-second stock data. Stock prices are highly volatile and change constantly during market hours.

However, I can provide you with the most recent closing price and a summary of its recent performance.

### **Tesla (TSLA) Stock Price**

For the most current, live price, please check a reliable financial news source like:

*   **Google Finance**
*   **Yahoo Finance**
*   **Bloomberg**
*   **Reuters**

As of the market close on **June 17, 2024**, the approximate stock price for Tesla (TSLA) was:

*   **~$187.44**

### **Recent Performance Summary**

Tesla's stock has had a very eventful and volatile year in 2024. Here is a breakdown of its recent performance:

*   **Last Trading Day (June 17, 2024):** The stock saw a significant gain of over **+5%**. This surge was largel

In [35]:
# Now make the same call with search grounding
response_with_search = await llm_manager.call_llm_with_citations(
    prompt=comparison_prompt,
    model="gemini-2.5-pro",
    temperature=0.0,
    tools=_get_gemini_search_tools("gemini-2.5-pro")
)

print("Response WITH Search Grounding:")
print("-" * 50)
print(response_with_search["content"])

# Show if citations were found
if "grounding_metadata" in response_with_search:
    grounding_meta = response_with_search["grounding_metadata"]
    if hasattr(grounding_meta, 'grounding_chunks') and grounding_meta.grounding_chunks:
        print(f"\n✓ Found {len(grounding_meta.grounding_chunks)} citation sources")
else:
    print("\n✗ No grounding metadata found")

Response WITH Search Grounding:
--------------------------------------------------
## Tesla Stock Shows Volatility with Recent Dip But Long-Term Gains

**As of Friday, July 11, 2025, Tesla (TSLA) closed at approximately $313.51, reflecting a slight increase of 1.17% in the last 24 hours of trading.** This comes amid a period of mixed performance for the electric vehicle giant.

While the daily performance shows a modest gain, a broader look reveals a recent downturn. The stock has seen a decline of 1.41% over the past week and a more significant drop of 6.25% over the last month.

However, looking at the longer-term picture, Tesla's stock has demonstrated substantial growth. Over the last year, it has surged by 19.07%, and in the last 12 months, the price has risen by 26.40%. This indicates underlying strength and investor confidence in the company's future prospects. The 52-week trading range for the stock has been between a low of $182.00 and a high of $488.54.

Tesla's market capita

### Test OpenAI Deep Research

OpenAI Deep Research models (`o3-deep-research` and `o4-mini-deep-research`) are specialized models that can perform in-depth research and analysis. According to the CHANGELOG, these models automatically have access to `web_search_preview` and `code_interpreter` tools.

In [None]:
# Example 3: Compare regular model vs Deep Research model on the same task
comparison_task = """
Create a very short and concise implementation plan for adding real-time collaboration features 
to a code editor, similar to Google Docs but for code. Consider the most relevant python libraries and write example code and run it:
- Operational Transformation vs CRDTs
- WebSocket vs WebRTC
- Conflict resolution strategies
- Performance at scale
"""

print("Comparing Regular vs Deep Research Models\n")

# First try with regular gpt-4o
regular_model = "gpt-4o"
try:
    regular_response = await llm_manager.call_llm_with_usage(
        prompt=comparison_task,
        model=regular_model,
        temperature=0.0
    )

    print(f"Regular Model response: ", {regular_response["content"]})
    
    print(f"Regular Model ({regular_model}) Response Length: {len(regular_response['content'])} chars")
    print(f"Token Usage: {regular_response['usage_metadata'].total_tokens}")
    estimated_cost = calculate_cost(
        regular_model,
        regular_response['usage_metadata'].prompt_tokens,
        regular_response['usage_metadata'].completion_tokens
    )
    print(f"Estimated Cost: ${estimated_cost:.4f}")
    
except Exception as e:
    print(f"Error with {regular_model}: {str(e)}")

print("\n" + "-"*50 + "\n")

Comparing Regular vs Deep Research Models



Regular Model response:  {'### Implementation Plan for Real-Time Collaboration in a Code Editor\n\n#### 1. Choose a Synchronization Model\n- **Operational Transformation (OT)**: Suitable for text-based collaboration, widely used in Google Docs.\n- **CRDTs (Conflict-free Replicated Data Types)**: Better for decentralized systems, handles conflicts naturally.\n\n**Decision**: Use OT for simplicity and existing library support.\n\n#### 2. Communication Protocol\n- **WebSocket**: Efficient for real-time, bidirectional communication.\n- **WebRTC**: More complex, used for peer-to-peer connections.\n\n**Decision**: Use WebSocket for server-client architecture.\n\n#### 3. Conflict Resolution Strategy\n- Implement server-side logic to handle conflicts using OT.\n- Use a central server to maintain the document state and broadcast changes.\n\n#### 4. Performance at Scale\n- Use horizontal scaling with load balancers.\n- Implement sharding for document storage.\n- Optimize WebSocket connections wi

In [9]:
# Now try with deep research model
deep_model = "o4-mini-deep-research"
try:
    deep_response = await llm_manager.call_llm_with_usage(
        prompt=comparison_task,
        model=deep_model,
        temperature=0.0
    )

    print(f"Deep Model response: ", {deep_response["content"]})
    
    print(f"Deep Research Model ({deep_model}) Response Length: {len(deep_response['content'])} chars")
    print(f"Token Usage: {deep_response['usage_metadata'].total_tokens}")
    estimated_cost = calculate_cost(
        deep_model,
        deep_response['usage_metadata'].prompt_tokens,
        deep_response['usage_metadata'].completion_tokens
    )
    print(f"Estimated Cost: ${estimated_cost:.4f}")
    
    # Show a snippet of the response to see the difference
    print("\nFirst 500 chars of Deep Research response:")
    print(deep_response['content'][:500] + "...")
    
except Exception as e:
    print(f"Error with {deep_model}: {str(e)}")

Deep Research Model (o4-mini-deep-research) Response Length: 3340 chars
Token Usage: 0
Estimated Cost: $0.0000

First 500 chars of Deep Research response:
- **OT vs CRDT:** Modern editors often favor CRDTs (like Ypy/Y-CRDT) for simpler merge semantics and offline edits.  CRDTs ensure *eventual consistency* without explicit coordination ([www.codingeasypeasy.com](https://www.codingeasypeasy.com/blog/build-a-real-time-collaborative-editor-with-fastapi-and-crdts-a-comprehensive-guide#:~:text=scenario,are%20applied%20in%20different%20orders)), whereas OT requires complex transform functions and a central server.  For example, a toy CRDT string merge (...


In [11]:
print(deep_response['content'])

- **OT vs CRDT:** Modern editors often favor CRDTs (like Ypy/Y-CRDT) for simpler merge semantics and offline edits.  CRDTs ensure *eventual consistency* without explicit coordination ([www.codingeasypeasy.com](https://www.codingeasypeasy.com/blog/build-a-real-time-collaborative-editor-with-fastapi-and-crdts-a-comprehensive-guide#:~:text=scenario,are%20applied%20in%20different%20orders)), whereas OT requires complex transform functions and a central server.  For example, a toy CRDT string merge (using unique IDs) might look like:
  
  ```python
  doc1 = []
  doc2 = []
  doc1.append(('1_A', 'A'))   # user1 inserts 'A'
  doc2.append(('1_B', 'B'))   # user2 inserts 'B'
  merged = sorted(doc1 + doc2, key=lambda x: x[0])
  print(''.join(char for _,char in merged))  # AB
  ```

- **WebSocket vs WebRTC:** Use WebSockets for a server-based broadcast model (e.g. with Python’s `websockets` or FastAPI), as in many CRDT stacks.  (WebRTC/P2P is possible but adds browser–signaling overhead.)  For ins

### Key Differences Between Search Grounding and Deep Research

1. **Search Grounding (Gemini)**:
   - Automatically searches the web for relevant information
   - Adds inline citations to responses
   - Best for factual queries requiring current information
   - No additional cost beyond regular API usage

2. **Deep Research Models (OpenAI)**:
   - Specialized models with web search and code interpreter tools
   - Designed for complex, multi-step research tasks
   - Can execute code and analyze results
   - Higher cost but more comprehensive analysis
   - May require special API access

Both features enhance the LLM's ability to provide accurate, up-to-date information, but they serve different use cases and have different cost/performance tradeoffs.

In [13]:
title = "Replace existing model with unified Model Calling Service (OpenRouter or Litellm)"
description = """
Describe how to replace and consolidate LLM model calls in yellhorn with Gemini, Open AI, etc. with OpenRouter or LiteLLM
"""

user_task = f"Title: {title}, Description: {description}"

## 3. Test Curate Context at scale

In [7]:
from examples.mock_context import run_curate_context
from yellhorn_mcp.server import curate_context

In [8]:
# Call create_workplan with our mock context
context_result = await (run_curate_context(
        curate_context_func=curate_context,
        user_task=user_task,
        repo_path=REPO_PATH,
        gemini_client=gemini_client,
        openai_client=openai_client,
        llm_manager=llm_manager,
        model=MODEL,
        codebase_reasoning="full",  # Use "none" for faster processing
        log_callback=log_callback
    )
)

print("Path to yellhorn context")
print(context_result)

[INFO] Starting .yellhorncontext file generation with gemini-2.0-flash-exp using full mode
[INFO] Using Git's tracking information - respecting .gitignore patterns
[INFO] Found .yellhornignore file at /Users/sravanj/project_work/yellhorn-mcp/.yellhornignore, will use it for filtering
[INFO] Applied .yellhornignore filtering: 16 of 55 files remain
[INFO] Extracted 3 directories from 16 filtered files
[INFO] Using full mode with file contents for codebase analysis
[INFO] Directory context:
<codebase_tree>
top_directory
	.DS_Store
		```text
		   Bud1          
		                                                         t sIlocblob                                                                                                                                                                                                                                                                                                                                                         ...
[INFO] An

[INFO] Analysis complete, found 1 important directories: yellhorn_mcp
[INFO] Processing complete, identified 1 important directories
[INFO] Added .gitignore patterns to .yellhorncontext
[INFO] Added .yellhornignore patterns to .yellhorncontext
[INFO] Successfully wrote .yellhorncontext file to /Users/sravanj/project_work/yellhorn-mcp/.yellhorncontext
[INFO] Generated .yellhorncontext file at /Users/sravanj/project_work/yellhorn-mcp/.yellhorncontext with 1 important directories, blacklist and whitelist patterns
[INFO] Waiting for 0 background tasks...
[INFO] All background tasks completed
Path to yellhorn context
Successfully created .yellhorncontext file at /Users/sravanj/project_work/yellhorn-mcp/.yellhorncontext with 1 important directories and existing ignore patterns from .yellhornignore.


#### Load .yellhorncontext

In [9]:
# Parse the path from the context_result string
import re

# Extract the file path using regex
match = re.search(r'at\s+(.+?)\s+with', context_result)
if match:
    context_file_path = match.group(1)
    print(f"Extracted path: {context_file_path}")
else:
    # Fallback: try to find any path-like string
    match = re.search(r'(/[^\s]+\.yellhorncontext)', context_result)
    if match:
        context_file_path = match.group(1)
        print(f"Extracted path: {context_file_path}")
    else:
        print("Could not extract path from result string")
        context_file_path = None

# Now you can use context_file_path to read the file
if context_file_path:
    with open(context_file_path, 'r') as f:
        context = f.read()
    print("Yellhorn context:")
    print(context)

Extracted path: /Users/sravanj/project_work/yellhorn-mcp/.yellhorncontext
Yellhorn context:
# Yellhorn Context File - AI context optimization
# Generated by yellhorn-mcp curate_context tool
# Based on task: Title: Replace existing model with unified Model Calling Service (OpenRouter or 

# Patterns from .gitignore file
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cac

## 4. Using Create Workplan

Now let's demonstrate how to use the create_workplan MCP tool to generate implementation plans.

In [6]:
from examples.mock_context import run_create_workplan, mock_github_command
from yellhorn_mcp.server import process_workplan_async

In [14]:
# Call process_workplan_async with our mock context
workplan_result = await (run_create_workplan(
        create_workplan_func=process_workplan_async,
        title=title,
        detailed_description=description,
        repo_path=REPO_PATH,
        gemini_client=gemini_client,
        openai_client=openai_client,
        llm_manager=llm_manager,
        model=MODEL,
        codebase_reasoning="full",
        log_callback=log_callback,
        github_command_func=mock_github_command,
        background_task_timeout=180
    )
)

print("Workplan Created:")
print("-" * 50)
print(f"Issue URL: {workplan_result['issue_url']}")
print(f"Issue Number: {workplan_result['issue_number']}")

[INFO] Using full mode with content retrieval for workplan generation
[INFO] Found .yellhorncontext file, using it for filtering
[INFO] Filtering codebase with 96 blacklist and 2 whitelist patterns from .yellhorncontext
[INFO] Filtered from 55 to 15 files
[INFO] Filled prompt for workplan generation:
You are an expert software developer tasked with creating a detailed workplan that will be published as a GitHub issue.
        
<codebase_tree>
top_directory
	.DS_Store
	.yellhorncontext
		```text
		# Yellhorn Context File - AI context optimization
		# Generated by yellhorn-mcp curate_context tool
		# Based on task: Title: Replace existing model with unified Model Calling Service (OpenRouter or 
		
		# Patterns from .gitignore file
		# Byte-compiled / optimized / DLL files
		__pycache__/
		*.py...
[INFO] Generating workplan with LLMManager for title: Replace existing model with unified Model Calling Service (OpenRouter or Litellm) with model gemini-2.0-flash-exp
[INFO] Attempting to enabl

[MOCK GitHub CLI] Running: gh issue edit 3026 --body-file /var/folders/3h/5k3yk1s13yn9cql9x6fl9g9h0000gn/T/tmpl__4v6b_.md
[MOCK GitHub CLI] In directory: /Users/sravanj/project_work/yellhorn-mcp
[MOCK GitHub CLI] File contents: # Replace existing model with unified Model Calling Service (OpenRouter or Litellm)

Okay, I wi[1](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQGEsJKYkM6W5UfQVHbpXdztp5-_MRScG3t_GSmjWuZVU4Fdg7eeqwtNW7L27Mq5dE0I8TtxK2Uqe1Zp_Ks6Vn6ITfknLtTKMFOuLGJN-0E6mKZK14wSsMbCIuQla71l7dzJEmYoiJrq), [2](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQE8T88k70ZJ6p3DaIB_3sBrDQyeOuJubTnAi8jnW6YEAbDORSng_gm_0eL4dmfh85o_lnHo1yWBDZcZXESmnUuGfksg8IHQ5o-R0KSOwrqlyAtM2XkPlGOE_P9QUwMPYKk4J8M9wxZJILf4IJWFsIXObGRiD9FjKiY7E0TcTwNuDvmED9qLn2jLiSuHSx5uRWWO6WjWdurAdQfqTKOYbboGuBB3pArp-3u5pW61VfHsEs2iaajRPbb9UtQ=), [3](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQEPEtYxBsoYFACEkWgso-WQv2NIOAuAbC8s_SVyzl3WQcdhbg_4N1aSPSO6V1Hyvq

## 5. Using Judge Workplan

In [22]:
from examples.mock_context import run_judge_workplan, mock_github_command
from yellhorn_mcp.server import process_judgement_async

In [23]:
# Example workplan content (you would typically get this from a GitHub issue)
WORKPLAN_CONTENT = """
# Example Workplan: Add Token Counter Feature

## Summary
Implement a token counting system using tiktoken to prevent token overflow in LLM calls.

## Implementation Steps
1. Create TokenCounter class with tiktoken integration
2. Add token counting to LLMManager
3. Implement automatic prompt chunking when limits are exceeded
4. Update all LLM call sites to use the new system

## Files to Modify
- `yellhorn_mcp/llm_manager.py`: Add token counting integration
- `yellhorn_mcp/server.py`: Update LLM call sites

## New Files to Create
- `yellhorn_mcp/token_counter.py`: Core token counting functionality
"""

# Example diff content (you would typically get this from git diff)
DIFF_CONTENT = """
diff --git a/yellhorn_mcp/token_counter.py b/yellhorn_mcp/token_counter.py
new file mode 100644
index 0000000..1234567
--- /dev/null
+++ b/yellhorn_mcp/token_counter.py
@@ -0,0 +1,50 @@
+# Token counting utilities using tiktoken.
+
+import tiktoken
+from typing import Optional
+
+class TokenCounter:
+    \"\"\"Token counter using tiktoken for accurate token counting.\"\"\"
+    
+    def __init__(self, model: str):
+        self.model = model
+        self.encoding = tiktoken.encoding_for_model(model)
+    
+    def count_tokens(self, text: str) -> int:
+        \"\"\"Count tokens in the given text.\"\"\"
+        return len(self.encoding.encode(text))

diff --git a/yellhorn_mcp/llm_manager.py b/yellhorn_mcp/llm_manager.py
index abcd123..efgh456 100644
--- a/yellhorn_mcp/llm_manager.py
+++ b/yellhorn_mcp/llm_manager.py
@@ -10,6 +10,7 @@ from typing import Dict, List, Optional, Any, Union
 from openai import AsyncOpenAI
 import google.genai as genai
 from .usage_metadata import UsageMetadata
+from .token_counter import TokenCounter
 
 class LLMManager:
     \"\"\"Unified LLM manager with token counting and chunking support.\"\"\"
"""

# Judge workplan parameters
base_ref = "main"
head_ref = "feature/token-counter"
subissue_to_update = "456"  # GitHub issue number for the sub-issue to update
parent_workplan_issue_number = "123"  # Original workplan issue number

In [24]:
# Call judge_workplan with our mock context
await run_judge_workplan(
    process_judgement_func=process_judgement_async,
    workplan_content=WORKPLAN_CONTENT,
    diff_content=DIFF_CONTENT,
    base_ref=base_ref,
    head_ref=head_ref,
    subissue_to_update=subissue_to_update,
    parent_workplan_issue_number=parent_workplan_issue_number,
    repo_path=REPO_PATH,
    gemini_client=gemini_client,
    openai_client=openai_client,
    llm_manager=llm_manager,
    model=MODEL,
    base_commit_hash="abc123",  # Optional: actual commit hash
    head_commit_hash="def456",  # Optional: actual commit hash
    debug=False,  # Set to True to see the full prompt used
    codebase_reasoning="full",  # Options: "full", "lsp", "file_structure", "none"
    disable_search_grounding=False,
    github_command_func=mock_github_command,
    log_callback=log_callback,
    wait_for_background_tasks=True,
    background_task_timeout=180
)

print("\nJudgement completed successfully!")
print(f"Check GitHub sub-issue #{subissue_to_update} for the detailed judgement results.")

[INFO] Using full mode with content retrieval for judgement
[INFO] Found .yellhorncontext file, using it for filtering
[INFO] Filtering codebase with 96 blacklist and 2 whitelist patterns from .yellhorncontext
[INFO] Filtered from 55 to 15 files
[INFO] Filled prompt for judgement generation:
You are an expert code evaluator judging if a code diff correctly implements a workplan.

<codebase_tree>
top_directory
	.DS_Store
	.yellhorncontext
		```text
		# Yellhorn Context File - AI context optimization
		# Generated by yellhorn-mcp curate_context tool
		# Based on task: Title: Replace existing model with unified Model Calling Service (OpenRouter or 
		
		# Patterns from .gitignore file
		# Byte-compiled / optimized / DLL files
		__pycache__/
		*.py[cod]
		*$py.class
		
		# C extensions
...
[INFO] Generating judgement with LLMManager for model gemini-2.0-flash-exp
[INFO] Attempting to enable search grounding for model gemini-2.0-flash-exp
[INFO] Search tools configured for model gemini-2.0-

[INFO] Updating sub-issue #456 with generated judgement
[MOCK GitHub CLI] Running: gh issue edit 456 --body-file /var/folders/3h/5k3yk1s13yn9cql9x6fl9g9h0000gn/T/tmppxi1_ob4.md
[MOCK GitHub CLI] In directory: /Users/sravanj/project_work/yellhorn-mcp
[MOCK GitHub CLI] File contents: ## Comparison Metadata
- **Workplan Issue**: `#123`
- **Base Ref**: `main` (Commit: `abc123`)
- **Head Ref**: `feature/token-counter` (Commit: `def456`)
- **Codebase Reasoning Mode**: `full`
- **AI Model**: `gemini-2.0-flash-exp`

Okay, I will evaluate the code diffs and determine if they correctly implement the workplan, based on the provided file system information.

First, I need to understand the user's request, which is to "Replace existing model with unified Model Calling Service (OpenRouter or ". However, the request is incomplete. I will proceed assuming the user wants to replace an existing model with a unified Model Calling Service, potentially using OpenRouter.

Given the available files, especial