# Yellhorn MCP Example in Notebook with LLM Manager

Instruction: Swap model to get different behavior LLM Manager config (Normal, Test full chunking, Test full with retry, Test full chunking & retry)

## Setup Environment

First, let's set up our environment and import the necessary modules:

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
import asyncio
import json
from typing import Dict, List, Optional, Union, Any
from pathlib import Path

# Import required Yellhorn MCP components
from yellhorn_mcp.token_counter import TokenCounter
from yellhorn_mcp.llm_manager import LLMManager

# Import API clients
from google import genai
from openai import AsyncOpenAI

from yellhorn_mcp.server import format_metrics_section, calculate_cost

## Configure API Keys

Set up API keys for Gemini and/or OpenAI. You can either set them in environment variables or directly in this notebook:

In [None]:
# Option 1: Set API keys directly (not recommended for production)
GEMINI_API_KEY = ""
OPENAI_API_KEY = ""

MODEL = "gemini-2.5-flash"  # or any OpenAI model like "gpt-4o"
REPO_PATH = os.path.dirname(os.getcwd())

# Option 2: Get API keys from environment variables (recommended)
# GEMINI_API_KEY = os.environ.get("GEMINI_API_KEY")
# OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")

# Set environment variables for server access
# os.environ["GEMINI_API_KEY"] = GEMINI_API_KEY
# os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
# os.environ["REPO_PATH"] = REPO_PATH
# os.environ["YELLHORN_MCP_MODEL"] = MODEL

## 1. Working with LLMManager

Now let's set up and use LLMManager, which provides unified access to different LLM APIs with automatic chunking.

In [4]:
# Initialize API clients
gemini_client = genai.Client(api_key=GEMINI_API_KEY)
openai_client = AsyncOpenAI(api_key=OPENAI_API_KEY)

# Create LLMManager with custom configuration
config = {
    "safety_margin_tokens": 200,
    "overlap_ratio": 0.1,
    "aggregation_strategy": "concatenate",
    "chunk_strategy": "paragraph",
    # Experimental limits to test chunking & retry behavior
    "model_limits" : {
        "gpt-4o": 30000,
        "gemini-2.0-flash-exp": 10000, 
        "o4-mini": 30000,
        "gemini-2.5-pro": 1_000_000
    }
}

llm_manager = LLMManager(
    openai_client=openai_client,
    gemini_client=gemini_client,
    config=config
)

def log_callback(level, message):
    """Custom log callback function."""
    print(f"[{level.upper()}] {message}")

In [5]:
# Function to make API calls and handle async operations
async def call_model(prompt, model, system_message=None, response_format=None):
    """Helper function to call a model using LLMManager."""
    try:
        response_dict = await llm_manager.call_llm_with_usage(
            prompt=prompt,
            model=model,
            temperature=0.0,
            system_message=system_message,
            response_format=response_format
        )
        return response_dict
    except Exception as e:
        return f"Error: {str(e)}"

## 2. Test LLM Manager Simple vs Chunking

### Simple vs Chunked

In [6]:
# Example 2: OpenAI call with JSON response
json_prompt = "Generate a list of 3 programming languages with their key features."

model = "gpt-4.1"
openai_json_response = await call_model(
    prompt=json_prompt,
    model=model,  # or any available OpenAI model
    # response_format="json"
)

print("OpenAI JSON Response:")
print("-" * 50)
print(openai_json_response["usage_metadata"])
print(openai_json_response["content"])
print(format_metrics_section("gpt-4.1",openai_json_response["usage_metadata"]))

OpenAI JSON Response:
--------------------------------------------------
<yellhorn_mcp.llm_manager.UsageMetadata object at 0x114cf6f50>
Sure! Here’s a list of 3 programming languages with their key features:

---

**1. Python**
- Easy-to-read, clean syntax
- Extensive standard library
- Dynamically typed and interpreted
- Large community support
- Widely used for web development, data science, automation, and AI

---

**2. JavaScript**
- Runs natively in web browsers
- Event-driven, asynchronous programming support
- Prototype-based object orientation
- Essential for front-end web development
- Large ecosystem with frameworks like React, Angular, and Vue

---

**3. Java**
- Statically typed and compiled to bytecode (runs on JVM)
- Strong object-oriented programming support
- Platform-independent (“write once, run anywhere”)
- Robust standard library and tools
- Commonly used for enterprise applications, Android development, and backend systems

---


---
## Completion Metrics
*   **Mod

In [None]:
from yellhorn_mcp.llm_manager import ChunkingStrategy
from yellhorn_mcp.token_counter import TokenCounter

chunks = ChunkingStrategy.split_by_paragraphs(
    text=json_prompt,
    max_tokens=5000,
    token_counter=TokenCounter(),
    model="gemini-2.0-flash-exp"
)

for chunk in chunks:
    print(TokenCounter().count_tokens(chunk, "gemini-2.0-flash-exp"))

In [6]:
# Example 1: Simple Gemini call
prompt = "Explain what token chunking is and why it's important for large language models."*1000
system_message = "You are a helpful AI assistant that provides clear and concise explanations."

model = "gpt-4o-mini"
gemini_response = await call_model(
    prompt=prompt,
    model=model,
    system_message=system_message
)

print("Gemini Response:")
print("-" * 50)
print(gemini_response["content"])
print(format_metrics_section(model,gemini_response["usage_metadata"]))

Gemini Response:
--------------------------------------------------
**Token Chunking: An Overview**

**What is Token Chunking?**
Token chunking is a technique used in natural language processing (NLP) to break down text into smaller, manageable pieces called "tokens." These tokens can be words, phrases, or even characters, depending on the context and the specific application. The process involves segmenting a continuous stream of text into discrete units that can be analyzed or processed by algorithms.

**Why is Token Chunking Important for Large Language Models?**

1. **Efficiency**: Large language models (LLMs) often deal with vast amounts of text data. Token chunking allows these models to process text in smaller segments, making computations more efficient and manageable.

2. **Context Preservation**: By chunking text into meaningful units, models can better understand the context and relationships between words or phrases. This is crucial for tasks like sentiment analysis, transl

In [11]:
gemini_response["usage_metadata"]

<yellhorn_mcp.llm_manager.UsageMetadata at 0x10cfdd810>

### Test Grounded Search

Google Search Grounding is a feature available for Gemini models that allows them to search the web and include citations in their responses. This is particularly useful for getting up-to-date information and verifying facts.

In [31]:
# Import the search grounding utilities
from yellhorn_mcp.search_grounding import _get_gemini_search_tools, add_citations_from_metadata

In [32]:
# Example 2: Compare responses with and without search grounding
comparison_prompt = "What is the current stock price of Tesla (TSLA) and its recent performance?"

# First, make a call without search grounding by not passing tools
response_without_search = await llm_manager.call_llm_with_usage(
    prompt=comparison_prompt,
    model="gemini-2.5-pro",
    temperature=0.0
)

print("Response WITHOUT Search Grounding:")
print("-" * 50)
print(response_without_search["content"])

print("\n" + "="*70 + "\n")

Response WITHOUT Search Grounding:
--------------------------------------------------
Of course. Here is a summary of Tesla's (TSLA) stock price and its recent performance.

As an AI, I cannot give you real-time, up-to-the-second stock data. Stock prices are highly volatile and change constantly during market hours.

However, I can provide you with the most recent closing price and a summary of its recent performance.

### **Tesla (TSLA) Stock Price**

For the most current, live price, please check a reliable financial news source like:

*   **Google Finance**
*   **Yahoo Finance**
*   **Bloomberg**
*   **Reuters**

As of the market close on **June 17, 2024**, the approximate stock price for Tesla (TSLA) was:

*   **~$187.44**

### **Recent Performance Summary**

Tesla's stock has had a very eventful and volatile year in 2024. Here is a breakdown of its recent performance:

*   **Last Trading Day (June 17, 2024):** The stock saw a significant gain of over **+5%**. This surge was largel

In [35]:
# Now make the same call with search grounding
response_with_search = await llm_manager.call_llm_with_citations(
    prompt=comparison_prompt,
    model="gemini-2.5-pro",
    temperature=0.0,
    tools=_get_gemini_search_tools("gemini-2.5-pro")
)

print("Response WITH Search Grounding:")
print("-" * 50)
print(response_with_search["content"])

# Show if citations were found
if "grounding_metadata" in response_with_search:
    grounding_meta = response_with_search["grounding_metadata"]
    if hasattr(grounding_meta, 'grounding_chunks') and grounding_meta.grounding_chunks:
        print(f"\n✓ Found {len(grounding_meta.grounding_chunks)} citation sources")
else:
    print("\n✗ No grounding metadata found")

Response WITH Search Grounding:
--------------------------------------------------
## Tesla Stock Shows Volatility with Recent Dip But Long-Term Gains

**As of Friday, July 11, 2025, Tesla (TSLA) closed at approximately $313.51, reflecting a slight increase of 1.17% in the last 24 hours of trading.** This comes amid a period of mixed performance for the electric vehicle giant.

While the daily performance shows a modest gain, a broader look reveals a recent downturn. The stock has seen a decline of 1.41% over the past week and a more significant drop of 6.25% over the last month.

However, looking at the longer-term picture, Tesla's stock has demonstrated substantial growth. Over the last year, it has surged by 19.07%, and in the last 12 months, the price has risen by 26.40%. This indicates underlying strength and investor confidence in the company's future prospects. The 52-week trading range for the stock has been between a low of $182.00 and a high of $488.54.

Tesla's market capita

### Test OpenAI Deep Research

OpenAI Deep Research models (`o3-deep-research` and `o4-mini-deep-research`) are specialized models that can perform in-depth research and analysis. According to the CHANGELOG, these models automatically have access to `web_search_preview` and `code_interpreter` tools.

In [None]:
# Example 3: Compare regular model vs Deep Research model on the same task
comparison_task = """
Create a very short and concise implementation plan for adding real-time collaboration features 
to a code editor, similar to Google Docs but for code. Consider the most relevant python libraries and write example code and run it:
- Operational Transformation vs CRDTs
- WebSocket vs WebRTC
- Conflict resolution strategies
- Performance at scale
"""

print("Comparing Regular vs Deep Research Models\n")

# First try with regular gpt-4o
regular_model = "gpt-4o"
try:
    regular_response = await llm_manager.call_llm_with_usage(
        prompt=comparison_task,
        model=regular_model,
        temperature=0.0
    )

    print(f"Regular Model response: ", {regular_response["content"]})
    
    print(f"Regular Model ({regular_model}) Response Length: {len(regular_response['content'])} chars")
    print(f"Token Usage: {regular_response['usage_metadata'].total_tokens}")
    estimated_cost = calculate_cost(
        regular_model,
        regular_response['usage_metadata'].prompt_tokens,
        regular_response['usage_metadata'].completion_tokens
    )
    print(f"Estimated Cost: ${estimated_cost:.4f}")
    
except Exception as e:
    print(f"Error with {regular_model}: {str(e)}")

print("\n" + "-"*50 + "\n")

Comparing Regular vs Deep Research Models



Regular Model response:  {'### Implementation Plan for Real-Time Collaboration in a Code Editor\n\n#### 1. Choose a Synchronization Model\n- **Operational Transformation (OT)**: Suitable for text-based collaboration, widely used in Google Docs.\n- **CRDTs (Conflict-free Replicated Data Types)**: Better for decentralized systems, handles conflicts naturally.\n\n**Decision**: Use OT for simplicity and existing library support.\n\n#### 2. Communication Protocol\n- **WebSocket**: Efficient for real-time, bidirectional communication.\n- **WebRTC**: More complex, used for peer-to-peer connections.\n\n**Decision**: Use WebSocket for server-client architecture.\n\n#### 3. Conflict Resolution Strategy\n- Implement server-side logic to handle conflicts using OT.\n- Use a central server to maintain the document state and broadcast changes.\n\n#### 4. Performance at Scale\n- Use horizontal scaling with load balancers.\n- Implement sharding for document storage.\n- Optimize WebSocket connections wi

In [9]:
# Now try with deep research model
deep_model = "o4-mini-deep-research"
try:
    deep_response = await llm_manager.call_llm_with_usage(
        prompt=comparison_task,
        model=deep_model,
        temperature=0.0
    )

    print(f"Deep Model response: ", {deep_response["content"]})
    
    print(f"Deep Research Model ({deep_model}) Response Length: {len(deep_response['content'])} chars")
    print(f"Token Usage: {deep_response['usage_metadata'].total_tokens}")
    estimated_cost = calculate_cost(
        deep_model,
        deep_response['usage_metadata'].prompt_tokens,
        deep_response['usage_metadata'].completion_tokens
    )
    print(f"Estimated Cost: ${estimated_cost:.4f}")
    
    # Show a snippet of the response to see the difference
    print("\nFirst 500 chars of Deep Research response:")
    print(deep_response['content'][:500] + "...")
    
except Exception as e:
    print(f"Error with {deep_model}: {str(e)}")

Deep Research Model (o4-mini-deep-research) Response Length: 3340 chars
Token Usage: 0
Estimated Cost: $0.0000

First 500 chars of Deep Research response:
- **OT vs CRDT:** Modern editors often favor CRDTs (like Ypy/Y-CRDT) for simpler merge semantics and offline edits.  CRDTs ensure *eventual consistency* without explicit coordination ([www.codingeasypeasy.com](https://www.codingeasypeasy.com/blog/build-a-real-time-collaborative-editor-with-fastapi-and-crdts-a-comprehensive-guide#:~:text=scenario,are%20applied%20in%20different%20orders)), whereas OT requires complex transform functions and a central server.  For example, a toy CRDT string merge (...


In [11]:
print(deep_response['content'])

- **OT vs CRDT:** Modern editors often favor CRDTs (like Ypy/Y-CRDT) for simpler merge semantics and offline edits.  CRDTs ensure *eventual consistency* without explicit coordination ([www.codingeasypeasy.com](https://www.codingeasypeasy.com/blog/build-a-real-time-collaborative-editor-with-fastapi-and-crdts-a-comprehensive-guide#:~:text=scenario,are%20applied%20in%20different%20orders)), whereas OT requires complex transform functions and a central server.  For example, a toy CRDT string merge (using unique IDs) might look like:
  
  ```python
  doc1 = []
  doc2 = []
  doc1.append(('1_A', 'A'))   # user1 inserts 'A'
  doc2.append(('1_B', 'B'))   # user2 inserts 'B'
  merged = sorted(doc1 + doc2, key=lambda x: x[0])
  print(''.join(char for _,char in merged))  # AB
  ```

- **WebSocket vs WebRTC:** Use WebSockets for a server-based broadcast model (e.g. with Python’s `websockets` or FastAPI), as in many CRDT stacks.  (WebRTC/P2P is possible but adds browser–signaling overhead.)  For ins

### Key Differences Between Search Grounding and Deep Research

1. **Search Grounding (Gemini)**:
   - Automatically searches the web for relevant information
   - Adds inline citations to responses
   - Best for factual queries requiring current information
   - No additional cost beyond regular API usage

2. **Deep Research Models (OpenAI)**:
   - Specialized models with web search and code interpreter tools
   - Designed for complex, multi-step research tasks
   - Can execute code and analyze results
   - Higher cost but more comprehensive analysis
   - May require special API access

Both features enhance the LLM's ability to provide accurate, up-to-date information, but they serve different use cases and have different cost/performance tradeoffs.

In [6]:
title = "Replace existing model with unified Model Calling Service (OpenRouter or Litellm)"
description = """
Describe how to replace and consolidate LLM model calls in yellhorn with Gemini, Open AI, etc. with OpenRouter or LiteLLM
"""

user_task = f"Title: {title}, Description: {description}"

## 3. Test Curate Context at scale

In [15]:
from examples.mock_context import run_curate_context, mock_github_command
from yellhorn_mcp.server import curate_context
from yellhorn_mcp.utils.git_utils import run_git_command_with_set_cwd, run_github_command_with_set_cwd

In [18]:
# Call create_workplan with our mock context
context_result = await (run_curate_context(
        user_task=user_task,
        repo_path=REPO_PATH,
        gemini_client=gemini_client,
        openai_client=openai_client,
        llm_manager=llm_manager,
        model=MODEL,
        codebase_reasoning="full",  # Use "none" for faster processing
        log_callback=log_callback,
        git_command_func=run_git_command_with_set_cwd(REPO_PATH),
        debug=True
    )
)

print("Path to yellhorn context")
print(context_result)

2025-08-10 16:02:06,394 INFO AFC is enabled with max remote calls: 10.


[INFO] Starting context curation process
[INFO] Deleted existing .yellhorncontext file before analysis
[INFO] Getting codebase context using full mode
[INFO] Getting codebase snapshot in mode: full
[INFO] Found .gitignore with 86 patterns
[INFO] Codebase context metrics: 12009 files, 110173 tokens based on (gemini-2.5-flash)
[INFO] Extracted 11 directories from 38 filtered files
[INFO] Directory context:
<codebase_tree>
.
├── .mcp.json
├── .python-version
├── CHANGELOG.md
├── CLAUDE.md
├── LLMManagerREADME.md
├── README.md
├── coverage_stats.txt
├── pyproject.toml
├── pyrightconfig.json
│   ├── workflows/
│   │   ├── publish.yml
│   │   └── tests.yml
├── docs/
│   ├── USAGE.md
│   └── coverage_baseline.md
├── notebooks/
│   ├── file_structure.ipynb
│   ├── llm_manager.ipynb
│   └── test_curate_context_v2.ipynb
├── yellhorn_mcp/
│   ├── __init__.py
│   ├── cli.py
│   ├── llm_manager.py
│   ├── me...
[INFO] Analyzing directory structure with gemini-2.5-flash
[INFO] [DEBUG] System message

#### Load .yellhorncontext

In [19]:
# Parse the path from the context_result string
import re

# Extract the file path using regex
match = re.search(r'at\s+(.+?)\s+with', context_result)
if match:
    context_file_path = match.group(1)
    print(f"Extracted path: {context_file_path}")
else:
    # Fallback: try to find any path-like string
    match = re.search(r'(/[^\s]+\.yellhorncontext)', context_result)
    if match:
        context_file_path = match.group(1)
        print(f"Extracted path: {context_file_path}")
    else:
        print("Could not extract path from result string")
        context_file_path = None

# Now you can use context_file_path to read the file
if context_file_path:
    with open(context_file_path, 'r') as f:
        context = f.read()
    print("Yellhorn context:")
    print(context)

Extracted path: /Users/sravanj/project_work/yellhorn-mcp/.yellhorncontext
Yellhorn context:
# Yellhorn Context File - AI context optimization
# Generated by yellhorn-mcp curate_context tool
# Based on task: Title: Replace existing model with unified Model Calling Service (OpenRouter or 

# Important directories to specifically include
CLAUDE.md
LLMManagerREADME.md
docs/USAGE.md
notebooks/llm_manager.ipynb
notebooks/test_curate_context_v2.ipynb
pyproject.toml
yellhorn_mcp/cli.py
yellhorn_mcp/integrations/gemini_integration.py
yellhorn_mcp/llm_manager.py
yellhorn_mcp/models/metadata_models.py
yellhorn_mcp/server.py
yellhorn_mcp/token_counter.py
yellhorn_mcp/utils/cost_tracker_utils.py
yellhorn_mcp/utils/lsp_utils.py
yellhorn_mcp/utils/search_grounding_utils.py
yellhorn_mcp/formatters/
yellhorn_mcp/processors/



## 4. Using Create Workplan

Now let's demonstrate how to use the create_workplan MCP tool to generate implementation plans.

In [21]:
from examples.mock_context import run_create_workplan, mock_github_command
from yellhorn_mcp.server import process_workplan_async

In [None]:
# Call process_workplan_async with our mock context
workplan_result = await (run_create_workplan(
        title=title,
        detailed_description=description,
        repo_path=REPO_PATH,
        gemini_client=gemini_client,
        openai_client=openai_client,
        llm_manager=llm_manager,
        model=MODEL,
        codebase_reasoning="full",
        log_callback=log_callback,
        github_command_func=mock_github_command,
        git_command_func=run_git_command_with_set_cwd(REPO_PATH),
        background_task_timeout=180
    )
)

print("Workplan Created:")
print("-" * 50)
print(f"Issue URL: {workplan_result['issue_url']}")
print(f"Issue Number: {workplan_result['issue_number']}")

2025-08-10 16:08:47,515 INFO AFC is enabled with max remote calls: 10.


[INFO] Getting codebase snapshot in mode: full
[INFO] Found .gitignore with 86 patterns
[INFO] Attempting to enable search grounding for model gemini-2.5-flash
[INFO] Search grounding enabled for model gemini-2.5-flash
[INFO] Found .yellhornignore with 1 patterns
[INFO] Found .yellhorncontext with 17 whitelist, 0 blacklist, and 0 negation patterns
[INFO] File categorization results out of 73 files:
[INFO]   - 6 always ignored (images, binaries, configs, etc.)
[INFO]   - 23 in yellhorncontext whitelist (included)
[INFO]   - 0 in yellhorncontext blacklist (excluded)
[INFO]   - 0 in yellhornignore whitelist (included)
[INFO]   - 0 in yellhornignore blacklist (excluded)
[INFO]   - 0 other files (excluded - .yellhorncontext exists)
[INFO] Total included: 23 files (excluded 6 always-ignored files)
[INFO] Read contents of 23 files
[MOCK GitHub CLI] Running: gh issue edit 2760 --body # Replace existing model with unified Model Calling Service (OpenRouter or Litellm)

## Refactor LLM Management

## 5. Using Review Workplan

In [26]:
from examples.mock_context import run_revise_workplan, mock_github_command
from yellhorn_mcp.server import process_revision_async

In [24]:
original_workplan = """## Refactor LLM Management with Unified Model Calling Service (LiteLLM)

### Summary
The current `yellhorn-mcp` codebase utilizes direct API calls to OpenAI and Google Gemini models, leading to fragmented LLM interaction logic and increased maintenance overhead for supporting new providers. This work plan proposes to replace and consolidate all LLM model calls through a unified model calling service, LiteLLM. This integration will simplify LLM API interactions, streamline the addition of new models and providers, and centralize advanced features like retry logic and cost tracking, thereby enhancing the system's flexibility, maintainability, and scalability. The primary subsystems affected will be `yellhorn_mcp/llm_manager.py`, `yellhorn_mcp/server.py`, and related utility and configuration files.

### Technical Details
*   **Languages & Frameworks**: Python 3.10+
*   **External Dependencies**:
    *   `litellm~=1.33.0`: This new dependency will be added to `pyproject.toml` under `[project] dependencies`. LiteLLM provides a unified interface to over 100 LLM APIs, including OpenAI and Gemini, simplifying API calls and offering built-in features like retry logic and cost tracking.[1](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQG4Ty1kIv8-x706DKH3z0MyTlkEO8W57vOoDQOfso2nMsh9xh2BNH93Zg-7QbiF8-0CBxgxiyt3VuInjEF2v0egmfYTmiNcdHEDT6Id4ChPxXEX3wuBraaQMTjxKfKfsdgM_OdlbKP6n4hHZpfv8Z0Dq_1WI8VknpKOaGbluD5pvvgochg=), [2](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHNO9XSCuQVloTCtVZqw58jsi_tTJFmx2aAJdBRQlPbj69lF2ZDEeGXdC4DOfLIXQ9wnw8vmYjsUreGa5rXUZogzcTaVIPrrHdIc-MoZp93nZIdnsLsynw=), [3](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHlyoBNK5-LKhYp37fYMMonXDdEhq0L2HOStpzDqK6Q8k29EGQBFuCvilzsLfcHyYt9xJdTW8jE2Wz8PAE91IOy5TT3WTDN4NOE9z84iS97Ix0-gVv17hI4sYRhfiK3Kh5Ak2KBhLJUVGxRzPqdJMp3sCjRj1nPP8ZAIOJasfsJrkzXU8Rh5ztGaAkie4Iz)
*   **Dependency Management & Pinning Strategy**: `poetry` (as indicated by `pyproject.toml`). The new `litellm` dependency will be pinned using the `~=` operator for compatible releases.
*   **Build, Lint, Formatting, Type-checking Commands**:
    *   `python -m black yellhorn_mcp tests`
    *   `python -m isort yellhorn_mcp tests`
    *   `python -m pytest`
    *   Type-checking will be performed via `pyright` (configured in `pyrightconfig.json`).
*   **Logging & Observability**: Existing `logging` module will be leveraged. Ensure LiteLLM's internal logging integrates seamlessly or is configured to use the existing logging setup.
*   **Analytics/KPIs**: Existing cost tracking and usage metadata (`yellhorn_mcp/utils/cost_tracker_utils.py`, `yellhorn_mcp/models/metadata_models.py`) will be adapted to consume LiteLLM's unified usage reports.
*   **Testing Frameworks & Helpers**: `pytest`, `pytest-asyncio`, `httpx`, `pytest-cov` will continue to be used. Existing mock testing infrastructure (`examples/mock_context.py`) will be adapted to mock LiteLLM calls.

### Architecture
*   **Existing Components Leveraged**:
    *   `yellhorn_mcp/llm_manager.py`: The core `LLMManager` class will be refactored to use LiteLLM as its underlying LLM interaction layer. Its existing logic for chunking (`ChunkingStrategy`), token counting (`TokenCounter`), and usage metadata aggregation (`UsageMetadata`) will be adapted to work with LiteLLM's outputs.
    *   `yellhorn_mcp/token_counter.py`: Will continue to provide model-specific token limits and counting, potentially integrating with LiteLLM's internal token counting utilities if beneficial.
    *   `yellhorn_mcp/models/metadata_models.py`: `UsageMetrics` and `CompletionMetadata` will continue to track LLM usage, adapted for LiteLLM's reporting format.
    *   `yellhorn_mcp/utils/cost_tracker_utils.py`: `calculate_cost` and `format_metrics_section` will be updated to map LiteLLM's model identifiers to existing pricing.
    *   `yellhorn_mcp/utils/search_grounding_utils.py`: The logic for configuring Google Search tools and adding citations will be adapted to pass through LiteLLM's API.
    *   `yellhorn_mcp/processors/`: All processors (`context_processor.py`, `judgement_processor.py`, `workplan_processor.py`) will continue to use `LLMManager` as their LLM interface.
*   **New Components Introduced**: No new top-level components will be introduced. The change is primarily an internal refactoring of the `LLMManager` to use LiteLLM.
*   **Control-flow & Data-flow Diagram (ASCII)**:
    ```
    +-----------------+       +-------------------+       +-------------------+
    | MCP Tools (CLI, |       | yellhorn_mcp/     |       | yellhorn_mcp/     |
    | VSCode, Claude) |------>| server.py         |------>| llm_manager.py    |
    +-----------------+       | (Tool Dispatcher) |       | (Unified LLM API) |
                               +-------------------+       +-------------------+
                                         ^                         |
                                         |                         |
                                         |                         |
                                         |                         |
                                         |                         v
                                         |                   +-----------------+
                                         |                   | LiteLLM Library |
                                         |                   | (litellm.py)    |
                                         |                   +-----------------+
                                         |                         |
                                         |                         |
                                         |                         |
                                         |                         v
                                         |                   +-----------------+
                                         |                   | OpenAI API      |
                                         |                   | Gemini API      |
                                         |                   | Other LLM APIs  |
                                         |                   +-----------------+
                                         |
                                         +-------------------------------------+
                                         (Usage Metadata, Responses, Errors)
    ```
*   **State-management, Retry/Fallback, and Error-handling Patterns**: LiteLLM offers robust built-in retry and fallback mechanisms.[2](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHNO9XSCuQVloTCtVZqw58jsi_tTJFmx2aAJdBRQlPbj69lF2ZDEeGXdC4DOfLIXQ9wnw8vmYjsUreGa5rXUZogzcTaVIPrrHdIc-MoZp93nZIdnsLsynw=) The existing `api_retry` decorator in `llm_manager.py` will be removed, and `LLMManager` will rely on LiteLLM's internal retry logic. Error handling will be adapted to catch LiteLLM-specific exceptions and translate them into `YellhornMCPError` where appropriate.

### Completion Criteria & Metrics
*   **Engineering Metrics**:
    *   All `pytest` unit and integration tests pass with 100% success rate.
    *   Test coverage for `yellhorn_mcp` remains at or above 70% line coverage.
    *   `black` and `isort` formatting checks pass.
    *   `pyright` type-checking runs clean with no errors.
    *   `LLMManager` successfully makes calls to both OpenAI and Gemini models via LiteLLM.
    *   Cost tracking and usage metadata reported in GitHub comments remain accurate and consistent with previous versions.
*   **Business Metrics**:
    *   Reduced time-to-integrate new LLM providers by leveraging LiteLLM's unified interface.
    *   Improved system reliability due to LiteLLM's robust retry and fallback mechanisms.
*   **Code-state Definition of Done**:
    *   All CI jobs (linting, formatting, testing) are green.
    *   `pyproject.toml` is updated with the new `litellm` dependency.
    *   `LLMManagerREADME.md` and `docs/USAGE.md` are updated to reflect the LiteLLM integration.

### References
*   `yellhorn_mcp/llm_manager.py`
*   `yellhorn_mcp/server.py`
*   `yellhorn_mcp/cli.py`
*   `yellhorn_mcp/token_counter.py`
*   `yellhorn_mcp/integrations/gemini_integration.py`
*   `yellhorn_mcp/utils/cost_tracker_utils.py`
*   `yellhorn_mcp/utils/search_grounding_utils.py`
*   `pyproject.toml`
*   LiteLLM GitHub Repository: `https://github.com/BerriAI/litellm`[4](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHJqrTQMqVpo30aziDctGWaQBTciLRLQ7cxR_-2L18AA-WDtAeW9Tc15Iulaz3EocAd6-xx6fScsj_J_o8JGGnbw4Umaj6-fP6WKin0aBn91jDU9DRzRDn6aHscPA==)
*   LiteLLM Documentation: `https://docs.litellm.ai/`[5](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQE7TAfZPbbVvO9V1PNxONobzzviteU9UhFpNZDcEBmmds4AFNHD3ve1SKZzeXa0X9_gmmsP5LkeBeFliaiF_qbJbV6oETAKOuLHrSY5jSvChIQC6H3FfQG08-jmKCzRQv0=), [2](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHNO9XSCuQVloTCtVZqw58jsi_tTJFmx2aAJdBRQlPbj69lF2ZDEeGXdC4DOfLIXQ9wnw8vmYjsUreGa5rXUZogzcTaVIPrrHdIc-MoZp93nZIdnsLsynw=), [6](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQG7LN0z9TdU0oWVk8IIjcNM2V5AwitnYaZ-qI9rf39ZkfkJY2RKVT_n-_7ZGXVmSnpQUZhy7xYafWgPWoH-uLFo1hTUD_FR4gh70OMHcD0r3Xnyj2xp1C1MS7iM3jgMnBsDCjkduQDfz5uk_xOGbD0ptegACqLFiQsWs1bcvY6MW1DJ6yc52PuDzj2LAvvO8tin), [7](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQGqLP_Q1VgUi5mp07zBgowNwY4Jc-LmtXBEAwG-NmrcwdEvaEjV4dMKVwWxOPCN4gyhmyl-Ccj5RiSuhOzyq-f-Ym9flHxUOiNu344Rxvqvk8wJ)
*   LiteLLM PyPI: `https://pypi.org/project/litellm/`

### Implementation Steps
### - [ ] Step 1: Add LiteLLM Dependency and Remove Direct API Clients
**Description**: Introduce `litellm` as a core dependency in `pyproject.toml`. Remove direct `google-genai` and `openai` client dependencies, as LiteLLM will abstract these. Remove `tenacity` as LiteLLM provides its own retry logic.
**Files**:
*   `pyproject.toml`
*   `yellhorn_mcp/server.py`
**Reference(s)**:
*   `pyproject.toml`
*   `yellhorn_mcp/server.py`
*   LiteLLM PyPI: `https://pypi.org/project/litellm/`
**Test(s)**:
*   Run `poetry install` to verify dependency resolution.
*   Ensure `yellhorn-mcp` CLI still starts without errors: `python -m yellhorn_mcp.cli --help`.

### - [ ] Step 2: Refactor `LLMManager` Initialization and Core Call Logic
**Description**: Modify `LLMManager.__init__` to accept a `litellm` client (or initialize it internally) instead of separate OpenAI and Gemini clients. Update `_single_call` to use `litellm.acompletion()` for all LLM interactions. Remove the `api_retry` decorator from `_call_openai` and `_call_gemini` (or remove these methods entirely if `_single_call` becomes the sole entry point).
**Files**:
*   `yellhorn_mcp/llm_manager.py`
*   `yellhorn_mcp/server.py`
**Reference(s)**:
*   `yellhorn_mcp/llm_manager.py`
*   `yellhorn_mcp/server.py`
*   LiteLLM Documentation: `https://docs.litellm.ai/docs/completion`
**Test(s)**:
*   Unit tests for `LLMManager`'s `_single_call` method, ensuring it correctly routes to LiteLLM.
*   Mock LiteLLM's `acompletion` to verify call parameters.

### - [ ] Step 3: Adapt `UsageMetadata` and `TokenCounter` for LiteLLM
**Description**: Update `UsageMetadata` to correctly parse LiteLLM's unified usage format. Investigate if `TokenCounter` can leverage LiteLLM's `token_counter` utility for more accurate or consistent token estimation across models.
**Files**:
*   `yellhorn_mcp/llm_manager.py`
*   `yellhorn_mcp/token_counter.py`
*   `yellhorn_mcp/models/metadata_models.py`
**Reference(s)**:
*   `yellhorn_mcp/llm_manager.py`
*   `yellhorn_mcp/token_counter.py`
*   `yellhorn_mcp/models/metadata_models.py`
*   LiteLLM API: `https://docs.litellm.ai/docs/token_counter`[8](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQGO7UZuLkO5CgKE-QOmpelvPzFXNHEcEyR-rll1KfKNLtvPxjmI1sxV0MFp3BiMU0CQxaoQb1h7uJBwtC2MOBdhw2fNZtrmaBRIrDy3pp3Xrb1448DCO1X_TBcT7TY=)
**Test(s)**:
*   Unit tests for `UsageMetadata` to ensure correct parsing of LiteLLM's usage data.
*   Unit tests for `TokenCounter` to verify token counts using LiteLLM's method (if adopted) or ensuring compatibility.

### - [ ] Step 4: Integrate Search Grounding and Deep Research Model Logic with LiteLLM
**Description**: Adapt the logic for Google Search Grounding (`_get_gemini_search_tools`, `add_citations_from_metadata`) and Deep Research model tool activation (`_is_deep_research_model` in `LLMManager`) to work seamlessly with LiteLLM's unified API. LiteLLM supports passing `tools` parameters.
**Files**:
*   `yellhorn_mcp/llm_manager.py`
*   `yellhorn_mcp/utils/search_grounding_utils.py`
*   `yellhorn_mcp/integrations/gemini_integration.py` (will be refactored or removed)
**Reference(s)**:
*   `yellhorn_mcp/llm_manager.py`
*   `yellhorn_mcp/utils/search_grounding_utils.py`
*   LiteLLM API: `https://docs.litellm.ai/docs/completion` (for `tools` parameter)
*   LiteLLM API: `https://docs.litellm.ai/docs/providers` (for model support)[5](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQE7TAfZPbbVvO9V1PNxONobzzviteU9UhFpNZDcEBmmds4AFNHD3ve1SKZzeXa0X9_gmmsP5LkeBeFliaiF_qbJbV6oETAKOuLHrSY5jSvChIQC6H3FfQG08-jmKCzRQv0=)
**Test(s)**:
*   Integration tests for `call_llm_with_citations` to ensure search grounding works via LiteLLM.
*   Unit tests for `LLMManager` to verify deep research model tool activation.

### - [ ] Step 5: Update `server.py` and `cli.py` for LiteLLM Integration
**Description**: Modify `app_lifespan` in `server.py` to initialize LiteLLM instead of separate OpenAI and Gemini clients. Adjust `cli.py` to handle API keys and model selection in a LiteLLM-compatible manner, potentially simplifying environment variable requirements. Remove `yellhorn_mcp/integrations/gemini_integration.py` as its functionality will be absorbed by `llm_manager.py` and LiteLLM.
**Files**:
*   `yellhorn_mcp/server.py`
*   `yellhorn_mcp/cli.py`
*   `yellhorn_mcp/integrations/gemini_integration.py` (delete)
**Reference(s)**:
*   `yellhorn_mcp/server.py`
*   `yellhorn_mcp/cli.py`
*   LiteLLM Getting Started: `https://docs.litellm.ai/docs/getting_started`[2](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQHNO9XSCuQVloTCtVZqw58jsi_tTJFmx2aAJdBRQlPbj69lF2ZDEeGXdC4DOfLIXQ9wnw8vmYjsUreGa5rXUZogzcTaVIPrrHdIc-MoZp93nZIdnsLsynw=)
**Test(s)**:
*   End-to-end integration tests for `create_workplan`, `judge_workplan`, and `curate_context` via the MCP server to ensure full functionality.
*   Verify CLI commands still function correctly.

### - [ ] Step 6: Update Cost Tracking and Documentation
**Description**: Review and update `yellhorn_mcp/utils/cost_tracker_utils.py` to ensure accurate cost calculation with LiteLLM's model naming conventions and potentially leverage LiteLLM's cost tracking features if they offer more granularity. Update `LLMManagerREADME.md` and `docs/USAGE.md` to reflect the new LiteLLM-based architecture and usage.
**Files**:
*   `yellhorn_mcp/utils/cost_tracker_utils.py`
*   `LLMManagerREADME.md`
*   `docs/USAGE.md`
**Reference(s)**:
*   `yellhorn_mcp/utils/cost_tracker_utils.py`
*   `LLMManagerREADME.md`
*   `docs/USAGE.md`
*   LiteLLM Model Cost Map: `https://docs.litellm.ai/docs/proxy/model_cost_map`[8](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQGO7UZuLkO5CgKE-QOmpelvPzFXNHEcEyR-rll1KfKNLtvPxjmI1sxV0MFp3BiMU0CQxaoQb1h7uJBwtC2MOBdhw2fNZtrmaBRIrDy3pp3Xrb1448DCO1X_TBcT7TY=)
**Test(s)**:
*   Verify cost calculations in generated GitHub comments are correct.
*   Review documentation for clarity and accuracy.

### Global Test Strategy
*   **Unit Tests**: All modified functions and classes will have comprehensive unit tests using `pytest` and `pytest-asyncio`. Mocking will be used for external API calls (LiteLLM).
*   **Integration Tests**: The `examples/mock_context.py` will be updated and used to run integration tests for `create_workplan`, `judge_workplan`, and `curate_context` tools, simulating GitHub interactions and verifying LLM calls through the LiteLLM integration.
*   **End-to-End Tests**: Manual verification of the `yellhorn-mcp` CLI and its interaction with GitHub issues.
*   **Test Coverage**: Maintain minimum 70% test coverage for all new and modified code, enforced by `pytest-cov`.
*   **Local Execution**: Tests can be run locally using `poetry run pytest`.
*   **Environment Variables/Secrets**: API keys for LiteLLM (and underlying providers if needed) will be managed via environment variables (`LITELLM_API_KEY` or provider-specific keys like `OPENAI_API_KEY`, `GEMINI_API_KEY` as LiteLLM supports them).
*   **Async Helpers/Fixtures**: Existing `pytest-asyncio` fixtures will be leveraged.

### Files to Modify / New Files to Create
*   **Files to Modify**:
    *   `pyproject.toml`: Add `litellm` dependency, remove `google-genai`, `openai`, `tenacity`, `google-api-core`.
    *   `yellhorn_mcp/llm_manager.py`: Refactor `LLMManager` to use LiteLLM, update `UsageMetadata`, remove `api_retry` decorator.
    *   `yellhorn_mcp/server.py`: Update `app_lifespan` to initialize LiteLLM, remove direct client initializations.
    *   `yellhorn_mcp/cli.py`: Adjust API key validation and model selection logic.
    *   `yellhorn_mcp/token_counter.py`: Potentially integrate LiteLLM's token counting.
    *   `yellhorn_mcp/utils/cost_tracker_utils.py`: Update `MODEL_PRICING` and `calculate_cost` for LiteLLM models.
    *   `yellhorn_mcp/utils/search_grounding_utils.py`: Ensure compatibility with LiteLLM's `tools` parameter.
    *   `LLMManagerREADME.md`: Update LLM Manager usage and architecture.
    *   `docs/USAGE.md`: Update installation, configuration, and tool usage sections.
*   **New Files to Create**:
    *   None, aiming for refactoring existing files.
[MOCK GitHub CLI] In directory: /Users/sravanj/project_work/yellhorn-mcp
[MOCK GitHub CLI] Updated issue 2760
[INFO] Successfully updated GitHub issue #2760 with generated workplan and metrics
[MOCK GitHub CLI] Running: gh issue comment 2760 --body ## ✅ Workplan generated successfully

### Generation Details
**Time**: 46.1 seconds  
**Completed**: 2025-08-10 23:09:33 UTC  
**Model Used**: `gemini-2.5-flash`  

### Token Usage
**Input Tokens**: 113,934  
**Output Tokens**: 4,109  
**Total Tokens**: 124,466  
**Estimated Cost**: $0.0445  

**Search Results Used**: 1  
**Context Size**: 398,011 characters """

In [30]:
# Call process_revision_async with our mock context
revision_result = await run_revise_workplan(
    issue_number="123", 
    original_workplan="Original workplan content here...", 
    revision_instructions="Focus on building test notebooks for each of the major changes. Add more detailed testing steps and include error handling requirements",
    repo_path=REPO_PATH,
    llm_manager=llm_manager,
    model=MODEL,
    codebase_reasoning="full",
    debug=True,
    disable_search_grounding=False,
    github_command_func=mock_github_command,
    git_command_func=run_git_command_with_set_cwd(REPO_PATH),
    log_callback=log_callback,
    wait_for_background_tasks=True,
    background_task_timeout=180
)

print("Workplan Revised:")
print("-" * 50)
print(f"Issue URL: {revision_result['issue_url']}")
print(f"Issue Number: {revision_result['issue_number']}")

2025-08-10 16:19:05,107 INFO AFC is enabled with max remote calls: 10.


[INFO] Getting codebase snapshot in mode: full
[INFO] Found .gitignore with 86 patterns
[MOCK GitHub CLI] Running: gh issue comment 123 --body <details>
<summary>Debug: Full prompt used for generation</summary>

```
You are an expert software developer tasked with revising an existing workplan based on revision instructions.

# Original Workplan
Original workplan content here...

# Revision Instructions
Focus on building test notebooks for each of the major changes. Add more detailed testing steps and include error handling requirements

# Codebase Context
<codebase_tree>
.
├── CLAUDE.md
├── LLMManagerREADME.md
├── pyproject.toml
├── docs/
│   └── USAGE.md
├── notebooks/
│   ├── llm_manager.ipynb
│   └── test_curate_context_v2.ipynb
├── yellhorn_mcp/
│   ├── cli.py
│   ├── llm_manager.py
│   ├── server.py
│   └── token_counter.py
│   ├── formatters/
│   │   ├── __init__.py
│   │   ├── codebase_snapshot.py
│   │   ├── context_fetcher.py
│   │   └── prompt_formatter.py
│   ├── integratio

## 6. Using Judge Workplan

In [36]:
from examples.mock_context import run_judge_workplan, mock_github_command
from yellhorn_mcp.server import process_judgement_async

In [34]:
# Example workplan content (you would typically get this from a GitHub issue)
WORKPLAN_CONTENT = """
# Example Workplan: Add Token Counter Feature

## Summary
Implement a token counting system using tiktoken to prevent token overflow in LLM calls.

## Implementation Steps
1. Create TokenCounter class with tiktoken integration
2. Add token counting to LLMManager
3. Implement automatic prompt chunking when limits are exceeded
4. Update all LLM call sites to use the new system

## Files to Modify
- `yellhorn_mcp/llm_manager.py`: Add token counting integration
- `yellhorn_mcp/server.py`: Update LLM call sites

## New Files to Create
- `yellhorn_mcp/token_counter.py`: Core token counting functionality
"""

# Example diff content (you would typically get this from git diff)
DIFF_CONTENT = """
diff --git a/yellhorn_mcp/token_counter.py b/yellhorn_mcp/token_counter.py
new file mode 100644
index 0000000..1234567
--- /dev/null
+++ b/yellhorn_mcp/token_counter.py
@@ -0,0 +1,50 @@
+# Token counting utilities using tiktoken.
+
+import tiktoken
+from typing import Optional
+
+class TokenCounter:
+    \"\"\"Token counter using tiktoken for accurate token counting.\"\"\"
+    
+    def __init__(self, model: str):
+        self.model = model
+        self.encoding = tiktoken.encoding_for_model(model)
+    
+    def count_tokens(self, text: str) -> int:
+        \"\"\"Count tokens in the given text.\"\"\"
+        return len(self.encoding.encode(text))

diff --git a/yellhorn_mcp/llm_manager.py b/yellhorn_mcp/llm_manager.py
index abcd123..efgh456 100644
--- a/yellhorn_mcp/llm_manager.py
+++ b/yellhorn_mcp/llm_manager.py
@@ -10,6 +10,7 @@ from typing import Dict, List, Optional, Any, Union
 from openai import AsyncOpenAI
 import google.genai as genai
 from .usage_metadata import UsageMetadata
+from .token_counter import TokenCounter
 
 class LLMManager:
     \"\"\"Unified LLM manager with token counting and chunking support.\"\"\"
"""

# Judge workplan parameters
base_ref = "main"
head_ref = "feature/token-counter"
subissue_to_update = "456"  # GitHub issue number for the sub-issue to update
parent_workplan_issue_number = "123"  # Original workplan issue number

In [35]:
# Call judge_workplan with our mock context
await run_judge_workplan(
    workplan_content=WORKPLAN_CONTENT,
    diff_content=DIFF_CONTENT,
    base_ref=base_ref,
    head_ref=head_ref,
    subissue_to_update=subissue_to_update,
    parent_workplan_issue_number=parent_workplan_issue_number,
    repo_path=REPO_PATH,
    gemini_client=gemini_client,
    openai_client=openai_client,
    llm_manager=llm_manager,
    model=MODEL,
    base_commit_hash="abc123",  # Optional: actual commit hash
    head_commit_hash="def456",  # Optional: actual commit hash
    debug=True,  # Set to True to see the full prompt used
    codebase_reasoning="full",  # Options: "full", "lsp", "file_structure", "none"
    disable_search_grounding=False,
    github_command_func=mock_github_command,
    log_callback=log_callback,
    wait_for_background_tasks=True,
    background_task_timeout=180
)

print("\nJudgement completed successfully!")
print(f"Check GitHub sub-issue #{subissue_to_update} for the detailed judgement results.")

2025-08-10 16:28:21,836 INFO AFC is enabled with max remote calls: 10.


[INFO] Attempting to enable search grounding for model gemini-2.5-flash
[INFO] Search grounding enabled for model gemini-2.5-flash
[INFO] LLM call initiated - Model: gemini-2.5-flash, Input tokens: 805, Model limit: 1048576, Temperature: 0.0
[INFO] LLM call completed - Model: gemini-2.5-flash, Completion tokens: 1010, Total tokens: 3198, Time: 11.34s
[MOCK GitHub CLI] Running: gh issue edit 456 --title Judgement for #123: feature/token-counter vs main --body-file /var/folders/3h/5k3yk1s13yn9cql9x6fl9g9h0000gn/T/tmpcpva3948.md
[MOCK GitHub CLI] In directory: /Users/sravanj/project_work/yellhorn-mcp
[MOCK GitHub CLI] File contents: Parent workplan: #123

## Comparison Metadata
- **Workplan Issue**: `#123`
- **Base Ref**: `main` (Commit: `abc123`)
- **Head Ref**: `feature/token-counter` (Commit: `def456`)
- **Codebase Reasoning Mode**: `full`
- **AI Model**: `gemini-2.5-flash`

## Judgement Summary
INCOMPLETE. The provided code diff only partially implements the first two steps of the wor