Skip to content

idiap/FakeLLM

Repository files navigation

logo

fakellm

Expose one or more pydantic_ai.Agent instances as fake OpenAI-compatible chat models.

It is designed so an OpenAI-compatible client can talk to what looks like a normal model, while the backend is actually a PydanticAI agent that can:

  • use its own server-side PydanticAI tools
  • surface client-provided OpenAI tools through PydanticAI deferred tools
  • honor OpenAI-style structured output via response_format
  • expose multiple fake model IDs from one app
  • keep one long-lived agent instance per fake model for the whole app lifetime

schema

Install

Install fakellm directly from GitHub. The fakellm name on PyPI is already taken by a different package, so do not install this project from PyPI.

For the CLI, install it as a uv-managed tool with Python 3.14:

uv tool install --python 3.14 git+https://github.com/idiap/FakeLLM.git

To use fakellm as a dependency in another uv project, add the Git source explicitly:

uv add "fakellm @ git+https://github.com/idiap/FakeLLM.git"

Run

fakellm mypackage.my_agent:agent --host 127.0.0.1 --port 8000

You can also point it at a FakeModels registry:

fakellm examples.multi_model_agents:MODELS --host 127.0.0.1 --port 8000

Use --prefix if you want a different API base path:

fakellm mypackage.my_agent:agent --host 127.0.0.1 --port 8000 --prefix /proxy/custom/v1

Protect the OpenAI-compatible routes with API keys by pointing FAKELLM_CONFIG at a JSON config file:

{
  "api_keys": [
    {
      "name": "crush-local",
      "key": "replace-with-a-secret",
      "model_id": "fake-pydanticai"
    }
  ]
}
FAKELLM_CONFIG=./config.json fakellm mypackage.my_agent:agent

You can also pass the path directly:

fakellm mypackage.my_agent:agent --config ./config.json

When API keys are configured, /v1/models and /v1/chat/completions require Authorization: Bearer <key> or X-API-Key: <key>. Each key is scoped to its configured model_id, and fakellm logs requests with the associated name. Omit model_id to let one key access every model in that entrypoint. As a shortcut, set FAKELLM_API_KEY to protect the whole entrypoint with a single shared key. /health remains public for liveness checks.

The same app also exposes an OpenAI Responses-compatible endpoint at /v1/responses. It uses the same hosted fake model IDs, hidden server-side PydanticAI tools, client-provided function tools, structured output handling, API-key auth, and request-context deps as /v1/chat/completions:

curl http://127.0.0.1:8000/v1/responses \
  -H 'Content-Type: application/json' \
  -d '{"model":"fake-pydanticai","input":"Say hello from Responses."}'

Client-visible tools use the Responses function-tool shape. fakellm remains stateless, so include the prior function_call item and the matching function_call_output item when returning tool results:

{
  "model": "fake-pydanticai",
  "input": [
    {"type": "message", "role": "user", "content": "Check Paris weather"},
    {
      "type": "function_call",
      "call_id": "weather-call",
      "name": "get_weather",
      "arguments": "{\"city\":\"Paris\"}"
    },
    {
      "type": "function_call_output",
      "call_id": "weather-call",
      "output": {"temperature_c": 21}
    }
  ],
  "tools": [
    {
      "type": "function",
      "name": "get_weather",
      "parameters": {
        "type": "object",
        "properties": {"city": {"type": "string"}},
        "required": ["city"]
      }
    }
  ]
}

Responses streaming supports optional reasoning/progress events with responses_reasoning on create_app():

  • "disabled": no reasoning events, the default
  • "agent": forward streamed PydanticAI ThinkingPart deltas from the backend agent
  • "custom": let AgenticWorkflow code call await request.reasoning.emit(...)
  • "premade": loop configured text chunks concurrently while the request runs
from fakellm import AgenticWorkflow, ResponsesReasoningConfig, WorkflowRequest, create_app


async def workflow(request: WorkflowRequest) -> str:
    if request.reasoning is not None:
        await request.reasoning.emit("checking context ")
    return "done"


app = create_app(
    AgenticWorkflow(workflow),
    model_name="responses-workflow",
    responses_reasoning=ResponsesReasoningConfig(mode="custom"),
)

You can also turn trusted request headers into PydanticAI deps without putting those values in the chat messages or tool schemas. This is useful behind an internal proxy that authenticates with a fakellm API key and forwards a user id:

from dataclasses import dataclass

from fastapi import HTTPException
from pydantic_ai import Agent, RunContext

from fakellm import RequestContext, create_app


@dataclass(frozen=True)
class UserDeps:
    user_id: str


def request_context_factory(context: RequestContext) -> UserDeps:
    user_id = context.header("x-user-id")
    if user_id is None:
        raise HTTPException(status_code=400, detail="Missing X-User-ID header.")
    return UserDeps(user_id=user_id)


agent = Agent("openai:gpt-4.1-mini", deps_type=UserDeps)


@agent.tool
async def lookup_internal_profile(ctx: RunContext[UserDeps]) -> dict[str, str]:
    return {"user_id": ctx.deps.user_id}


app = create_app(
    agent,
    model_name="internal-assistant",
    api_keys=[{"name": "internal-proxy", "key": "secret", "model_id": "internal-assistant"}],
    request_context_factory=request_context_factory,
)

RequestContext also exposes the parsed chat completion request. Use context.parameter("model") for any top-level request field and context.extra_parameter("user") for unmodeled/custom fields that fakellm does not otherwise interpret, such as OpenAI's user value or application-specific metadata:

def request_context_factory(context: RequestContext) -> UserDeps:
    user_id = context.extra_parameter("user")
    if not isinstance(user_id, str):
        raise HTTPException(status_code=400, detail="Missing user parameter.")
    return UserDeps(user_id=user_id)

For bot-style deployments, ContextPipeline turns those request values into a typed deps object declaratively:

from dataclasses import dataclass

from fakellm import ContextPipeline


@dataclass(frozen=True)
class BotDeps:
    mattermost_user_id: str
    mattermost_username: str | None


context = (
    ContextPipeline(BotDeps)
    .from_extra("safety_identifier", as_="mattermost_user_id")
    .resolve(
        "mattermost_username",
        lookup_mattermost_username,
        from_="mattermost_user_id",
        optional=True,
    )
    .cache_per_request()
)

WorkflowRequest has convenience helpers for common bot workflows: latest_user_text(), conversation_text(), auth_name, context, and emit_progress(). For Responses streams, async with request.span(...): emits progress text and records elapsed time on the span object.

Bot Runtime Helpers

The lower-level adapter APIs remain available, and fakellm also ships a small bot-oriented layer for common multi-agent server concerns:

from fakellm import (
    ApiKeyAuth,
    ContextPipeline,
    DependencyErrorPolicy,
    FakeModels,
    ManagedMCP,
    OpenAICompatibleBackend,
    Route,
    RouterWorkflow,
    create_bot_app,
)


backend = OpenAICompatibleBackend.from_env(
    endpoint="LLM_ENDPOINT",
    model="LLM_MODEL",
    api_key="LLM_KEY",
)

root = RouterWorkflow(
    model=backend,
    routes=[
        Route("biss", build_biss_agent, "BISS support and documentation"),
        Route("rooms", build_rooms_agent, "room search and booking"),
    ],
)

rooms_mcp = ManagedMCP.http(
    id="rooms",
    command=["python", "-m", "rooms.rooms_mcp"],
    url_env="ROOMS_MCP_URL",
)

app = create_bot_app(
    FakeModels()
    .add("root", root, lifecycle="per_request", dependency_policy="degrade")
    .add("rooms", build_rooms_agent, lifecycle="per_request", dependency_policy="degrade"),
    context=context,
    auth=ApiKeyAuth.from_env(),
    managed_mcp=[rooms_mcp],
    dependency_errors=DependencyErrorPolicy(
        message="A required upstream service is temporarily unavailable.",
    ),
    progress=True,
)

FakeModels.add() and FakeModel support three lifecycle policies:

  • startup: start once during FastAPI lifespan, the default
  • lazy: start on first use and reuse
  • per_request: build/enter/close for each request, useful for request-scoped toolsets

Set dependency_policy="degrade" to turn connection-like dependency failures into an assistant fallback response instead of failing the whole app startup or request. /ready is available on every create_app() app and includes model lifecycle records plus any custom DependencyHealth checks passed through health=[...].

For MCP-heavy bots, MCPToolset, ManagedMCP, EnvSecret, and ContextValue cover the common pieces: external-or-managed MCP URL configuration, app lifespan startup/shutdown, context-aware headers, env-driven timeouts, and readiness checks. ToolCallPolicy provides a small middleware helper for policies like requiring a resolved username or filling a missing description from title. DependencyErrorPolicy can convert uncaught connection-like failures at the app boundary into OpenAI-compatible JSON error responses.

Deploy From YAML

For a complete configured deployment, use fakellm deploy with a YAML file:

fakellm deploy --config ./myconfig.yaml
host: 127.0.0.1
port: 8000
prefix: /v1

api_keys:
  - name: local-client
    key: replace-with-a-secret
    model_id: assistant

# Or omit model_id to allow the key to use every model in this entrypoint.
# You can also set FAKELLM_API_KEY instead of writing an api_keys block.

backends:
  local-llm:
    type: openai-compatible
    base_url: http://127.0.0.1:11434/v1
    api_key: ${LLM_KEY:-}
    model: llama3.1

mcps:
  filesystem:
    transport: stdio
    command: npx
    args: ["-y", "@modelcontextprotocol/server-filesystem", "."]
  remote-search:
    transport: http
    url: https://mcp.example.com/mcp
    headers:
      Authorization: Bearer ${MCP_TOKEN}

models:
  assistant:
    backend: local-llm
    instructions: You are a concise assistant with access to configured MCP tools.
    code_mode: true
    mcps: [filesystem, remote-search]

backends, mcps, and models can be written either as mappings, as shown above, or as lists with name/model_id fields. MCP and FastMCP support are included in fakellm's main dependencies. Use transport: http or transport: streamable-http for streamable HTTP MCP servers, and transport: sse for older SSE MCP servers. Set code_mode: true on a model to enable PydanticAI Harness CodeMode for its configured tools.

You can also expose multiple prefixes from one process with entrypoints:

backends:
  local-llm:
    base_url: http://127.0.0.1:11434/v1
    model: llama3.1

entrypoints:
  /alpha/v1:
    api_keys:
      - name: alpha-client
        key: alpha-secret
        model_id: alpha
    models:
      alpha:
        backend: local-llm
  /beta/v1:
    api_keys:
      - name: beta-client
        key: beta-secret
        model_id: beta
    models:
      beta:
        backend: local-llm

Optional Real Backend Tests

The default test suite uses in-memory model doubles and never needs a live LLM. To smoke-test fakellm against a real OpenAI-compatible backend, create a local .env file and opt in explicitly:

FAKELLM_RUN_REAL_LLM_TESTS=1
LLM_ENDPOINT=http://127.0.0.1:11434/v1
LLM_MODEL=gemma3:4b
# LLM_KEY=replace-with-a-secret-if-your-backend-needs-one

Then run only the real-backend tests:

uv run pytest -m real_llm tests/test_real_backend.py

Those tests stay skipped unless FAKELLM_RUN_REAL_LLM_TESTS=1 is present. When enabled, they cover:

  • direct /v1/chat/completions and /v1/responses calls through a fakellm app;
  • image and file content parts forwarded to a real backend model through both APIs;
  • the multi-model OpenAI-compatible smoke helper with hidden PydanticAI tools;
  • the examples.multi_model_agents outer-agent flow;
  • the examples.dual_subagent_judge_workflow fan-out and judge workflow.

Because these tests depend on a live model following tool-use instructions, they are intended as local release smokes rather than CI defaults.

Run With Docker

Build the image from the repository root:

docker build -t fakellm .

The container defaults to fakellm deploy --config /config/fakellm.yaml and listens on port 8000. Mount your deployment YAML at that path and publish the port:

docker run --rm \
  -p 8000:8000 \
  -v "$PWD/myconfig.yaml:/config/fakellm.yaml:ro" \
  fakellm

Then check that the service is up:

curl http://127.0.0.1:8000/health
curl http://127.0.0.1:8000/v1/models

If your deploy config uses environment variables such as ${LLM_KEY}, pass them to Docker:

docker run --rm \
  -p 8000:8000 \
  -v "$PWD/myconfig.yaml:/config/fakellm.yaml:ro" \
  -e LLM_KEY \
  fakellm

You can also override the default command and run the single-agent CLI form:

docker run --rm -p 8000:8000 fakellm \
  examples.multi_model_agents:MODELS \
  --host 0.0.0.0 \
  --port 8000

Or embed it directly:

from fastapi import FastAPI

from fakellm import ApiKey, create_app
from mypackage.my_agent import agent

app: FastAPI = create_app(
    agent,
    model_name="fake-pydanticai",
    prefix="/proxy/custom/v1",
    api_keys=[
        ApiKey(
            name="crush-local",
            key="replace-with-a-secret",
            model_id="fake-pydanticai",
        )
    ],
)

For tests and examples that should exercise the app in memory with FastAPI lifespan enabled:

from fakellm import create_app, live_client

app = create_app(agent, model_name="fake-pydanticai")

async with live_client(app) as client:
    response = await client.get("/v1/models")

Use live_client() when you want to call the fakellm app without starting a real HTTP server, for example in tests or in examples where an outer agent talks to the proxy entirely in-process.

Why it exists:

  • it creates an in-memory httpx.AsyncClient for the FastAPI app
  • it explicitly runs FastAPI lifespan startup/shutdown, so app-scoped state is initialized and cleaned up correctly
  • that matters for fakellm because fake model registries and cached long-lived agents are created for the app lifetime and then released on shutdown

When you need it:

  • use live_client(app) for tests
  • use it for local examples that wire an OpenAIProvider(http_client=...) directly to the in-memory fakellm app
  • do not use it when fakellm is already running behind a real HTTP endpoint; in that case use a normal httpx.AsyncClient or any OpenAI-compatible client against the server URL

Multimodal user content

User messages can include OpenAI-compatible content parts for text, images, and files. Image parts use the Chat Completions image_url shape, including remote URLs and base64 data URLs:

{
  "role": "user",
  "content": [
    {"type": "text", "text": "What text is in this image?"},
    {
      "type": "image_url",
      "image_url": {
        "url": "data:image/png;base64,...",
        "detail": "auto"
      }
    }
  ]
}

File parts can pass uploaded OpenAI file IDs or inline base64 file data. Inline files are forwarded to the inner PydanticAI agent as BinaryContent; file IDs are forwarded as OpenAI UploadedFile references.

{
  "role": "user",
  "content": [
    {"type": "text", "text": "Summarize this file."},
    {
      "type": "file",
      "file": {
        "filename": "notes.txt",
        "file_data": "VGhlIGZpbGUgY29udGVudHMu"
      }
    }
  ]
}

For multiple fake models:

from fastapi import FastAPI

from fakellm import FakeModels, create_app
from mypackage.multi_models import build_code_agent, build_weather_agent

app: FastAPI = create_app(
    FakeModels()
    .add("demo-weather", build_weather_agent)
    .add("demo-code", build_code_agent),
    prefix="/proxy/custom/v1",
)

FakeModels.add() accepts either:

  • an Agent
  • a factory returning an Agent
  • an AgenticWorkflow
  • a factory returning an AgenticWorkflow

If you already have a plain mapping from fake model ID to an Agent or factory, that still works.

Programmatic Agentic Workflows

You can also host a custom workflow directly, without making the backend itself a pydantic_ai.Agent. Wrap an async function or object with AgenticWorkflow and return either a final value or a WorkflowResponse with client-visible tool calls:

from fakellm import (
  AgenticWorkflow,
  WorkflowRequest,
  WorkflowResponse,
  WorkflowToolCall,
  create_app,
)


async def workflow(request: WorkflowRequest) -> WorkflowResponse | str:
  if request.tool_results:
    weather = request.tool_results["weather-call"]
    return f"The client says Paris is {weather['temperature_c']}C."

  return WorkflowResponse(
    tool_calls=[
      WorkflowToolCall(
        name="get_weather",
        arguments={"city": "Paris"},
        id="weather-call",
      )
    ]
  )


app = create_app(
  AgenticWorkflow(workflow),
  model_name="workflow-assistant",
)

WorkflowRequest includes the original OpenAI-compatible messages, normalized PydanticAI message history for callers that want it, client-provided tool results, externally supplied tool definitions, the resolved output type, and any deps returned by request_context_factory.

Examples

examples/multi_model_agents.py shows the recommended multi-model DX with FakeModels(), then runs two outer agents in parallel against the two fake model IDs:

MODELS = FakeModels().add("demo-weather", build_weather_agent).add("demo-code", build_code_agent)

That example uses standard PydanticAI agents and normal @agent.tool_plain tools. Configure LLM_MODEL, LLM_ENDPOINT, and optional LLM_KEY before running it.

examples/builtin_tools_agent.py shows the same inner/outer fakellm shape as the CodeMode example, using a hidden server-side Hacker News tool that fetches the raw https://news.ycombinator.com/news DOM instead of the Firebase API. It auto-loads LLM_MODEL, LLM_ENDPOINT, and optional LLM_KEY from .env, then asks the outer agent for the current top three Hacker News articles through the fake OpenAI-compatible model.

examples/uvicorn_hacker_news_agent.py exposes a Hacker News-capable agent as a real OpenAI-compatible HTTP server for clients like Crush. The hidden agent uses the Hacker News Firebase API to fetch the current top stories, while its backing LLM comes from LLM_MODEL, LLM_ENDPOINT, and optional LLM_KEY. The example auto-loads those values from a local .env file if present.

LLM_MODEL=your-backend-model
LLM_ENDPOINT=https://example.com/v1
LLM_KEY=...
uv run uvicorn examples.uvicorn_hacker_news_agent:app --host 127.0.0.1 --port 8000

For a local Ollama-compatible backend, omit LLM_KEY and point LLM_ENDPOINT at your local /v1 base URL. The project-local .crush.json configures Crush to use this uvicorn server as an openai-compat provider with model ID fakellm-hacker-news. Once the server is running, start Crush from this repo and ask for the top three Hacker News stories.

examples/uvicorn_hacker_news_mcp_agent.py exposes the same kind of server, but the hidden Hacker News capability comes from an in-process FastMCP server attached to the backend PydanticAI agent as a toolset. The OpenAI-compatible client still only sees the fake model ID, not the MCP server or its tools.

uv run --with "pydantic-ai-slim[fastmcp]" \
  uvicorn examples.uvicorn_hacker_news_mcp_agent:app --host 127.0.0.1 --port 8000

The model ID for that MCP-backed example is fakellm-hacker-news-mcp.

If you want PydanticAI Harness CodeMode behind fakellm, see examples/code_mode_agent.py. It hosts an inner CodeMode-enabled agent as a fake OpenAI model, then runs an outer PydanticAI agent against that fake model. Per the official docs, CodeMode comes from pydantic-ai-harness and wraps your normal tools into a single run_code tool so the inner model can orchestrate multiple tool calls in Python. Run it with:

LLM_KEY=... \
LLM_MODEL=chat \
LLM_ENDPOINT=https://example.com/v1 \
uv run python examples/code_mode_agent.py

The example also auto-loads these values from a local .env file, so uv run python examples/code_mode_agent.py works if .env defines LLM_KEY, LLM_MODEL, and LLM_ENDPOINT.

examples/openai_compatible_hidden_tool_smoke.py runs a real end-to-end smoke test against any OpenAI-compatible backend and verifies:

  • multiple backend-backed inner PydanticAI agents
  • a hidden server-side tool call
  • the fakellm OpenAI-compatible adapter
  • second-layer PydanticAI agents using the adapter as their model endpoint
  • concurrent fake model hosting from one app
  • a custom prefix

It accepts LLM_MODEL, LLM_ENDPOINT, and optional LLM_KEY from the environment, so it works with hosted backends and local Ollama-compatible endpoints.

LLM_KEY=... \
LLM_MODEL=chat \
LLM_ENDPOINT=https://example.com/v1 \
uv run python examples/openai_compatible_hidden_tool_smoke.py \
  --proxy-model-names fake-backend-alpha fake-backend-beta \
  --prefix /proxy/custom/v1

For a local Ollama-compatible endpoint, you can omit LLM_KEY and point LLM_ENDPOINT at your local /v1 base URL.

examples/openai_responses_reasoning.py exercises the /v1/responses endpoint entirely in memory. It shows a workflow that requests a client-visible function call, receives a function_call_output, and streams custom reasoning/progress events with ResponsesReasoningConfig(mode="custom").

uv run python examples/openai_responses_reasoning.py

About

Expose one or more pydantic_ai.Agent instances as fake OpenAI-compatible chat models

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors