Superseded with deeper integration by https://github.com/notBlubbll/GC2XY
Ollama-emulating proxy that connects GitHub Copilot Chat&Agent to the OpenCode Zen + Go APIs + models.
No key needed — free models work immediately. Add an API key in .env to unlock paid models and enable premium mode (models from your go sub). Full tool calling and streaming.
Console — model table, context lengths, key status, commands:
![]() |
![]() |
Agent mode — tool calling with free and paid models:
Chit-chat:
| What | Why |
|---|---|
| Bun or Node.js | Runtime (Bun preferred, Node as fallback) |
| Visual Studio 2026 (18.6.0+ incl. Insiders) | Ollama provider (18.6.0+) |
| Client | Status |
|---|---|
| VS 2026 (18.6.0+) | ✓ Supported |
| VS 2026 Insiders | ✓ Supported (features may be in flux) |
| VS 2026 (LocalPilot) | ⚠ Unsupported but working |
| VS Code | ⚠ Supported, not fully tested |
| SQL Studio (22.6.0+) | ✓ Supported |
VS 2026: Ollama provider available in Visual Studio 2026 18.6.0 and later (regular + Insiders). Insiders features may be in flux due to its preview nature — expect occasional regressions.
VS Code: Works via the GitHub Copilot extension's Ollama provider, but has not been thoroughly tested. Tool calling and streaming may have edge cases.
SQL Studio: Ollama provider available in SQL Server Management Studio 22.6.0 and later.
start.cmd # Windows
bun run start # Bun (preferred)
npm run node # Node.js fallbackRequires Visual Studio 2026 18.6.0+ (regular or Insiders).
- Open the Copilot Chat panel
- Click the model dropdown (next to the agent selector) → Manage Models
- Click Select Provider → Ollama
- Leave the endpoint at
http://localhost:11434(unless your port differs) - Click Add — VS fetches models and validates them automatically:
![]() |
![]() |
You can now select any model from the dropdown. No model IDs to configure — the proxy resolves display names to the correct API IDs.
- Install GitHub Copilot extension
- Open Copilot Chat → model dropdown → Manage Models
- Click Select Provider → Ollama
- Enter
http://localhost:11434as the endpoint - Click Add — models appear with
[FREE]/[GO]prefixes:
| Prefix | Meaning |
|---|---|
[FREE] |
Free tier — always available, no key needed |
[FREE*] |
Freemium — free model, requires API key (orange in console) |
[GO] |
Premium — requires OPENCODE_API_KEY in .env |
[CROF] |
CrofAI — requires CROF_API_KEY in .env |
[M365] |
Microsoft 365 Copilot — requires M365C_TOKEN_PATH in .env |
# .env
OPENCODE_API_KEY=your-go-key
OPENCODE_API_KEYS=["key1","key2"] # multi-key rotationAdding an API key enables freemium mode — the Zen free models are also available with API key authentication (shown in orange in the console). Paid models are fetched dynamically and support full tool calling.
Startup key validation: On launch, the proxy pings deepseek-v4-flash with max_tokens: 1 to verify the key can run inference. If it returns 429, the key is marked as rate-limited (with timing extracted from the error message, e.g. Resets in 1 day) and paid models are hidden from the list — avoiding wasted API calls on an unusable key. If deepseek-v4-flash returns 404, it falls back to the first premium model in the API response. Cooldown state is persisted to .cache/key-state.json and respected on restart.
When multiple API keys are configured via OPENCODE_API_KEYS, the proxy uses an ApiBalancer that:
- Shuffles keys — keys are shuffled and distributed randomly each time the pool is exhausted, preventing predictable rotation patterns
- Tracks consecutive 429s — each key's rate limit responses are counted independently
- Auto-cooldowns keys:
- 10 consecutive 429s → key removed from rotation
- Cooldown duration: if upstream usage data available, matches the actual API quota reset (
rollingUsage~5h,weeklyUsage~until Monday UTC,monthlyUsage~1st of month). Otherwise falls back to 5h (first strike) / 1 week (second strike) - A single successful request immediately clears all cooldowns and resets the 429 counter
Cooldown state is persisted to .cache/key-state.json. You can manually edit this file to clear cooldowns or adjust counters:
{
"keys": {
"sk-abc1...xyz9": {
"consecutive429": 3,
"cooldownUntil": "2026-05-09T18:00:00.000Z"
}
},
"updatedAt": "2026-05-09T13:00:00.000Z"
}| Field | Description |
|---|---|
consecutive429 |
Current consecutive 429 count (resets on success) |
cooldownUntil |
ISO timestamp when the key returns to rotation (5h or 1 week cooldown) |
To manually clear a key's cooldown, delete its entry or remove cooldownUntil, then restart the proxy.
On startup and on each model refresh, the proxy loads .cache/key-state.json and restores any active cooldowns into the ApiBalancer:
-
_restoreState()— reads the JSON file, mapsshortkey fragments (sk-abc1...xyz9) back to full key strings from the env, setscooldownUntilentries for non-expired cooldowns. Logs[keys] restored N cooldown(s) from cacheon success. -
Direct disk safety net (
fetchGoModelsRaw) — as a second check, readskey-state.jsondirectly and builds acooldownFromDiskMap. ThekeyInCooldown()helper checks both the in-memory_balancer.cooldownUntilAND the direct disk Map. This ensures a key in cooldown is never pinged even if the_restoreStatemapping fails silently (e.g. key format mismatch between sessions). -
Individual key cooldown — before each ping in
fetchGoModelsRaw,keyInCooldown(k)is called. Keys in cooldown log[keys] key[N] in cooldown (~Xs) — skippingand are never contacted. -
All-key cooldown — if every key is in cooldown, the entire paid model fetch is skipped with
[keys] all keys in cooldown — skipping paid models.
This means a key rate-limited from a previous session will never be pinged on restart — it's skipped entirely, saving a wasted 429 roundtrip.
| Variable | Default | Description |
|---|---|---|
SERVER_PORT |
11434 |
Listen port |
SERVER_HOST |
127.0.0.1 |
Listen host |
DEFAULT_MODEL |
big-pickle |
Fallback model |
DEFAULT_TEMPERATURE |
— | Global temperature (e.g. 0.1) |
M365CO_PORT |
— | M365 WebSocket relay port (e.g. 8765) |
CROF_API_KEY |
— | CrofAI API key for Crof models (set to enable the Crof model section) |
CACHE_ENABLED |
true |
Prompt cache |
CACHE_MAX_SIZE |
64 |
Max cached entries |
CACHE_TTL_SEC |
300 |
Cache TTL |
REQUEST_LOG |
true |
Log incoming requests to console |
HIDE_FREE |
false |
Hide free models and [FREE]/[GO] tags & dividers |
SESSION_KEEPALIVE_ENABLED |
true |
Keep KV cache warm between turns |
SESSION_KEEPALIVE_INTERVAL_MS |
120000 |
Keepalive ping interval (ms) |
SESSION_KEEPALIVE_IDLE_TIMEOUT_MS |
600000 |
Stop keepalive after idle (ms) |
SESSION_KEEPALIVE_MAX_LIFETIME_MS |
86400000 |
Cycle upstream cache after (ms, 24h) |
DISPLAY_REASONING |
false |
Mirror DeepSeek thinking tokens into Cursor/VS-visible markdown blocks |
COLLAPSIBLE_REASONING |
true |
Use collapsible <details> blocks when DISPLAY_REASONING=true |
Models appear in VS Code's Copilot list as [FREE] Model Name, [FREE*] Model Name, [GO] Model Name, [CROF] Model Name, and [M365] M365 Copilot — the prefix indicates free vs freemium vs paid vs Crof vs M365 tier at a glance.
Free (always available, auto-validated): Big Pickle, MiniMax M2.5 Free, Nemotron 3 Super Free, Ring 2.6 1T Free
Freemium (requires API key, same Zen free models — auto-detected via ping): Big Pickle, MiniMax M2.5 Free, Nemotron 3 Super Free, Ring 2.6 1T Free — these route to the same free Zen endpoint but authenticate with your API key. On startup, each free model is pinged without a key; if it returns 401, it's retried with your API key. Models that succeed with the key are marked freemium and shown in orange in the console.
Paid (requires Go API key): fetched dynamically from OpenCode — all support tool calling
Some models support adjustable thinking effort. These appear in the model list both as a default entry (no thinking mode applied) and as separate tagged entries per thinking level:
| Tag | Meaning |
|---|---|
[LOW] |
Low reasoning effort |
[MED] |
Medium reasoning effort |
[HIGH] |
High reasoning effort |
[MAX] |
Maximum reasoning effort |
Models that think on their own (GLM, Kimi, MiniMax, Qwen) appear without tags — their thinking is handled internally and can't be controlled.
This thinking-mode tagging on the model name will be replaced with native thinking controls once Visual Studio adds proper support for it.
M365 Copilot (optional, requires M365CO_PORT): your company's Microsoft 365 Copilot chat — two models (Quick + Think), chat-only, no tools
When using VS 2026 agent mode, the proxy handles file creation and project integration:
- New files (
.css,.js,.py, etc.) are created via tool calls — written to disk with absolute workspace paths - Project files (
.csproj,.vbproj,.fsproj, etc.) are handled natively — markdown code blocks pass through for VS to edit in-place - Auto-injection: new files are automatically added to the project's
.csprojwith the correct<Content Include="..." />entry - Workspace root is extracted from VS 2026's IDE state context — relative file paths are resolved automatically
To create a new file, just ask Copilot (e.g. "create me a css file called test.css"). The AI will:
- Create the file
- Read the project file
- Add the file reference to the project
| Endpoint | Method | Description |
|---|---|---|
/api/tags |
GET | Ollama model list with capabilities, context length, pricing |
/v1/chat/completions |
POST | Chat with tool calling, streaming, cache |
/v1/engines/copilot-codex/completions |
POST | Inline code completions |
/api/show |
POST | Model detail with full capabilities, context, pricing |
/api/stats |
GET | Proxy metrics (uptime, model counts, concurrency, reasoning cache, key status) |
/api/refresh |
POST | Force refresh model list from upstream APIs |
/api/diagnostics |
POST | Self-test with tool-calling roundtrip (connectivity, streaming, tool verification) |
/health |
GET | Health check with model counts |
/api/version |
GET | Returns 420.96.00 |
/stop |
GET | Shutdown |
| Command | Action |
|---|---|
r / restart |
Restart proxy |
s / stop |
Shut down |
e / exit |
Shut down |
Or curl http://localhost:11434/stop
On startup, the proxy fetches the latest ticks from notBlubbll/gc2oc/.version (raw) and compares them with the local .version file. If they differ, the repo has been updated — the console title changes to:
gc2oc (outdated, check github for new version)
The status line shows green when up to date (match) and red when outdated (mismatch).
A GitHub Actions workflow writes the current UNIX timestamp in ms to the
versionfile on each push tomain.
The proxy tracks tokens-per-second throughput and displays a rolling average in the console title.
| State | Title |
|---|---|
| No activity | gc2oc |
| After requests | gc2oc [42.5 t/s] |
| Outdated + TPS | gc2oc (outdated, check github for new version) [42.5 t/s] |
Set SHOW_TPS=false in .env to disable.
No git required. update.cmd downloads the latest main.zip from GitHub and applies only changed files — your config and caches are preserved.
update.cmd
| Step | What happens |
|---|---|
| Download | Fetches main.zip from the repo |
| Extract | Unzips to a temp folder |
| Compare | MD5-hashes every file — copies only new or changed files |
| Preserve | .env, .cache/, .dist/, node_modules/, .git/ are never touched |
| Cleanup | Temp folder removed automatically |
Each file is labeled NEW (first time), UPD (changed), or SKIP (preserved) so you can see exactly what was updated. Restart the proxy after updating to use the new code.
update-and-build.cmd fetches the latest source then runs build.cmd in one step — pull the newest code and produce a fresh .dist standalone.
update-and-build.cmd
The full model list (free + paid + Pollinations + M365) is cached to .cache/models.json. On restart, if the key hash matches and no relevant config changed (free tier models, M365 token path), the cache is loaded instantly — no upstream API calls needed.
Invalidation triggers:
- Key hash changes — keys added, removed, or rotated → full refresh
- Free tier models changed in code — SHA256 hash of all free model IDs compared to cached value
- M365 token set or removed — cached M365 presence vs current env mismatch
SHA256 hash of all API keys (sorted, deduped) persisted to .cache/keyhash.json. Used at startup to detect key changes without re-parsing .env — if the hash matches, paid models load from disk cache instantly.
Persists per-key cooldown state between restarts. See Key Cooldown Checker above for the full load/restore flow. File is written on every cooldown state change and loaded on startup + each refreshModels() call.
LRU with TTL. Responses keyed by hash of model + temperature + tool count + session discriminator + normalized messages. Cache hits replay instantly with zero tokens. Controlled by CACHE_ENABLED/CACHE_MAX_SIZE/CACHE_TTL_SEC.
Per-session reasoning text from <think> tags is cached and re-attached when a cached prompt-response pair is replayed. Ensures DeepSeek-style reasoning isn't lost on cache hits. Session-scoped — different conversations never share reasoning data, even if they produce the same text.
Multi-tier reasoning key system (inspired by yxlao/deepseek-cursor-proxy)
Reasoning is stored under multiple lookup keys to maximize cache hit rates across tool-call conversations:
| Key type | Description | Survives |
|---|---|---|
| Message signature | SHA256 of content + canonicalized tool calls | Exact match |
| Tool call ID | Per-call id field |
Argument re-ordering |
| Tool call signature | SHA256 of tool name + normalized args | ID reassignment |
| Tool name | Plain function name | Interrupted streams, missing IDs |
| Legacy content hash | Simplified content-based hash | Backward compatibility |
Lookup priority: Message signature → Tool call IDs → Tool call signatures → Tool names → Legacy content hash → FIFO → Global last-reasoning fallback.
Smart memory management: LRU eviction at 5000 entries, preserving permanent fallback keys (g:*:last, *:mdl:*). No disk persistence — everything stays in memory for speed.
When DISPLAY_REASONING=true, DeepSeek reasoning tokens are mirrored into Cursor/VS Code-visible content as Markdown blocks:
- Collapsible (
COLLAPSIBLE_REASONING=true, default):<details><summary>Thinking</summary>...</details> - Plain (
COLLAPSIBLE_REASONING=false):<think>...</think>
Echoed thinking blocks in incoming assistant content are automatically stripped before forwarding to upstream APIs. This prevents "reasoning doubling" when VS/Cursor echoes back the proxy-injected display blocks.
When the upstream DeepSeek API returns reasoning_content must be passed back, the proxy:
- Tier 1 — Retries with
thinking: falsedisabled (preserves full history) - Tier 2 — Strips all assistant/tool messages, keeping only system + user
This matches the recovery strategy in yxlao/deepseek-cursor-proxy, which pioneered this approach for Cursor compatibility.
On startup and refresh, each free model is pinged with a lightweight request (max_tokens: 1). Only responding models appear in the list. Results are cached to disk models.
A quick big-pickle ping runs at startup to verify Zen API reachability: 200 ok, 401 key denied, or unreachable.
The proxy detects and numbers distinct conversation sessions. Each new chat tab or task context gets a monotonic session ID, visible in the console:
new session 3 (vscode, go/deepseek-v4-flash, c:\workspace\project)
[vscode][3]>[go/deepseek-v4-flash]
[vscode][3] stream done (42 chunks)
A session is identified by hashing:
- All user messages before the first assistant/tool message — VS sends the context block + the user's query as separate user messages, and the combined hash uniquely identifies each chat tab
- Workspace root — same query in a different project is treated as a different session
Switching models mid-conversation keeps the same session — the message history and cached context carry over.
This means different chat tabs, different workspaces, and different models get separate session IDs with isolated caches.
All in-memory caches (prompt-response LRU, reasoning/<think> text) are session-scoped — data from one conversation never leaks into another. Two different users asking the same question will never get each other's cached responses or reasoning text.
When you do iterative development on the same code area, most context (system prompt, loaded files, tool results) is identical across turns. The proxy keeps the upstream LLM provider's KV cache warm between turns, so subsequent prompts pay ~10x cheaper cache-read pricing instead of full input token pricing.
Keepalive — after SESSION_KEEPALIVE_INTERVAL_MS (default 2min) of inactivity, a background ping is sent to the upstream API with the same conversation prefix (max_tokens:1, no tools, no stream). This prevents KV cache eviction. After SESSION_KEEPALIVE_IDLE_TIMEOUT_MS (default 10min) of total inactivity, pinging stops. After SESSION_KEEPALIVE_MAX_LIFETIME_MS (default 24h), the keepalive cycles to re-establish a fresh upstream cache.
Conversation continuity — when VS/VS Code opens a new chat in the same workspace, the proxy detects it and:
- Enriches the system prompt with
"You previously worked on this project..."so the model knows prior knowledge applies - Shares the reasoning cache across sessions in the same workspace (workspace-scoped fallback when conversation-scoped lookup misses)
| Variable | Default | Description |
|---|---|---|
SESSION_KEEPALIVE_ENABLED |
true |
Enable/disable session keepalive |
SESSION_KEEPALIVE_INTERVAL_MS |
120000 |
Milliseconds between pings (min 30000) |
SESSION_KEEPALIVE_IDLE_TIMEOUT_MS |
600000 |
Milliseconds of inactivity before stopping |
SESSION_KEEPALIVE_MAX_LIFETIME_MS |
86400000 |
Max session life before cycling upstream cache (24h) |
Inspired by TaskSync #98. A 40-minute agentic session with warming can cost 8x less than a 5-minute session without.
6 free models via Pollinations (GPT-OSS 20B backend, reasoning + tools):
| Model ID | Display Name | Context |
|---|---|---|
pol/openai-fast |
Pollinations GPT-OSS 20B | 131K |
pol/GPT-5 |
Pollinations GPT-5 | 131K |
pol/Claude |
Pollinations Claude | 200K |
pol/Gemini |
Pollinations Gemini | 1M |
pol/DeepSeek |
Pollinations DeepSeek | 131K |
pol/Llama-4 |
Pollinations Llama 4 | 131K |
pol/Mistral |
Pollinations Mistral | 131K |
All route through the same Pollinations openai backend — no API key required. By default, only the clean pol/openai-fast model is shown. The 6 cosplay aliases are hidden unless HIDE_POLL_COSPLAY=false is set.
| Variable | Default | Description |
|---|---|---|
SHOW_POLL_MODELS |
true |
Show Pollinations models |
HIDE_POLL_COSPLAY |
true |
Hide cosplay aliases (GPT-5, Claude, Gemini, DeepSeek, Llama-4, Mistral) — only show GPT-OSS 20B |
You can add CrofAI models as an additional model provider alongside OpenCode Go. Set CROF_API_KEY in .env to enable. Crof models appear with a [CROF] prefix in VS Code and are listed under a Crof section in the console banner.
CROF_API_KEY=your-crof-keyCrof models use the crof/ prefix in model IDs (e.g. crof/deepseek-v4-flash) to avoid conflicts with OpenCode Go models. All Crof models support tool calling, streaming, vision, and thinking modes.
Auto-refresh: Crof models detect key state changes at runtime — add or remove CROF_API_KEY and the model list refreshes without restarting the proxy.
You can route chat requests through your company's Microsoft 365 Copilot (the web chat at m365.cloud.microsoft) as an additional model. Two models appear: [M365] M365 Copilot Quick and [M365] M365 Copilot Think.
The proxy connects to a WebSocket relay server that runs a browser-automated M365 Copilot session. The relay intercepts the M365 substrate WebSocket (substrate.office.com) and forwards chat requests/responses. This is the same approach used by m365-copilot-openai-proxy.
An external relay is required. The proxy uses a WebSocket-based protocol; the relay handles browser automation and M365 auth.
| Relay | Setup | Description |
|---|---|---|
| g365-headless-relay | npm install |
Playwright Chromium off-screen relay — open-source, cross-platform, persistent profile |
The proxy's M365 WebSocket protocol is inspired by the same substrate-interception concept used in m365-copilot-openai-proxy, but the wire format is different — they are not interchangeable.
Constraints:
- Token expires in ~1 hour (browser session handles auth — no manual token extraction).
- System prompts and conversation history are folded into the message as plain text (labeled sections with
---separator). - No tool calls or agent mode — M365 Copilot is chat-only.
- Clone and install:
git clone https://github.com/notBlubbll/g365-headless-relay cd g365-headless-relay npm install - First run — sign in (visible browser):
debug.cmd
- Subsequent runs — off-screen relay:
start.cmd
- Set the relay port in
.env:M365CO_PORT=8765
- Restart the proxy.
System prompts and conversation history are folded into the message as plain text before sending to M365:
System instructions:
Be concise and helpful.
Prior conversation transcript:
User: What is TypeScript?
Assistant: TypeScript is a typed superset of JavaScript.
---
Tell me more about interfaces.
When the browser session expires, restart the relay:
- g365-headless-relay: run
debug.cmdto re-sign in, thenstart.cmd
No manual token copying required — the browser session handles all auth.
Enriched from OmniRoute (RTK+Caveman stacked compression) and caveman. 7 compression levels available:
| Level | Savings | Description |
|---|---|---|
off |
0% | No compression |
lite |
~15% | Whitespace collapse, dedup system prompts |
caveman / standard |
~30% | 30+ regex rules: filler removal, context condensation, structural compression, multi-turn dedup |
aggressive |
~50% | All Caveman + progressive message aging + tool result summarization |
ultra |
~75% | All Aggressive + heuristic token pruning + stopword removal |
rtk |
60-90% | Command-aware filters for shell/test/build/git output |
stacked |
78-95% | RTK first, then Caveman — best for mixed prompts with tool logs + prose |
Functions available in token-optimizer.js: compressContent(), compressMessages(), compressBest(), estimatedSavings().
build.cmd auto-detects the best available runtime and builds accordingly:
| Script | Behavior | Requires |
|---|---|---|
build.cmd |
Auto-detect — tries Bun first, falls back to Node.js | Bun or Node.js |
build-bun.cmd |
Explicit Bun — single .exe |
Bun |
build-node.cmd |
Explicit Node.js — portable folder | Node.js |
All scripts clean .dist/ before building but preserve dotfiles (.env, .version, .cache/, etc.) so your config survives rebuilds. .env is seeded only on the first build (never overwritten), while .version is always updated to match the current source.
Compiles to gc2oc (Bun standalone) + service.exe (C# launcher) using bun build --compile. The Bun runtime is embedded.
- No runtime required —
gc2ocis fully self-contained (~112 MB) - No
node_modules— all JS modules bundled service.exehandles restart/update loop,.envloading, port cleanup, and Windows service modestart.cmdis a one-shot launcher: callsservice.exeand exitsgc2ochas no.exeextension — prevents accidental double-click;service.exeis the entry point- Requires Windows 10 1809+ / Windows Server 2019+ (same OS floor as Bun)
Windows service:
sc create gc2oc binPath= "C:\path\.dist\service.exe" start= auto
sc start gc2oc
Creates a portable folder with node (no extension) + source + production dependencies. Run start.cmd or service.exe inside the folder.
- No install needed on the target machine — the Node.js binary is bundled
- Works on Windows Server 2016+ and any Windows that runs Node.js v18+
service.exeis a C# launcher with the same restart/update loop and Windows service support
For older Windows where Bun won't run (Server 2016), use Node.js directly:
npm run node # Node.js fallback
start.cmd # auto-detects Bun vs NodeBun (preferred) → Node.js (fallback for older Windows) · Hono · direct fetch
See credits.md for the full list of open-source projects that inspired patterns and features in gc2oc.
Key inspirations include copilot-proxy, Qwen-Copilot-Proxy, Proxllama, vLLM-proxy-for-VS-Code, antigravity-copilot, OmniRoute, OpenCode Zen Provider, yxlao/deepseek-cursor-proxy, and many more.







