-
Notifications
You must be signed in to change notification settings - Fork 2
Model as a Service (MaaS)
The extension supports Model-as-a-Service (MaaS) on Google Cloud Vertex AI, providing access to open-weight third-party models through a managed, serverless inference API.
- What is MaaS?
- How MaaS differs from Gemini and Claude
- How the extension supports MaaS
- Supported MaaS models
- Why these models?
- Thinking / reasoning tokens
- Pricing
- Troubleshooting
Model-as-a-Service is a Vertex AI offering that lets you run open-weight models (DeepSeek, Qwen, Kimi, and others) without managing infrastructure. Google Cloud hosts the model weights, handles inference scaling, and exposes a standard OpenAI-compatible Chat Completions API.
Key characteristics:
- Serverless — no GPU provisioning, no deployment YAML, no cold starts (on-demand tier)
-
OpenAI-compatible — the same
POST /chat/completionsJSON protocol you'd use with the OpenAI SDK - Bearer token auth — authenticated with your Google Cloud credentials (ADC or Service Account), not API keys
- GCP IAM integrated — access controlled by Vertex AI IAM roles, billed to your GCP project
-
Global and regional endpoints — models available at
aiplatform.googleapis.com(global) or region-pinned endpoints likeus-east5-aiplatform.googleapis.com
The MaaS endpoint URL pattern is:
https://{REGION}-aiplatform.googleapis.com/v1/projects/{PROJECT_ID}/locations/{REGION}/endpoints/openapi/chat/completions
| Aspect | Gemini (Google) | Claude (Anthropic) | MaaS (Open-Weight) |
|---|---|---|---|
| API protocol | Gemini SDK (@google/genai) |
Anthropic Messages API (@anthropic-ai/vertex-sdk) |
OpenAI Chat Completions (openai npm package) |
| Model ownership | Anthropic | Various (DeepSeek, Alibaba, Moonshot) | |
| Serving | Google-first-party | Anthropic on Vertex | Open-weight models on GCP infra |
| Auth | ADC / Service Account | ADC / Service Account | ADC / Service Account (same) |
| Pricing | Per 1M tokens, published | Per 1M tokens, published | Varies by model and throughput tier |
| Thinking mode | Gemini thinking variant | Extended thinking (thinking.budget_tokens) |
Model-specific (chat_template_kwargs or native) |
The extension treats MaaS as a third vendor alongside google (Gemini) and anthropic (Claude). The implementation is in a single provider file: src/providers/VertexMaaSProvider.ts.
┌──────────────────────────────────────────────────────┐
│ VertexChatModelDispatcher │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │
│ │ Google │ │Anthropic │ │ MaaS │ │
│ │ Provider │ │ Provider │ │ Provider │ │
│ │ │ │ │ │ │ │
│ │ @google/ │ │@anthropic│ │ openai SDK │ │
│ │ genai │ │/vertex- │ │ + google-auth │ │
│ │ │ │ sdk │ │ │ │
│ └────┬─────┘ └────┬─────┘ └────────┬─────────┘ │
│ │ │ │ │
└───────┼──────────────┼────────────────┼──────────────┘
│ │ │
Gemini API Anthropic API OpenAI-compatible
Chat Completions API
on Vertex AI
-
Single provider file — all three MaaS models share identical protocol (OpenAI Chat Completions on Vertex). Only temperature, thinking mode, and
max_tokensdiffer per model. These are driven by a staticMODEL_CONFIGlookup table. -
OpenAI SDK — the
openainpm package handles SSE streaming, typed request/response objects, and error types. The SDK'sbaseURLis pointed at the Vertex MaaS endpoint. -
Fresh Bearer token per request — the GCP access token expires hourly. A new
GoogleAuthinstance and token are fetched for each inference call. -
Thinking token filtering — models that produce chain-of-thought reasoning (DeepSeek V3.2, Kimi K2) return it in
reasoning_contentdeltas. The provider silently consumes these; only the finalcontentreaches the VS Code chat UI.
| Property | Value |
|---|---|
| VS Code model ID | qwen3-coder-480b |
| Provider | Alibaba / Qwen |
| Context window | 262,144 tokens |
| Max output | 4,096 tokens |
| Thinking | No — instruction-tuned coding model |
| Tool calling | Yes |
| Image input | No |
Qwen3-Coder is a dedicated code generation model scoring 38.7 on SWE-bench Pro. It runs at temperature: 0.1 for deterministic, syntax-correct code output.
| Property | Value |
|---|---|
| VS Code model ID | deepseek-v3.2 |
| Provider | DeepSeek |
| Context window | 131,072 tokens |
| Max output | 8,192 tokens (shared with thinking) |
| Thinking | Yes — always enabled via chat_template_kwargs: { thinking: true }
|
| Tool calling | Yes |
| Image input | No |
DeepSeek V3.2 is a Mixture-of-Experts model with strong reasoning capabilities. Thinking is explicitly enabled on every request. When tools are present, system prompts are omitted (per GCP MaaS guidance — DeepSeek performs better at function calling without them). Runs at temperature: 0.3 for balanced exploration during reasoning.
| Property | Value |
|---|---|
| VS Code model ID | kimi-k2-thinking |
| Provider | Moonshot AI / Kimi |
| Context window | 32,768 tokens |
| Max output | 8,192 tokens (shared with thinking) |
| Thinking | Yes — native, always on |
| Tool calling | Yes |
| Image input | No |
Kimi K2 is a Mixture-of-Experts reasoning model with thinking always enabled natively. No toggle parameter is needed. Runs at temperature: 0.6 providing freedom for reasoning exploration while staying grounded.
Note: Kimi K2 may not be available in all GCP projects. If the model does not appear in your model picker, check the "Vertex AI Models: MaaS Provider" output channel — a 404 during discovery means it is not yet enabled for your project or region.
MaaS on Vertex AI offers many models. The three included here were selected based on:
- Coding proficiency — each model has strong benchmark scores on coding tasks and performs well in assisted-coding scenarios with tool calling
- Tool calling support — the VS Code Copilot Chat agent mode requires function/tool calling. Models without this capability (like deepseek-r1) were excluded
- OpenAI-compatible API — all three models are accessed through the same Chat Completions protocol, enabling a single-provider architecture
- Thinking/reasoning diversity — the set covers the spectrum: no thinking (Qwen3-Coder), configurable thinking (DeepSeek V3.2), and native thinking (Kimi K2)
-
Temperature tuning — each model has temperature and
top_pvalues researched and set per-model rather than using generic defaults
| Model | Reason excluded |
|---|---|
| deepseek-r1 | No tool/function calling support — incompatible with Copilot agent mode |
| Llama 4 Maverick | Not yet available on MaaS at the time of integration |
| GPT OSS models | Uses reasoning_effort (Low/Medium/High) rather than chat_template_kwargs — different thinking control protocol |
| Claude models | Already supported via the dedicated VertexAnthropicProvider
|
| Gemini models | Already supported via the dedicated VertexGoogleProvider
|
MaaS model support is config-driven. To add a new MaaS model:
- Add an entry to the
MODEL_CONFIGmap insrc/providers/VertexMaaSProvider.tswith the model's MaaS path, temperature,top_p, thinking mode, andmaxTokens - Add a corresponding entry to
src/models.jsonwith the VS Code ID, vendor ("maas"), display name, context window, and capabilities
No other code changes are needed — discovery, streaming, tool calling, and error handling all work generically from the config.
Two of the three MaaS models produce chain-of-thought reasoning tokens before generating their final answer.
When a model reasons, the streaming response includes two types of content:
-
reasoning_content— the model's internal chain-of-thought (not shown to the user) -
content— the final answer text (shown in the VS Code chat panel)
The extension silently consumes reasoning_content deltas and only emits content as visible text. Both token types count toward completion_tokens in billing.
| Model | Thinking enabled? | How | Tokens visible to user? |
|---|---|---|---|
| Qwen3-Coder 480B | No | N/A | All output tokens are content
|
| DeepSeek V3.2 | Yes, always | chat_template_kwargs: { thinking: true } |
Only content; reasoning_content consumed silently |
| Kimi K2 Thinking | Yes, native | Always on by the model | Only content; reasoning_content consumed silently |
Thinking models share their max_tokens budget between reasoning and the final answer. If max_tokens is too low, the model may exhaust its budget on reasoning before producing an answer. For this reason, DeepSeek V3.2 and Kimi K2 are configured with maxTokens: 8192 while Qwen3-Coder uses 4096.
MaaS model pricing is set to $0 in the extension's models.json because:
- MaaS pricing depends on the GCP region, throughput tier (on-demand vs. provisioned), and customer-specific contracts
- Google Cloud does not publish static per-token prices for all MaaS models
- Pricing information is available in the Google Cloud Console under Vertex AI pricing for your specific project and commitment level
The extension's usage dashboard still tracks token counts, timestamps, and model names — only the dollar amount shows as $0.00. You can update the pricing values in src/models.json if you have known rates for your deployment.
- Open the output channel: View → Output → "Vertex AI Models: MaaS Provider"
- Look for
🏓 MaaS <model-path> → ❌entries - Common causes:
- 404 — model not enabled in your GCP project (enable it in Vertex AI Model Garden)
-
403 — IAM permissions missing (add
roles/aiplatform.userto your account) -
401 — authentication expired (run
gcloud auth application-default login)
If you see "400 status code (no body)":
- Ensure the model supports tool calling (all three supported models do)
- Check the output channel for request diagnostics
- This is most likely a transient GCP-side issue — the extension will retry automatically
The extension uses exponential backoff with automatic retries for transient errors (429, 503). If streaming stops, the output channel will show the retry history. You can also click the Stop button in Copilot Chat to cancel gracefully.
See the general Diagnostics & Troubleshooting page for non-MaaS-specific issues.