Model as a Service (MaaS)

Vertex AI Model-as-a-Service (MaaS) — Open-Weight Models

The extension supports Model-as-a-Service (MaaS) on Google Cloud Vertex AI, providing access to open-weight third-party models through a managed, serverless inference API.

What is MaaS?

Model-as-a-Service is a Vertex AI offering that lets you run open-weight models (DeepSeek, Qwen, Kimi, and others) without managing infrastructure. Google Cloud hosts the model weights, handles inference scaling, and exposes a standard OpenAI-compatible Chat Completions API.

Key characteristics:

Serverless — no GPU provisioning, no deployment YAML, no cold starts (on-demand tier)
OpenAI-compatible — the same POST /chat/completions JSON protocol you'd use with the OpenAI SDK
Bearer token auth — authenticated with your Google Cloud credentials (ADC or Service Account), not API keys
GCP IAM integrated — access controlled by Vertex AI IAM roles, billed to your GCP project
Global and regional endpoints — models available at aiplatform.googleapis.com (global) or region-pinned endpoints like us-east5-aiplatform.googleapis.com

The MaaS endpoint URL pattern is:

https://{REGION}-aiplatform.googleapis.com/v1/projects/{PROJECT_ID}/locations/{REGION}/endpoints/openapi/chat/completions

How MaaS differs from Gemini and Claude

Aspect	Gemini (Google)	Claude (Anthropic)	MaaS (Open-Weight)
API protocol	Gemini SDK (`@google/genai`)	Anthropic Messages API (`@anthropic-ai/vertex-sdk`)	OpenAI Chat Completions (`openai` npm package)
Model ownership	Google	Anthropic	Various (DeepSeek, Alibaba, Moonshot)
Serving	Google-first-party	Anthropic on Vertex	Open-weight models on GCP infra
Auth	ADC / Service Account	ADC / Service Account	ADC / Service Account (same)
Pricing	Per 1M tokens, published	Per 1M tokens, published	Varies by model and throughput tier
Thinking mode	Gemini thinking variant	Extended thinking (`thinking.budget_tokens`)	Model-specific (`chat_template_kwargs` or native)

How the extension supports MaaS

The extension treats MaaS as a third vendor alongside google (Gemini) and anthropic (Claude). The implementation is in a single provider file: src/providers/VertexMaaSProvider.ts.

Architecture

┌──────────────────────────────────────────────────────┐
│                  VertexChatModelDispatcher           │
│                                                      │
│  ┌──────────┐  ┌──────────┐  ┌──────────────────┐   │
│  │  Google  │  │Anthropic │  │      MaaS        │   │
│  │ Provider │  │ Provider │  │    Provider      │   │
│  │          │  │          │  │                  │   │
│  │ @google/ │  │@anthropic│  │  openai SDK      │   │
│  │  genai   │  │/vertex-  │  │  + google-auth   │   │
│  │          │  │   sdk    │  │                  │   │
│  └────┬─────┘  └────┬─────┘  └────────┬─────────┘   │
│       │              │                │              │
└───────┼──────────────┼────────────────┼──────────────┘
        │              │                │
   Gemini API    Anthropic API    OpenAI-compatible
                                  Chat Completions API
                                  on Vertex AI

Key design decisions

Single provider file — all three MaaS models share identical protocol (OpenAI Chat Completions on Vertex). Only temperature, thinking mode, and max_tokens differ per model. These are driven by a static MODEL_CONFIG lookup table.
OpenAI SDK — the openai npm package handles SSE streaming, typed request/response objects, and error types. The SDK's baseURL is pointed at the Vertex MaaS endpoint.
Fresh Bearer token per request — the GCP access token expires hourly. A new GoogleAuth instance and token are fetched for each inference call.
Thinking token filtering — models that produce chain-of-thought reasoning (DeepSeek V3.2, Kimi K2) return it in reasoning_content deltas. The provider silently consumes these; only the final content reaches the VS Code chat UI.

Supported MaaS models

Qwen3 Coder 480B

Property	Value
VS Code model ID	`qwen3-coder-480b`
Provider	Alibaba / Qwen
Context window	262,144 tokens
Max output	4,096 tokens
Thinking	No — instruction-tuned coding model
Tool calling	Yes
Image input	No

Qwen3-Coder is a dedicated code generation model scoring 38.7 on SWE-bench Pro. It runs at temperature: 0.1 for deterministic, syntax-correct code output.

DeepSeek V3.2

Property	Value
VS Code model ID	`deepseek-v3.2`
Provider	DeepSeek
Context window	131,072 tokens
Max output	8,192 tokens (shared with thinking)
Thinking	Yes — always enabled via `chat_template_kwargs: { thinking: true }`
Tool calling	Yes
Image input	No

DeepSeek V3.2 is a Mixture-of-Experts model with strong reasoning capabilities. Thinking is explicitly enabled on every request. When tools are present, system prompts are omitted (per GCP MaaS guidance — DeepSeek performs better at function calling without them). Runs at temperature: 0.3 for balanced exploration during reasoning.

Kimi K2 Thinking

Property	Value
VS Code model ID	`kimi-k2-thinking`
Provider	Moonshot AI / Kimi
Context window	32,768 tokens
Max output	8,192 tokens (shared with thinking)
Thinking	Yes — native, always on
Tool calling	Yes
Image input	No

Kimi K2 is a Mixture-of-Experts reasoning model with thinking always enabled natively. No toggle parameter is needed. Runs at temperature: 0.6 providing freedom for reasoning exploration while staying grounded.

Note: Kimi K2 may not be available in all GCP projects. If the model does not appear in your model picker, check the "Vertex AI Models: MaaS Provider" output channel — a 404 during discovery means it is not yet enabled for your project or region.

Why these models?

Selection criteria

MaaS on Vertex AI offers many models. The three included here were selected based on:

Coding proficiency — each model has strong benchmark scores on coding tasks and performs well in assisted-coding scenarios with tool calling
Tool calling support — the VS Code Copilot Chat agent mode requires function/tool calling. Models without this capability (like deepseek-r1) were excluded
OpenAI-compatible API — all three models are accessed through the same Chat Completions protocol, enabling a single-provider architecture
Thinking/reasoning diversity — the set covers the spectrum: no thinking (Qwen3-Coder), configurable thinking (DeepSeek V3.2), and native thinking (Kimi K2)
Temperature tuning — each model has temperature and top_p values researched and set per-model rather than using generic defaults

Models considered but not included

Model	Reason excluded
deepseek-r1	No tool/function calling support — incompatible with Copilot agent mode
Llama 4 Maverick	Not yet available on MaaS at the time of integration
GPT OSS models	Uses `reasoning_effort` (Low/Medium/High) rather than `chat_template_kwargs` — different thinking control protocol
Claude models	Already supported via the dedicated `VertexAnthropicProvider`
Gemini models	Already supported via the dedicated `VertexGoogleProvider`

Adding more models

MaaS model support is config-driven. To add a new MaaS model:

Add an entry to the MODEL_CONFIG map in src/providers/VertexMaaSProvider.ts with the model's MaaS path, temperature, top_p, thinking mode, and maxTokens
Add a corresponding entry to src/models.json with the VS Code ID, vendor ("maas"), display name, context window, and capabilities

No other code changes are needed — discovery, streaming, tool calling, and error handling all work generically from the config.

Thinking / reasoning tokens

Two of the three MaaS models produce chain-of-thought reasoning tokens before generating their final answer.

How it works on MaaS

When a model reasons, the streaming response includes two types of content:

reasoning_content — the model's internal chain-of-thought (not shown to the user)
content — the final answer text (shown in the VS Code chat panel)

The extension silently consumes reasoning_content deltas and only emits content as visible text. Both token types count toward completion_tokens in billing.

Per-model thinking behavior

Model	Thinking enabled?	How	Tokens visible to user?
Qwen3-Coder 480B	No	N/A	All output tokens are `content`
DeepSeek V3.2	Yes, always	`chat_template_kwargs: { thinking: true }`	Only `content`; `reasoning_content` consumed silently
Kimi K2 Thinking	Yes, native	Always on by the model	Only `content`; `reasoning_content` consumed silently

Why thinking uses more output tokens

Thinking models share their max_tokens budget between reasoning and the final answer. If max_tokens is too low, the model may exhaust its budget on reasoning before producing an answer. For this reason, DeepSeek V3.2 and Kimi K2 are configured with maxTokens: 8192 while Qwen3-Coder uses 4096.

Pricing

MaaS model pricing is set to $0 in the extension's models.json because:

MaaS pricing depends on the GCP region, throughput tier (on-demand vs. provisioned), and customer-specific contracts
Google Cloud does not publish static per-token prices for all MaaS models
Pricing information is available in the Google Cloud Console under Vertex AI pricing for your specific project and commitment level

The extension's usage dashboard still tracks token counts, timestamps, and model names — only the dollar amount shows as $0.00. You can update the pricing values in src/models.json if you have known rates for your deployment.

Troubleshooting

Model doesn't appear in the picker

Open the output channel: View → Output → "Vertex AI Models: MaaS Provider"
Look for 🏓 MaaS <model-path> → ❌ entries
Common causes:
- 404 — model not enabled in your GCP project (enable it in Vertex AI Model Garden)
- 403 — IAM permissions missing (add roles/aiplatform.user to your account)
- 401 — authentication expired (run gcloud auth application-default login)

400 errors during chat

If you see "400 status code (no body)":

Ensure the model supports tool calling (all three supported models do)
Check the output channel for request diagnostics
This is most likely a transient GCP-side issue — the extension will retry automatically

Streaming stops unexpectedly

The extension uses exponential backoff with automatic retries for transient errors (429, 503). If streaming stops, the output channel will show the retry history. You can also click the Stop button in Copilot Chat to cancel gracefully.

See the general Diagnostics & Troubleshooting page for non-MaaS-specific issues.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model as a Service (MaaS)

Vertex AI Model-as-a-Service (MaaS) — Open-Weight Models

Table of Contents

What is MaaS?

How MaaS differs from Gemini and Claude

How the extension supports MaaS

Architecture

Key design decisions

Supported MaaS models

Qwen3 Coder 480B

DeepSeek V3.2

Kimi K2 Thinking

Why these models?

Selection criteria

Models considered but not included

Adding more models

Thinking / reasoning tokens

How it works on MaaS

Per-model thinking behavior

Why thinking uses more output tokens

Pricing

Troubleshooting

Model doesn't appear in the picker

400 errors during chat

Streaming stops unexpectedly

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally