Skip to content

Model as a Service (MaaS)

jorsm edited this page May 31, 2026 · 1 revision

Vertex AI Model-as-a-Service (MaaS) — Open-Weight Models

The extension supports Model-as-a-Service (MaaS) on Google Cloud Vertex AI, providing access to open-weight third-party models through a managed, serverless inference API.

Table of Contents

What is MaaS?

Model-as-a-Service is a Vertex AI offering that lets you run open-weight models (DeepSeek, Qwen, Kimi, and others) without managing infrastructure. Google Cloud hosts the model weights, handles inference scaling, and exposes a standard OpenAI-compatible Chat Completions API.

Key characteristics:

  • Serverless — no GPU provisioning, no deployment YAML, no cold starts (on-demand tier)
  • OpenAI-compatible — the same POST /chat/completions JSON protocol you'd use with the OpenAI SDK
  • Bearer token auth — authenticated with your Google Cloud credentials (ADC or Service Account), not API keys
  • GCP IAM integrated — access controlled by Vertex AI IAM roles, billed to your GCP project
  • Global and regional endpoints — models available at aiplatform.googleapis.com (global) or region-pinned endpoints like us-east5-aiplatform.googleapis.com

The MaaS endpoint URL pattern is:

https://{REGION}-aiplatform.googleapis.com/v1/projects/{PROJECT_ID}/locations/{REGION}/endpoints/openapi/chat/completions

How MaaS differs from Gemini and Claude

Aspect Gemini (Google) Claude (Anthropic) MaaS (Open-Weight)
API protocol Gemini SDK (@google/genai) Anthropic Messages API (@anthropic-ai/vertex-sdk) OpenAI Chat Completions (openai npm package)
Model ownership Google Anthropic Various (DeepSeek, Alibaba, Moonshot)
Serving Google-first-party Anthropic on Vertex Open-weight models on GCP infra
Auth ADC / Service Account ADC / Service Account ADC / Service Account (same)
Pricing Per 1M tokens, published Per 1M tokens, published Varies by model and throughput tier
Thinking mode Gemini thinking variant Extended thinking (thinking.budget_tokens) Model-specific (chat_template_kwargs or native)

How the extension supports MaaS

The extension treats MaaS as a third vendor alongside google (Gemini) and anthropic (Claude). The implementation is in a single provider file: src/providers/VertexMaaSProvider.ts.

Architecture

┌──────────────────────────────────────────────────────┐
│                  VertexChatModelDispatcher           │
│                                                      │
│  ┌──────────┐  ┌──────────┐  ┌──────────────────┐   │
│  │  Google  │  │Anthropic │  │      MaaS        │   │
│  │ Provider │  │ Provider │  │    Provider      │   │
│  │          │  │          │  │                  │   │
│  │ @google/ │  │@anthropic│  │  openai SDK      │   │
│  │  genai   │  │/vertex-  │  │  + google-auth   │   │
│  │          │  │   sdk    │  │                  │   │
│  └────┬─────┘  └────┬─────┘  └────────┬─────────┘   │
│       │              │                │              │
└───────┼──────────────┼────────────────┼──────────────┘
        │              │                │
   Gemini API    Anthropic API    OpenAI-compatible
                                  Chat Completions API
                                  on Vertex AI

Key design decisions

  • Single provider file — all three MaaS models share identical protocol (OpenAI Chat Completions on Vertex). Only temperature, thinking mode, and max_tokens differ per model. These are driven by a static MODEL_CONFIG lookup table.
  • OpenAI SDK — the openai npm package handles SSE streaming, typed request/response objects, and error types. The SDK's baseURL is pointed at the Vertex MaaS endpoint.
  • Fresh Bearer token per request — the GCP access token expires hourly. A new GoogleAuth instance and token are fetched for each inference call.
  • Thinking token filtering — models that produce chain-of-thought reasoning (DeepSeek V3.2, Kimi K2) return it in reasoning_content deltas. The provider silently consumes these; only the final content reaches the VS Code chat UI.

Supported MaaS models

Qwen3 Coder 480B

Property Value
VS Code model ID qwen3-coder-480b
Provider Alibaba / Qwen
Context window 262,144 tokens
Max output 4,096 tokens
Thinking No — instruction-tuned coding model
Tool calling Yes
Image input No

Qwen3-Coder is a dedicated code generation model scoring 38.7 on SWE-bench Pro. It runs at temperature: 0.1 for deterministic, syntax-correct code output.

DeepSeek V3.2

Property Value
VS Code model ID deepseek-v3.2
Provider DeepSeek
Context window 131,072 tokens
Max output 8,192 tokens (shared with thinking)
Thinking Yes — always enabled via chat_template_kwargs: { thinking: true }
Tool calling Yes
Image input No

DeepSeek V3.2 is a Mixture-of-Experts model with strong reasoning capabilities. Thinking is explicitly enabled on every request. When tools are present, system prompts are omitted (per GCP MaaS guidance — DeepSeek performs better at function calling without them). Runs at temperature: 0.3 for balanced exploration during reasoning.

Kimi K2 Thinking

Property Value
VS Code model ID kimi-k2-thinking
Provider Moonshot AI / Kimi
Context window 32,768 tokens
Max output 8,192 tokens (shared with thinking)
Thinking Yes — native, always on
Tool calling Yes
Image input No

Kimi K2 is a Mixture-of-Experts reasoning model with thinking always enabled natively. No toggle parameter is needed. Runs at temperature: 0.6 providing freedom for reasoning exploration while staying grounded.

Note: Kimi K2 may not be available in all GCP projects. If the model does not appear in your model picker, check the "Vertex AI Models: MaaS Provider" output channel — a 404 during discovery means it is not yet enabled for your project or region.

Why these models?

Selection criteria

MaaS on Vertex AI offers many models. The three included here were selected based on:

  1. Coding proficiency — each model has strong benchmark scores on coding tasks and performs well in assisted-coding scenarios with tool calling
  2. Tool calling support — the VS Code Copilot Chat agent mode requires function/tool calling. Models without this capability (like deepseek-r1) were excluded
  3. OpenAI-compatible API — all three models are accessed through the same Chat Completions protocol, enabling a single-provider architecture
  4. Thinking/reasoning diversity — the set covers the spectrum: no thinking (Qwen3-Coder), configurable thinking (DeepSeek V3.2), and native thinking (Kimi K2)
  5. Temperature tuning — each model has temperature and top_p values researched and set per-model rather than using generic defaults

Models considered but not included

Model Reason excluded
deepseek-r1 No tool/function calling support — incompatible with Copilot agent mode
Llama 4 Maverick Not yet available on MaaS at the time of integration
GPT OSS models Uses reasoning_effort (Low/Medium/High) rather than chat_template_kwargs — different thinking control protocol
Claude models Already supported via the dedicated VertexAnthropicProvider
Gemini models Already supported via the dedicated VertexGoogleProvider

Adding more models

MaaS model support is config-driven. To add a new MaaS model:

  1. Add an entry to the MODEL_CONFIG map in src/providers/VertexMaaSProvider.ts with the model's MaaS path, temperature, top_p, thinking mode, and maxTokens
  2. Add a corresponding entry to src/models.json with the VS Code ID, vendor ("maas"), display name, context window, and capabilities

No other code changes are needed — discovery, streaming, tool calling, and error handling all work generically from the config.

Thinking / reasoning tokens

Two of the three MaaS models produce chain-of-thought reasoning tokens before generating their final answer.

How it works on MaaS

When a model reasons, the streaming response includes two types of content:

  • reasoning_content — the model's internal chain-of-thought (not shown to the user)
  • content — the final answer text (shown in the VS Code chat panel)

The extension silently consumes reasoning_content deltas and only emits content as visible text. Both token types count toward completion_tokens in billing.

Per-model thinking behavior

Model Thinking enabled? How Tokens visible to user?
Qwen3-Coder 480B No N/A All output tokens are content
DeepSeek V3.2 Yes, always chat_template_kwargs: { thinking: true } Only content; reasoning_content consumed silently
Kimi K2 Thinking Yes, native Always on by the model Only content; reasoning_content consumed silently

Why thinking uses more output tokens

Thinking models share their max_tokens budget between reasoning and the final answer. If max_tokens is too low, the model may exhaust its budget on reasoning before producing an answer. For this reason, DeepSeek V3.2 and Kimi K2 are configured with maxTokens: 8192 while Qwen3-Coder uses 4096.

Pricing

MaaS model pricing is set to $0 in the extension's models.json because:

  • MaaS pricing depends on the GCP region, throughput tier (on-demand vs. provisioned), and customer-specific contracts
  • Google Cloud does not publish static per-token prices for all MaaS models
  • Pricing information is available in the Google Cloud Console under Vertex AI pricing for your specific project and commitment level

The extension's usage dashboard still tracks token counts, timestamps, and model names — only the dollar amount shows as $0.00. You can update the pricing values in src/models.json if you have known rates for your deployment.

Troubleshooting

Model doesn't appear in the picker

  1. Open the output channel: View → Output → "Vertex AI Models: MaaS Provider"
  2. Look for 🏓 MaaS <model-path> → ❌ entries
  3. Common causes:
    • 404 — model not enabled in your GCP project (enable it in Vertex AI Model Garden)
    • 403 — IAM permissions missing (add roles/aiplatform.user to your account)
    • 401 — authentication expired (run gcloud auth application-default login)

400 errors during chat

If you see "400 status code (no body)":

  • Ensure the model supports tool calling (all three supported models do)
  • Check the output channel for request diagnostics
  • This is most likely a transient GCP-side issue — the extension will retry automatically

Streaming stops unexpectedly

The extension uses exponential backoff with automatic retries for transient errors (429, 503). If streaming stops, the output channel will show the retry history. You can also click the Stop button in Copilot Chat to cancel gracefully.

See the general Diagnostics & Troubleshooting page for non-MaaS-specific issues.

Clone this wiki locally