# ðŸ““ The GenAI Revolution Cookbook

**Title:** How to Prototype a Locally Hosted Chatbot with Kimi K2 on Google Cloud (using Chainlit UI)

**Description:** Step-by-step guide to build a self-hosted chatbot using Kimi K2 on Google Cloud: covering GPU VM setup, vLLM inference, backend & UI integration, containerization, HTTPS security, testing.

---

*This jupyter notebook contains executable code examples. Run the cells below to try out the code yourself!*



You're about to build a private, GPU-accelerated chatbot that runs entirely under your controlâ€”no third-party API calls, no data leakage, and full customization of the model and interface. This guide walks you through deploying a complete chatbot stack on a single Google Cloud VM: vLLM for fast inference, FastAPI for request handling and guardrails, and Chainlit for a streaming chat UI. By the end, you'll have a working prototype served over HTTPS, ready for internal use or further extension.

This is a focused, single-VM prototype. Out of scope: autoscaling, multi-node orchestration, and production-grade authentication (we use simple rate limiting). You'll need a Google Cloud account with GPU quota, a domain for TLS, and a Hugging Face token with access to the Kimi K2 model weights (or a substitute instruct model).

## Why This Approach Works

**Why Kimi K2?**  
Kimi K2 is a 7B-parameter instruct model optimized for long-context reasoning and conversational tasks. It fits comfortably on a T4 or L4 GPU (16 GB VRAM) and delivers strong performance for chat applications without requiring multi-GPU setups or quantization. If you don't have access to Kimi K2, substitute a tested alternative like Qwen2-7B-Instruct or Llama-3.1-8B-Instruct, both of which fit similar VRAM profiles and support long contexts.

**Why vLLM?**  
vLLM is a high-throughput inference server with paged attention, enabling efficient memory use and batching. It exposes an OpenAI-compatible API, so you can reuse familiar client code and swap models without rewriting application logic. For a single-user or small-team prototype, vLLM on one GPU provides low-latency responses and straightforward deployment.

**Why FastAPI?**  
FastAPI provides a lightweight, typed REST layer between the UI and vLLM. This decoupling lets you enforce guardrails (input truncation, rate limiting) and abstract model details from the frontend. It also simplifies future extensions like tool calling or multi-model routing. For tips on crafting robust prompts and ensuring consistent outputs from LLM APIs, see our article on [prompt engineering with LLM APIs](/article/prompt-engineering-with-llm-apis-how-to-get-reliable-outputs-3).

**Why Chainlit?**  
Chainlit is a Python framework for building chat UIs with minimal boilerplate. It handles streaming, message history rendering, and WebSocket connections out of the box, letting you focus on backend logic rather than frontend plumbing.

**Why Google Cloud?**  
GCE offers predictable GPU availability (T4, L4, A100) with per-second billing and straightforward VM provisioning. Artifact Registry integrates cleanly for container storage, and Compute Engine's firewall and networking are simple to configure for single-VM deployments.

## How It Works (High-Level Overview)

The architecture is a three-tier pipeline:

1. **Browser â†’ Chainlit (UI):** User sends a message via the Chainlit web interface.
2. **Chainlit â†’ FastAPI (Backend):** Chainlit forwards the conversation history to the FastAPI backend via HTTP POST.
3. **FastAPI â†’ vLLM (Inference):** FastAPI validates inputs, truncates history to fit context limits, applies rate limiting, and calls vLLM's OpenAI-compatible `/v1/chat/completions` endpoint.
4. **vLLM â†’ FastAPI â†’ Chainlit â†’ Browser:** vLLM generates a response, FastAPI returns it to Chainlit, and Chainlit streams it to the user.

**Data flow:**  
- Conversation history lives in the Chainlit session and is passed to FastAPI on each request.
- FastAPI truncates messages to fit `MAX_INPUT_CHARS` and `MAX_HISTORY_MESSAGES` before forwarding to vLLM.
- Rate limiting is enforced per client IP in FastAPI (with caveats for reverse proxies, addressed below).

**TLS termination:**  
Nginx sits in front of Chainlit and FastAPI, handling HTTPS via Let's Encrypt and proxying requests to the backend services.

## Setup & Installation

### Prerequisites

- **Google Cloud account** with GPU quota (at least 1 T4 or L4 in your chosen region)
- **Domain name** for TLS (e.g., `chat.yourdomain.com`)
- **Hugging Face token** with access to Kimi K2 weights (or a substitute model)
- **Budget estimate:** ~$0.35â€“$0.50/hour for a T4 VM (n1-standard-4 + 1 T4 GPU)
- **Local tools:** `gcloud` CLI, Docker, and `docker compose` (v2.x)

### Provision the VM

Create a VM with a T4 GPU and sufficient disk for model weights and Docker images. This example uses `us-central1-a` and an `n1-standard-4` instance with a 100 GB boot disk.

In [None]:
gcloud compute instances create k2-chatbot \
  --zone=us-central1-a \
  --machine-type=n1-standard-4 \
  --accelerator=type=nvidia-tesla-t4,count=1 \
  --image-family=ubuntu-2004-lts \
  --image-project=ubuntu-os-cloud \
  --boot-disk-size=100GB \
  --maintenance-policy=TERMINATE \
  --metadata=install-nvidia-driver=True

**Note on GPU drivers:** The `install-nvidia-driver=True` metadata flag installs a compatible NVIDIA driver automatically. For newer GPUs like L4, you may need driver version 550 or higher. If the automatic install fails, SSH into the VM and run:

In [None]:
sudo /opt/deeplearning/install-driver.sh

Then reboot and verify:

In [None]:
nvidia-smi

### Configure Firewall

Allow HTTP and HTTPS traffic. Do not open ports 3000 or 8000 directlyâ€”these will be accessed only via Nginx.

In [None]:
gcloud compute firewall-rules create allow-http-https \
  --allow=tcp:80,tcp:443 \
  --source-ranges=0.0.0.0/0 \
  --target-tags=http-server,https-server

gcloud compute instances add-tags k2-chatbot \
  --zone=us-central1-a \
  --tags=http-server,https-server

### Install Docker and Docker Compose

SSH into the VM and install Docker:

In [None]:
gcloud compute ssh k2-chatbot --zone=us-central1-a

Once connected, run:

In [None]:
sudo apt-get update
sudo apt-get install -y docker.io docker-compose-plugin
sudo usermod -aG docker $USER
newgrp docker

Verify Docker can access the GPU:

In [None]:
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu20.04 nvidia-smi

### Set Up Hugging Face Token

Export your Hugging Face token as an environment variable. For better secrets hygiene, create a `.env` file instead of exporting in shell history:

In [None]:
echo "HF_TOKEN=your_hf_token_here" > ~/.env
chmod 600 ~/.env

Load it in your shell:

In [None]:
export $(cat ~/.env | xargs)

**Security note:** For production, use Google Secret Manager or a similar service to manage tokens.

## Step-by-Step Implementation

### Step 1: Create Project Structure

Set up directories for the API, UI, and Docker configuration:

In [None]:
mkdir -p ~/k2-chatbot/{api,ui}
cd ~/k2-chatbot

### Step 2: Write the FastAPI Backend

Create `api/main.py`. This file handles chat requests, applies rate limiting and input truncation, and calls vLLM.

In [None]:
# Purpose: FastAPI backend for Kimi K2 chatbot, handling chat requests, rate limiting, and vLLM inference.

import os
import time
import logging
from typing import List, Literal, Optional, Dict, Tuple
from fastapi import FastAPI, Request, HTTPException
from pydantic import BaseModel, Field
from openai import OpenAI

# Configure basic logging for runtime behavior and error tracking
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s %(levelname)s %(message)s"
)

# Load environment variables for configuration
VLLM_BASE_URL = os.getenv("VLLM_BASE_URL", "http://vllm:8000/v1")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "EMPTY")
MODEL_ID = os.getenv("MODEL_ID", "your-model-id")
MAX_INPUT_CHARS = int(os.getenv("MAX_INPUT_CHARS", "12000"))
MAX_HISTORY_MESSAGES = int(os.getenv("MAX_HISTORY_MESSAGES", "24"))
RATE_LIMIT_RPS = float(os.getenv("RATE_LIMIT_RPS", "2.0"))

# Initialize OpenAI client for vLLM server
client = OpenAI(base_url=VLLM_BASE_URL, api_key=OPENAI_API_KEY)

# Initialize FastAPI app
app = FastAPI(title="K2 Chat API")

class ChatMessage(BaseModel):
    """
    Represents a single chat message in the conversation.

    Args:
        role (Literal["system", "user", "assistant"]): The role of the message sender.
        content (str): The message content.
        timestamp (Optional[float]): Optional UNIX timestamp for the message.

    Example:
        {"role": "user", "content": "Hello!", "timestamp": 1710000000.0}
    """
    role: Literal["system", "user", "assistant"]
    content: str
    timestamp: Optional[float] = None

class ChatRequest(BaseModel):
    """
    Request body for the /chat endpoint.

    Args:
        messages (List[ChatMessage]): Conversation history.
        temperature (float): Sampling temperature for the model.
        max_tokens (int): Maximum tokens to generate in the response.
    """
    messages: List[ChatMessage] = Field(default_factory=list)
    temperature: float = 0.2
    max_tokens: int = 512

class ChatResponse(BaseModel):
    """
    Response body for the /chat endpoint.

    Args:
        content (str): The assistant's reply.
        tokens (Optional[int]): Number of tokens used (if available).
        latency_ms (Optional[int]): Latency in milliseconds.
    """
    content: str
    tokens: Optional[int] = None
    latency_ms: Optional[int] = None

# In-memory leaky bucket rate limiter per client IP
rate_state: Dict[str, Tuple[float, float]] = {}  # ip -> (last_time, tokens)

def check_rate_limit(ip: str) -> None:
    """
    Enforces a simple leaky bucket rate limit per client IP.

    Args:
        ip (str): The client's IP address.

    Raises:
        HTTPException: If the client exceeds the allowed rate.
    """
    now = time.time()
    last, tokens = rate_state.get(ip, (now, RATE_LIMIT_RPS))
    elapsed = max(0.0, now - last)
    # Refill tokens based on elapsed time
    tokens = min(RATE_LIMIT_RPS, tokens + elapsed * RATE_LIMIT_RPS)
    if tokens < 1.0:
        logging.warning(f"Rate limit exceeded for IP {ip}")
        raise HTTPException(status_code=429, detail="Too many requests")
    # Update state with new token count
    rate_state[ip] = (now, tokens - 1.0)
    # Optional: Clean up old IPs to avoid unbounded memory growth
    if len(rate_state) > 10000:
        # Remove oldest entries if too many IPs (prototype trade-off)
        for old_ip in list(rate_state.keys())[:1000]:
            del rate_state[old_ip]

def truncate_messages(msgs: List[ChatMessage]) -> List[ChatMessage]:
    """
    Truncates the message history to fit within max message count and character limits.

    Args:
        msgs (List[ChatMessage]): The full conversation history.

    Returns:
        List[ChatMessage]: The truncated message list, preserving order.

    Notes:
        - Keeps only the last MAX_HISTORY_MESSAGES.
        - Ensures total character count does not exceed MAX_INPUT_CHARS.
        - Truncates from the oldest messages first.
    """
    # Keep only the last N messages
    msgs = msgs[-MAX_HISTORY_MESSAGES:]
    total = 0
    kept = []
    # Walk backwards to keep the most recent messages within char limit
    for m in reversed(msgs):
        if total + len(m.content) > MAX_INPUT_CHARS:
            break
        kept.append(m)
        total += len(m.content)
    # Reverse to restore chronological order
    return list(reversed(kept))

@app.post("/chat", response_model=ChatResponse)
async def chat(req: Request, body: ChatRequest) -> ChatResponse:
    """
    Handles chat requests from the UI, applies rate limiting and truncation, and calls vLLM.

    Args:
        req (Request): The FastAPI request object (for client IP).
        body (ChatRequest): The chat request payload.

    Returns:
        ChatResponse: The assistant's reply, token usage, and latency.

    Raises:
        HTTPException: For input validation, rate limiting, or backend errors.
    """
    ip = req.client.host
    check_rate_limit(ip)
    if not body.messages:
        logging.error("No messages provided in request")
        raise HTTPException(status_code=400, detail="messages required")
    msgs = truncate_messages(body.messages)

    start = time.time()
    try:
        # Prepare messages for OpenAI SDK (dict format)
        openai_msgs = [m.model_dump() for m in msgs]
        # Call vLLM via OpenAI-compatible API
        res = client.chat.completions.create(
            model=MODEL_ID,
            messages=openai_msgs,
            temperature=body.temperature,
            max_tokens=body.max_tokens,
        )
        content = res.choices[0].message.content
        usage = getattr(res, "usage", None)
        tokens = usage.total_tokens if usage else None
    except Exception as e:
        # Log the error for debugging, but avoid leaking sensitive info
        logging.exception("Error during vLLM inference")
        raise HTTPException(status_code=500, detail="Model backend error")
    latency_ms = int((time.time() - start) * 1000)
    # Log successful request for monitoring
    logging.info(f"Chat completed for IP {ip} in {latency_ms} ms, tokens: {tokens}")
    return ChatResponse(content=content, tokens=tokens, latency_ms=latency_ms)

**Why a separate backend?** This layer decouples the UI from the model server, letting you enforce guardrails (truncation, rate limiting) and abstract model details. It also simplifies future extensions like tool calling or multi-model routing.

Create `api/requirements.txt`:

In [None]:
fastapi==0.109.0
uvicorn[standard]==0.27.0
openai==1.10.0
pydantic==2.5.3

Create `api/Dockerfile`:

```dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY main.py .
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8001"]
```

### Step 3: Write the Chainlit UI

Create `ui/app.py`. This file defines the chat interface and forwards messages to the FastAPI backend.

In [None]:
import os
import chainlit as cl
import httpx

API_URL = os.getenv("API_URL", "http://api:8001/chat")

@cl.on_message
async def main(message: cl.Message):
    # Retrieve conversation history from Chainlit's message history
    history = cl.user_session.get("history", [])
    history.append({"role": "user", "content": message.content})
    
    # Call FastAPI backend
    async with httpx.AsyncClient(timeout=60.0) as client:
        try:
            resp = await client.post(API_URL, json={"messages": history})
            resp.raise_for_status()
            data = resp.json()
            reply = data["content"]
        except Exception as e:
            reply = f"Error: {e}"
    
    # Append assistant reply to history
    history.append({"role": "assistant", "content": reply})
    cl.user_session.set("history", history)
    
    # Send reply to user
    await cl.Message(content=reply).send()

**Note on message history:** This code uses `cl.user_session` to store conversation history in memory. For production, consider persisting history to a database or using Chainlit's built-in data layer.

Create `ui/requirements.txt`:

In [None]:
chainlit==1.0.0
httpx==0.26.0

Create `ui/Dockerfile`:

```dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py .
CMD ["chainlit", "run", "app.py", "--host", "0.0.0.0", "--port", "3000"]
```

### Step 4: Write Docker Compose Configuration

Create `docker-compose.yml` in the project root. This file orchestrates vLLM, FastAPI, and Chainlit.

```yaml
version: "3.8"

services:
  vllm:
    image: vllm/vllm-openai:v0.3.0
    container_name: vllm
    environment:
      - HF_TOKEN=${HF_TOKEN}
    command: >
      --model ${MODEL_ID}
      --max-model-len 8192
      --gpu-memory-utilization 0.9
      --trust-remote-code
    ports:
      - "127.0.0.1:8000:8000"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/v1/models"]
      interval: 30s
      timeout: 10s
      retries: 3

  api:
    build: ./api
    container_name: api
    environment:
      - VLLM_BASE_URL=http://vllm:8000/v1
      - MODEL_ID=${MODEL_ID}
      - MAX_INPUT_CHARS=12000
      - MAX_HISTORY_MESSAGES=24
      - RATE_LIMIT_RPS=2.0
    ports:
      - "127.0.0.1:8001:8001"
    depends_on:
      vllm:
        condition: service_healthy

  ui:
    build: ./ui
    container_name: ui
    environment:
      - API_URL=http://api:8001/chat
    ports:
      - "127.0.0.1:3000:3000"
    depends_on:
      - api
```

**GPU access note:** This configuration uses `deploy.resources.reservations.devices` with `driver: nvidia`, which works with Docker Compose v2.x and the NVIDIA Container Toolkit. If you encounter "no CUDA device" errors, ensure the toolkit is installed and `nvidia-smi` works inside a test container.

**Security note:** The `--trust-remote-code` flag allows the model to execute custom code during loading. This is often required for newer models but poses a security risk. Pin model revisions and review model repositories before use.

Create a `.env` file in the project root to centralize configuration:

In [None]:
HF_TOKEN=your_hf_token_here
MODEL_ID=kimi-k2-7b-instruct

Adjust `MODEL_ID` to match your chosen model. If you're unsure how to evaluate which LLM best fits your hardware and application needs, our guide on [how to choose an AI model for your appâ€”speed, cost, and reliability](/article/how-to-choose-an-ai-model-for-your-app-speed-cost-reliability) offers a practical framework.

### Step 5: Build and Start Services

Build the images and start the stack:

In [None]:
docker compose build
docker compose up -d

Check logs to verify services are running:

In [None]:
docker compose logs -f

Wait for vLLM to download the model and load it into GPU memory. This may take 5â€“10 minutes on first run.

### Step 6: Validate Locally via SSH Tunnel

Before exposing the service publicly, test it locally. On your local machine, create an SSH tunnel to the VM:

In [None]:
gcloud compute ssh k2-chatbot --zone=us-central1-a -- -L 3000:localhost:3000

Open `http://localhost:3000` in your browser. You should see the Chainlit interface. Send a message and verify the assistant responds.

### Step 7: Configure Nginx and Let's Encrypt

Install Nginx and Certbot on the VM:

In [None]:
sudo apt-get install -y nginx certbot python3-certbot-nginx

Create an Nginx configuration file at `/etc/nginx/sites-available/k2-chatbot`:

```nginx
server {
    listen 80;
    server_name chat.yourdomain.com;

    location / {
        proxy_pass http://127.0.0.1:3000;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_read_timeout 300s;
    }

    location /api/ {
        proxy_pass http://127.0.0.1:8001/;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_read_timeout 300s;
    }
}
```

Enable the site and reload Nginx:

In [None]:
sudo ln -s /etc/nginx/sites-available/k2-chatbot /etc/nginx/sites-enabled/
sudo nginx -t
sudo systemctl reload nginx

Obtain a TLS certificate from Let's Encrypt:

In [None]:
sudo certbot --nginx -d chat.yourdomain.com

Follow the prompts. Certbot will automatically update the Nginx configuration to redirect HTTP to HTTPS.

**Rate limiting caveat:** The FastAPI backend uses `req.client.host` for rate limiting, which will be the Nginx proxy IP when behind a reverse proxy. To enforce per-client limits, configure FastAPI to trust the `X-Forwarded-For` header. Add this to `api/main.py`:

In [None]:
from fastapi.middleware.trustedhost import TrustedHostMiddleware

app.add_middleware(TrustedHostMiddleware, allowed_hosts=["*"])

# Update check_rate_limit to read X-Forwarded-For
def get_client_ip(req: Request) -> str:
    forwarded = req.headers.get("X-Forwarded-For")
    if forwarded:
        return forwarded.split(",")[0].strip()
    return req.client.host

Then replace `ip = req.client.host` with `ip = get_client_ip(req)` in the `chat` function.

### Step 8: Test Over HTTPS

Open `https://chat.yourdomain.com` in your browser. Verify the TLS certificate is valid and the chat interface loads. Send a message and confirm the assistant responds.

## Run and Validate

### Test Long Context

Paste a longer input (e.g., a 2000-word document) and ask the assistant to summarize it. Ensure vLLM's `--max-model-len` setting accommodates the input. If you hit truncation, adjust `MAX_INPUT_CHARS` in the `.env` file and restart the API service:

In [None]:
docker compose restart api

Monitor VRAM usage to avoid out-of-memory errors:

In [None]:
nvidia-smi

For a deeper understanding of how LLMs handle memory and why they sometimes "forget" information as context grows, check out our article on [context rot and LLM memory limitations](/article/context-rot-why-llms-forget-as-their-memory-grows-3).

### Test Rate Limiting

Send multiple rapid requests from the same client. After exceeding the rate limit (default: 2 requests/second), you should receive a 429 error.

### Check Logs

Review logs for errors or performance issues:

In [None]:
docker compose logs vllm
docker compose logs api
docker compose logs ui

## Next Steps

This prototype runs on a single VM with basic rate limiting and no persistent storage. To move toward production:

- **Add authentication:** Integrate OAuth or API keys to control access.
- **Persist conversation history:** Store messages in a database (PostgreSQL, Firestore) instead of in-memory sessions.
- **Scale horizontally:** Use Kubernetes or Cloud Run to deploy multiple replicas behind a load balancer.
- **Monitor and alert:** Add Prometheus metrics and Grafana dashboards to track latency, throughput, and GPU utilization.
- **Push images to Artifact Registry:** Tag and push your built images for versioned deployments:

In [None]:
docker tag api:latest us-docker.pkg.dev/YOUR_PROJECT/k2-repo/api:v1
docker push us-docker.pkg.dev/YOUR_PROJECT/k2-repo/api:v1

You now have a working, GPU-accelerated chatbot running entirely under your control. You can customize the model, adjust guardrails, and extend the backend with tools or multi-model routingâ€”all without relying on third-party APIs.