# ERNIE-4.5-VL-28B-A3B-Thinking-AWQ-4bit — vLLM Notebook

This notebook serves `cyankiwi/ERNIE-4.5-VL-28B-A3B-Thinking-AWQ-4bit` via vLLM (OpenAI-compatible API).

**System Requirements:**
- GPU: A100 with ~40GB VRAM
- CPU RAM: 80GB+
- Runtime: Colab GPU, Linux

**What’s in this notebook:**
- vLLM installation (nightly wheels)
- Start server (background) and readiness check
- Text and image+text chat completion examples
- Clean shutdown

Notes: This vLLM path avoids the Transformers compressed-tensors route that caused `.weight` attribute errors.

## 1. Environment Setup

- Ensure a GPU runtime is enabled
- Install vLLM (nightly) which will install a compatible PyTorch build
- Verify GPU with `nvidia-smi`

In [None]:
# Verify GPU availability
!nvidia-smi

In [None]:
# Check GPU availability and specs
!nvidia-smi

## 2. Memory Monitoring Utilities

## Notes and Troubleshooting

### Memory Optimization Tips:
1. **AWQ 4-bit quantization** reduces memory footprint significantly (~4x less than FP16)
2. **device_map="auto"** automatically distributes model across GPU and CPU when needed
3. The model should fit comfortably in 39GB VRAM with headroom for activations
4. Only 3B parameters are activated per inference (sparse MoE architecture)

### Model-Specific Features:
- **Vision-Language**: Supports both text-only and image+text inputs
- **Thinking Mode**: Multi-step reasoning for complex tasks
- **Tool Calling**: Can use external tools (requires vLLM for full support)
- **Video Understanding**: Temporal awareness and event localization
- **Visual Grounding**: Precise object localization and grounding

### Common Issues:

**Missing autoawq Package:**
- Install with: `!pip install autoawq>=0.2.0`
- This is required for loading AWQ quantized models

**Out of Memory:**
- Reduce `max_new_tokens` to 256 or 512
- Lower batch size (use single examples)
- Enable more aggressive CPU offloading: `max_memory={0: "36GB", "cpu": "80GB"}`

**Slow Inference:**
- AWQ quantization provides excellent speed vs quality tradeoff
- Expect 20-40 tokens/sec on A100 for this model size
- Use `do_sample=False` for faster greedy decoding
- For production use, consider vLLM (see model card for setup)

**Model Loading Errors:**
- Ensure `trust_remote_code=True` is set (the model uses custom modeling code)
- Check if model requires specific transformers version (>=4.37.0)
- Verify autoawq is properly installed
- Make sure to add image preprocessor with `model.add_image_preprocess(processor)`

**Image Processing Errors:**
- Always use the processor to prepare inputs (not just tokenizer)
- Follow the message format shown in the helper function
- Images should be PIL Image objects in RGB mode

### Performance Expectations:
- **Model Size:** ~6-8GB in VRAM (AWQ 4-bit INT4)
- **Inference Speed:** 20-40 tokens/second on A100
- **First Token Latency:** 1-2 seconds
- **Memory Usage:** 8-12GB VRAM during inference with typical prompts
- **Activated Parameters:** Only 3B per forward pass (sparse MoE)

### For Production Deployment:
This notebook uses `transformers` for simplicity. For production, the model card recommends using **vLLM** for better performance:

```bash
pip install uv
uv pip install -U vllm --pre \
  --extra-index-url https://wheels.vllm.ai/nightly \
  --extra-index-url https://download.pytorch.org/whl/cu129 \
  --index-strategy unsafe-best-match

vllm serve cyankiwi/ERNIE-4.5-VL-28B-A3B-Thinking-AWQ-4bit --trust-remote-code \
  --reasoning-parser ernie45 \
  --tool-call-parser ernie45 \
  --enable-auto-tool-choice
```

### Additional Resources:
- Model Card: https://huggingface.co/cyankiwi/ERNIE-4.5-VL-28B-A3B-Thinking-AWQ-4bit
- Base Model: https://huggingface.co/baidu/ERNIE-4.5-VL-28B-A3B-Thinking
- Demo Space: https://huggingface.co/spaces/baidu/ERNIE-4.5-VL-28B-A3B-Thinking
- AutoAWQ: https://github.com/casper-hansen/AutoAWQ
- vLLM: https://github.com/vllm-project/vllm
- Transformers Docs: https://huggingface.co/docs/transformers

## 8. vLLM Inference (Recommended)

This section serves the same model via vLLM using the OpenAI-compatible API. It avoids the Transformers compressed-tensors path and is the configuration recommended by the model card.

Steps:
1. Install vLLM (will install a matching CUDA+PyTorch build)
2. Start the server in the background and wait for it to load
3. Send a test chat completion request
4. (Optional) Stop the server


In [None]:
# Install vLLM (nightly wheels) — no custom pip flags
!pip install -q -U --pre vllm \
  --extra-index-url https://wheels.vllm.ai/nightly

import sys
print(f"Python: {sys.version}")

# Sanity-check import
try:
    import vllm
    import torch
    print(f"✓ vLLM version: {getattr(vllm, '__version__', 'unknown')}")
    print(f"✓ Torch version: {torch.__version__}")
except Exception as e:
    import traceback
    print("⚠ Error importing vllm/torch after install:", e)
    traceback.print_exc()

In [None]:
# Start vLLM server in background and wait until ready
import subprocess, time, requests, os, sys

MODEL_ID = globals().get("MODEL_ID", "cyankiwi/ERNIE-4.5-VL-28B-A3B-Thinking-AWQ-4bit")
PORT = int(os.environ.get("VLLM_PORT", "8000"))
HOST = os.environ.get("VLLM_HOST", "127.0.0.1")

cmd = [
    sys.executable, "-m", "vllm.entrypoints.openai.api_server",
    "--model", MODEL_ID,
    "--host", HOST,
    "--port", str(PORT),
    "--trust-remote-code",
]

# Add optional tuning flags via env vars
gpu_util = os.environ.get("VLLM_GPU_UTILIZATION")
if gpu_util:
    cmd += ["--gpu-memory-utilization", gpu_util]

print("Launching:", " ".join(cmd))

# Write logs to file to keep cell responsive
log_path = "/tmp/vllm_server.log"
with open(log_path, "w") as log_f:
    proc = subprocess.Popen(cmd, stdout=log_f, stderr=subprocess.STDOUT)
    
os.environ["VLLM_PROC_PID"] = str(proc.pid)
print(f"vLLM server starting (pid={proc.pid}). Logs: {log_path}")

# Give it a moment to start, then check if it's still alive
time.sleep(3)
if proc.poll() is not None:
    print(f"⚠ Process exited early with code {proc.returncode}. Check logs:")
    with open(log_path, "r") as f:
        print(f.read())
else:
    print("✓ Process still running, waiting for HTTP readiness...")

# Poll server readiness
base_url = f"http://{HOST}:{PORT}"
ready = False
for i in range(180):  # up to ~3 minutes
    # First check if process is still alive
    if proc.poll() is not None:
        print(f"⚠ Server process died (exit code {proc.returncode}). Check logs.")
        break
        
    try:
        r = requests.get(base_url + "/v1/models", timeout=2)
        if r.status_code == 200:
            ready = True
            break
    except Exception:
        pass
    time.sleep(1)

if ready:
    print("✓ vLLM server ready!")
else:
    print("⚠ vLLM server not ready. Check logs with the log viewer cell.")

In [None]:
# vLLM client: text-only chat completion
import requests, json, os

PORT = int(os.environ.get("VLLM_PORT", "8000"))
HOST = os.environ.get("VLLM_HOST", "127.0.0.1")
base_url = f"http://{HOST}:{PORT}"
headers = {"Content-Type": "application/json"}
MODEL_ID = globals().get("MODEL_ID", "cyankiwi/ERNIE-4.5-VL-28B-A3B-Thinking-AWQ-4bit")

# Health check first
try:
    health_r = requests.get(base_url + "/v1/models", timeout=5)
    if health_r.status_code != 200:
        print(f"⚠ Server health check failed: {health_r.status_code}")
        print("Server may not be ready. Run the log viewer to debug.")
        raise Exception(f"Health check failed: {health_r.status_code}")
    else:
        print("✓ Server health check passed")
except requests.exceptions.ConnectionError:
    print("⚠ Cannot connect to vLLM server. Is it running?")
    print("Run the server start cell first, then check logs if it fails.")
    raise

payload = {
    "model": MODEL_ID,
    "messages": [
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    "max_tokens": 256,
    "temperature": 0.7
}

print("Sending chat completion request...")
r = requests.post(base_url + "/v1/chat/completions", headers=headers, data=json.dumps(payload), timeout=180)
print(f"Status: {r.status_code}")

if r.status_code == 200:
    response = r.json()
    print("\nResponse:")
    print(response["choices"][0]["message"]["content"])
else:
    print(f"Error: {r.status_code}")
    print(r.text)

In [None]:
# (Optional) vLLM client: simple image+text example
import requests, json, os

PORT = int(os.environ.get("VLLM_PORT", "8000"))
HOST = os.environ.get("VLLM_HOST", "127.0.0.1")
base_url = f"http://{HOST}:{PORT}"
headers = {"Content-Type": "application/json"}
MODEL_ID = globals().get("MODEL_ID", "cyankiwi/ERNIE-4.5-VL-28B-A3B-Thinking-AWQ-4bit")

# Provide a publicly accessible image URL
image_url = "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example1.jpg"

payload = {
    "model": MODEL_ID,
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What color clothes is the girl in the picture wearing?"},
                {"type": "image_url", "image_url": {"url": image_url}}
            ]
        }
    ],
    "max_tokens": 256,
    "temperature": 0.2
}

r = requests.post(base_url + "/v1/chat/completions", headers=headers, data=json.dumps(payload), timeout=180)
print(r.status_code)
print(r.json()["choices"][0]["message"]["content"])

In [None]:
# Stop vLLM server and free resources
import os, signal, time
pid = int(os.environ.get("VLLM_PROC_PID", "0"))
if pid > 0:
    try:
        os.kill(pid, signal.SIGTERM)
        time.sleep(2)
        print(f"✓ Terminated vLLM server (pid={pid})")
    except Exception as e:
        print(f"Could not terminate server pid={pid}: {e}")
else:
    print("No vLLM server PID found. If needed, run: !pkill -f 'vllm serve'")

In [None]:
# View last 100 lines of vLLM server logs (for debugging)
import os
log_path = "/tmp/vllm_server.log"
if os.path.exists(log_path):
    with open(log_path, "r") as f:
        lines = f.readlines()
        for line in lines[-100:]:
            print(line.rstrip())
else:
    print("Log file not found. Start the server first.")