# Colab vLLM Setup with ngrok and GPU Monitoring

This notebook sets up a vLLM server for a small Qwen model (0.5B preferred, fallback to 1.5B), tunnels it with ngrok, and monitors GPU usage with nvidia-smi (nvtop alternative for Colab).

In [None]:
# Install required packages
!pip install vllm pyngrok torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

In [None]:
# Set up ngrok (replace YOUR_NGROK_AUTH_TOKEN with your actual token)
from pyngrok import ngrok
ngrok.set_auth_token("YOUR_NGROK_AUTH_TOKEN")  # Get from https://dashboard.ngrok.com/get-started/your-authtoken

# Start ngrok tunnel for port 8000
public_url = ngrok.connect(8000)
print(f"Public URL: {public_url}")
print(f"Set VLLM_BASE_URL={public_url}/v1 in your local environment.")

In [None]:
# Download and run vLLM server with Qwen 0.5B model
# If 0.5B not available, use 1.5B
import subprocess
import time
import threading

model_name = "Qwen/Qwen2.5-0.5B-Instruct"
fallback_model = "Qwen/Qwen2.5-1.5B-Instruct"
server_process = None

def start_vllm(model):
    global server_process
    server_process = subprocess.Popen([
        "python", "-m", "vllm.entrypoints.openai.api_server",
        "--model", model,
        "--host", "0.0.0.0",
        "--port", "8000",
        "--max-model-len", "2048",
        "--dtype", "float16",
        "--gpu-memory-utilization", "0.9"
    ])
    time.sleep(30)  # Wait longer for model download
    return server_process

# Try 0.5B first
try:
    print("Starting vLLM with Qwen2.5-0.5B-Instruct...")
    start_vllm(model_name)
    # Quick health check
    import requests
    time.sleep(5)
    if requests.get("http://localhost:8000/health").status_code == 200:
        print("vLLM server started successfully with 0.5B model.")
    else:
        raise Exception("Health check failed")
except Exception as e:
    print(f"0.5B failed: {e}. Trying 1.5B...")
    if server_process:
        server_process.terminate()
    start_vllm(fallback_model)
    print("vLLM server started with 1.5B model.")

In [None]:
# GPU Monitoring with nvidia-smi (run this in a separate cell or loop)
# For continuous monitoring, run: !watch -n 1 nvidia-smi
!nvidia-smi

In [None]:
# Continuous GPU monitoring in a loop (stop with interrupt)
import subprocess
import time

def monitor_gpu():
    while True:
        result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)
        print(result.stdout)
        time.sleep(5)  # Update every 5 seconds

monitor_thread = threading.Thread(target=monitor_gpu, daemon=True)
monitor_thread.start()
print("GPU monitoring started. Interrupt kernel to stop.")

In [None]:
# Test the server
import requests

response = requests.post(
    "http://localhost:8000/v1/chat/completions",
    json={
        "model": "qwen2.5",
        "messages": [{"role": "user", "content": "Hello, how are you?"}],
        "max_tokens": 50,
        "temperature": 0.8
    }
)
if response.status_code == 200:
    print("Test successful:")
    print(response.json()['choices'][0]['message']['content'])
else:
    print(f"Test failed: {response.status_code} - {response.text}")

## Instructions
1. Replace `YOUR_NGROK_AUTH_TOKEN` with your actual ngrok token.
2. Run the cells in order. The server will download the model (may take a few minutes).
3. Note the public URL printed (e.g., https://abc.ngrok.io). Set `export VLLM_BASE_URL=https://abc.ngrok.io/v1` in your local terminal.
4. For GPU monitoring, run the nvidia-smi cell or the continuous monitor. (nvtop can be installed with !apt install nvtop, but nvidia-smi is more reliable in Colab.)
5. In local code, call `configure_api_client(dry_run=False, base_url=os.environ.get('VLLM_BASE_URL'))` before running the debate tournament.
6. Keep this Colab runtime active while running local code. Use T4 GPU runtime for best performance.
7. To stop: Interrupt kernel and run `server_process.terminate()` if needed.