# üöÄ Kaggle Buffer Test Runner - All-in-One

This notebook loads vLLM, registers it with the backend, and runs buffer tests.
**Run all cells in order - no separate notebook needed!**

## What this does:
1. ‚úÖ Installs vLLM dependencies
2. ‚úÖ Loads Qwen 14B AWQ model on Kaggle GPUs
3. ‚úÖ Clones repo and configures git
4. ‚úÖ Registers vLLM with backend (for responses, summarization, AND judging)
5. ‚úÖ Runs buffer tests (sizes: 5, 10, 20, 40)
6. ‚úÖ Auto-pushes results to GitHub after each buffer

## Important:
- This is a **single notebook** - no need for separate notebooks
- vLLM model stays in memory for all operations
- No server needed - tests use direct Python imports

In [1]:
# Cell 1: Install vLLM dependencies
print("="*60)
print("üì¶ INSTALLING vLLM DEPENDENCIES")
print("="*60)

! uv pip uninstall -q --system 'tensorflow'
! uv pip install -q --system 'vllm' 'triton==3.2.0' 'logits-processor-zoo' 'numpy<2'

print("‚úÖ Dependencies installed")
print("="*60)

üì¶ INSTALLING vLLM DEPENDENCIES
‚úÖ Dependencies installed


In [2]:
# Cell 2: Import libraries
import os
import shutil
import subprocess
import sys
import numpy as np
import pandas as pd
import torch
import vllm
from logits_processor_zoo.vllm import MultipleChoiceLogitsProcessor

print("="*60)
print("üìö LIBRARIES IMPORTED")
print("="*60)
print(f"‚úÖ PyTorch version: {torch.__version__}")
print(f"‚úÖ CUDA available: {torch.cuda.is_available()}")
print(f"üéÆ GPUs available: {torch.cuda.device_count()}")
if torch.cuda.is_available():
    for i in range(torch.cuda.device_count()):
        print(f"   GPU {i}: {torch.cuda.get_device_name(i)}")
print("="*60)

INFO 12-19 13:46:18 [__init__.py:239] Automatically detected platform cuda.
üìö LIBRARIES IMPORTED
‚úÖ PyTorch version: 2.6.0+cu124
‚úÖ CUDA available: True
üéÆ GPUs available: 2
   GPU 0: Tesla T4
   GPU 1: Tesla T4


In [3]:
# Cell 3: Load Kaggle secrets and set environment
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()

# Load all secrets
os.environ["GITHUB_TOKEN"] = user_secrets.get_secret("GITHUB_TOKEN")
os.environ["GROQ_API_KEY"] = user_secrets.get_secret("GROQ_API_KEY")
os.environ["HuggingFACEHUB_access_token"] = user_secrets.get_secret("HuggingFACEHUB_access_token")
os.environ["LANGCHAIN_API_KEY"] = user_secrets.get_secret("LANGCHAIN_API_KEY")

# Set vLLM configuration
os.environ["LLM_BACKEND"] = "vllm"
model_path = "/kaggle/input/qwen2.5/transformers/14b-instruct-awq/1"
os.environ["VLLM_MODEL_PATH"] = model_path
os.environ["VLLM_USE_V1"] = "0"

print("="*60)
print("üîê SECRETS AND CONFIGURATION LOADED")
print("="*60)
print(f"‚úÖ GITHUB_TOKEN: {os.environ['GITHUB_TOKEN'][:4]}...{os.environ['GITHUB_TOKEN'][-4:]}")
print(f"‚úÖ LLM_BACKEND: vllm")
print(f"‚úÖ VLLM_MODEL_PATH: {model_path}")
print("="*60)

üîê SECRETS AND CONFIGURATION LOADED
‚úÖ GITHUB_TOKEN: gith...tWfg
‚úÖ LLM_BACKEND: vllm
‚úÖ VLLM_MODEL_PATH: /kaggle/input/qwen2.5/transformers/14b-instruct-awq/1


In [4]:
# Cell 4: Load vLLM model on Kaggle GPUs (takes 2-3 minutes)
print("="*60)
print("üöÄ LOADING vLLM MODEL")
print("="*60)
print(f"üìÇ Model: {model_path}")
print(f"üéÆ GPUs: {torch.cuda.device_count()}")
print("‚è≥ This takes 2-3 minutes...")
print("="*60)

llm = vllm.LLM(
    model_path,
    quantization='awq',
    tensor_parallel_size=torch.cuda.device_count(),
    gpu_memory_utilization=0.91,
    trust_remote_code=True,
    dtype="half",
    enforce_eager=True,
    max_model_len=5120,
    disable_log_stats=True,
    enable_prefix_caching=True
)
tokenizer = llm.get_tokenizer()

print("\n" + "="*60)
print("‚úÖ vLLM MODEL LOADED SUCCESSFULLY!")
print("="*60)
print(f"   Memory per GPU: ~{torch.cuda.get_device_properties(0).total_memory / 1024**3 * 0.91:.1f}GB used")
print("="*60)

üöÄ LOADING vLLM MODEL
üìÇ Model: /kaggle/input/qwen2.5/transformers/14b-instruct-awq/1
üéÆ GPUs: 2
‚è≥ This takes 2-3 minutes...
INFO 12-19 13:46:48 [config.py:717] This model supports multiple tasks: {'generate', 'embed', 'reward', 'score', 'classify'}. Defaulting to 'generate'.
INFO 12-19 13:46:49 [config.py:1770] Defaulting to use mp for distributed inference
INFO 12-19 13:46:49 [llm_engine.py:240] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/kaggle/input/qwen2.5/transformers/14b-instruct-awq/1', speculative_config=None, tokenizer='/kaggle/input/qwen2.5/transformers/14b-instruct-awq/1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=5120, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=True, kv_cache_dtype=auto,  device_c

[W1219 13:47:07.340058593 socket.cpp:204] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W1219 13:47:07.340919295 socket.cpp:204] [c10d] The hostname of the client socket cannot be retrieved. err=-3


INFO 12-19 13:47:08 [utils.py:1055] Found nccl from library libnccl.so.2
INFO 12-19 13:47:08 [pynccl.py:69] vLLM is using nccl==2.21.5
[1;36m(VllmWorkerProcess pid=163)[0;0m INFO 12-19 13:47:08 [utils.py:1055] Found nccl from library libnccl.so.2
[1;36m(VllmWorkerProcess pid=163)[0;0m INFO 12-19 13:47:08 [pynccl.py:69] vLLM is using nccl==2.21.5


[W1219 13:47:07.605848240 socket.cpp:204] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W1219 13:47:07.606563841 socket.cpp:204] [c10d] The hostname of the client socket cannot be retrieved. err=-3


INFO 12-19 13:47:08 [custom_all_reduce_utils.py:206] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 12-19 13:47:33 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
[1;36m(VllmWorkerProcess pid=163)[0;0m INFO 12-19 13:47:33 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 12-19 13:47:33 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_db29915c'), local_subscribe_addr='ipc:///tmp/21cb049e-6bd4-482f-8c27-4cc0fb3eba3e', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 12-19 13:47:33 [parallel_state.py:1004] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0
[1;36m(VllmWorkerProcess pid=163)[0;0m INFO 12-19 13:47:33 [parallel_state.py:1004] rank 1 in world size 2 is assigned as DP rank 0, PP ra

Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]


[1;36m(VllmWorkerProcess pid=163)[0;0m INFO 12-19 13:48:55 [loader.py:458] Loading weights took 82.15 seconds
INFO 12-19 13:48:55 [loader.py:458] Loading weights took 82.26 seconds
[1;36m(VllmWorkerProcess pid=163)[0;0m INFO 12-19 13:48:56 [model_runner.py:1140] Model loading took 4.6720 GiB and 82.476138 seconds
INFO 12-19 13:48:56 [model_runner.py:1140] Model loading took 4.6720 GiB and 82.591390 seconds
[1;36m(VllmWorkerProcess pid=163)[0;0m INFO 12-19 13:49:07 [worker.py:287] Memory profiling takes 10.33 seconds
[1;36m(VllmWorkerProcess pid=163)[0;0m INFO 12-19 13:49:07 [worker.py:287] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.91) = 13.41GiB
[1;36m(VllmWorkerProcess pid=163)[0;0m INFO 12-19 13:49:07 [worker.py:287] model weights take 4.67GiB; non_torch_memory takes 0.10GiB; PyTorch activation peak memory takes 0.47GiB; the rest of the memory reserved for KV Cache is 8.17GiB.
INFO 12-19 13:49:07 [worker.py:287] Memory profili

In [5]:
# Cell 5: Clone repository and configure git
REPO_URL = "https://github.com/moonmehedi/Subchat-Trees-A-Scalable-Architecture-for-Multi-Threaded-Dialogue-and-Context-Isolation-in-LLM.git"
REPO_DIR = "Subchat-Trees"
BRANCH = "kaggle-run"

print("="*60)
print("üì• CLONING REPOSITORY")
print("="*60)

# Remove existing directory if present
if os.path.exists(REPO_DIR):
    print(f"‚ö†Ô∏è  Removing existing {REPO_DIR} directory...")
    shutil.rmtree(REPO_DIR)

# Clone with LFS skip to save bandwidth
clone_env = os.environ.copy()
clone_env["GIT_LFS_SKIP_SMUDGE"] = "1"

result = subprocess.run(
    ["git", "clone", "-b", BRANCH, "--single-branch", REPO_URL, REPO_DIR],
    capture_output=True,
    text=True,
    env=clone_env
)

if result.returncode == 0:
    print(f"‚úÖ Cloned {BRANCH} branch!")
    
    # Pull LFS files for scenarios
    os.chdir(REPO_DIR)
    subprocess.run(
        ["git", "lfs", "pull", "--include=backend/dataset/scenarios/*.json"],
        capture_output=True,
        text=True
    )
    print("‚úÖ Pulled scenario files from Git LFS")
    
    # Configure git identity
    subprocess.run(["git", "config", "user.name", "moonmehedi"], check=True)
    subprocess.run(["git", "config", "user.email", "the.mehedi.hasan.moon@gmail.com"], check=True)
    print("‚úÖ Git identity configured")
    
    os.chdir("..")
else:
    print(f"‚ùå Clone failed: {result.stderr}")

print("="*60)

üì• CLONING REPOSITORY
‚úÖ Cloned kaggle-run branch!
‚úÖ Pulled scenario files from Git LFS
‚úÖ Git identity configured


In [6]:
# Cell 6: Register vLLM with backend
sys.path.insert(0, os.path.join(REPO_DIR, "backend"))

from src.services.vllm_client import VLLMClient

print("="*60)
print("üîó REGISTERING vLLM WITH BACKEND")
print("="*60)

VLLMClient.set_model(llm)

print(f"‚úÖ vLLM registered: {VLLMClient.is_available()}")
print("   ‚úÖ Response generation will use vLLM")
print("   ‚úÖ Summarization will use vLLM")
print("   ‚úÖ Judge/Classification will use vLLM")
print("="*60)

üîó REGISTERING vLLM WITH BACKEND
‚úÖ vLLM model registered: /kaggle/input/qwen2.5/transformers/14b-instruct-awq/1
‚úÖ vLLM registered: True
   ‚úÖ Response generation will use vLLM
   ‚úÖ Summarization will use vLLM
   ‚úÖ Judge/Classification will use vLLM


In [7]:
# Cell 7: Install backend requirements
print("="*60)
print("üì¶ INSTALLING BACKEND REQUIREMENTS")
print("="*60)

! pip install -q -r /kaggle/working/Subchat-Trees/backend/requirements.txt

print("‚úÖ Backend requirements installed")
print("="*60)

üì¶ INSTALLING BACKEND REQUIREMENTS
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m67.3/67.3 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m4.2/4.2 MB[0m [31m55.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m138.3/138.3 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m61.6/61.6 kB[0m [31m3.9 MB

In [8]:
# Cell 8: Quick test to verify vLLM integration works
from src.services.simple_llm import SimpleLLMClient
from src.models.tree import TreeNode

print("="*60)
print("üß™ QUICK INTEGRATION TEST")
print("="*60)

llm_client = SimpleLLMClient()
root = TreeNode(node_id="test", title="Test", buffer_size=5, llm_client=llm_client)
root.buffer.add_message("user", "Hello")

response = llm_client.generate_response(root, "What is 2+2?")
print(f"‚úÖ Test response: {response[:100]}...")
print("‚úÖ vLLM integration working!")
print("="*60)

‚úÖ Using vLLM backend with Kaggle GPU: /kaggle/input/qwen2.5/transformers/14b-instruct-awq/1
üß™ QUICK INTEGRATION TEST
‚úÖ vLLM connected for RESPONSES: /kaggle/input/qwen2.5/transformers/14b-instruct-awq/1
‚úÖ vLLM will be used for SUMMARIZATION: /kaggle/input/qwen2.5/transformers/14b-instruct-awq/1
üìä Buffer size: 5 messages | Summarization will trigger every 5 messages
üìã Buffer (1/5): Last 3 messages (full log in file)
   1. [user] Hello
*******************context*********************
 [{'role': 'user', 'content': 'Hello'}]


Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

‚úÖ Test response: Hello! How can I assist you today?...
‚úÖ vLLM integration working!


In [9]:
# Cell 9: RUN BUFFER TESTS
# Using exec() to run in the SAME process so it can access the loaded vLLM model

print("="*60)
print("üöÄ RUNNING BUFFER TESTS")
print("="*60)
print("üìä Testing buffer sizes: 5, 10, 20, 40")
print("üì§ Results will auto-push to GitHub after each buffer")
print("‚è≥ This may take several hours depending on scenario count")
print("="*60)

os.chdir("/kaggle/working/Subchat-Trees/backend")

# Run the test script in the SAME process (so it can access the loaded vLLM model)
exec(open("dataset/kaggle_buffer_test_runner.py").read())

üöÄ RUNNING BUFFER TESTS
üìä Testing buffer sizes: 5, 10, 20, 40
üì§ Results will auto-push to GitHub after each buffer
‚è≥ This may take several hours depending on scenario count


NameError: name '__file__' is not defined

In [None]:
# Cell 10: Completion and cleanup (optional)
print("="*60)
print("‚úÖ BUFFER TESTS COMPLETE!")
print("="*60)
print("üì§ All results have been pushed to GitHub")
print("üìä Check the kaggle_logs/ directory in your repo")
print("")
print("üí° You can now stop the kernel to save GPU quota")
print("   Uncomment the line below to auto-shutdown:")
print("="*60)

# Uncomment to force shutdown and save GPU quota:
# import os; os._exit(0)