# üöÄ Hierarchical Subchat System - Kaggle GPU Testing

## üìã Setup Checklist (Do Once):

### 1Ô∏è‚É£ **Add Kaggle Secrets** (Most Important!)
Go to: **https://www.kaggle.com/settings** ‚Üí Add-ons ‚Üí Secrets

Add these two secrets:
- **`GROQ_API_KEY`** = Your Groq API key (for query decomposition)
- **`GITHUB_TOKEN`** = Your GitHub personal access token (for pushing results)

### 2Ô∏è‚É£ **Enable Internet in This Notebook**
- Click "‚öôÔ∏è Settings" (top right)
- Turn ON **"Internet"** toggle
- Click "Save"

### 3Ô∏è‚É£ **Make Sure This Notebook is PRIVATE**
- Never share secrets in public notebooks!

---

## ‚ñ∂Ô∏è Run Order:
1. **Cells 2-7**: Load secrets and libraries
2. **Cell 8**: Set environment variables (LLM_BACKEND=vllm)
3. **Cell 9**: Load vLLM model on Kaggle GPUs (Qwen-3 14B AWQ) - **Takes 2-3 minutes**
4. **Cells 15-17**: Clone repo, configure git, run test log push
5. **Cell 21**: Integrate vLLM with backend (registers model globally)
6. **Cell 23**: Test vLLM integration
7. **Run your actual tests**: Execute test scripts in backend/dataset/
8. **Push results**: Use git_commit_and_push() function to sync to GitHub

---

## üéØ What This Notebook Does:
- ‚úÖ Loads vLLM model (Qwen-3 14B) on Kaggle's 2x GPUs with AWQ quantization
- ‚úÖ Integrates vLLM with your backend via `VLLMClient` singleton
- ‚úÖ Runs hierarchical subchat tests using GPU-accelerated inference
- ‚úÖ Generates performance logs for buffer size analysis
- ‚úÖ Automatically syncs results back to your GitHub repo

---

## üîß Technical Details:
- **vLLM Config**: Tensor parallelism across both GPUs, 91% memory utilization
- **Backend Mode**: `LLM_BACKEND=vllm` (set in cell 8)
- **Model**: Qwen-3 14B AWQ (5120 max tokens, prefix caching enabled)
- **No .env needed**: Secrets loaded from Kaggle environment
- **Git Workflow**: Clone ‚Üí Test ‚Üí Push results to `kaggle-run` branch

In [13]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/qwen-3/transformers/32b-awq/1/model.safetensors.index.json
/kaggle/input/qwen-3/transformers/32b-awq/1/model-00003-of-00004.safetensors
/kaggle/input/qwen-3/transformers/32b-awq/1/config.json
/kaggle/input/qwen-3/transformers/32b-awq/1/merges.txt
/kaggle/input/qwen-3/transformers/32b-awq/1/LICENSE
/kaggle/input/qwen-3/transformers/32b-awq/1/model-00001-of-00004.safetensors
/kaggle/input/qwen-3/transformers/32b-awq/1/README.md
/kaggle/input/qwen-3/transformers/32b-awq/1/tokenizer.json
/kaggle/input/qwen-3/transformers/32b-awq/1/vocab.json
/kaggle/input/qwen-3/transformers/32b-awq/1/tokenizer_config.json
/kaggle/input/qwen-3/transformers/32b-awq/1/model-00004-of-00004.safetensors
/kaggle/input/qwen-3/transformers/32b-awq/1/model-00002-of-00004.safetensors
/kaggle/input/qwen-3/transformers/32b-awq/1/generation_config.json
/kaggle/input/qwen-3/transformers/14b-awq/1/model.safetensors.index.json
/kaggle/input/qwen-3/transformers/14b-awq/1/config.json
/kaggle/input/qwen-3/trans

# CHECKING OUT MY GPU WORKINGW

*** repo: https://github.com/moonmehedi/Subchat-Trees-A-Scalable-Architecture-for-Multi-Threaded-Dialogue-and-Context-Isolation-in-LLM ***


In [14]:
! uv pip uninstall -q --system 'tensorflow'
! uv pip install -q --system  'vllm' 'triton' 'logits-processor-zoo' 'numpy<2'

In [15]:
import os
import re
import logging
from pathlib import Path
import pickle
import json
import joblib
import shutil
import glob
from tqdm.auto import tqdm
import warnings

import numpy as np
import pandas as pd



# For Qwen
import torch
import vllm
from logits_processor_zoo.vllm import MultipleChoiceLogitsProcessor


In [16]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
secret_value_0 = user_secrets.get_secret("GITHUB_TOKEN")
secret_value_1 = user_secrets.get_secret("GROQ_API_KEY")
secret_value_2 = user_secrets.get_secret("HuggingFACEHUB_access_token")
secret_value_3 = user_secrets.get_secret("LANGCHAIN_API_KEY")

# ‚úÖ IMPORTANT: Set them in os.environ so other code can access them
os.environ["GITHUB_TOKEN"] = secret_value_0
os.environ["GROQ_API_KEY"] = secret_value_1
os.environ["HuggingFACEHUB_access_token"] = secret_value_2
os.environ["LANGCHAIN_API_KEY"] = secret_value_3
os.environ["LLM_BACKEND"] = "vllm"

# Print the tokens (first 4 and last 4 characters for security)
print("="*60)
print("üîê SECRETS LOADED AND SET IN ENVIRONMENT")
print("="*60)
print(f"‚úÖ GITHUB_TOKEN: {secret_value_0[:4]}...{secret_value_0[-4:]}")
print(f"‚úÖ GROQ_API_KEY: {secret_value_1[:4]}...{secret_value_1[-4:]}")
print(f"‚úÖ HuggingFACEHUB_access_token: {secret_value_2[:4]}...{secret_value_2[-4:]}")
print(f"‚úÖ LANGCHAIN_API_KEY: {secret_value_3[:4]}...{secret_value_3[-4:]}")
print(f"‚úÖ LLM_BACKEND: vllm")
print("="*60)

üîê SECRETS LOADED AND SET IN ENVIRONMENT
‚úÖ GITHUB_TOKEN: gith...tWfg
‚úÖ GROQ_API_KEY: gsk_...l6gr
‚úÖ HuggingFACEHUB_access_token: hf_E...GaQC
‚úÖ LANGCHAIN_API_KEY: lsv2...ea2f
‚úÖ LLM_BACKEND: vllm


In [17]:
# vLLM V1 does not currently accept logits processor so we need to disable it
# https://docs.vllm.ai/en/latest/getting_started/v1_user_guide.html#deprecated-features
os.environ["VLLM_USE_V1"] = "0"

#model_path = "/kaggle/input/qwen2.5-coder/transformers/32b-instruct-awq/1"
# Use 14B model (32B causes linker failures with libcuda on Kaggle T4 GPUs)
model_path = "/kaggle/input/qwen-3/transformers/14b-awq/1"
llm = vllm.LLM(
    model_path,
    quantization='awq',
    tensor_parallel_size=torch.cuda.device_count(),
    gpu_memory_utilization=0.91,
    trust_remote_code=True,
    dtype="half",
    enforce_eager=True,
    max_model_len=5120,
    disable_log_stats=True,
    enable_prefix_caching=True
)

INFO 12-13 17:22:16 [utils.py:253] non-default args: {'trust_remote_code': True, 'dtype': 'half', 'seed': None, 'max_model_len': 5120, 'tensor_parallel_size': 2, 'enable_prefix_caching': True, 'gpu_memory_utilization': 0.91, 'disable_log_stats': True, 'quantization': 'awq', 'enforce_eager': True, 'model': '/kaggle/input/qwen-3/transformers/32b-awq/1'}


The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.


INFO 12-13 17:22:16 [model.py:637] Resolved architecture: Qwen3ForCausalLM
INFO 12-13 17:22:16 [model.py:1750] Using max model len 5120
INFO 12-13 17:22:16 [model.py:1750] Using max model len 5120
INFO 12-13 17:22:16 [scheduler.py:228] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 12-13 17:22:16 [vllm.py:707] Cudagraph is disabled under eager mode
INFO 12-13 17:22:16 [scheduler.py:228] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 12-13 17:22:16 [vllm.py:707] Cudagraph is disabled under eager mode


[0;36m(EngineCore_DP0 pid=1743)[0;0m INFO 12-13 17:22:26 [core.py:93] Initializing a V1 LLM engine (v0.12.0) with config: model='/kaggle/input/qwen-3/transformers/32b-awq/1', speculative_config=None, tokenizer='/kaggle/input/qwen-3/transformers/32b-awq/1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=5120, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None,

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:07<00:23,  7.85s/it]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:07<00:23,  7.85s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:21<00:23, 11.52s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:21<00:23, 11.52s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:36<00:12, 12.82s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:36<00:12, 12.82s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:48<00:00, 12.43s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:48<00:00, 12.03s/it]
[0;36m(Worker_TP0 pid=1754)[0;0m 
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:48<00:00, 12.43s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:48<00:00, 12.03s/it]
[0;36m(Worker_TP0 pid=1754)[0;0m 


[0;36m(Worker_TP0 pid=1754)[0;0m INFO 12-13 17:23:25 [default_loader.py:308] Loading weights took 48.24 seconds
[0;36m(Worker_TP0 pid=1754)[0;0m INFO 12-13 17:23:25 [gpu_model_runner.py:3549] Model loading took 9.0570 GiB memory and 48.815444 seconds
[0;36m(Worker_TP0 pid=1754)[0;0m INFO 12-13 17:23:25 [gpu_model_runner.py:3549] Model loading took 9.0570 GiB memory and 48.815444 seconds
[0;36m(Worker_TP0 pid=1754)[0;0m INFO 12-13 17:23:53 [gpu_worker.py:359] Available KV cache memory: 2.84 GiB
[0;36m(Worker_TP0 pid=1754)[0;0m INFO 12-13 17:23:53 [gpu_worker.py:359] Available KV cache memory: 2.84 GiB
[0;36m(EngineCore_DP0 pid=1743)[0;0m INFO 12-13 17:23:54 [kv_cache_utils.py:1286] GPU KV cache size: 23,264 tokens
[0;36m(EngineCore_DP0 pid=1743)[0;0m INFO 12-13 17:23:54 [kv_cache_utils.py:1291] Maximum concurrency for 5,120 tokens per request: 4.54x
[0;36m(Worker_TP0 pid=1754)[0;0m INFO 12-13 17:23:54 [kernel_warmup.py:65] Warming up FlashInfer attention.
[0;36m(Worker_

[0;36m(EngineCore_DP0 pid=1743)[0;0m Process EngineCore_DP0:
[0;36m(EngineCore_DP0 pid=1743)[0;0m Traceback (most recent call last):
[0;36m(EngineCore_DP0 pid=1743)[0;0m   File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
[0;36m(EngineCore_DP0 pid=1743)[0;0m     self.run()
[0;36m(EngineCore_DP0 pid=1743)[0;0m   File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
[0;36m(EngineCore_DP0 pid=1743)[0;0m     self._target(*self._args, **self._kwargs)
[0;36m(EngineCore_DP0 pid=1743)[0;0m   File "/usr/local/lib/python3.11/dist-packages/vllm/v1/engine/core.py", line 847, in run_engine_core
[0;36m(EngineCore_DP0 pid=1743)[0;0m     raise e
[0;36m(EngineCore_DP0 pid=1743)[0;0m   File "/usr/local/lib/python3.11/dist-packages/vllm/v1/engine/core.py", line 834, in run_engine_core
[0;36m(EngineCore_DP0 pid=1743)[0;0m     engine_core = EngineCoreProc(*args, **kwargs)
[0;36m(EngineCore_DP0 pid=1743)[0;0m                   ^^^^^^^^^^^^

RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

In [None]:
from vllm import SamplingParams

def stream_generate(llm, prompt):
    sampling_params = SamplingParams(
        temperature=0.2,
        top_p=0.9,
        max_tokens=512,
    )

    for output in llm.generate(
        [prompt],
        sampling_params,
        
    ):
        yield output.outputs[0].text


# ‚îÄ‚îÄ Usage ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
prompt = """You are a helpful assistant.
User: Explain tensor parallelism in simple terms.
Assistant:"""

for token in stream_generate(llm, prompt):
    print(token, end="", flush=True)


NameError: name 'llm' is not defined

In [None]:
prompt = """what is quantum computing?"""

for token in stream_generate(llm, prompt):
    print(token, end="", flush=True)

In [None]:
print('h')

h


# test github

In [None]:
# Load secrets from Kaggle's secure environment
from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()

print("="*60)
print("üîê LOADING SECRETS FROM KAGGLE")
print("="*60)

# Try to load GROQ_API_KEY
try:
    GROQ_API_KEY = user_secrets.get_secret("GROQ_API_KEY")
    os.environ["GROQ_API_KEY"] = GROQ_API_KEY
    print("‚úÖ GROQ_API_KEY loaded successfully")
    print(f"   Key length: {len(GROQ_API_KEY)} characters")
except Exception as e:
    print(f"‚ö†Ô∏è  GROQ_API_KEY not found: {e}")
    print("   Add it in Kaggle Settings ‚Üí Secrets")
    GROQ_API_KEY = None

# Try to load GITHUB_TOKEN
try:
    GITHUB_TOKEN = user_secrets.get_secret("GITHUB_TOKEN")
    os.environ["GITHUB_TOKEN"] = GITHUB_TOKEN
    print("‚úÖ GITHUB_TOKEN loaded successfully")
    print(f"   Token length: {len(GITHUB_TOKEN)} characters")
except Exception as e:
    print(f"‚ö†Ô∏è  GITHUB_TOKEN not found: {e}")
    print("   Add it in Kaggle Settings ‚Üí Secrets")
    GITHUB_TOKEN = None

# Set LLM backend to use vLLM (local model on Kaggle GPU)
os.environ["LLM_BACKEND"] = "vllm"  # We'll use the vLLM model loaded above
print("\n‚úÖ LLM_BACKEND set to 'vllm' (using Kaggle GPU)")

print("="*60)

üîê LOADING SECRETS FROM KAGGLE
‚úÖ GROQ_API_KEY loaded successfully
   Key length: 56 characters
‚úÖ GITHUB_TOKEN loaded successfully
   Token length: 93 characters

‚úÖ LLM_BACKEND set to 'vllm' (using Kaggle GPU)
‚úÖ GITHUB_TOKEN loaded successfully
   Token length: 93 characters

‚úÖ LLM_BACKEND set to 'vllm' (using Kaggle GPU)


In [None]:
# Check GPU availability and configuration
import torch

print("="*60)
print("üîç ENVIRONMENT CHECK")
print("="*60)
print(f"‚úÖ PyTorch version: {torch.__version__}")
print(f"‚úÖ CUDA available: {torch.cuda.is_available()}")
print(f"‚úÖ CUDA version: {torch.version.cuda}")
print(f"‚úÖ Number of GPUs: {torch.cuda.device_count()}")

if torch.cuda.is_available():
    for i in range(torch.cuda.device_count()):
        print(f"\nüéÆ GPU {i}: {torch.cuda.get_device_name(i)}")
        print(f"   Memory: {torch.cuda.get_device_properties(i).total_memory / 1024**3:.2f} GB")

print(f"\n‚úÖ Current working directory: {os.getcwd()}")
print("="*60)

üîç ENVIRONMENT CHECK
‚úÖ PyTorch version: 2.6.0+cu124
‚úÖ CUDA available: False
‚úÖ CUDA version: 12.4
‚úÖ Number of GPUs: 0

‚úÖ Current working directory: /kaggle/working


In [None]:
# Clone the kaggle-run branch from GitHub (PUBLIC READ - no auth needed)
import subprocess

REPO_URL = "https://github.com/moonmehedi/Subchat-Trees-A-Scalable-Architecture-for-Multi-Threaded-Dialogue-and-Context-Isolation-in-LLM.git"
REPO_DIR = "Subchat-Trees"
BRANCH = "kaggle-run"

print("="*60)
print("üì• CLONING REPOSITORY")
print("="*60)

# Remove existing directory if present
if os.path.exists(REPO_DIR):
    print(f"‚ö†Ô∏è  Removing existing {REPO_DIR} directory...")
    shutil.rmtree(REPO_DIR)

# Clone the specific branch (no authentication needed for public repos)
# Skip LFS files to avoid bandwidth quota issues
print(f"üîÑ Cloning {BRANCH} branch (skipping LFS files)...")
print("   No authentication required for cloning (public repo)")

# Set environment variable to skip LFS files
clone_env = os.environ.copy()
clone_env["GIT_LFS_SKIP_SMUDGE"] = "1"

result = subprocess.run(
    ["git", "clone", "-b", BRANCH, "--single-branch", REPO_URL, REPO_DIR],
    capture_output=True,
    text=True,
    env=clone_env
)

if result.returncode == 0:
    print(f"‚úÖ Successfully cloned {BRANCH} branch!")
    print(f"üìÇ Repository location: {os.path.abspath(REPO_DIR)}")
    
    # List key directories to verify
    print("\nüìÅ Key directories found:")
    key_dirs = ["backend", "backend/src", "backend/dataset"]
    for dir_path in key_dirs:
        full_path = os.path.join(REPO_DIR, dir_path)
        if os.path.exists(full_path):
            print(f"   ‚úÖ {dir_path}")
        else:
            print(f"   ‚ùå {dir_path} (not found)")
else:
    print(f"‚ùå Clone failed: {result.stderr}")
    
print("="*60)

üì• CLONING REPOSITORY
‚ö†Ô∏è  Removing existing Subchat-Trees directory...
üîÑ Cloning kaggle-run branch (skipping LFS files)...
   No authentication required for cloning (public repo)
‚úÖ Successfully cloned kaggle-run branch!
üìÇ Repository location: /kaggle/working/Subchat-Trees

üìÅ Key directories found:
   ‚úÖ backend
   ‚úÖ backend/src
   ‚úÖ backend/dataset
‚úÖ Successfully cloned kaggle-run branch!
üìÇ Repository location: /kaggle/working/Subchat-Trees

üìÅ Key directories found:
   ‚úÖ backend
   ‚úÖ backend/src
   ‚úÖ backend/dataset


In [None]:
# Configure git identity
os.chdir(REPO_DIR)

print("="*60)
print("‚öôÔ∏è  CONFIGURING GIT")
print("="*60)

!git config user.name "moonmehedi"
!git config user.email "the.mehedi.hasan.moon@gmail.com"

print("‚úÖ Git identity configured!")
print(f"   User: moonmehedi")
print(f"   Email: the.mehedi.hasan.moon@gmail.com")

# Verify current branch
branch_result = subprocess.run(["git", "branch", "--show-current"], capture_output=True, text=True)
print(f"\n‚úÖ Current branch: {branch_result.stdout.strip()}")
print("="*60)

os.chdir("..")  # Return to parent directory

‚öôÔ∏è  CONFIGURING GIT
‚úÖ Git identity configured!
   User: moonmehedi
   Email: the.mehedi.hasan.moon@gmail.com

‚úÖ Current branch: kaggle-run
‚úÖ Git identity configured!
   User: moonmehedi
   Email: the.mehedi.hasan.moon@gmail.com

‚úÖ Current branch: kaggle-run


In [None]:
def create_test_log(log_dir="kaggle_logs", log_file="connection_test.log"):
    """Create a detailed test log with GPU and environment info"""
    from datetime import datetime
    
    os.makedirs(log_dir, exist_ok=True)
    log_path = os.path.join(log_dir, log_file)
    current_time = datetime.now()
    
    with open(log_path, "w") as f:
        f.write("="*60 + "\n")
        f.write("üî¨ KAGGLE GPU TEST RUN - CONNECTION VERIFICATION\n")
        f.write("="*60 + "\n\n")
        
        f.write(f"üìÖ Test Date: {current_time.strftime('%Y-%m-%d')}\n")
        f.write(f"‚è∞ Test Time: {current_time.strftime('%H:%M:%S UTC')}\n")
        f.write(f"üìç Timestamp: {pd.Timestamp.now()}\n\n")
        
        f.write("="*60 + "\n")
        f.write("üéÆ GPU CONFIGURATION\n")
        f.write("="*60 + "\n")
        f.write(f"GPU Count: {torch.cuda.device_count()}\n")
        
        if torch.cuda.is_available():
            for i in range(torch.cuda.device_count()):
                f.write(f"\nGPU {i}:\n")
                f.write(f"  - Name: {torch.cuda.get_device_name(i)}\n")
                f.write(f"  - Memory: {torch.cuda.get_device_properties(i).total_memory / 1024**3:.2f} GB\n")
        else:
            f.write("‚ö†Ô∏è  No GPU detected\n")
        
        f.write("\n" + "="*60 + "\n")
        f.write("üìä ENVIRONMENT INFO\n")
        f.write("="*60 + "\n")
        f.write(f"PyTorch Version: {torch.__version__}\n")
        f.write(f"CUDA Available: {torch.cuda.is_available()}\n")
        f.write(f"Working Directory: {os.getcwd()}\n")
        
        f.write("\n" + "="*60 + "\n")
        f.write("‚úÖ TEST STATUS: SUCCESS\n")
        f.write("="*60 + "\n")
        f.write(f"\nThis log was generated from Kaggle notebook\n")
        f.write(f"Push attempt at: {current_time.isoformat()}\n")
    
    return log_path, current_time


def git_commit_and_push(file_path, commit_message, branch="kaggle-run"):
    """Commit a file and push to GitHub"""
    import subprocess
    
    # Add file
    add_result = subprocess.run(["git", "add", file_path], capture_output=True, text=True)
    if add_result.returncode != 0:
        return False, f"Git add failed: {add_result.stderr}"
    
    # Commit
    commit_result = subprocess.run(["git", "commit", "-m", commit_message], capture_output=True, text=True)
    if commit_result.returncode != 0:
        return False, f"Git commit failed: {commit_result.stderr}"
    
    # Push with token
    if "GITHUB_TOKEN" not in os.environ:
        return False, "GITHUB_TOKEN not found in environment"
    
    repo_url_with_token = f"https://{os.environ['GITHUB_TOKEN']}@github.com/moonmehedi/Subchat-Trees-A-Scalable-Architecture-for-Multi-Threaded-Dialogue-and-Context-Isolation-in-LLM.git"
    
    # Set remote URL
    subprocess.run(["git", "remote", "set-url", "origin", repo_url_with_token], capture_output=True)
    
    # Push
    push_result = subprocess.run(["git", "push", "origin", branch], capture_output=True, text=True)
    
    if push_result.returncode == 0:
        return True, "Push successful"
    else:
        return False, f"Push failed: {push_result.stderr}"


# Main execution
print("="*60)
print("üß™ TESTING GIT PUSH CAPABILITY")
print("="*60)

try:
    # Change to repo directory
    os.chdir(REPO_DIR)
    
    # Create test log
    log_path, timestamp = create_test_log()
    print(f"‚úÖ Created detailed test log: {log_path}")
    
    # Commit and push
    commit_msg = f"Test: Kaggle GPU verification - {timestamp.strftime('%Y-%m-%d %H:%M:%S')}"
    success, message = git_commit_and_push(log_path, commit_msg, BRANCH)
    
    if success:
        print("\n‚úÖ Successfully pushed to GitHub!")
        print(f"   üìÅ Check: {log_path}")
        print(f"   üìÖ Pushed at: {timestamp.strftime('%Y-%m-%d %H:%M:%S')}")
        print("   üí° Pull on your local machine to verify sync")
    else:
        print(f"\n‚ùå {message}")
        
except Exception as e:
    print(f"\n‚ùå Error: {e}")
    import traceback
    traceback.print_exc()
finally:
    # Always return to parent directory
    os.chdir("..")

print("="*60)

üß™ TESTING GIT PUSH CAPABILITY
‚úÖ Created detailed test log: kaggle_logs/connection_test.log
[kaggle-run 7a97502] Test: Kaggle GPU verification - 2025-12-13 16:32:42
 1 file changed, 27 insertions(+)
 create mode 100644 kaggle_logs/connection_test.log
[kaggle-run 7a97502] Test: Kaggle GPU verification - 2025-12-13 16:32:42
 1 file changed, 27 insertions(+)
 create mode 100644 kaggle_logs/connection_test.log

üîÑ Pushing to GitHub...

üîÑ Pushing to GitHub...
‚úÖ Successfully pushed to GitHub!
   üìÅ Check: kaggle_logs/connection_test.log
   üìÖ Pushed at: 2025-12-13 16:32:42
   üí° Pull on your local machine to verify sync
‚úÖ Successfully pushed to GitHub!
   üìÅ Check: kaggle_logs/connection_test.log
   üìÖ Pushed at: 2025-12-13 16:32:42
   üí° Pull on your local machine to verify sync


# üîó Step 5: Integrate vLLM with Backend

**This connects the loaded vLLM model to your backend code**

In [None]:
# Register the vLLM model with the backend
import sys
sys.path.insert(0, os.path.join(REPO_DIR, "backend"))

from src.services.vllm_client import VLLMClient

print("="*60)
print("üîó INTEGRATING vLLM WITH BACKEND")
print("="*60)

# Register the globally loaded vLLM model
VLLMClient.set_model(llm)

print("‚úÖ vLLM model is now available to backend services")
print(f"   Model: {model_path}")
print(f"   GPUs: {torch.cuda.device_count()}")
print(f"   Backend will use LLM_BACKEND={os.getenv('LLM_BACKEND')}")
print("="*60)

# üß™ Step 6: Test vLLM Integration

**Quick test to verify backend can use vLLM on Kaggle GPU**

In [None]:
# Test the backend with vLLM
from src.services.simple_llm import SimpleLLMClient
from src.models.tree import TreeNode, LocalBuffer

print("="*60)
print("üß™ TESTING BACKEND WITH vLLM")
print("="*60)

# Create a simple test
llm_client = SimpleLLMClient()

# Create a test node with buffer
buffer = LocalBuffer(buffer_size=5)
root = TreeNode(id="test", title="Test Conversation", buffer=buffer)

# Add a message to buffer
buffer.add_message("user", "Hello, test message")

# Test generation
print("\nüìù Testing response generation...")
response = llm_client.generate_response(root, "What is 2+2?")

print(f"\n‚úÖ Response: {response}")
print(f"üìä Token usage: {llm_client.get_last_usage()}")
print("\n" + "="*60)
print("‚úÖ Backend integration successful!")
print("="*60)

# testing ends