# Few-Shot Test Generation Pipeline - Kaggle Runner

This notebook clones the Cross-lingual-test repository and runs the few-shot test generation pipeline with different embedding models.

**Requirements:**
- GPU Runtime (recommended for faster embedding)
- Internet enabled for cloning repo and downloading models

**Models to test:**
1. microsoft/unixcoder-base (768 dim)
2. Salesforce/SFR-Embedding-Code-400M_R (larger, potentially better)
3. sentence-transformers/all-MiniLM-L6-v2 (smaller, faster)
4. microsoft/codebert-base (768 dim)

## Step 1: Setup and Clone Repository

In [None]:
# Check GPU availability
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

In [None]:
# Clone the repository
!git clone https://github.com/lequangdung2005/Cross-lingual-test.git
%cd Cross-lingual-test/Few_shot

In [None]:
# Install required dependencies
!pip install -q transformers datasets sentence-transformers torch tqdm numpy scikit-learn faiss-cpu

## Step 2: Verify Installation

In [None]:
# Verify imports
import sys
import os

## Step 3: Configuration

In [None]:
# Configuration
# List of embedding models to test
EMBEDDING_MODELS = [
    "microsoft/unixcoder-base",
    "Salesforce/SFR-Embedding-Code-400M_R",
    "sentence-transformers/all-MiniLM-L6-v2",
    "microsoft/codebert-base"
]

# You can modify these settings
TOP_K = 3  # Number of final examples to return after reranking
RERANK_POOL_SIZE = 50  # Number of candidates to retrieve before reranking
SIMILARITY_THRESHOLD = 0.0  # Minimum similarity threshold (0.0 = no filtering)
MAX_EXAMPLES = None  # Set to a number (e.g., 5000) for quick testing, None for all

# Output directory for results
OUTPUT_BASE = "/kaggle/working/few_shot_outputs"
os.makedirs(OUTPUT_BASE, exist_ok=True)

## Step 4: Run Test Runner for Each Model

In [None]:
import time
from datetime import datetime

# Store results summary
results_summary = []

for i, model_name in enumerate(EMBEDDING_MODELS, 1):
    print(f"\nProcessing model {i}/{len(EMBEDDING_MODELS)}: {model_name}")
    
    start_time = time.time()
    
    # Build command with rerank pool size
    cmd = f"python test_runner.py --embedder '{model_name}' --top-k {TOP_K} --rerank-pool-size {RERANK_POOL_SIZE} --similarity-threshold {SIMILARITY_THRESHOLD}"
    if MAX_EXAMPLES:
        cmd += f" --max-examples {MAX_EXAMPLES}"
    
    # Run the script
    return_code = os.system(cmd)
    
    elapsed_time = time.time() - start_time
    
    # Record results
    result = {
        "model": model_name,
        "success": return_code == 0,
        "elapsed_time": elapsed_time,
        "timestamp": datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    }
    results_summary.append(result)
    
    status = "✓" if return_code == 0 else "✗"
    print(f"{status} Completed in {elapsed_time:.2f}s")

print("\nAll models processed!")

## Step 5: Results Summary

In [None]:
import pandas as pd

# Create summary DataFrame
df_summary = pd.DataFrame(results_summary)
df_summary['elapsed_time_min'] = df_summary['elapsed_time'] / 60

print("\nEXECUTION SUMMARY")
print("=" * 80)
print(df_summary[['model', 'success', 'elapsed_time_min']].to_string(index=False))
print("\n" + "=" * 80)
print(f"Total: {len(results_summary)} | Successful: {sum(1 for r in results_summary if r['success'])} | Failed: {sum(1 for r in results_summary if not r['success'])}")
print(f"Total time: {sum(r['elapsed_time'] for r in results_summary) / 60:.2f} minutes")
print("=" * 80)

## Step 6: Check Generated Files

In [None]:
# Check generated prompt files
import glob

prompt_dir = "src/data/constructed_prompt"
if os.path.exists(prompt_dir):
    print("Generated Few-Shot Prompt Files:\n")
    for model_name in EMBEDDING_MODELS:
        model_short = model_name.replace('/', '_')
        print(f"\n{model_name}")
        print("-" * 60)
        
        for lang in ['rust', 'go', 'julia']:
            file_path = f"{prompt_dir}/{model_name}/{lang}/data_with_fewshot.jsonl"
            if os.path.exists(file_path):
                with open(file_path, 'r') as f:
                    line_count = sum(1 for _ in f)
                file_size = os.path.getsize(file_path) / (1024 * 1024)  # MB
                print(f"  {lang.upper():6s}: {line_count:4d} prompts, {file_size:.2f} MB")
            else:
                print(f"  {lang.upper():6s}: Not found")
else:
    print(f"Prompt directory not found: {prompt_dir}")

## Step 7: Check Database Files

In [None]:
# Check database files
db_dir = "src/data/database"
if os.path.exists(db_dir):
    print("Generated Database Index Files:\n")
    for model_name in EMBEDDING_MODELS:
        db_path = f"{db_dir}/{model_name}/database_index.pkl"
        if os.path.exists(db_path):
            file_size = os.path.getsize(db_path) / (1024 * 1024)  # MB
            print(f"✓ {model_name}: {file_size:.2f} MB")
        else:
            print(f"✗ {model_name}: Not found")
else:
    print(f"Database directory not found: {db_dir}")

## Step 8: Sample Output Inspection

In [None]:
import json

# Show a sample prompt from the first successful model
for model_name in EMBEDDING_MODELS:
    sample_file = f"src/data/constructed_prompt/{model_name}/rust/data_with_fewshot.jsonl"
    if os.path.exists(sample_file):
        print(f"Sample Few-Shot Prompt from {model_name}:\n")
        print("=" * 80)
        with open(sample_file, 'r') as f:
            first_line = f.readline()
            sample = json.loads(first_line)
            
            print(f"Case ID: {sample['id']}")
            print(f"Retrieved Context Keys: {list(sample['retrieved_context'].keys())}")
            
            # Show focal method (truncated)
            focal = sample['retrieved_context'].get('focal_method', 'N/A')
            print(f"\nFocal Method (first 300 chars):\n{focal[:300]}...")
            
            # Show number of retrieved examples
            results = sample['retrieved_context'].get('results', [])
            print(f"\nNumber of retrieved examples: {len(results)}")
            
            if results:
                print(f"\nFirst retrieved example (truncated):")
                first_result = results[0]
                print(f"  Similarity: {first_result.get('similarity', 'N/A')}")
                example_code = first_result.get('example', {}).get('focal_method', 'N/A')
                print(f"  Code: {example_code[:200]}...")
        
        print("\n" + "=" * 80)
        break  # Only show first successful model
else:
    print("No sample files found")

## Step 9: Download Results (Optional)

In [None]:
# Create a zip file of all generated prompts for download
import shutil

output_zip = "/kaggle/working/few_shot_prompts.zip"
prompt_dir = "src/data/constructed_prompt"

if os.path.exists(prompt_dir):
    shutil.make_archive(
        output_zip.replace('.zip', ''),
        'zip',
        prompt_dir
    )
    
    zip_size = os.path.getsize(output_zip) / (1024 * 1024)  # MB
    print(f"✓ Created {output_zip} ({zip_size:.2f} MB)")
    print("Download from Kaggle output panel.")
else:
    print(f"Prompt directory not found: {prompt_dir}")

## Step 10: Cleanup (Optional)

In [None]:
# Optional: Clean up large database files to save space
# Uncomment the following lines if you want to remove database files

# import shutil
# db_dir = "src/data/database"
# if os.path.exists(db_dir):
#     shutil.rmtree(db_dir)
#     print("✓ Database files removed")

## Notes

**Expected Runtime:**
- Each model takes approximately 10-30 minutes depending on:
  - Model size
  - Number of training examples
  - GPU availability
  - Dataset size

**Output Structure:**
```
src/data/
├── constructed_prompt/
│   ├── microsoft/unixcoder-base/
│   │   ├── rust/data_with_fewshot.jsonl
│   │   ├── go/data_with_fewshot.jsonl
│   │   └── julia/data_with_fewshot.jsonl
│   └── [other models]/
└── database/
    ├── microsoft/unixcoder-base/database_index.pkl
    └── [other models]/database_index.pkl
```

**Troubleshooting:**
- If a model fails to load, it may be too large for available memory
- Try setting `MAX_EXAMPLES` to a smaller number (e.g., 5000) for testing
- Ensure GPU runtime is enabled for faster processing
- Check that internet is enabled for downloading models and datasets