## ‚úÖ ALL DEPENDENCY ISSUES FIXED!

**Final working versions:**
- `transformers`: 4.57.1 ‚Üí **4.44.2** ‚úÖ (has proper DynamicCache support)
- `peft`: 0.17.1 ‚Üí **0.13.2** ‚úÖ (supports use_dora, use_rslora, all config params)
- **These versions are fully compatible!**

**Next steps:**
1. **Restart the kernel** (Kernel ‚Üí Restart Kernel) 
2. Run all cells from the top to reload with correct library versions
3. The diagnostic test should complete in ~30-60 seconds with cache enabled
4. Full inference (4096 tokens) should take 3-7 minutes

**What was the problem?**
- ‚ùå transformers 4.57.1 was too new (DynamicCache had breaking changes)
- ‚ùå transformers 4.36.0 was too old (missing cache classes peft expected)
- ‚ùå peft 0.7.0/0.8.2/0.11.0 were too old (missing use_dora, use_rslora support)
- ‚úÖ **Solution**: transformers 4.44.2 + peft 0.13.2 = perfect compatibility!
- ‚úÖ Your training and adapters are perfect - this was 100% library version mismatch!

# Test Trained CAD Parameter Extraction Model

This notebook loads the trained LoRA adapters from `./phi3-cad-TwoStages-Radapters-2` and tests the model's ability to extract CAD parameters from natural language instructions.

**Prerequisites:**
- Trained adapters saved in `./phi3-cad-TwoStages-Radapters-2`
- GPU with CUDA support (recommended)
- Python environment with: `transformers`, `peft`, `torch`

## 1. Import Dependencies

In [1]:
import torch
import json
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ Imports successful")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

‚úÖ Imports successful
PyTorch version: 2.4.0+cu121
CUDA available: True
GPU: Quadro RTX 6000
GPU Memory: 25.19 GB


## 2. Load Base Model and Tokenizer

In [2]:
# Model configuration
base_model_name = "microsoft/Phi-3-mini-128k-instruct"
adapter_path = "./phi3-cad-TwoStages-Radapters-2"

print("=" * 80)
print("LOADING MODEL")
print("=" * 80)

# Load tokenizer
print(f"\n1Ô∏è‚É£ Loading tokenizer from {base_model_name}...")
tokenizer = AutoTokenizer.from_pretrained(
    base_model_name,
    trust_remote_code=True,
    padding_side="right"
)

# Set padding token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id

print(f"   ‚úÖ Tokenizer loaded")
print(f"   Vocab size: {len(tokenizer)}")
print(f"   EOS token: {tokenizer.eos_token}")

LOADING MODEL

1Ô∏è‚É£ Loading tokenizer from microsoft/Phi-3-mini-128k-instruct...
   ‚úÖ Tokenizer loaded
   Vocab size: 32011
   EOS token: <|endoftext|>
   ‚úÖ Tokenizer loaded
   Vocab size: 32011
   EOS token: <|endoftext|>


In [None]:
# Load base model
print(f"\n2Ô∏è‚É£ Loading base model: {base_model_name}")
print("   ‚è≥ This may take 1-2 minutes...")

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    device_map="auto",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    attn_implementation="eager",
    
)

print(f"   ‚úÖ Base model loaded")
print(f"   Device: {base_model.device}")
print(f"   Dtype: {base_model.dtype}")


2Ô∏è‚É£ Loading base model: microsoft/Phi-3-mini-128k-instruct
   ‚è≥ This may take 1-2 minutes...


`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
Current `flash-attention` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.
Current `flash-attention` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

   ‚úÖ Base model loaded
   Device: cuda:0
   Dtype: torch.bfloat16


## 3. Load Trained LoRA Adapters

In [4]:
# Load trained adapters
print(f"\n3Ô∏è‚É£ Loading trained LoRA adapters from: {adapter_path}")
print("   ‚è≥ Loading adapters...")

model = PeftModel.from_pretrained(
    base_model,
    adapter_path,
    device_map="auto"
)

print(f"   ‚úÖ Trained adapters loaded!")

# Model statistics
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model.parameters())

print(f"\nüìä Model Statistics:")
print(f"   Total parameters: {total_params:,}")
print(f"   Trainable parameters: {trainable_params:,}")
print(f"   Trainable %: {100 * trainable_params / total_params:.2f}%")

print("\n" + "=" * 80)
print("‚úÖ MODEL READY FOR INFERENCE")
print("=" * 80)


3Ô∏è‚É£ Loading trained LoRA adapters from: ./phi3-cad-TwoStages-Radapters-2
   ‚è≥ Loading adapters...
   ‚úÖ Trained adapters loaded!

üìä Model Statistics:
   Total parameters: 3,829,992,448
   Trainable parameters: 0
   Trainable %: 0.00%

‚úÖ MODEL READY FOR INFERENCE
   ‚úÖ Trained adapters loaded!

üìä Model Statistics:
   Total parameters: 3,829,992,448
   Trainable parameters: 0
   Trainable %: 0.00%

‚úÖ MODEL READY FOR INFERENCE


## 4. Define Inference Function

In [5]:
# System prompt used during training
SYSTEM_PROMPT = (
    "You are a CAD parameter predictor. Given a natural language instruction, "
    "predict the appropriate values for CAD parameters. "
    "Output JSON with parameter paths as keys and predicted values. "
    "Set parameters to 0 when they are not relevant to the instruction. "
    "Infer reasonable defaults when specific values are not mentioned."
)

def extract_parameters(instruction, max_new_tokens=4096, temperature=0.0, top_p=0.95, verbose=True):
    """
    Extract CAD parameters from natural language instruction.
    
    Args:
        instruction: Natural language CAD instruction
        max_new_tokens: Maximum tokens to generate (default: 4096)
        temperature: Sampling temperature (0.0 = greedy)
        top_p: Nucleus sampling parameter
        verbose: Print generation progress
    
    Returns:
        dict: Extracted parameters or error info
    """
    if verbose:
        print(f"‚è±Ô∏è  Generating with max_new_tokens={max_new_tokens}...")
    
    # Format prompt using chat template
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": instruction},
    ]
    
    formatted_prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    
    # Tokenize
    inputs = tokenizer(formatted_prompt, return_tensors="pt").to(model.device)
    input_length = inputs.input_ids.shape[1]
    
    if verbose:
        print(f"   Input tokens: {input_length}")
    
    # Generate - disable cache to avoid DynamicCache compatibility issues
    generation_kwargs = {
        "max_new_tokens": max_new_tokens,
        "do_sample": (temperature > 0.0),
        "pad_token_id": tokenizer.pad_token_id,
        "eos_token_id": tokenizer.eos_token_id,
        "use_cache": False,  # Disable cache to avoid 'seen_tokens' error
    }
    
    # Only add temperature/top_p if sampling is enabled
    if temperature > 0.0:
        generation_kwargs["temperature"] = temperature
        generation_kwargs["top_p"] = top_p
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            **generation_kwargs
        )
    
    # Decode only the generated tokens (not the input prompt)
    generated_ids = outputs[0][input_length:]
    generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True)
    
    output_length = len(generated_ids)
    
    if verbose:
        print(f"   Generated tokens: {output_length}")
        print(f"   Output length: {len(generated_text)} characters")
    
    # Parse JSON with multiple strategies
    
    # Strategy 1: Direct JSON parse
    try:
        result = json.loads(generated_text)
        if verbose:
            print(f"   ‚úÖ Parsed successfully (strategy 1: direct parse)")
        return result
    except json.JSONDecodeError:
        pass
    
    # Strategy 2: Find JSON block
    try:
        start = generated_text.find('{')
        end = generated_text.rfind('}') + 1
        if start != -1 and end > start:
            json_text = generated_text[start:end]
            result = json.loads(json_text)
            if verbose:
                print(f"   ‚úÖ Parsed successfully (strategy 2: find JSON block)")
            return result
    except (json.JSONDecodeError, ValueError):
        pass
    
    # Strategy 3: Split by markers
    for marker in ["<|assistant|>", "assistant:", "Assistant:"]:
        if marker in generated_text:
            text = generated_text.split(marker)[-1].strip()
            try:
                result = json.loads(text)
                if verbose:
                    print(f"   ‚úÖ Parsed successfully (strategy 3: split by '{marker}')")
                return result
            except json.JSONDecodeError:
                continue
    
    # Parsing failed - return error info
    if verbose:
        print(f"   ‚ùå Failed to parse JSON output")
    
    return {
        "error": "parsing_failed",
        "output_length": len(generated_text),
        "tokens_generated": output_length,
        "raw_output_preview": generated_text[:1000]
    }

print("‚úÖ Inference function defined")

‚úÖ Inference function defined


## 5. Test the Model

Let's test with various CAD instructions to see how the model performs.

## üö® QUICK DIAGNOSTIC TEST (512 tokens only)

Let's first verify the model works with a small output to diagnose the issue.

In [6]:
# Quick test with SMALL output and CACHE ENABLED
print("üî¨ DIAGNOSTIC TEST - Testing with 512 tokens and cache enabled")
print("=" * 80)

test_instruction = "Create a cube 10mm by 10mm by 10mm"

print(f"üìù Instruction: {test_instruction}\n")
print("‚è±Ô∏è  Generating with max_new_tokens=512 (cache enabled)...")

# Modified generation with cache enabled
messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": test_instruction},
]

formatted_prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer(formatted_prompt, return_tensors="pt").to(model.device)
input_length = inputs.input_ids.shape[1]

print(f"   Input tokens: {input_length}")
print(f"   Starting generation...")

import time
start_time = time.time()

try:
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=512,
            do_sample=False,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
            # use_cache will default to True
        )
    
    elapsed = time.time() - start_time
    generated_ids = outputs[0][input_length:]
    generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True)
    
    print(f"   ‚úÖ Generation completed in {elapsed:.1f} seconds!")
    print(f"   Generated tokens: {len(generated_ids)}")
    print(f"   Output length: {len(generated_text)} characters")
    print(f"\nüìÑ Raw Output:\n{generated_text[:1000]}")
    
    # Try to parse
    try:
        result = json.loads(generated_text)
        print(f"\n‚úÖ Successfully parsed JSON!")
        print(f"   Total parameters: {len(result)}")
        print(f"   Non-zero: {sum(1 for v in result.values() if v != 0)}")
    except:
        print(f"\n‚ö†Ô∏è  Could not parse as JSON (possibly truncated)")
        
except Exception as e:
    print(f"\n‚ùå Error: {type(e).__name__}: {str(e)[:200]}")
    
print("\n" + "=" * 80)

üî¨ DIAGNOSTIC TEST - Testing with 512 tokens and cache enabled
üìù Instruction: Create a cube 10mm by 10mm by 10mm

‚è±Ô∏è  Generating with max_new_tokens=512 (cache enabled)...
   Input tokens: 82
   Starting generation...
   Input tokens: 82
   Starting generation...


The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
You are not running the flash-attention implementation, expect numerical differences.
You are not running the flash-attention implementation, expect numerical differences.


   ‚úÖ Generation completed in 162.5 seconds!
   Generated tokens: 512
   Output length: 1049 characters

üìÑ Raw Output:
{
  "parts.part_1.coordinate_system.Euler Angles[0]": 0.0,
  "parts.part_1.coordinate_system.Euler Angles[1]": 0.0,
  "parts.part_1.coordinate_system.Euler Angles[2]": 0.0,
  "parts.part_1.coordinate_system.Translation Vector[0]": 0.0,
  "parts.part_1.coordinate_system.Translation Vector[1]": 0.0,
  "parts.part_1.coordinate_system.Translation Vector[2]": 0.0,
  "parts.part_1.description.height": 0.009999999999999999,
  "parts.part_1.description.length": 0.09999999999999999,
  "parts.part_1.description.name": "",
  "parts.part_1.description.shape": "",
  "parts.part_1.description.width": 0.09999999999999999,
  "parts.part_1.extrusion.extrude_depth_opposite_normal": 0.0,
  "parts.part_1.extrusion.extrude_depth_towards_normal": 0.009999999999999999,
  "parts.part_1.extrusion.operation": "NewBodyFeatureOperation",
  "parts.part_1.extrusion.sketch_scale": 0.099999999999

In [7]:
# Test prompts
test_prompts = [
    "Create a cube 10mm by 20mm by 30mm centered at the origin",
    "Design a cylinder with radius 5mm and height 15mm",
    "Make a rectangular block 100mm x 50mm x 25mm",
]

print("=" * 80)
print("TESTING PARAMETER EXTRACTION MODEL")
print("=" * 80)
print(f"\nüìù Running {len(test_prompts)} test cases...\n")

results = []

for i, prompt in enumerate(test_prompts, 1):
    print("\n" + "=" * 80)
    print(f"TEST {i}/{len(test_prompts)}")
    print("=" * 80)
    print(f"üìù Instruction: {prompt}")
    print("-" * 80)
    
    result = extract_parameters(prompt, max_new_tokens=4096, verbose=True)
    results.append({"instruction": prompt, "result": result})
    
    print("-" * 80)
    
    if "error" in result:
        print(f"‚ùå Error: {result['error']}")
        print(f"   Tokens generated: {result.get('tokens_generated', 0)}")
        print(f"   Output length: {result.get('output_length', 0)} chars")
        print(f"\n   Raw output preview (first 500 chars):")
        print(f"   {result.get('raw_output_preview', '')[:500]}")
    else:
        total_params = len(result)
        non_zero_params = sum(1 for v in result.values() if v != 0)
        
        print(f"‚úÖ Successfully extracted parameters")
        print(f"   Total parameters: {total_params}")
        print(f"   Non-zero parameters: {non_zero_params}")
        print(f"   Zero-padded parameters: {total_params - non_zero_params}")
        
        # Show first 15 non-zero parameters
        print(f"\n   First 15 non-zero parameters:")
        count = 0
        for key, value in result.items():
            if value != 0:
                print(f"      {key}: {value}")
                count += 1
                if count >= 15:
                    break
        
        if non_zero_params > 15:
            print(f"      ... and {non_zero_params - 15} more non-zero parameters")

print("\n" + "=" * 80)
print("‚úÖ TESTING COMPLETE")
print("=" * 80)

TESTING PARAMETER EXTRACTION MODEL

üìù Running 3 test cases...


TEST 1/3
üìù Instruction: Create a cube 10mm by 20mm by 30mm centered at the origin
--------------------------------------------------------------------------------
‚è±Ô∏è  Generating with max_new_tokens=4096...
   Input tokens: 86


OutOfMemoryError: CUDA out of memory. Tried to allocate 462.00 MiB. GPU 0 has a total capacity of 23.46 GiB of which 462.62 MiB is free. Including non-PyTorch memory, this process has 2.12 GiB memory in use. Process 1669217 has 1.03 GiB memory in use. Process 1646205 has 18.21 GiB memory in use. Process 1723537 has 1.63 GiB memory in use. Of the allocated memory 1.59 GiB is allocated by PyTorch, and 336.29 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

## 6. Analyze Results

In [None]:
# Summary statistics
print("=" * 80)
print("RESULTS SUMMARY")
print("=" * 80)

successful = sum(1 for r in results if "error" not in r["result"])
failed = len(results) - successful

print(f"\nüìä Overall Statistics:")
print(f"   Total tests: {len(results)}")
print(f"   Successful: {successful}")
print(f"   Failed: {failed}")
print(f"   Success rate: {100 * successful / len(results):.1f}%")

if successful > 0:
    # Stats for successful extractions
    successful_results = [r["result"] for r in results if "error" not in r["result"]]
    
    avg_total = sum(len(r) for r in successful_results) / len(successful_results)
    avg_nonzero = sum(sum(1 for v in r.values() if v != 0) for r in successful_results) / len(successful_results)
    
    print(f"\nüìà Successful Extraction Statistics:")
    print(f"   Avg total parameters: {avg_total:.0f}")
    print(f"   Avg non-zero parameters: {avg_nonzero:.0f}")
    print(f"   Avg zero-padding: {avg_total - avg_nonzero:.0f} ({100 * (avg_total - avg_nonzero) / avg_total:.1f}%)")

print("\n" + "=" * 80)

## 7. Interactive Testing

Test the model with your own custom instructions!

In [None]:
# Interactive testing
custom_instruction = "Create a sphere with radius 8mm"

print("=" * 80)
print("CUSTOM TEST")
print("=" * 80)
print(f"\nüìù Your instruction: {custom_instruction}")
print("-" * 80)

result = extract_parameters(custom_instruction, max_new_tokens=4096, verbose=True)

print("-" * 80)

if "error" in result:
    print(f"‚ùå Error: {result['error']}")
    print(f"\n   Raw output preview:\n{result.get('raw_output_preview', '')[:1000]}")
else:
    print(f"‚úÖ Extracted {len(result)} parameters")
    print(f"   Non-zero: {sum(1 for v in result.values() if v != 0)}")
    
    print(f"\n   All non-zero parameters:")
    for key, value in result.items():
        if value != 0:
            print(f"      {key}: {value}")

print("\n" + "=" * 80)