# SDK-Bench: Step-by-Step Evaluation Pipeline

This notebook demonstrates the complete SDK-Bench evaluation pipeline, showing:
1. How prompts are constructed
2. What each metric measures
3. How evaluation works
4. Actual results on a sample task

We'll use a real sample from the SDK-Bench dataset and walk through every stage.

## Setup and Imports

In [1]:
import sys
import json
from pathlib import Path
from typing import Dict, List, Optional
import os

# Add parent directory to path
sys.path.insert(0, str(Path.cwd().parent))

# Import SDK-Bench modules
from sdkbench.llm import LLMConfig, AnthropicProvider
from sdkbench.llm.prompt_builder import PromptBuilder
from sdkbench.llm.solution_generator import SolutionGenerator
from sdkbench.evaluator import Evaluator
from sdkbench.core.ground_truth import GroundTruth

# Note: The actual metric evaluator classes are imported but not used directly
# The Evaluator class handles all metric evaluations internally
# If you need to use them directly, import like this:
# from sdkbench.metrics import (
#     IAccEvaluator, CCompEvaluator, IPAEvaluator,
#     FCorrEvaluator, CQEvaluator, SemSimEvaluator
# )

# For pretty printing
from IPython.display import display, Markdown, HTML
import pprint
pp = pprint.PrettyPrinter(indent=2)

In [2]:
from dotenv import load_dotenv

load_dotenv('/Users/arshath/play/naptha/better-onboarding/SDKBench/.env')

True

## 1. Load a Sample Task

Let's load a sample task and understand its structure.

In [3]:
# Choose a sample task
SAMPLE_ID = "task1_init_001"  # Basic initialization task
sample_path = Path("../samples") / SAMPLE_ID

# Load metadata
metadata_path = sample_path / "expected" / "metadata.json"
with open(metadata_path) as f:
    metadata = json.load(f)

print("Sample Task:", SAMPLE_ID)
print("\nTask Metadata:")
pp.pprint(metadata)

Sample Task: task1_init_001

Task Metadata:
{ 'clerk_version': '5.0.0',
  'description': 'Initialize Clerk authentication by wrapping the application '
                 'with ClerkProvider',
  'difficulty': 'easy',
  'estimated_lines': 10,
  'evaluation_targets': { 'c_comp': { 'optional_env_vars': 0,
                                      'required_env_vars': 2},
                          'f_corr': { 'expected_pass': True,
                                      'test_command': 'npm test -- '
                                                      'init.test.ts'},
                          'i_acc': { 'correct_file': 'app/layout.tsx',
                                     'correct_imports': [ 'ClerkProvider from '
                                                          '@clerk/nextjs'],
                                     'correct_pattern': 'ClerkProvider'}},
  'framework': 'nextjs',
  'ground_truth': { 'ingredients': { 'configuration': { 'env_vars': [ 'NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY',


## 2. Understanding the Task Types

SDK-Bench has 6 task types that test different aspects of SDK instrumentation:

In [4]:
task_types = {
    1: {
        "name": "Initialization & Configuration",
        "description": "Basic SDK setup and initialization",
        "example": "Add Clerk authentication to a Next.js app",
        "key_challenges": [
            "Correct import statements",
            "Proper component wrapping",
            "Environment variable configuration"
        ]
    },
    2: {
        "name": "Basic Feature Integration",
        "description": "Implement core SDK features",
        "example": "Add sign-in/sign-out buttons",
        "key_challenges": [
            "Using correct SDK hooks",
            "Proper UI component placement",
            "State management"
        ]
    },
    3: {
        "name": "Advanced Feature Integration",
        "description": "Complex SDK feature implementation",
        "example": "Multi-factor authentication or custom user metadata",
        "key_challenges": [
            "Complex API interactions",
            "Error handling",
            "Advanced configuration"
        ]
    },
    4: {
        "name": "Debugging & Troubleshooting",
        "description": "Fix broken SDK implementations",
        "example": "Fix authentication flow that's not working",
        "key_challenges": [
            "Identifying the bug",
            "Understanding error messages",
            "Applying correct fix"
        ]
    },
    5: {
        "name": "Migration",
        "description": "Upgrade SDK to newer version",
        "example": "Migrate from Clerk v4 to v5",
        "key_challenges": [
            "API breaking changes",
            "Deprecated methods",
            "New patterns adoption"
        ]
    },
    6: {
        "name": "Performance Optimization",
        "description": "Optimize SDK usage for better performance",
        "example": "Reduce bundle size or optimize API calls",
        "key_challenges": [
            "Identifying bottlenecks",
            "Applying optimizations",
            "Maintaining functionality"
        ]
    }
}

current_task_type = metadata['task_type']
print(f"Current Sample Task Type: {current_task_type}")
print(f"Name: {task_types[current_task_type]['name']}")
print(f"Description: {task_types[current_task_type]['description']}")
print(f"\nKey Challenges:")
for challenge in task_types[current_task_type]['key_challenges']:
    print(f"  - {challenge}")

Current Sample Task Type: 1
Name: Initialization & Configuration
Description: Basic SDK setup and initialization

Key Challenges:
  - Correct import statements
  - Proper component wrapping
  - Environment variable configuration


## 3. Build the Prompt

Now let's see how the prompt is constructed for the LLM. The prompt has two parts:
1. **System Prompt**: Sets the context and role
2. **User Prompt**: Provides the specific task

In [5]:
# Build the prompt
builder = PromptBuilder()
input_dir = sample_path / "input"

system_prompt, user_prompt = builder.build_from_metadata(metadata_path, input_dir)

print("="*80)
print("SYSTEM PROMPT")
print("="*80)
print(system_prompt)
print("\n" + "="*80)
print("USER PROMPT")
print("="*80)
print(user_prompt[:2000] + "..." if len(user_prompt) > 2000 else user_prompt)

SYSTEM PROMPT
You are an expert developer specializing in authentication integration.
You are helping integrate Clerk authentication (version 5.0.0) into a nextjs application.


Clerk is a complete authentication and user management solution for modern web applications.

Key Concepts:
1. ClerkProvider: Wraps your React app to provide authentication context
2. Middleware: Protects routes on the server-side
3. Hooks: Access user data and auth state in React components
4. Server-side helpers: Get auth state in server components and API routes

Clerk v5 (Latest):
- Package: @clerk/nextjs
- Middleware: clerkMiddleware()
- Server imports: @clerk/nextjs/server
- Client imports: @clerk/nextjs

Clerk v4 (Legacy):
- Package: @clerk/nextjs@4
- Middleware: authMiddleware()
- Different import paths

Environment Variables:
- NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY: Your publishable key
- CLERK_SECRET_KEY: Your secret key
- Optional URL configs for custom sign-in/up pages


Your responses should:
1. Provid

## 4. Understanding the Prompt Structure

Let's break down what's included in the prompts:

In [6]:
# Analyze prompt components
print("System Prompt Components:")
print(f"  - Length: {len(system_prompt)} characters")
print(f"  - Sets role as SDK integration expert")
print(f"  - Provides response format instructions")
print(f"  - Specifies file output format")

print("\nUser Prompt Components:")
print(f"  - Length: {len(user_prompt)} characters")
print(f"  - Task description: {metadata['description'][:100]}...")
print(f"  - SDK: {metadata.get('sdk_name', 'Clerk')} v{metadata['clerk_version']}")
print(f"  - Framework: {metadata['framework']}")
print(f"  - Number of input files: {len(list(input_dir.glob('**/*'))) if input_dir.exists() else 0}")

# Show input files if they exist
if input_dir.exists():
    print("\nInput Files Provided:")
    for file in input_dir.rglob('*'):
        if file.is_file():
            rel_path = file.relative_to(input_dir)
            size = file.stat().st_size
            print(f"  - {rel_path} ({size} bytes)")

System Prompt Components:
  - Length: 1290 characters
  - Sets role as SDK integration expert
  - Provides response format instructions
  - Specifies file output format

User Prompt Components:
  - Length: 1296 characters
  - Task description: Initialize Clerk authentication by wrapping the application with ClerkProvider...
  - SDK: Clerk v5.0.0
  - Framework: nextjs
  - Number of input files: 5

Input Files Provided:
  - package.json (262 bytes)
  - .env.example (33 bytes)
  - app/layout.tsx (239 bytes)
  - app/page.tsx (146 bytes)


## 5. Generate Solution (Optional - Uses LLM)

You can either:
1. Use an existing solution from previous runs
2. Generate a new solution using the LLM (requires API key)

For demonstration, we'll show both options.

In [7]:
# Option 1: Use existing solution
USE_EXISTING = True
existing_solution_dir = Path("../results/llm_solutions") / SAMPLE_ID / "claude-3-haiku-20240307"

if USE_EXISTING and existing_solution_dir.exists():
    print("Using existing solution from:", existing_solution_dir)
    solution_dir = existing_solution_dir
    
    # Show solution files
    print("\nSolution Files:")
    for file in solution_dir.rglob('*'):
        if file.is_file() and file.name not in ['metadata.json', 'llm_response.txt']:
            rel_path = file.relative_to(solution_dir)
            print(f"\n{rel_path}:")
            print("-" * 40)
            with open(file) as f:
                content = f.read()
                print(content[:500] + "..." if len(content) > 500 else content)
else:
    print("To generate a new solution, set USE_EXISTING=False and provide API key")
    print("Example:")
    print("  os.environ['ANTHROPIC_API_KEY'] = 'your-key'")
    print("  # Then run the generation code below")

Using existing solution from: ../results/llm_solutions/task1_init_001/claude-3-haiku-20240307

Solution Files:

.env.local:
----------------------------------------
NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY=your_clerk_publishable_key
CLERK_SECRET_KEY=your_clerk_secret_key

package.json:
----------------------------------------
{
  "name": "nextjs-app",
  "version": "0.1.0",
  "private": true,
  "scripts": {
    "dev": "next dev",
    "build": "next build",
    "start": "next start"
  },
  "dependencies": {
    "react": "^18.2.0",
    "react-dom": "^18.2.0",
    "next": "^14.0.0"
  }
}

generation_metadata.json:
----------------------------------------
{
  "sample_id": "task1_init_001",
  "model": "claude-3-haiku-20240307",
  "generated_at": "2025-11-23T20:59:25.597304",
  "files_generated": [
    ".env.local",
    "package.json",
    ".env.example",
    "llm_response.txt",
    "app/layout.tsx",
    "app/page.tsx"
  ]
}

.env.example:
----------------------------------------
# Add environment v

In [8]:
# Option 2: Generate new solution (requires API key)
GENERATE_NEW = True  # Set to True to generate

if GENERATE_NEW:
    # Check for API key
    if not os.getenv('ANTHROPIC_API_KEY'):
        print("Please set ANTHROPIC_API_KEY environment variable")
    else:
        # Configure LLM
        config = LLMConfig(
            model="claude-3-haiku-20240307",
            temperature=0.1,
            max_tokens=4000,
            api_key=os.getenv('ANTHROPIC_API_KEY')
        )
        
        # Create provider and generate
        provider = AnthropicProvider(config)
        response = provider.generate(user_prompt, system_prompt)
        
        print("LLM Response:")
        print(f"  - Model: {response.model}")
        print(f"  - Tokens: {response.tokens_used}")
        print(f"  - Cost: ${response.cost:.4f}")
        print(f"  - Latency: {response.latency_ms:.0f}ms")
        
        # Generate solution files
        generator = SolutionGenerator()
        solution_dir = generator.generate_solution(
            response.content,
            Path("../temp_solutions"),
            SAMPLE_ID,
            config.model,
            copy_input=input_dir if input_dir.exists() else None
        )
        
        print(f"\nSolution generated at: {solution_dir}")

LLM Response:
  - Model: claude-3-haiku-20240307
  - Tokens: 1331
  - Cost: $0.0009
  - Latency: 4538ms

Solution generated at: ../temp_solutions/task1_init_001/claude-3-haiku-20240307


In [9]:
print(response.content)

To initialize Clerk authentication in your NextJS application, follow these steps:

// filepath: app/layout.tsx
```typescript
import { ClerkProvider } from "@clerk/nextjs";

export default function RootLayout({
  children,
}: {
  children: React.ReactNode
}) {
  return (
    <html lang="en">
      <body>
        <ClerkProvider>
          {children}
        </ClerkProvider>
      </body>
    </html>
  )
}
```

In this updated `app/layout.tsx` file, we have wrapped the entire application with the `ClerkProvider` component. This ensures that the Clerk authentication context is available throughout the application.

To use Clerk's features, you'll need to configure the necessary environment variables. Create a `.env.local` file in the root of your project and add the following variables:

// filepath: .env.local
```
NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY=your_clerk_publishable_key
CLERK_SECRET_KEY=your_clerk_secret_key
```

Replace `your_clerk_publishable_key` and `your_clerk_secret_key` with t

## 6. Evaluation Metrics

SDK-Bench uses 6 metrics to evaluate the quality of SDK instrumentation. Let's understand each one:

In [10]:
metrics_explanation = {
    "I-ACC (Implementation Accuracy)": {
        "measures": "Whether the solution implements the required SDK functionality",
        "how": "Checks if key SDK methods and components are present",
        "range": "0-100%",
        "example": "For Clerk: checks if ClerkProvider is used, authentication hooks are called",
        "weight": 0.3
    },
    "C-COMP (Configuration Completeness)": {
        "measures": "Whether all required configuration is present",
        "how": "Verifies environment variables, config files, and settings",
        "range": "0-100%",
        "example": "Checks if .env has CLERK_PUBLISHABLE_KEY and CLERK_SECRET_KEY",
        "weight": 0.2
    },
    "IPA (Integration Point Accuracy)": {
        "measures": "Whether SDK is integrated at the correct locations",
        "how": "Compares file paths and integration points with ground truth",
        "range": "0-1 (F1 score)",
        "example": "Checks if authentication is added to the right components",
        "weight": 0.15
    },
    "F-CORR (Functional Correctness)": {
        "measures": "Whether the code compiles and runs without errors",
        "how": "Runs build/test commands and checks for errors",
        "range": "0-100%",
        "example": "npm run build && npm test",
        "weight": 0.15
    },
    "CQ (Code Quality)": {
        "measures": "Code quality and best practices",
        "how": "Checks formatting, patterns, error handling",
        "range": "0-100%",
        "example": "Proper imports, no unused variables, follows framework patterns",
        "weight": 0.1
    },
    "SEM-SIM (Semantic Similarity)": {
        "measures": "How similar the solution is to the reference implementation",
        "how": "Compares code structure and logic using embeddings/AST",
        "range": "0-100%",
        "example": "Similar variable names, function structure, logic flow",
        "weight": 0.1
    }
}

for metric, details in metrics_explanation.items():
    print(f"\n{metric} (Weight: {details['weight']*100}%)")
    print("="*60)
    print(f"Measures: {details['measures']}")
    print(f"How: {details['how']}")
    print(f"Range: {details['range']}")
    print(f"Example: {details['example']}")


I-ACC (Implementation Accuracy) (Weight: 30.0%)
Measures: Whether the solution implements the required SDK functionality
How: Checks if key SDK methods and components are present
Range: 0-100%
Example: For Clerk: checks if ClerkProvider is used, authentication hooks are called

C-COMP (Configuration Completeness) (Weight: 20.0%)
Measures: Whether all required configuration is present
How: Verifies environment variables, config files, and settings
Range: 0-100%
Example: Checks if .env has CLERK_PUBLISHABLE_KEY and CLERK_SECRET_KEY

IPA (Integration Point Accuracy) (Weight: 15.0%)
Measures: Whether SDK is integrated at the correct locations
How: Compares file paths and integration points with ground truth
Range: 0-1 (F1 score)
Example: Checks if authentication is added to the right components

F-CORR (Functional Correctness) (Weight: 15.0%)
Measures: Whether the code compiles and runs without errors
How: Runs build/test commands and checks for errors
Range: 0-100%
Example: npm run build

In [11]:
# Create evaluator
evaluator = Evaluator(solution_dir, metadata_path=metadata_path)

# Run quick evaluation (without build/test)
print("Running evaluation...")
result = evaluator.evaluate_quick()

print("\nEvaluation Complete!")
print("="*60)
print(f"Overall Score: {result.overall_score:.1f}%")
print("="*60)

Running evaluation...

Evaluation Complete!
Overall Score: 33.5%


## 7. Run Evaluation

Now let's evaluate the solution using all metrics:

## 8. Detailed Metric Results

Let's examine each metric's evaluation in detail:

# C-COMP: Configuration Completeness  
print("2. C-COMP (Configuration Completeness)")
print("="*60)
if result.c_comp:
    print(f"Score: {result.c_comp.score:.1f}%")
    print(f"\nComponent Scores:")
    print(f"  - Environment Variables: {result.c_comp.env_vars_score*100:.1f}%")
    print(f"  - Provider Properties: {result.c_comp.provider_props_score*100:.1f}%")
    print(f"  - Middleware Config: {result.c_comp.middleware_config_score*100:.1f}%")
    
    print(f"\nWeighting:")
    print(f"  - Environment Variables: 50% (contributes {result.c_comp.env_vars_score*50:.1f}%)")
    print(f"  - Provider Properties: 30% (contributes {result.c_comp.provider_props_score*30:.1f}%)")
    print(f"  - Middleware Config: 20% (contributes {result.c_comp.middleware_config_score*20:.1f}%)")
    
    if result.c_comp.missing_env_vars:
        print(f"\nMissing Environment Variables:")
        for var in result.c_comp.missing_env_vars:
            print(f"  ‚úó {var}")
    
    if result.c_comp.missing_provider_props:
        print(f"\nMissing Provider Properties:")
        for prop in result.c_comp.missing_provider_props:
            print(f"  ‚úó {prop}")
            
    if result.c_comp.missing_middleware_config:
        print(f"\nMissing Middleware Config:")
        for config in result.c_comp.missing_middleware_config:
            print(f"  ‚úó {config}")
else:
    print("Not evaluated")

print("\n" + "-"*60)

In [12]:
# I-ACC: Implementation Accuracy
print("1. I-ACC (Implementation Accuracy)")
print("="*60)
if result.i_acc:
    print(f"Score: {result.i_acc.score:.1f}%")
    print(f"\nComponent Breakdown:")
    print(f"  - File Location Correct: {'‚úì' if result.i_acc.file_location_correct else '‚úó'}")
    print(f"  - Imports Correct: {'‚úì' if result.i_acc.imports_correct else '‚úó'}")
    print(f"  - Pattern Correct: {'‚úì' if result.i_acc.pattern_correct else '‚úó'}")
    print(f"  - Placement Correct: {'‚úì' if result.i_acc.placement_correct else '‚úó'}")
    
    print(f"\nWeighting:")
    print(f"  - File Location: 20% {'(+20.0%)' if result.i_acc.file_location_correct else '(0.0%)'}")
    print(f"  - Imports: 20% {'(+20.0%)' if result.i_acc.imports_correct else '(0.0%)'}")
    print(f"  - Pattern: 30% {'(+30.0%)' if result.i_acc.pattern_correct else '(0.0%)'}")
    print(f"  - Placement: 30% {'(+30.0%)' if result.i_acc.placement_correct else '(0.0%)'}")
    
    if result.i_acc.details:
        print("\nAdditional Details:")
        for key, value in result.i_acc.details.items():
            print(f"  - {key}: {value}")
else:
    print("Not evaluated")

print("\n" + "-"*60)

1. I-ACC (Implementation Accuracy)
Score: 100.0%

Component Breakdown:
  - File Location Correct: ‚úì
  - Imports Correct: ‚úì
  - Pattern Correct: ‚úì
  - Placement Correct: ‚úì

Weighting:
  - File Location: 20% (+20.0%)
  - Imports: 20% (+20.0%)
  - Pattern: 30% (+30.0%)
  - Placement: 30% (+30.0%)

------------------------------------------------------------


In [13]:
# C-COMP: Configuration Completeness  
print("2. C-COMP (Configuration Completeness)")
print("="*60)
if result.c_comp:
    print(f"Score: {result.c_comp.score:.1f}%")
    print(f"\nComponent Scores:")
    print(f"  - Environment Variables: {result.c_comp.env_vars_score*100:.1f}%")
    print(f"  - Provider Properties: {result.c_comp.provider_props_score*100:.1f}%")
    print(f"  - Middleware Config: {result.c_comp.middleware_config_score*100:.1f}%")
    
    print(f"\nWeighting:")
    print(f"  - Environment Variables: 50% (contributes {result.c_comp.env_vars_score*50:.1f}%)")
    print(f"  - Provider Properties: 30% (contributes {result.c_comp.provider_props_score*30:.1f}%)")
    print(f"  - Middleware Config: 20% (contributes {result.c_comp.middleware_config_score*20:.1f}%)")
    
    if result.c_comp.missing_env_vars:
        print(f"\nMissing Environment Variables:")
        for var in result.c_comp.missing_env_vars:
            print(f"  ‚úó {var}")
    
    if result.c_comp.missing_provider_props:
        print(f"\nMissing Provider Properties:")
        for prop in result.c_comp.missing_provider_props:
            print(f"  ‚úó {prop}")
            
    if result.c_comp.missing_middleware_config:
        print(f"\nMissing Middleware Config:")
        for config in result.c_comp.missing_middleware_config:
            print(f"  ‚úó {config}")
else:
    print("Not evaluated")

print("\n" + "-"*60)

2. C-COMP (Configuration Completeness)
Score: 0.0%

Component Scores:
  - Environment Variables: 0.0%
  - Provider Properties: 0.0%
  - Middleware Config: 0.0%

Weighting:
  - Environment Variables: 50% (contributes 0.0%)
  - Provider Properties: 30% (contributes 0.0%)
  - Middleware Config: 20% (contributes 0.0%)

------------------------------------------------------------


In [14]:
# IPA: Integration Point Accuracy
print("3. IPA (Integration Point Accuracy)")
print("="*60)
if result.ipa:
    print(f"F1 Score: {result.ipa.f1:.3f}")
    print(f"Precision: {result.ipa.precision:.3f}")
    print(f"Recall: {result.ipa.recall:.3f}")
    
    print(f"\nTrue Positives (Correct files): {len(result.ipa.true_positives)}")
    for file in result.ipa.true_positives[:5]:  # Show first 5
        print(f"  ‚úì {file}")
    
    print(f"\nFalse Positives (Extra files): {len(result.ipa.false_positives)}")
    for file in result.ipa.false_positives[:5]:
        print(f"  ‚úó {file}")
    
    print(f"\nFalse Negatives (Missing files): {len(result.ipa.false_negatives)}")
    for file in result.ipa.false_negatives[:5]:
        print(f"  ‚úó {file}")
else:
    print("Not evaluated")

print("\n" + "-"*60)

3. IPA (Integration Point Accuracy)
F1 Score: 1.000
Precision: 1.000
Recall: 1.000

True Positives (Correct files): 0

False Positives (Extra files): 0

False Negatives (Missing files): 0

------------------------------------------------------------


In [15]:
# CQ: Code Quality
print("4. CQ (Code Quality)")
print("="*60)
if result.cq:
    print(f"Score: {result.cq.score:.1f}%")
    print(f"\nIssues Found:")
    print(f"  - Type Errors: {result.cq.type_errors}")
    print(f"  - ESLint Errors: {result.cq.eslint_errors}")
    print(f"  - Security Issues: {result.cq.security_issues}")
    
    print(f"\nScore Calculation:")
    print(f"  Base Score: 100%")
    if result.cq.type_errors > 0:
        print(f"  - Type Errors: -{result.cq.type_errors * 5}% ({result.cq.type_errors} √ó 5)")
    if result.cq.eslint_errors > 0:
        print(f"  - ESLint Errors: -{result.cq.eslint_errors * 2}% ({result.cq.eslint_errors} √ó 2)")
    if result.cq.security_issues > 0:
        print(f"  - Security Issues: -{result.cq.security_issues * 20}% ({result.cq.security_issues} √ó 20)")
    print(f"  Final Score: {result.cq.score:.1f}%")
    
    if result.cq.type_error_details:
        print("\nType Error Details:")
        for error in result.cq.type_error_details[:5]:  # Show first 5
            print(f"  - {error}")
            
    if result.cq.eslint_error_details:
        print("\nESLint Error Details:")
        for error in result.cq.eslint_error_details[:5]:  # Show first 5
            print(f"  - {error}")
            
    if result.cq.security_issue_details:
        print("\nSecurity Issue Details:")
        for issue in result.cq.security_issue_details[:5]:  # Show first 5
            print(f"  - {issue}")
else:
    print("Not evaluated")

print("\n" + "-"*60)

4. CQ (Code Quality)
Score: 100.0%

Issues Found:
  - Type Errors: 0
  - ESLint Errors: 0
  - Security Issues: 0

Score Calculation:
  Base Score: 100%
  Final Score: 100.0%

------------------------------------------------------------


In [16]:
# SEM-SIM: Semantic Similarity
print("5. SEM-SIM (Semantic Similarity)")
print("="*60)
if result.sem_sim:
    print(f"Score: {result.sem_sim.score:.1f}%")
    print(f"\nSimilarity Score: {result.sem_sim.similarity_score:.1f}%")
    print(f"Pattern Match: {'‚úì' if result.sem_sim.pattern_match else '‚úó'}")
    print(f"Approach Match: {'‚úì' if result.sem_sim.approach_match else '‚úó'}")
    
    if result.sem_sim.matched_patterns:
        print(f"\nMatched Patterns ({len(result.sem_sim.matched_patterns)}):")
        for pattern in result.sem_sim.matched_patterns[:5]:  # Show first 5
            print(f"  ‚úì {pattern}")
    
    if result.sem_sim.missing_patterns:
        print(f"\nMissing Patterns ({len(result.sem_sim.missing_patterns)}):")
        for pattern in result.sem_sim.missing_patterns[:5]:  # Show first 5
            print(f"  ‚úó {pattern}")
            
    if result.sem_sim.details:
        print("\nAdditional Details:")
        for key, value in result.sem_sim.details.items():
            print(f"  - {key}: {value}")
else:
    print("Not evaluated")

print("\n" + "-"*60)

5. SEM-SIM (Semantic Similarity)
Score: 0.0%

Similarity Score: 0.0%
Pattern Match: ‚úó
Approach Match: ‚úó

------------------------------------------------------------


## 9. Weighted Score Calculation

The overall score is a weighted average of all metrics:

In [17]:
# Analyze results to identify issues
print("Analysis of Results:")
print("="*60)

issues = []

# Check each metric
if result.i_acc and result.i_acc.score < 100:
    missing_components = []
    if not result.i_acc.file_location_correct:
        missing_components.append("file location")
    if not result.i_acc.imports_correct:
        missing_components.append("imports")
    if not result.i_acc.pattern_correct:
        missing_components.append("initialization pattern")
    if not result.i_acc.placement_correct:
        missing_components.append("component placement")
    if missing_components:
        issues.append(f"I-ACC issues: {', '.join(missing_components)}")

if result.c_comp and result.c_comp.score < 100:
    if result.c_comp.missing_env_vars:
        issues.append(f"Missing environment variables: {', '.join(result.c_comp.missing_env_vars)}")
    if result.c_comp.missing_provider_props:
        issues.append(f"Missing provider properties: {', '.join(result.c_comp.missing_provider_props)}")
    if result.c_comp.missing_middleware_config:
        issues.append(f"Missing middleware config: {', '.join(result.c_comp.missing_middleware_config)}")

if result.ipa and result.ipa.f1 < 1.0:
    if result.ipa.false_negatives:
        issues.append(f"Missing integration in {len(result.ipa.false_negatives)} files")
    if result.ipa.false_positives:
        issues.append(f"Unnecessary changes in {len(result.ipa.false_positives)} files")

if result.cq and result.cq.score < 100:
    quality_issues = []
    if result.cq.type_errors > 0:
        quality_issues.append(f"{result.cq.type_errors} type errors")
    if result.cq.eslint_errors > 0:
        quality_issues.append(f"{result.cq.eslint_errors} linting errors")
    if result.cq.security_issues > 0:
        quality_issues.append(f"{result.cq.security_issues} security issues")
    if quality_issues:
        issues.append(f"Code quality issues: {', '.join(quality_issues)}")

if result.sem_sim and result.sem_sim.missing_patterns:
    issues.append(f"Missing {len(result.sem_sim.missing_patterns)} expected patterns")

if issues:
    print("Issues Found:")
    for i, issue in enumerate(issues, 1):
        print(f"{i}. {issue}")
else:
    print("‚úì No major issues found!")

# Recommendations
print("\nRecommendations for Improvement:")
print("-"*40)

if result.c_comp and result.c_comp.score == 0:
    print("1. Environment Configuration:")
    print("   - Ensure .env or .env.local file is created")
    print("   - Include all required API keys")
    print("   - Use correct variable names (NEXT_PUBLIC_* for client-side)")

if result.sem_sim and result.sem_sim.score == 0:
    print("\n2. Code Structure:")
    print("   - Follow framework conventions")
    print("   - Use standard file naming")
    print("   - Match expected component structure")

if result.ipa and result.ipa.f1 < 0.8:
    print("\n3. Integration Points:")
    print("   - Review which files need SDK integration")  
    print("   - Avoid modifying unrelated files")
    print("   - Focus on the specific task requirements")

Analysis of Results:
‚úì No major issues found!

Recommendations for Improvement:
----------------------------------------
1. Environment Configuration:
   - Ensure .env or .env.local file is created
   - Include all required API keys
   - Use correct variable names (NEXT_PUBLIC_* for client-side)

2. Code Structure:
   - Follow framework conventions
   - Use standard file naming
   - Match expected component structure


## 10. Common Issues and Insights

Based on the evaluation, let's identify common issues:

In [18]:
# Show how overall score is calculated
print("Overall Score Calculation:")
print("="*60)

weights = {
    "I-ACC": 0.30,
    "C-COMP": 0.20,
    "IPA": 0.15,
    "F-CORR": 0.15,
    "CQ": 0.10,
    "SEM-SIM": 0.10
}

scores = {
    "I-ACC": result.i_acc.score if result.i_acc else 0,
    "C-COMP": result.c_comp.score if result.c_comp else 0,
    "IPA": result.ipa.f1 * 100 if result.ipa else 0,  # Convert to percentage
    "F-CORR": result.f_corr.score if result.f_corr else 0,
    "CQ": result.cq.score if result.cq else 0,
    "SEM-SIM": result.sem_sim.score if result.sem_sim else 0
}

weighted_sum = 0
for metric, weight in weights.items():
    score = scores[metric]
    contribution = score * weight
    weighted_sum += contribution
    print(f"{metric:10} {score:6.1f}% √ó {weight:.2f} = {contribution:6.2f}%")

print("-"*40)
print(f"{'Overall':10} {weighted_sum:6.1f}%")
print(f"\nStored Overall Score: {result.overall_score:.1f}%")

Overall Score Calculation:
I-ACC       100.0% √ó 0.30 =  30.00%
C-COMP        0.0% √ó 0.20 =   0.00%
IPA         100.0% √ó 0.15 =  15.00%
F-CORR        0.0% √ó 0.15 =   0.00%
CQ          100.0% √ó 0.10 =  10.00%
SEM-SIM       0.0% √ó 0.10 =   0.00%
----------------------------------------
Overall      55.0%

Stored Overall Score: 33.5%


In [19]:
# Analyze results to identify issues
print("Analysis of Results:")
print("="*60)

issues = []

# Check each metric
if result.i_acc and result.i_acc.score < 100:
    issues.append(f"Missing SDK features: {result.i_acc.features_missing}")

if result.c_comp and result.c_comp.score < 100:
    if result.c_comp.missing_env_vars:
        issues.append(f"Missing environment variables: {result.c_comp.missing_env_vars}")
    if result.c_comp.missing_dependencies:
        issues.append(f"Missing dependencies: {result.c_comp.missing_dependencies}")

if result.ipa and result.ipa.f1 < 1.0:
    if result.ipa.false_negatives:
        issues.append(f"Missing integration in {len(result.ipa.false_negatives)} files")
    if result.ipa.false_positives:
        issues.append(f"Unnecessary changes in {len(result.ipa.false_positives)} files")

if result.cq and result.cq.score < 100:
    issues.append(f"Code quality issues: {len(result.cq.issues if result.cq.issues else 0)} found")

if issues:
    print("Issues Found:")
    for i, issue in enumerate(issues, 1):
        print(f"{i}. {issue}")
else:
    print("‚úì No major issues found!")

# Recommendations
print("\nRecommendations for Improvement:")
print("-"*40)

if result.c_comp and result.c_comp.score == 0:
    print("1. Environment Configuration:")
    print("   - Ensure .env or .env.local file is created")
    print("   - Include all required API keys")
    print("   - Use correct variable names (NEXT_PUBLIC_* for client-side)")

if result.sem_sim and result.sem_sim.score == 0:
    print("\n2. Code Structure:")
    print("   - Follow framework conventions")
    print("   - Use standard file naming")
    print("   - Match expected component structure")

if result.ipa and result.ipa.f1 < 0.8:
    print("\n3. Integration Points:")
    print("   - Review which files need SDK integration")  
    print("   - Avoid modifying unrelated files")
    print("   - Focus on the specific task requirements")

Analysis of Results:


AttributeError: 'CCompResult' object has no attribute 'missing_dependencies'

In [None]:
# Create summary visualization
import json

summary = {
    "sample_id": SAMPLE_ID,
    "task_type": task_types[current_task_type]['name'],
    "overall_score": f"{result.overall_score:.1f}%",
    "metrics": {
        "I-ACC (Implementation)": f"{scores['I-ACC']:.1f}%",
        "C-COMP (Configuration)": f"{scores['C-COMP']:.1f}%",
        "IPA (Integration Points)": f"{scores['IPA']:.1f}%",
        "F-CORR (Functional)": f"{scores['F-CORR']:.1f}%",
        "CQ (Code Quality)": f"{scores['CQ']:.1f}%",
        "SEM-SIM (Similarity)": f"{scores['SEM-SIM']:.1f}%"
    },
    "key_strengths": [],
    "key_weaknesses": []
}

# Identify strengths and weaknesses
for metric, score in scores.items():
    if score >= 80:
        summary["key_strengths"].append(f"{metric}: {score:.1f}%")
    elif score < 50:
        summary["key_weaknesses"].append(f"{metric}: {score:.1f}%")

print("EVALUATION SUMMARY")
print("="*60)
print(json.dumps(summary, indent=2))

# Performance interpretation
print("\nPerformance Interpretation:")
print("-"*40)
if result.overall_score >= 80:
    print("üèÜ Excellent: The LLM successfully implemented the SDK task")
elif result.overall_score >= 60:
    print("‚úÖ Good: The core functionality is correct with minor issues")
elif result.overall_score >= 40:
    print("‚ö†Ô∏è Fair: Basic implementation present but significant gaps")
else:
    print("‚ùå Poor: Major issues in SDK implementation")

## Appendix: Testing Different Models

You can test different models by changing the configuration:

In [None]:
# Available models and their characteristics
models = {
    "Anthropic": [
        {
            "model": "claude-3-5-sonnet-20241022",
            "description": "Most capable, best for complex tasks",
            "cost": "$3/$15 per 1M tokens (input/output)",
            "speed": "Medium"
        },
        {
            "model": "claude-3-haiku-20240307",
            "description": "Fast and affordable, good for simple tasks",
            "cost": "$0.25/$1.25 per 1M tokens",
            "speed": "Very Fast"
        },
        {
            "model": "claude-haiku-4-5-20251001",
            "description": "Claude Haiku 4.5 - newer version",
            "cost": "$1/$5 per 1M tokens",
            "speed": "Fast"
        }
    ],
    "OpenAI": [
        {
            "model": "gpt-4-turbo-preview",
            "description": "GPT-4 Turbo with latest improvements",
            "cost": "$10/$30 per 1M tokens",
            "speed": "Medium"
        },
        {
            "model": "gpt-3.5-turbo",
            "description": "Fast and cheap, good baseline",
            "cost": "$0.50/$1.50 per 1M tokens",
            "speed": "Very Fast"
        }
    ]
}

print("Available Models for Testing:")
print("="*60)
for provider, provider_models in models.items():
    print(f"\n{provider}:")
    for model_info in provider_models:
        print(f"  ‚Ä¢ {model_info['model']}")
        print(f"    - {model_info['description']}")
        print(f"    - Cost: {model_info['cost']}")
        print(f"    - Speed: {model_info['speed']}")

## Next Steps

1. **Try different samples**: Change `SAMPLE_ID` to test other task types
2. **Generate new solutions**: Set up API key and generate fresh solutions  
3. **Compare models**: Test different LLMs on the same task
4. **Analyze patterns**: Look for common failure modes across tasks
5. **Improve prompts**: Based on the evaluation results, refine prompts

This notebook provides a complete understanding of how SDK-Bench evaluates LLM capabilities for SDK instrumentation tasks.

In [None]:
from sdkbench.evaluator import Evaluator

In [23]:
evaluator = Evaluator(solution_dir, metadata_path=metadata_path)

In [30]:
evaluator.i_acc_evaluator.solution.files.keys()

dict_keys(['package.json', 'generation_metadata.json', 'app/layout.tsx', 'app/page.tsx'])

In [32]:
evaluator.i_acc_evaluator.ground_truth.ground_truth

{'ingredients': {'initialization': {'location': 'app/layout.tsx',
   'pattern': 'ClerkProvider wrapper',
   'imports': ['@clerk/nextjs']},
  'configuration': {'env_vars': ['NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY',
    'CLERK_SECRET_KEY'],
   'provider_props': [],
   'optional_config': []}}}