# Nova Evaluation Container Demo
## Leveraging Nova Evaluation for Base and Custom Model Evaluations using SageMaker AI

This notebook demonstrates the key features introduced in the Nova Evaluation Container:
- **Metadata passthrough** for stratified analysis
- **Log probabilities** for uncertainty-aware evaluation
- **Multi-node evaluation** for scalability
- **Custom metrics** with BYOM workflow
- **Failure analysis** on low confidence predictions

## Prerequisites
This notebook can run locally or in Amazon Sagemaker

- **AWS Account** for stratified analysis
- **IAM Permissions** Create Roles, Create Lambda Functions, Assume SageMaker Execution Role
- **Sagemaker Training Instance** ml.g5.12xlarge (1 or more)


Install the SageMaker SDK


In [None]:
!pip install --upgrade sagemaker
import sagemaker
print(sagemaker.__version__) 

Import required libraries for data manipulation (pandas, numpy), visualization (matplotlib, seaborn), AWS services (boto3), and SageMaker evaluation workflows.


In [None]:
import json
import pandas as pd
import numpy as np
from typing import List, Dict
import matplotlib.pyplot as plt
import seaborn as sns
import yaml
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import boto3
# SageMaker imports
import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker.inputs import TrainingInput

print("Environment setup complete")

## Step 1: Dataset Preparation with Metadata Passthrough

The **metadata passthrough** feature allows us to preserve custom fields end-to-end, enabling stratified analysis across different categories and difficulty levels without post-hoc joins.

In [None]:
def load_sample_dataset(file_path: str = 'modified_text_eval_dataset.jsonl') -> List[Dict]:
    """Load the sample dataset with metadata"""
    dataset = []
    with open(file_path, 'r') as f:
        for line in f:
            dataset.append(json.loads(line))
    return dataset

def analyze_dataset_metadata(dataset: List[Dict]) -> pd.DataFrame:
    """Analyze metadata distribution in the dataset"""
    metadata_list = []
    for item in dataset:
        metadata = json.loads(item.get('metadata', '{}'))
        metadata_list.append(metadata)
    
    return pd.DataFrame(metadata_list)

# Load and analyze the sample dataset
dataset = load_sample_dataset()
metadata_df = analyze_dataset_metadata(dataset)

print(f"Dataset size: {len(dataset)} examples")
print(f"\nMetadata fields: {list(metadata_df.columns)}")
print(f"\nMetadata distribution:")
for col in metadata_df.columns:
    print(f"{col}: {metadata_df[col].value_counts().to_dict()}")

## Step 2: Coding Up the Custom Metrics with BYOM Workflow

The **Bring Your Own Metrics (BYOM)** feature provides complete control over evaluation metrics. Here we implement custom metrics for our pet breed classification task. First lets download the github customization layer from https://github.com/aws/nova-custom-eval-sdk/releases and then run below command to publish a layer version 

In [None]:
# Lambda function code for custom metrics (BYOM)
lambda_code = '''
import json
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from nova_custom_evaluation_sdk.processors.decorators import preprocess, postprocess
from nova_custom_evaluation_sdk.lambda_handler import build_lambda_handler

@preprocess
def preprocessor(event: dict, context) -> dict:
    data = event.get('data', {})
    return {
        "statusCode": 200,
        "body": {
            "system": data.get("system"),
            "prompt": data.get("prompt", ""),
            "gold": data.get("gold", "")
        }
    }

@postprocess
def postprocessor(event: dict, context) -> dict:
    data = event.get('data', {})
    inference_output = data.get('inference_output', '')
    gold = data.get('gold', '')
    
    metrics = []
    
    # 1. Schema Validation
    schema_valid = validate_schema(inference_output)
    metrics.append({"metric": "schema_validation", "value": 1.0 if schema_valid else 0.0})
    
    # 2. Classification Metrics
    class_metrics = calculate_class_metrics(inference_output, gold)
    metrics.extend(class_metrics)
    
    return {"statusCode": 200, "body": metrics}

def validate_schema(output: str) -> bool:
    try:
        parsed = json.loads(output)
        return "class" in parsed
    except:
        return False

def calculate_class_metrics(inference_output: str, gold: str) -> list:
    pred_class = extract_class(inference_output)
    true_class = extract_class(gold)
    
    if not pred_class or not true_class:
        return [{"metric": "class_accuracy", "value": 0.0}]
    
    accuracy = 1.0 if pred_class.lower() == true_class.lower() else 0.0
    return [{"metric": "class_accuracy", "value": accuracy}]

def extract_class(text: str) -> str:
    try:
        parsed = json.loads(text)
        return str(parsed.get("class", "")).strip()
    except:
        return text.strip()

lambda_handler = build_lambda_handler(
    preprocessor=preprocessor,
    postprocessor=postprocessor
)
'''

print("Custom Lambda function code for BYOM workflow:")
print("- Schema validation metric")
print("- Classification accuracy metric")
print("- Preprocessing and postprocessing hooks")

Now lets use this code and create a lambda function with the added layer as shown

## Step 3: Setting Up the Custom Metrics Lambda Function 

### Step 3.a: Publish Lambda Layer

Publish the Nova custom evaluation layer to AWS Lambda using boto3 and the provided zip file. The layer contains SDK dependencies required for custom metric computation.


In [None]:
import os
import boto3

layer_name = "nova-custom-eval-layer"
zip_file_path = 'nova-custom-evaluation-layer.zip'

# Get region with fallback
region = os.environ.get('AWS_DEFAULT_REGION') or boto3.Session().region_name or 'us-east-1'
lambda_client = boto3.client('lambda', region_name=region)

if os.path.exists(zip_file_path):
    try:
        with open(zip_file_path, 'rb') as f:
            response = lambda_client.publish_layer_version(
                LayerName=layer_name,
                Content={'ZipFile': f.read()},
                CompatibleRuntimes=['python3.9', 'python3.10', 'python3.11', 'python3.12']
            )
        layer_arn = response['LayerVersionArn']
        print(f"Published layer: {layer_arn}")
    except Exception as e:
        print(f"Error: {e}")
        raise
else:
    # Check for existing layer
    try:
        response = lambda_client.list_layer_versions(LayerName=layer_name, MaxItems=1)
        layer_arn = response['LayerVersions'][0]['LayerVersionArn']
        print(f"Using existing layer: {layer_arn}")
    except:
        raise Exception(f"Layer zip not found. Download from https://github.com/aws/nova-custom-eval-sdk/releases")


### Step 3.b: Create an IAM role that has permisions to create Lambda Function
Create an IAM execution role with basic Lambda permissions to enable function execution and CloudWatch logging.


In [None]:
import json
import time
import os
import boto3

# Get region with fallback
region = os.environ.get('AWS_DEFAULT_REGION') or boto3.Session().region_name or 'us-east-1'

iam_client = boto3.client("iam", region_name=region)
lambda_client = boto3.client("lambda", region_name=region)
sts_client = boto3.client("sts", region_name=region)

role_name = "nova-custom-eval-lambda-role"

try:
    account_id = sts_client.get_caller_identity()["Account"]
    lambda_client.list_functions(MaxItems=1)
    
    try:
        role_response = iam_client.get_role(RoleName=role_name)
        role_arn = role_response["Role"]["Arn"]
        print(f"Using existing role: {role_arn}")
    except iam_client.exceptions.NoSuchEntityException:
        trust_policy = {
            "Version": "2012-10-17",
            "Statement": [{
                "Effect": "Allow",
                "Principal": {"Service": "lambda.amazonaws.com"},
                "Action": "sts:AssumeRole"
            }]
        }
        
        role_response = iam_client.create_role(
            RoleName=role_name,
            AssumeRolePolicyDocument=json.dumps(trust_policy),
            Description="Execution role for Nova custom evaluation Lambda"
        )
        
        iam_client.attach_role_policy(
            RoleName=role_name,
            PolicyArn="arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
        )
        
        role_arn = role_response["Role"]["Arn"]
        print(f"Created role: {role_arn}")
        time.sleep(10)
        
except Exception as e:
    print(f"Error: {e}")
    raise


### Step 3c: Create Lambda Function with Layer

Deploy the Lambda function with custom metrics code and attach the evaluation layer for runtime dependencies.


In [None]:
import boto3
import tempfile
import zipfile
import os
import time

function_name = "nova-custom-eval"
region = os.environ.get('AWS_DEFAULT_REGION') or boto3.Session().region_name or 'us-east-1'

with tempfile.TemporaryDirectory() as tmp_dir:
    lambda_file = os.path.join(tmp_dir, "lambda_function.py")
    with open(lambda_file, "w") as f:
        f.write(lambda_code)
    
    zip_path = os.path.join(tmp_dir, "function.zip")
    with zipfile.ZipFile(zip_path, "w") as z:
        z.write(lambda_file, arcname="lambda_function.py")
    
    with open(zip_path, "rb") as f:
        zip_content = f.read()
    
    lambda_client = boto3.client("lambda", region_name=region)
    
    try:
        lambda_client.get_function(FunctionName=function_name)
        print(f"Updating function {function_name}...")
        response = lambda_client.update_function_code(
            FunctionName=function_name,
            ZipFile=zip_content
        )
    except lambda_client.exceptions.ResourceNotFoundException:
        print(f"Creating function {function_name}...")
        response = lambda_client.create_function(
            FunctionName=function_name,
            Runtime="python3.9",
            Role=role_arn,
            Handler="lambda_function.lambda_handler",
            Code={"ZipFile": zip_content},
            Timeout=30,
            MemorySize=256
        )
    
    # Wait for function to be ready
    while True:
        config = lambda_client.get_function_configuration(FunctionName=function_name)
        if config["LastUpdateStatus"] == "Successful":
            break
        time.sleep(2)
    
    # Now update layer
    response = lambda_client.update_function_configuration(
        FunctionName=function_name,
        Layers=[layer_arn]
    )
    
    function_arn = response["FunctionArn"]
    print(f"Lambda Function ARN: {function_arn}")


### Step 3d: Update Recipe Configuration with Lambda ARN


Configure the evaluation recipe with model parameters, inference settings (including top_logprobs for confidence analysis), and the Lambda function ARN for custom metrics previously published.

In [None]:
import yaml

recipe_config = {
    "run": {
        "name": "support-ticket-classification",
        "model_type": "amazon.nova-lite-v1:0:300k",
        "model_name_or_path": "nova-lite/prod",
        "replicas": 1
    },
    "evaluation": {
        "task": "gen_qa",
        "strategy": "gen_qa",
        "metric": "all"
    },
    "inference": {
        "max_new_tokens": 2048,
        "temperature": 0,
        "top_logprobs": 10
    },
    "processor": {
        "lambda_arn": function_arn,
        "preprocessing": {"enabled": False},
        "postprocessing": {"enabled": True},
        "aggregation": "average"
    }
}

with open("lambda_recipe.yaml", "w") as f:
    yaml.dump(recipe_config, f, default_flow_style=False, sort_keys=False)

print(f"Recipe saved with Lambda ARN: {function_arn}")


## Step 4: Launch SageMaker Evaluation Job

Configure and launch the Nova evaluation job with custom metrics and log probabilities enabled.
Configure the SageMaker training job with compute resources, S3 paths, and the evaluation recipe, then submit for execution. 

*Change the following fields to match your values (input_s3_uri, output_s3_uri, role)*

*Don't forget to upload your dataset to S3*




In [None]:
import os
os.environ['AWS_DEFAULT_REGION'] = 'us-east-1'

import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker.inputs import TrainingInput

sagemaker_session = sagemaker.Session()

# Use your actual SageMaker execution role ARN
role = "enter-sagemaker-execution-role-ARN"

input_s3_uri = "s3://enter-bucket-name/input/text_eval_dataset.jsonl"
output_s3_uri = "s3://enter-budket-name/output/"

estimator = PyTorch(
    output_path=output_s3_uri,
    base_job_name="support-ticket-classification",
    role=role,
    instance_type="ml.g5.12xlarge",
    training_recipe="lambda_recipe.yaml",
    sagemaker_session=sagemaker_session,
    image_uri="708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-evaluation-repo:SM-TJ-Eval-latest"
)

print(f"Using role: {role}")
estimator.fit(inputs={"train": TrainingInput(s3_data=input_s3_uri)})

### ^^ This will generally take somewhere between 20-40 mins

## Step 5: Review Results after Evaluation Job is compeleted

Load the evaluation results from the S3 output path, including detailed metrics, predictions, and log probabilities for analysis.


In [None]:
def load_parquet_file(file_path):
    """
    Load a parquet file into a pandas DataFrame
    
    Args:
        file_path (str): Path to the parquet file
        
    Returns:
        pandas.DataFrame: The loaded DataFrame
    """
    try:
        # Read the parquet file
        df = pd.read_parquet(file_path)
        return df
    except Exception as e:
        print(f"Error loading parquet file: {e}")
        return None

Download, extract, and load evaluation results from S3, including detailed metrics, predictions, and log probabilities for analysis.


In [None]:
import boto3
import os
import tarfile
import glob

def download_and_extract_output(output_s3_uri, job_name, output_dir='./evaluation_results'):
    """
    Download output.tar.gz from S3 and extract it to a local directory.
    Returns the path to the parquet results file.
    """
    os.makedirs(output_dir, exist_ok=True)
    
    if output_s3_uri.startswith('s3://'):
        output_s3_uri = output_s3_uri[5:]
    
    if output_s3_uri.endswith('/'):
        output_s3_uri = output_s3_uri[:-1]
    
    parts = output_s3_uri.split('/')
    bucket = parts[0]
    prefix = '/'.join(parts[1:])
    
    s3_key = f"{prefix}/{job_name}/output/output.tar.gz"
    local_tar_path = os.path.join(output_dir, "output.tar.gz")
    
    print(f"Downloading from s3://{bucket}/{s3_key} to {local_tar_path}")
    
    s3_client = boto3.client('s3')
    
    try:
        s3_client.download_file(bucket, s3_key, local_tar_path)
        print("Download completed successfully")
        
        print(f"Extracting to {output_dir}")
        with tarfile.open(local_tar_path, "r:gz") as tar:
            tar.extractall(path=output_dir)
        print("Extraction completed successfully")
        
        os.remove(local_tar_path)
        print(f"Removed {local_tar_path}")
        
        # Find the parquet file
        parquet_pattern = os.path.join(output_dir, "**", "*.parquet")
        parquet_files = glob.glob(parquet_pattern, recursive=True)
        
        if parquet_files:
            results_path = parquet_files[0]
            print(f"\nResults parquet file: {results_path}")
            return results_path
        else:
            print("Warning: No parquet file found")
            return None
    
    except Exception as e:
        print(f"Error: {str(e)}")
        return None

# Download and get results path
results_data_path = download_and_extract_output(output_s3_uri, estimator._current_job_name)

# Load the results
if results_data_path:
    results_data = pd.read_parquet(results_data_path)
    print(f"Loaded {len(results_data)} rows from results")
else:
    print("Failed to load results")


#### Eval results visualization

After your evaluation job completes successfully, you can access and analyze the results using the information in this section. Based on the output_s3_path (such as s3://output_path/) defined in the recipe, the output structure is the following:


```
job_name/
├── eval-results/
│    └── results_[timestamp].json
│    └── inference_output.jsonl (only present for gen_qa)
│    └── details/
│        └── model/
│            └── execution-date-time/
│                └──details_task_name_#_datetime.parquet
└── tensorboard-results/
    └── eval/
        └── events.out.tfevents.[timestamp]
```

## Step 5: Failure Analysis with Log Probabilities

After evaluation completes, analyze low-confidence predictions using log probabilities and metadata for stratified failure analysis.

In [None]:
import json
import numpy as np
import pandas as pd
import ast

def calculate_confidence_score(pred_logits_str) -> float:
    """Calculate confidence score from log probabilities"""
    if not pred_logits_str or pred_logits_str == '[]':
        return 0.0
    
    try:
        logits = ast.literal_eval(pred_logits_str)
        if not logits or len(logits) == 0:
            return 0.0
        
        token_probs = []
        for token_dict in logits[0]:
            if isinstance(token_dict, dict) and token_dict:
                max_logprob = max(token_dict.values())
                token_probs.append(np.exp(max_logprob))
        
        return float(np.mean(token_probs)) if token_probs else 0.0
    except:
        return 0.0

def analyze_low_confidence_failures(results_df: pd.DataFrame, confidence_threshold: float = 0.7, quality_threshold: float = 0.3) -> dict:
    """Perform comprehensive failure analysis"""
    
    results_df['confidence'] = results_df['pred_logits'].apply(calculate_confidence_score)
    results_df['quality_score'] = results_df['metrics'].apply(
        lambda x: ast.literal_eval(x).get('f1', 0) if x else 0
    )
    results_df['is_correct'] = results_df['quality_score'] >= quality_threshold
    
    # Analyze low confidence predictions
    low_conf_df = results_df[results_df['confidence'] < confidence_threshold]
    
    # Also identify high confidence but low quality (overconfident errors)
    overconfident_df = results_df[(results_df['confidence'] >= confidence_threshold) & 
                                   (results_df['quality_score'] < quality_threshold)]
    
    analysis_results = {
        'summary': {
            'total_predictions': len(results_df),
            'low_confidence_count': len(low_conf_df),
            'low_confidence_rate': len(low_conf_df) / len(results_df) if len(results_df) > 0 else 0,
            'overconfident_errors': len(overconfident_df),
            'overconfident_rate': len(overconfident_df) / len(results_df) if len(results_df) > 0 else 0,
            'avg_confidence': results_df['confidence'].mean(),
            'avg_quality': results_df['quality_score'].mean(),
            'overall_accuracy': results_df['is_correct'].mean()
        },
        'low_confidence_examples': [],
        'overconfident_examples': []
    }
    
    # Low confidence examples
    if len(low_conf_df) > 0:
        for idx, row in low_conf_df.iterrows():
            try:
                specifics = ast.literal_eval(row['specifics'])
                metadata = json.loads(specifics.get('metadata', '{}'))
                
                analysis_results['low_confidence_examples'].append({
                    'example': row['example'][:100] + '...' if len(row['example']) > 100 else row['example'],
                    'confidence': row['confidence'],
                    'quality_score': row['quality_score'],
                    'category': metadata.get('category', 'unknown'),
                    'difficulty': metadata.get('difficulty', 'unknown'),
                    'domain': metadata.get('domain', 'unknown')
                })
            except:
                pass
    
    # Overconfident errors
    if len(overconfident_df) > 0:
        for idx, row in overconfident_df.iterrows():
            try:
                specifics = ast.literal_eval(row['specifics'])
                metadata = json.loads(specifics.get('metadata', '{}'))
                
                analysis_results['overconfident_examples'].append({
                    'example': row['example'][:100] + '...' if len(row['example']) > 100 else row['example'],
                    'confidence': row['confidence'],
                    'quality_score': row['quality_score'],
                    'category': metadata.get('category', 'unknown'),
                    'difficulty': metadata.get('difficulty', 'unknown'),
                    'domain': metadata.get('domain', 'unknown')
                })
            except:
                pass
    
    return analysis_results

# Use already loaded data
results_df = pd.DataFrame(results_data)
analysis_results = analyze_low_confidence_failures(results_df, confidence_threshold=0.7, quality_threshold=0.3)

print("=" * 60)
print("FAILURE ANALYSIS RESULTS")
print("=" * 60)
print(f"\nTotal predictions: {analysis_results['summary']['total_predictions']}")
print(f"Average confidence: {analysis_results['summary']['avg_confidence']:.3f}")
print(f"Average F1 quality: {analysis_results['summary']['avg_quality']:.3f}")
print(f"Overall accuracy (F1>0.3): {analysis_results['summary']['overall_accuracy']:.1%}")

print(f"\n{'=' * 60}")
print("LOW CONFIDENCE PREDICTIONS (confidence < 0.7)")
print("=" * 60)
print(f"Count: {analysis_results['summary']['low_confidence_count']} ({analysis_results['summary']['low_confidence_rate']:.1%})")
for i, example in enumerate(analysis_results['low_confidence_examples'][:5], 1):
    print(f"\n{i}. Confidence: {example['confidence']:.3f} | F1: {example['quality_score']:.3f}")
    print(f"   {example['difficulty']} | {example['domain']}")
    print(f"   {example['example']}")

print(f"\n{'=' * 60}")
print("OVERCONFIDENT ERRORS (confidence >= 0.7 but F1 < 0.3)")
print("=" * 60)
print(f"Count: {analysis_results['summary']['overconfident_errors']} ({analysis_results['summary']['overconfident_rate']:.1%})")
for i, example in enumerate(analysis_results['overconfident_examples'][:5], 1):
    print(f"\n{i}. Confidence: {example['confidence']:.3f} | F1: {example['quality_score']:.3f}")
    print(f"   {example['difficulty']} | {example['domain']}")
    print(f"   {example['example']}")


### Helper Functions for Result Parsing

The following utility functions parse evaluation outputs, extract class predictions from JSON responses, and handle log probability data for confidence analysis.


1. Parse log probability data and extract confidence scores for predicted classes. These functions handle token-level probability extraction from the model's output, supporting both single-token and multi-token class predictions.


In [None]:
import re
import json

def extract_gold(text):
    """
    More robust extraction handling various formats and bytes
    """
    is_bytes = isinstance(text, bytes)
    
    if is_bytes:
        patterns = [
            rb'"class"\s*:\s*"([^"]+)"',
            rb'"class"\s*:\s*([^,}\s]+)',
            rb"'class'\s*:\s*'([^']+)'",
            rb"'class'\s*:\s*([^,}\s]+)",
        ]
        
        for pattern in patterns:
            match = re.search(pattern, text)
            if match:
                value = match.group(1).decode('utf-8', errors='ignore').strip()
                value = re.sub(r'[.,;]+$', '', value)
                return value
    else:
        patterns = [
            r'"class"\s*:\s*"([^"]+)"',
            r'"class"\s*:\s*([^,}\s]+)',
            r"'class'\s*:\s*'([^']+)'",
            r"'class'\s*:\s*([^,}\s]+)",
        ]
        
        for pattern in patterns:
            match = re.search(pattern, text)
            if match:
                value = match.group(1).strip()
                value = re.sub(r'[.,;]+$', '', value)
                return value
    
    return None

def parse_preds(text):
    """
    Parse a malformed JSON string into a dictionary
    """
    if isinstance(text, bytes):
        text = text.decode('utf-8', errors='ignore')
    
    text = text.strip()
    if text.startswith('[') and text.endswith(']'):
        text = text[1:-1].strip()
    
    if (text.startswith("'") and text.endswith("'")) or \
       (text.startswith('"') and text.endswith('"')):
        text = text[1:-1]
    
    text = text.replace('\\n', '\n')
    text = text.replace('\\"', '"')
    
    try:
        return json.loads(text)
    except json.JSONDecodeError as e:
        result = {}
        
        class_match = re.search(r'"class"\s*:\s*"([^"]+)"', text)
        if class_match:
            result['class'] = class_match.group(1)
        
        thought_match = re.search(r'"thought"\s*:\s*"([^"]+)"', text)
        if thought_match:
            result['thought'] = thought_match.group(1)
        
        desc_match = re.search(r'"description"\s*:\s*"([^"]+)"', text)
        if desc_match:
            result['description'] = desc_match.group(1)
        
        return result if result else None

# Success confirmation
print("✓ Helper functions defined successfully:")
print("  - extract_gold(): Extracts class from gold standard")
print("  - parse_preds(): Parses prediction text to dict")
print("\nFunctions are ready to use for data processing.")

2. Execute the complete analysis pipeline on evaluation results. This function iterates through all predictions, extracts classes and metadata, computes confidence scores, and generates summary statistics including accuracy breakdowns by metadata categories.


In [None]:
import re
import math
import ast
import json

def parse_log_probs(data_str):
    """Parse the log probability data string into a list of token dictionaries"""
    pattern = r"\{([^}]+)\}"
    matches = re.findall(pattern, data_str)
    
    parsed_sequence = []
    for match in matches:
        pairs = re.findall(r"'([^']+)':\s*([-\d.e]+)", match)
        token_dict = {token: float(log_prob) for token, log_prob in pairs}
        parsed_sequence.append(token_dict)
    
    return parsed_sequence

def find_target_tokens(target_class):
    """Split target class into component tokens"""
    tokens = []
    
    if '_' in target_class:
        parts = target_class.split('_')
        for i, part in enumerate(parts):
            if i > 0:
                tokens.append('_')
            subtokens = re.findall(r'[A-Z]?[a-z]+|[A-Z]+(?=[A-Z][a-z]|\b)', part)
            tokens.extend(subtokens)
    else:
        tokens = re.findall(r'[A-Z]?[a-z]+|[A-Z]+(?=[A-Z][a-z]|\b)', target_class)
    
    return [target_class] + tokens

def extract_class_confidence(data_str, target_class):
    """Extract confidence score for any target class prediction"""
    parsed_sequence = parse_log_probs(data_str)
    possible_tokens = find_target_tokens(target_class)
    
    print(f"Searching for target class: '{target_class}'")
    print(f"Possible token sequences: {possible_tokens}\n")
    
    found_tokens = []
    
    for step_idx, token_dict in enumerate(parsed_sequence):
        for token in possible_tokens:
            if token in token_dict:
                log_prob = token_dict[token]
                prob = math.exp(log_prob)
                found_tokens.append({
                    'step': step_idx,
                    'token': token,
                    'log_prob': log_prob,
                    'probability': prob,
                    'confidence_pct': prob * 100
                })
    
    if not found_tokens:
        print(f"❌ Target class '{target_class}' not found in sequence")
        return None
    
    if len(found_tokens) == 1:
        result = found_tokens[0]
        print("=" * 70)
        print(f"✓ FOUND: '{target_class}' as single token")
        print("=" * 70)
        print(f"Step: {result['step']}")
        print(f"Log Probability: {result['log_prob']:.10f}")
        print(f"Confidence: {result['confidence_pct']:.6f}%")
        print("=" * 70)
        return result
    else:
        print("=" * 70)
        print(f"✓ FOUND: '{target_class}' as token sequence")
        print("=" * 70)
        
        overall_prob = 1.0
        for token_info in found_tokens:
            print(f"  Step {token_info['step']:2d} | '{token_info['token']:12s}' | "
                  f"Confidence: {token_info['confidence_pct']:8.4f}%")
            overall_prob *= token_info['probability']
        
        overall_log_prob = math.log(overall_prob)
        print("-" * 70)
        print(f"Overall Models Sequence Confidence: {overall_prob*100:.6f}%")
        print(f"Overall Log Probability: {overall_log_prob:.10f}")
        print("=" * 70)
        
        return {
            'tokens': found_tokens,
            'overall_probability': overall_prob,
            'overall_log_prob': overall_log_prob,
            'confidence_pct': overall_prob * 100
        }

def parse_to_dict(text):
    """Parse a malformed JSON string into a dictionary"""
    if isinstance(text, bytes):
        text = text.decode('utf-8', errors='ignore')
    
    original_text = text
    text = text.strip()
    
    if text.startswith('[') and text.endswith(']'):
        text = text[1:-1].strip()
    
    if (text.startswith("'") and text.endswith("'")) or \
       (text.startswith('"') and text.endswith('"')):
        text = text[1:-1]
    
    try:
        result = ast.literal_eval(original_text)
        if isinstance(result, dict):
            for key, value in result.items():
                if isinstance(value, str) and (value.strip().startswith('{') or value.strip().startswith('[')):
                    try:
                        result[key] = json.loads(value)
                    except:
                        pass
            return result
    except (ValueError, SyntaxError):
        pass
    
    text = text.replace('\\n', '\n')
    text = text.replace('\\"', '"')
    text = text.replace("\\'", "'")
    
    try:
        return json.loads(text)
    except json.JSONDecodeError:
        pass
    
    result = {}
    patterns = [
        r'"(\w+)"\s*:\s*"([^"]*)"',
        r"'(\w+)'\s*:\s*'([^']*)'",
        r'"(\w+)"\s*:\s*({[^}]+})',
        r"'(\w+)'\s*:\s*({[^}]+})",
    ]
    
    for pattern in patterns:
        matches = re.findall(pattern, text, re.DOTALL)
        for key, value in matches:
            if value.strip().startswith('{'):
                try:
                    result[key] = json.loads(value.replace("'", '"'))
                except:
                    result[key] = value
            else:
                result[key] = value
    
    return result if result else None

# Success confirmation
print("✓ Log probability and parsing functions defined successfully:")
print("  - parse_log_probs(): Parses log probability strings")
print("  - find_target_tokens(): Tokenizes target class names")
print("  - extract_class_confidence(): Extracts confidence for predictions")
print("  - parse_to_dict(): Parses malformed JSON/dict strings")
print("\nAll functions are ready for confidence analysis and data processing.")


3. Run comprehensive classification analysis across the entire evaluation dataset. This function processes predictions, gold labels, metadata, and confidence scores to generate per-sample results with stratified performance metrics.


In [None]:
import ast
import re
import json

def extract_class_value(text):
    """Extract class value from various text formats"""
    # Handle string representation of list: "['...']"
    if isinstance(text, str) and text.startswith('['):
        try:
            parsed_list = ast.literal_eval(text)
            if isinstance(parsed_list, list) and len(parsed_list) > 0:
                text = parsed_list[0]
        except:
            pass
    
    # Handle numpy array
    if isinstance(text, np.ndarray) and len(text) > 0:
        text = text[0]
    
    # Convert to string
    text = str(text)
    
    # Try JSON format: {"class": "Document_Request"}
    try:
        parsed = json.loads(text)
        if isinstance(parsed, dict) and 'class' in parsed:
            return parsed['class']
    except:
        pass
    
    # Try markdown format: **Issue Type:** Document Request
    patterns = [
        r'\*\*Issue Type:\*\*\s*([^\n]+)',
        r'Issue Type:\s*([^\n]+)',
        r'"class"\s*:\s*"([^"]+)"',
        r"'class'\s*:\s*'([^']+)'",
    ]
    
    for pattern in patterns:
        match = re.search(pattern, text, re.IGNORECASE)
        if match:
            value = match.group(1).strip()
            # Clean up common suffixes
            value = re.sub(r'\s*/\s*.*$', '', value)  # Remove "/ Department" part
            return value
    
    return None

# Test it
print("Testing extraction:")
pred_sample = results_data['predictions'][0]
gold_sample = results_data['gold'][0]

print(f"Prediction extracted: {extract_class_value(pred_sample)}")
print(f"Gold extracted: {extract_class_value(gold_sample)}")


4. Preview the analysis results showing predictions, gold labels, correctness, confidence scores, and metadata for the first few samples.


In [None]:
results_df.head()

## Step 6: Multi-Node Scaling (optional)

Optionally you can scale evaluation to larger datasets using **multi-node evaluation** by simply increasing the replica count.

In [None]:
# Multi-node configuration - just change the replicas parameter
multinode_config = {
    "run": {
        "name": "support-ticket-classification-multinode",
        "model_name_or_path": "nova-lite-v1:0",
        "replicas": 4  # Scale to 4 nodes automatically you must have quota allocation for this
    },
    "evaluation": {
        "task": "gen_qa",
        "strategy": "gen_qa",
        "metric": "all"
    },
    "inference": {
        "max_new_tokens": 2048,
        "temperature": 0,
        "top_logprobs": 10
    }
}

print("Multi-node scaling configuration:")
print(f"Replicas: {multinode_config['run']['replicas']}")
print("\nBenefits:")
print("- Automatic workload distribution")
print("- Preserved metadata-based analysis")
print("- Deterministic result aggregation")
print("- Scale from thousands to millions of examples")

Display the first few rows of the analysis results to verify the data structure and preview classification outcomes with confidence scores and metadata fields.


## Step 7: Visualization and Insights

Visualize the evaluation results and failure analysis to identify improvement opportunities.

In [None]:
import ast
import json
import re

# Check if confidence already exists, if not calculate it
if 'confidence' not in results_df.columns:
    def calculate_confidence_score(pred_logits_str):
        if not pred_logits_str or pred_logits_str == '[]':
            return 0.0
        try:
            logits = ast.literal_eval(pred_logits_str)
            if not logits or len(logits) == 0:
                return 0.0
            token_probs = [np.exp(max(td.values())) for td in logits[0] if isinstance(td, dict) and td]
            return float(np.mean(token_probs)) if token_probs else 0.0
        except:
            return 0.0
    
    results_df['confidence'] = results_df['pred_logits'].apply(calculate_confidence_score)

# Extract class values
def extract_class_value(text):
    if isinstance(text, str) and text.startswith('['):
        try:
            text = ast.literal_eval(text)[0]
        except:
            pass
    if isinstance(text, np.ndarray) and len(text) > 0:
        text = text[0]
    text = str(text)
    try:
        parsed = json.loads(text)
        if isinstance(parsed, dict) and 'class' in parsed:
            return parsed['class']
    except:
        pass
    match = re.search(r'\*\*Issue Type:\*\*\s*([^\n]+)', text, re.IGNORECASE)
    if match:
        return re.sub(r'\s*/\s*.*$', '', match.group(1).strip())
    return None

# Extract metadata
def extract_metadata(row):
    try:
        specifics = ast.literal_eval(row['specifics'])
        metadata = json.loads(specifics.get('metadata', '{}'))
        return pd.Series({
            'meta_category': metadata.get('category', 'unknown'),
            'meta_difficulty': metadata.get('difficulty', 'unknown'),
            'meta_domain': metadata.get('domain', 'unknown')
        })
    except:
        return pd.Series({'meta_category': 'unknown', 'meta_difficulty': 'unknown', 'meta_domain': 'unknown'})

# Add missing columns
results_df['pred_class'] = results_df['predictions'].apply(extract_class_value)
results_df['gold_class'] = results_df['gold'].apply(extract_class_value)
results_df['correct'] = results_df['pred_class'] == results_df['gold_class']
results_df[['meta_category', 'meta_difficulty', 'meta_domain']] = results_df.apply(extract_metadata, axis=1)

# Create visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

axes[0, 0].hist(results_df['confidence'], bins=20, alpha=0.7, edgecolor='black')
axes[0, 0].axvline(0.7, color='red', linestyle='--', label='Low confidence threshold')
axes[0, 0].set_title('Confidence Score Distribution')
axes[0, 0].set_xlabel('Confidence Score')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].legend()

confidence_bins = pd.cut(results_df['confidence'], bins=5)
accuracy_by_conf = results_df.groupby(confidence_bins, observed=True)['correct'].mean()
accuracy_by_conf.plot(kind='bar', ax=axes[0, 1], rot=45)
axes[0, 1].set_title('Accuracy by Confidence Bin')
axes[0, 1].set_ylabel('Accuracy')

domain_perf = results_df.groupby('meta_domain')['correct'].mean()
domain_perf.plot(kind='bar', ax=axes[1, 0], color='skyblue')
axes[1, 0].set_title('Accuracy by Domain')
axes[1, 0].set_ylabel('Accuracy')
axes[1, 0].tick_params(axis='x', rotation=45)

difficulty_perf = results_df.groupby('meta_difficulty')['correct'].mean()
difficulty_perf.plot(kind='bar', ax=axes[1, 1], color='lightcoral')
axes[1, 1].set_title('Accuracy by Difficulty')
axes[1, 1].set_ylabel('Accuracy')
axes[1, 1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()


## Conclusion

This notebook demonstrated key capabilities of the Nova Evaluation Container:

### **Metadata Passthrough**
- Preserves custom fields end-to-end for stratified analysis
- Enables performance analysis by category, difficulty, domain
- No post-hoc joins required

### **Log Probabilities** 
- Token-level uncertainty information for calibration studies
- Confidence-based routing and quality gates
- Hallucination detection capabilities

### **Custom Metrics (BYOM)**
- Complete control over evaluation pipeline
- Schema validation and domain-specific metrics
- Pre/post-processing hooks

### **Multi-Node Scaling**
- Simple replica configuration for distributed evaluation
- Maintains deterministic results and metadata analysis
- Scales from thousands to millions of examples


### Next Steps
1. Upload your own dataset to S3 and configure the recipe YAML
2. Deploy the custom Lambda function for BYOM workflow
3. Launch the SageMaker evaluation job
4. Analyze results using the failure analysis framework
5. Scale to larger datasets using multi-node configuration

For more information, see the [Nova Evaluation Container documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-hp-evaluate.html).