# FLARE25-PaliGemma: Evaluation Notebook

This notebook adapts the provided python script for evaluation into a step-by-step Jupyter Notebook environment.

This evaluation notebook supports all FLARE25 task types with specialized metrics:
1. **Classification**: Balanced Accuracy
2. **Multi-label Classification**: F1-Score (micro-average)
3. **Detection / Instance Detection**: F1-Score (with IoU>0.5 threshold)
4. **Counting / Regression**: Mean Absolute Error (MAE)
5. **Report Generation**: GREEN Score

We will break down the script into logical sections and execute them step by step in this notebook.

## 1. Environment Setup and Dependencies

Install and import all required libraries for evaluation.

In [1]:
!pip install scikit-learn



In [2]:
!git clone https://github.com/Stanford-AIMI/GREEN.git
%cd GREEN
!pip install -e .

Cloning into 'GREEN'...
remote: Enumerating objects: 380, done.[K
remote: Counting objects: 100% (242/242), done.[K
remote: Compressing objects: 100% (148/148), done.[K
remote: Total 380 (delta 115), reused 202 (delta 93), pack-reused 138 (from 1)[K
Receiving objects: 100% (380/380), 278.12 KiB | 14.64 MiB/s, done.
Resolving deltas: 100% (143/143), done.
/content/GREEN
Obtaining file:///content/GREEN
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting torch==2.2.2 (from green_score==0.0.11)
  Downloading torch-2.2.2-cp311-cp311-manylinux1_x86_64.whl.metadata (25 kB)
Collecting transformers==4.40.0 (from green_score==0.0.11)
  Downloading transformers-4.40.0-py3-none-any.whl.metadata (137 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m137.6/137.6 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate==0.30.1 (from green_score==0.0.11)
  Downloading acceler

In [3]:
import json
import re
import os
import ast
import argparse
from collections import defaultdict
from sklearn.metrics import mean_absolute_error, balanced_accuracy_score
from green_score import GREEN

## 2. Evaluation Configuration

Configure the evaluation parameters:
- **Dataset Path**: Path to FLARE25 validation dataset
- **Prediction File**: JSON file with model predictions
- **Output Settings**: Where to save evaluation results
- **Task Settings**: Which tasks to evaluate

In [1]:
# Configure evaluation settings
class EvalArgs:
    def __init__(self):
        # Dataset configuration
        self.base_dataset_path = "original_dataset"  # FLARE25 dataset directory
        self.prediction_file = "predictions_public.json"  # Model predictions

        # Output configuration
        self.output_dir = "evaluation_results"
        self.output_filename = "metrics_public.json"

        # Evaluation settings
        self.verbose = True  # Detailed output

args = EvalArgs()

print("Evaluation Configuration:")
print(f"  Dataset Path: {args.base_dataset_path}")
print(f"  Prediction File: {args.prediction_file}")
print(f"  Output Directory: {args.output_dir}")
print(f"  Metrics File: {args.output_filename}")


Evaluation Configuration:
  Dataset Path: original_dataset
  Prediction File: predictions_public.json
  Output Directory: evaluation_results
  Metrics File: metrics_public.json


## 3. Data Loading and Processing Utilities

Essential functions for handling FLARE25 dataset:
- **File Discovery**: Find all JSON files in validation directory
- **Data Loading**: Load and merge multiple JSON files
- **Sample Matching**: Create unique keys for matching predictions with ground truth
- **Validation**: Ensure all required files exist

In [6]:
def find_json_files(base_path):
    """Recursively find all JSON files in the specified directory."""
    json_files = []
    for root, dirs, files in os.walk(base_path):
        for file in files:
            if file.endswith('.json'):
                json_files.append(os.path.join(root, file))
    return json_files


def load_and_merge_json_files(json_files):
    """Load and merge multiple JSON files into a single list."""
    all_data = []
    dataset_info = {}

    for json_file in json_files:
        try:
            with open(json_file, 'r') as f:
                data = json.load(f)
                if isinstance(data, list):
                    all_data.extend(data)
                    dataset_info[os.path.basename(json_file)] = len(data)
                else:
                    all_data.append(data)
                    dataset_info[os.path.basename(json_file)] = 1
        except Exception as e:
            print(f"Warning: Failed to load {json_file}: {e}")

    return all_data, dataset_info


def create_sample_key(sample):
    """Create a unique key for matching samples between ground truth and predictions."""
    image_name = str(sample.get("ImageName", sample.get("image", "")))
    question = str(sample.get("Question", ""))
    return f"{image_name}||{question}"


def validate_paths(ground_truth_path, prediction_file):
    """Validate that required paths exist and contain data."""
    # Check ground truth path
    if not os.path.exists(ground_truth_path):
        raise FileNotFoundError(f"Ground truth dataset directory not found: {ground_truth_path}")

    # Find JSON files in ground truth path
    gt_files = find_json_files(ground_truth_path)
    if not gt_files:
        raise FileNotFoundError(f"No JSON files found in {ground_truth_path}")

    # Check prediction file
    if not os.path.exists(prediction_file):
        raise FileNotFoundError(f"Prediction file not found: {prediction_file}")

    return gt_files, True

## 4. Task-Specific Metric Calculation Functions

In [7]:
def calculate_iou(bbox1, bbox2):
    """Calculate Intersection over Union (IoU) between two bounding boxes."""
    try:
        x1_min, y1_min, x1_max, y1_max = bbox1
        x2_min, y2_min, x2_max, y2_max = bbox2

        # Calculate intersection coordinates
        x_left = max(x1_min, x2_min)
        y_top = max(y1_min, y2_min)
        x_right = min(x1_max, x2_max)
        y_bottom = min(y1_max, y2_max)

        # Check if there's no intersection
        if x_right <= x_left or y_bottom <= y_top:
            return 0.0

        # Calculate areas
        intersection_area = (x_right - x_left) * (y_bottom - y_top)
        bbox1_area = (x1_max - x1_min) * (y1_max - y1_min)
        bbox2_area = (x2_max - x2_min) * (y2_max - y2_min)
        union_area = bbox1_area + bbox2_area - intersection_area

        return intersection_area / union_area if union_area > 0 else 0.0
    except Exception:
        return 0.0


def safe_float_conversion(value):
    """Safely convert value to float for numeric tasks."""
    try:
        return float(value)
    except Exception:
        return None


def normalize_text(text):
    """Normalize text for consistent comparison."""
    return str(text).strip().lower() if text is not None else ""

### 4.1 Classification Metrics

$$ \text{Balanced Accuracy} = \frac{1}{C}\sum_{i = 1}^C \frac{TP_i}{TP_i + FN_i} $$

In [8]:
def calculate_classification_metrics(predictions, ground_truth):
    """Calculate balanced accuracy for single-label classification tasks."""
    normalized_gt = [normalize_text(ref) for ref in ground_truth]
    normalized_pred = [normalize_text(pred) for pred in predictions]

    return {
        "balanced_accuracy": balanced_accuracy_score(normalized_gt, normalized_pred)
    }

### 4.2 Multi-label Classification Metrics

$$
Precision_{micro} = \frac{\sum_i TP_i}{ \sum_i TP_i + \sum_i FP_i }, \quad Recall_{micro} = \frac{\sum_i TP_i}{ \sum_i TP_i + \sum_i FN_i }
$$

$$
F1_{micro} = 2 \cdot \frac{Precision_{micro} \cdot Recall_{micro}}{Precision_{micro} + Recall_{micro}}
$$


In [9]:
def calculate_multilabel_metrics(predictions, ground_truth):
    """
    Calculate F1-score for multi-label classification tasks.
    Labels are separated by semicolons.
    """
    true_positives = false_positives = false_negatives = 0

    for pred, gt in zip(predictions, ground_truth):
        if isinstance(pred, str) and isinstance(gt, str):
            # Parse labels (semicolon or comma separated)
            pred_labels = set(label.strip().lower() for label in re.split(r'[;]', pred) if label.strip())
            gt_labels = set(label.strip().lower() for label in re.split(r'[;]', gt) if label.strip())

            # Calculate confusion matrix components
            true_positives += len(pred_labels & gt_labels)
            false_positives += len(pred_labels - gt_labels)
            false_negatives += len(gt_labels - pred_labels)

    # Calculate precision, recall, and F1-score
    precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0.0
    recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0.0
    f1_score = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0

    return {
        "f1_score": f1_score,
        "precision": precision,
        "recall": recall
    }

### 4.3 Detection Metrics

$$
Precision = \frac{TP}{TP + FP}, \quad Recall = \frac{TP}{TP + FN}
$$

$$
F1 = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}
$$

In [10]:
def calculate_detection_metrics(predictions, ground_truth):
    """
    Calculate F1-score for object detection tasks.
    Matches predictions to ground truth using IoU > 0.5 threshold.
    """
    true_positives = false_positives = false_negatives = 0

    for pred, gt in zip(predictions, ground_truth):
        try:
            # Parse JSON strings to lists
            if isinstance(gt, str):
                gt = ast.literal_eval(gt)
            if isinstance(pred, str):
                pred = ast.literal_eval(pred)

            if not isinstance(pred, list):
                false_negatives += len(gt)
                continue

            matched_predictions = set()

            # For each ground truth box, find best matching prediction
            for gt_bbox in gt:
                best_iou, best_idx = 0.0, -1

                for idx, pred_bbox in enumerate(pred):
                    if idx in matched_predictions:
                        continue
                    iou = calculate_iou(gt_bbox, pred_bbox)
                    if iou > best_iou:
                        best_iou, best_idx = iou, idx

                if best_iou > 0.5:  # IoU threshold
                    true_positives += 1
                    matched_predictions.add(best_idx)
                else:
                    false_negatives += 1

            # Unmatched predictions are false positives
            false_positives += len(pred) - len(matched_predictions)

        except Exception:
            continue

    # Calculate metrics
    precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0.0
    recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0.0
    f1_score = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0

    return {
        "detection_f1": f1_score,
        "precision": precision,
        "recall": recall
    }

### 4.4 Instance Detection Metrics

$$
Precision = \frac{TP}{TP + FP}, \quad Recall = \frac{TP}{TP + FN}
$$

$$
F1 = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}
$$

In [11]:
def calculate_instance_detection_metrics(predictions, ground_truth):
    """
    Calculate F1-score for instance detection tasks.
    Handles class-aware detection with IoU matching.
    """
    true_positives = false_positives = false_negatives = 0

    for pred, gt in zip(predictions, ground_truth):
        try:
            # Parse JSON strings to dictionaries
            if isinstance(gt, str):
                gt = json.loads(gt)
            if isinstance(pred, str):
                pred = json.loads(pred)

            if not isinstance(pred, dict):
                # Count all ground truth instances as false negatives
                total_fn = sum(len(v) for v in gt.values() if isinstance(v, list))
                false_negatives += total_fn
                continue

            # Process each class separately
            all_classes = set(gt.keys()) | set(pred.keys())

            for class_name in all_classes:
                gt_bboxes = gt.get(class_name, [])
                pred_bboxes = pred.get(class_name, [])

                # Hungarian matching algorithm (simplified)
                gt_matched = set()
                pred_matched = set()

                # Create IoU matrix
                iou_matrix = [[calculate_iou(gt_box, pred_box)
                              for pred_box in pred_bboxes]
                              for gt_box in gt_bboxes]

                # Greedy matching with IoU > 0.5
                while True:
                    max_iou, max_gt_idx, max_pred_idx = -1, -1, -1

                    for i, row in enumerate(iou_matrix):
                        if i in gt_matched:
                            continue
                        for j, iou_value in enumerate(row):
                            if j in pred_matched:
                                continue
                            if iou_value > max_iou:
                                max_iou, max_gt_idx, max_pred_idx = iou_value, i, j

                    if max_iou >= 0.5:
                        true_positives += 1
                        gt_matched.add(max_gt_idx)
                        pred_matched.add(max_pred_idx)
                    else:
                        break

                false_negatives += len(gt_bboxes) - len(gt_matched)
                false_positives += len(pred_bboxes) - len(pred_matched)

        except Exception:
            continue

    # Calculate metrics
    precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0.0
    recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0.0
    f1_score = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0

    return {
        "instance_f1": f1_score,
        "precision": precision,
        "recall": recall
    }

### 4.5 Regression, Counting Metrics

$$
MAE = \frac{1}{N} \sum_{i = 1}^N |y_i - \hat{y}_i|
$$

Where:
- $ y_i $: ground truth value
- $ \hat{y}_i $: predicted value
- $ N $: number of samples


In [12]:
def calculate_regression_metrics(predictions, ground_truth):
    """Calculate Mean Absolute Error for regression and counting tasks."""
    gt_floats = [safe_float_conversion(x) for x in ground_truth]
    pred_floats = [safe_float_conversion(x) for x in predictions]

    # Filter out invalid conversions
    valid_pairs = [(pred, gt) for pred, gt in zip(pred_floats, gt_floats)
                   if pred is not None and gt is not None]

    if not valid_pairs:
        return {"mean_absolute_error": None}

    preds, gts = zip(*valid_pairs)
    mae = mean_absolute_error(gts, preds)

    return {
        "mean_absolute_error": mae,
        "valid_samples": len(valid_pairs),
        "total_samples": len(predictions)
    }

### 4.6 Report Generation Metrics

$$
GREEN = \frac{\# \ matched \ findings }{ \# \ matched \ findings + \sum_{i = (a)}^{(f)} error_{sig,i}}
$$

Where:
- **(a)** False report of a finding in the candidate
- **(b)** Missing a finding present in the reference
- **(c)** Misidentification of a finding‚Äôs anatomic location/position
- **(d)** Misassessment of the severity of a finding
- **(e)** Mentioning a comparison that isn‚Äôt in the reference
- **(f)** Omitting a comparison detailing a change from a prior study

In [13]:
def calculate_report_generation_metrics(predictions, ground_truth):
    """Calculate GREEN Score for medical report generation tasks."""
    try:
        green_model_id = "StanfordAIMI/GREEN-radllama2-7b"
        green_scorer = GREEN(green_model_id, output_dir=".")
        mean, _, _, _, _ = green_scorer(ground_truth, predictions)
        return {"green_score": mean}
    except Exception as e:
        return {"green_score": None, "error": str(e)}

## 5. Task Dispatcher and Data Loading

### 5.1 Task Dispatcher
Routes each task type to its appropriate metric calculation function based on the TaskType field in the dataset.

In [14]:
def calculate_task_metrics(predictions, ground_truth, task_type):
    """Calculate appropriate metrics based on task type."""
    task_type_normalized = task_type.lower().strip()

    # Route to appropriate metric calculation function
    if task_type_normalized == "classification":
        return calculate_classification_metrics(predictions, ground_truth)
    elif task_type_normalized == "multi-label classification":
        return calculate_multilabel_metrics(predictions, ground_truth)
    elif task_type_normalized == "detection":
        return calculate_detection_metrics(predictions, ground_truth)
    elif task_type_normalized in ("regression", "counting"):
        return calculate_regression_metrics(predictions, ground_truth)
    elif task_type_normalized == "instance_detection":
        return calculate_instance_detection_metrics(predictions, ground_truth)
    elif task_type_normalized in ("report_generation", "report generation"):
        return calculate_report_generation_metrics(predictions, ground_truth)
    else:
        print(f"Unknown task type: {task_type}")
        return {}

### 5.2 Load and Match Data

Load ground truth and prediction data, then match samples for evaluation.

In [15]:
print("Loading and validating data files...")

# Construct ground truth dataset path
gt_dataset_path = os.path.join(args.base_dataset_path, "validation-public")
print(f"Ground truth path: {gt_dataset_path}")

# Validate paths and discover files
try:
    gt_files, _ = validate_paths(gt_dataset_path, args.prediction_file)
    print(f"Found {len(gt_files)} ground truth JSON files")

    if args.verbose:
        print("   Ground truth files:")
        for file in gt_files:
            print(f"     - {os.path.relpath(file, gt_dataset_path)}")

except Exception as e:
    print(f"Path validation failed: {e}")
    raise

# Load ground truth data
print(f"\nLoading ground truth data...")
ground_truth_data, gt_dataset_info = load_and_merge_json_files(gt_files)
print(f"Total ground truth samples: {len(ground_truth_data)}")

if args.verbose:
    print("   Dataset breakdown:")
    for filename, count in gt_dataset_info.items():
        print(f"     - {filename}: {count} samples")

# Load prediction data
print(f"\nLoading prediction data from {args.prediction_file}...")
try:
    with open(args.prediction_file, 'r') as f:
        prediction_data = json.load(f)
    print(f"Total prediction samples: {len(prediction_data)}")
except Exception as e:
    print(f"Failed to load predictions: {e}")
    raise

print(f"\nMatching samples between ground truth and predictions...")

Loading and validating data files...
Ground truth path: /content/drive/MyDrive/flare/organized_dataset/validation-public
Found 12 ground truth JSON files
   Ground truth files:
     - Xray/IU_XRay/IU_XRay_questions_val.json
     - Xray/chestdr/chestdr_questions_val.json
     - Endoscopy/endo/endo_questions_val.json
     - Clinical/neojaundice/neojaundice_questions_val.json
     - Mammography/CMMD/CMMD_questions_val.json
     - Retinography/retino/retino_questions_val.json
     - Ultrasound/BUSI-det/BUSI-det_questions_val.json
     - Ultrasound/BUSI/BUSI_questions_val.json
     - Ultrasound/BUS-UCLM/BUS-UCLM_questions_val.json
     - Ultrasound/BUS-UCLM-det/BUS-UCLM-det_questions_val.json
     - Microscopy/neurips22cell/neurips22cell_questions_val.json
     - Dermatology/bcn20000/bcn20000_questions_val.json

Loading ground truth data...
Total ground truth samples: 5577
   Dataset breakdown:
     - IU_XRay_questions_val.json: 1945 samples
     - chestdr_questions_val.json: 970 samples
  

In [16]:
# Create lookup dictionaries for efficient matching
gt_lookup = {create_sample_key(sample): sample for sample in ground_truth_data}
pred_lookup = {create_sample_key(sample): sample for sample in prediction_data}

print(f"üîç Sample matching statistics:")
print(f"   GT lookup size: {len(gt_lookup)}")
print(f"   Pred lookup size: {len(pred_lookup)}")

# Group data by task type and match samples
task_type_to_gt = defaultdict(list)
task_type_to_pred = defaultdict(list)
task_type_counts = defaultdict(int)

matched_samples = 0
unmatched_samples = 0
task_type_distribution = defaultdict(int)

for sample_key, gt_sample in gt_lookup.items():
    task_type = gt_sample.get("TaskType", "").strip().lower()

    if not task_type:
        continue

    task_type_distribution[task_type] += 1

    pred_sample = pred_lookup.get(sample_key)
    if pred_sample is None:
        unmatched_samples += 1
        continue

    gt_answer = gt_sample.get("Answer", "")
    pred_answer = pred_sample.get("Answer", "")

    task_type_to_gt[task_type].append(gt_answer)
    task_type_to_pred[task_type].append(pred_answer)
    task_type_counts[task_type] += 1
    matched_samples += 1

print(f"\nüìà Matching Results:")
print(f"   ‚úÖ Successfully matched: {matched_samples} samples")
print(f"   ‚ùå Unmatched samples: {unmatched_samples}")

print(f"\nüìã Task Type Distribution:")
for task_type in sorted(task_type_distribution.keys()):
    total = task_type_distribution[task_type]
    matched = task_type_counts[task_type]
    match_rate = (matched / total * 100) if total > 0 else 0
    print(f"   {task_type}: {matched}/{total} ({match_rate:.1f}%)")


üîç Sample matching statistics:
   GT lookup size: 5577
   Pred lookup size: 5577

üìà Matching Results:
   ‚úÖ Successfully matched: 5577 samples
   ‚ùå Unmatched samples: 0

üìã Task Type Distribution:
   classification: 2843/2843 (100.0%)
   counting: 100/100 (100.0%)
   detection: 175/175 (100.0%)
   multi-label classification: 514/514 (100.0%)
   report_generation: 1945/1945 (100.0%)


In [17]:
print("üßÆ Calculating metrics by task type...")
evaluation_results = {}
total_evaluated = 0

for task_type in sorted(task_type_to_gt.keys()):
    gt_answers = task_type_to_gt[task_type]
    pred_answers = task_type_to_pred[task_type]

    print(f"\nüìä Evaluating {task_type}:")
    print(f"   Samples: {len(gt_answers)}")

    # Calculate task-specific metrics
    metrics = calculate_task_metrics(pred_answers, gt_answers, task_type)
    metrics["num_examples"] = len(gt_answers)
    evaluation_results[task_type] = metrics
    total_evaluated += len(gt_answers)

    # Display results
    if args.verbose:
        print("   Results:")
        for metric_name, metric_value in metrics.items():
            if metric_name != "num_examples":
                if isinstance(metric_value, float):
                    print(f"     {metric_name}: {metric_value:.4f}")
                else:
                    print(f"     {metric_name}: {metric_value}")
    else:
        # Show primary metric only
        primary_metrics = {
            "classification": "balanced_accuracy",
            "multi-label classification": "f1_score",
            "detection": "detection_f1",
            "instance_detection": "instance_f1",
            "regression": "mean_absolute_error",
            "counting": "mean_absolute_error",
            "report_generation": "green_score"
        }

        primary_metric = primary_metrics.get(task_type)
        if primary_metric and primary_metric in metrics:
            value = metrics[primary_metric]
            if isinstance(value, float):
                print(f"   {primary_metric}: {value:.4f}")
            else:
                print(f"   {primary_metric}: {value}")

print(f"\nüéØ Evaluation Summary:")
print(f"   Total evaluated samples: {total_evaluated}")
print(f"   Task types evaluated: {len(evaluation_results)}")


üßÆ Calculating metrics by task type...

üìä Evaluating classification:
   Samples: 2843
   Results:
     balanced_accuracy: 0.2532

üìä Evaluating counting:
   Samples: 100
   Results:
     mean_absolute_error: 295.6500
     valid_samples: 100
     total_samples: 100

üìä Evaluating detection:
   Samples: 175
   Results:
     detection_f1: 0.1979
     precision: 0.2022
     recall: 0.1937

üìä Evaluating multi-label classification:
   Samples: 514
   Results:
     f1_score: 0.4437
     precision: 0.4730
     recall: 0.4177

üìä Evaluating report_generation:
   Samples: 1945


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/691 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/3.59G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/12.9k [00:00<?, ?B/s]

tokenization_chexagent.py:   0%|          | 0.00/26.6k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/620 [00:00<?, ?B/s]

Processing data...making prompts


Map:   0%|          | 0/1945 [00:00<?, ? examples/s]

Done.
==== Beginning Inference ====


244it [2:46:50, 41.03s/it]


==== End Inference ====
Computing summary ...


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.52k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/594 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

  return fit_method(estimator, *args, **kwargs)


Seconds per example:  5.173197606412795
   Results:
     green_score: 0.7063

üéØ Evaluation Summary:
   Total evaluated samples: 5577
   Task types evaluated: 5


In [19]:
print("üíæ Saving evaluation results...")

# Create output directory if it doesn't exist
os.makedirs(args.output_dir, exist_ok=True)
output_file = os.path.join(args.output_dir, args.output_filename)

# Add metadata to results
final_results = {
    "evaluation_metadata": {
        "total_samples_evaluated": total_evaluated,
        "total_task_types": len(evaluation_results),
        "ground_truth_path": gt_dataset_path,
        "prediction_file": args.prediction_file,
        "matched_samples": matched_samples,
        "unmatched_samples": unmatched_samples
    },
    "task_metrics": evaluation_results
}

# Save to JSON file
try:
    with open(output_file, "w") as f:
        json.dump(final_results, f, indent=2)
    print(f"‚úÖ Evaluation metrics saved to: {output_file}")
except Exception as e:
    print(f"‚ùå Failed to save results: {e}")

üíæ Saving evaluation results...
‚úÖ Evaluation metrics saved to: evaluation_results/metrics_public.json

üìÅ Output file structure:
   üìä Metadata: evaluation_metadata
   üìà Metrics: task_metrics
   üìÇ Location: evaluation_results/metrics_public.json
