# 📊 Comprehensive Model Evaluation for Cardiac Segmentation

This notebook provides a comprehensive evaluation framework for cardiac MRI segmentation models. We'll evaluate trained models using quantitative metrics, qualitative visualizations, error analysis, and performance comparisons across different architectures and configurations.

## Objectives
- Load and evaluate trained models from previous notebooks
- Compute comprehensive quantitative metrics (medical-specific)
- Create qualitative visualizations of segmentation results
- Perform detailed error analysis and failure case investigation
- Compare performance across different model architectures
- Generate statistical analysis and confidence intervals
- Create publication-ready evaluation reports

## Key Components
1. **Model Loading**: Load trained models and configurations
2. **Quantitative Evaluation**: Medical segmentation metrics computation
3. **Qualitative Analysis**: Visual comparison of predictions vs ground truth
4. **Error Analysis**: Identification and analysis of failure cases
5. **Statistical Analysis**: Confidence intervals and significance testing
6. **Performance Comparison**: Multi-model architecture comparison
7. **Report Generation**: Comprehensive evaluation reports

## Evaluation Strategy
- **Multi-metric Assessment**: Dice, IoU, Hausdorff Distance, Surface metrics
- **Per-class Analysis**: Individual evaluation for each cardiac structure
- **Statistical Robustness**: Bootstrap confidence intervals and significance tests
- **Visual Validation**: Qualitative assessment with expert-level visualizations
- **Failure Case Analysis**: Systematic investigation of poor predictions

In [5]:
# Environment Setup and Library Imports
import os
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import json
import pickle
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Statistical analysis
from scipy import stats
from scipy.stats import bootstrap
import statsmodels.api as sm
from sklearn.metrics import confusion_matrix, classification_report

# PyTorch and related libraries
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset
import torchvision.transforms as transforms

# Image processing
import cv2
from skimage import morphology, filters, segmentation
from skimage.metrics import structural_similarity

# Check if running in Google Colab or VS Code
try:
    import google.colab
    IN_COLAB = True
    print("🔵 Running in Google Colab")
    
    # Mount Google Drive if needed
    from google.colab import drive
    drive.mount('/content/drive')
    
    # Install additional packages if needed
    !pip install -q plotly
    !pip install -q nibabel
    !pip install -q SimpleITK
    
except ImportError:
    IN_COLAB = False
    print("🟢 Running in VS Code/Local Environment")
    
    # Check if packages are installed, install if needed
    try:
        import plotly.graph_objects as go
        import plotly.express as px
        import nibabel as nib
        print("✅ All required packages are available")
    except ImportError as e:
        print(f"⚠️ Missing package: {e}")
        print("Installing missing packages...")
        # For VS Code, you can install packages using the terminal or pip
        # !pip install plotly nibabel SimpleITK

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed(42)
    torch.cuda.manual_seed_all(42)

# Configure matplotlib
plt.style.use('default')
sns.set_palette("husl")

# Setup paths based on environment
if IN_COLAB:
    BASE_DIR = Path('/content/drive/MyDrive/Heart_Segmentation_Advanced')
else:
    BASE_DIR = Path('.')

DATA_DIR = BASE_DIR / 'data'
MODEL_DIR = BASE_DIR / 'models'
OUTPUT_DIR = BASE_DIR / 'outputs'
UTILS_DIR = BASE_DIR / 'utils'

# Add utils to path
sys.path.append(str(UTILS_DIR))

# Import custom modules
try:
    from visualization_utils import VisualizationUtils
    print("✅ Custom modules imported successfully")
except ImportError as e:
    print(f"⚠️ Could not import custom modules: {e}")
    print("Make sure utils modules are available in the project directory")

# PyTorch configuration
print(f"PyTorch version: {torch.__version__}")

# GPU configuration
if torch.cuda.is_available():
    print(f"🚀 CUDA available: {torch.cuda.device_count()} GPU(s)")
    for i in range(torch.cuda.device_count()):
        print(f"  GPU {i}: {torch.cuda.get_device_name(i)}")
else:
    print("⚠️ No GPU detected, using CPU")

# Mixed precision configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
use_amp = torch.cuda.is_available()  # Automatic Mixed Precision
print(f'Device: {device}')
if use_amp:
    print('Mixed precision: Enabled (AMP)')
else:
    print('Mixed precision: Disabled (CPU mode)')

print("\n✅ Environment setup complete!")
print(f"📁 Working directory: {BASE_DIR}")
print(f"🎯 Ready for model evaluation!")

🟢 Running in VS Code/Local Environment
✅ All required packages are available
⚠️ Could not import custom modules: cannot import name 'VisualizationUtils' from 'visualization_utils' (c:\Users\leonardo.costa\OneDrive - Lightera, LLC\Documentos\GitHub\pratica-aprendizado-de-maquina\Heart_Segmentation_Advanced\utils\visualization_utils.py)
Make sure utils modules are available in the project directory
PyTorch version: 2.7.1+cpu
⚠️ No GPU detected, using CPU
Device: cpu
Mixed precision: Disabled (CPU mode)

✅ Environment setup complete!
📁 Working directory: .
🎯 Ready for model evaluation!


In [8]:
# Load PyTorch-based Implementations
print("📥 Loading PyTorch implementations from previous notebooks...")

# PyTorch-based metric functions
def dice_coefficient(y_true, y_pred, smooth=1e-6):
    """Calculate Dice coefficient using PyTorch tensors"""
    y_true_f = y_true.view(-1)
    y_pred_f = y_pred.view(-1)
    intersection = torch.sum(y_true_f * y_pred_f)
    dice_coeff = (2. * intersection + smooth) / (torch.sum(y_true_f) + torch.sum(y_pred_f) + smooth)
    return dice_coeff

def iou_score(y_true, y_pred, threshold=0.5, smooth=1e-6):
    """Calculate IoU (Jaccard) score using PyTorch tensors"""
    y_true_binary = (y_true > threshold).float()
    y_pred_binary = (y_pred > threshold).float()
    intersection = torch.sum(y_true_binary * y_pred_binary)
    union = torch.sum(y_true_binary) + torch.sum(y_pred_binary) - intersection
    iou = (intersection + smooth) / (union + smooth)
    return iou

def sensitivity_score(y_true, y_pred, threshold=0.5, smooth=1e-6):
    """Calculate sensitivity (recall) using PyTorch tensors"""
    y_true_binary = (y_true > threshold).float()
    y_pred_binary = (y_pred > threshold).float()
    true_positives = torch.sum(y_true_binary * y_pred_binary)
    possible_positives = torch.sum(y_true_binary)
    sensitivity = (true_positives + smooth) / (possible_positives + smooth)
    return sensitivity

def specificity_score(y_true, y_pred, threshold=0.5, smooth=1e-6):
    """Calculate specificity using PyTorch tensors"""
    y_true_binary = (y_true > threshold).float()
    y_pred_binary = (y_pred > threshold).float()
    true_negatives = torch.sum((1 - y_true_binary) * (1 - y_pred_binary))
    possible_negatives = torch.sum(1 - y_true_binary)
    specificity = (true_negatives + smooth) / (possible_negatives + smooth)
    return specificity

def precision_score_custom(y_true, y_pred, threshold=0.5, smooth=1e-6):
    """Calculate precision using PyTorch tensors (renamed to avoid sklearn conflict)"""
    y_true_binary = (y_true > threshold).float()
    y_pred_binary = (y_pred > threshold).float()
    true_positives = torch.sum(y_true_binary * y_pred_binary)
    predicted_positives = torch.sum(y_pred_binary)
    precision = (true_positives + smooth) / (predicted_positives + smooth)
    return precision

class MedicalMetrics:
    """Medical imaging metrics calculator for PyTorch tensors and numpy arrays"""
    def __init__(self, threshold=0.5, smooth=1e-6):
        self.threshold = threshold
        self.smooth = smooth
    
    def calculate_metrics(self, y_true, y_pred):
        """Calculate comprehensive medical metrics"""
        # Convert numpy arrays to tensors if needed
        if isinstance(y_true, np.ndarray):
            y_true = torch.from_numpy(y_true).float()
        if isinstance(y_pred, np.ndarray):
            y_pred = torch.from_numpy(y_pred).float()
        
        # Ensure tensors are on the same device
        device = y_pred.device
        y_true = y_true.to(device)
        
        metrics = {
            'dice': dice_coefficient(y_true, y_pred, self.smooth).item(),
            'iou': iou_score(y_true, y_pred, self.threshold, self.smooth).item(),
            'sensitivity': sensitivity_score(y_true, y_pred, self.threshold, self.smooth).item(),
            'specificity': specificity_score(y_true, y_pred, self.threshold, self.smooth).item(),
            'precision': precision_score_custom(y_true, y_pred, self.threshold, self.smooth).item()
        }
        
        # Calculate additional metrics
        recall = metrics['sensitivity']  # Same as sensitivity
        if metrics['precision'] + recall > 0:
            f1_score = 2 * (metrics['precision'] * recall) / (metrics['precision'] + recall)
        else:
            f1_score = 0.0
        
        metrics['recall'] = recall
        metrics['f1_score'] = f1_score
        
        return metrics
        
    def compute_metrics_numpy(self, y_true_np, y_pred_np):
        """Compute metrics using numpy arrays (for compatibility)"""
        y_true_binary = (y_true_np > self.threshold).astype(np.float32)
        y_pred_binary = (y_pred_np > self.threshold).astype(np.float32)
        
        intersection = np.sum(y_true_binary * y_pred_binary)
        union = np.sum(y_true_binary) + np.sum(y_pred_binary) - intersection
        
        # Dice coefficient
        dice = (2.0 * intersection + self.smooth) / (np.sum(y_true_binary) + np.sum(y_pred_binary) + self.smooth)
        
        # IoU
        iou = (intersection + self.smooth) / (union + self.smooth)
        
        # Sensitivity and Specificity
        true_positives = intersection
        false_negatives = np.sum(y_true_binary * (1 - y_pred_binary))
        false_positives = np.sum((1 - y_true_binary) * y_pred_binary)
        true_negatives = np.sum((1 - y_true_binary) * (1 - y_pred_binary))
        
        sensitivity = true_positives / (true_positives + false_negatives + self.smooth)
        specificity = true_negatives / (true_negatives + false_positives + self.smooth)
        precision = true_positives / (true_positives + false_positives + self.smooth)
        
        f1 = 2 * (precision * sensitivity) / (precision + sensitivity + self.smooth)
        
        vol_true = np.sum(y_true_binary)
        vol_pred = np.sum(y_pred_binary)
        vol_sim = 1.0 - np.abs(vol_true - vol_pred) / (vol_true + vol_pred + self.smooth)
        
        return {
            'dice': dice,
            'iou': iou,
            'sensitivity': sensitivity,
            'specificity': specificity,
            'precision': precision,
            'f1': f1,
            'volume_similarity': vol_sim
        }

class EvaluationConfig:
    """Configuration class for model evaluation"""
    def __init__(self):
        # Evaluation parameters
        self.batch_size = 8
        self.threshold = 0.5
        
        # Metrics parameters
        self.smooth = 1e-6
        self.hausdorff_percentile = 95
        
        # Visualization parameters
        self.num_samples_to_show = 10
        self.figure_size = (15, 10)
        self.cmap = 'viridis'
        
        # Statistical analysis
        self.confidence_level = 0.95
        self.bootstrap_samples = 1000
        
        # Paths
        self.results_dir = OUTPUT_DIR / 'evaluation_results'
        self.results_dir.mkdir(parents=True, exist_ok=True)
        
    def save_config(self, path=None):
        if path is None:
            path = self.results_dir / 'evaluation_config.json'
        
        config_dict = {k: v for k, v in self.__dict__.items() 
                      if not k.startswith('_') and isinstance(v, (str, int, float, bool, list))}
        
        # Convert Path objects to strings
        for key, value in config_dict.items():
            if isinstance(value, Path):
                config_dict[key] = str(value)
        
        with open(path, 'w') as f:
            json.dump(config_dict, f, indent=2)
        
        return path

# Initialize evaluation configuration
eval_config = EvaluationConfig()
config_path = eval_config.save_config()

# Initialize metrics calculator
metrics_calculator = MedicalMetrics()

print("✅ PyTorch-based implementations loaded successfully!")
print(f"📁 Evaluation results will be saved to: {eval_config.results_dir}")
print(f"⚙️ Configuration saved to: {config_path}")
print("🔥 All functions use PyTorch tensors and operations")

📥 Loading PyTorch implementations from previous notebooks...
✅ PyTorch-based implementations loaded successfully!
📁 Evaluation results will be saved to: outputs\evaluation_results
⚙️ Configuration saved to: outputs\evaluation_results\evaluation_config.json
🔥 All functions use PyTorch tensors and operations


In [9]:
# Model Loading and Management (PyTorch)
class ModelManager:
    """
    Utility class for loading and managing trained PyTorch models
    """
    
    def __init__(self, model_dir=None):
        self.model_dir = model_dir or MODEL_DIR
        self.loaded_models = {}
        self.model_configs = {}
        
        # Ensure model directory exists
        self.model_dir.mkdir(parents=True, exist_ok=True)
        
        print(f"🏗️ ModelManager initialized")
        print(f"📁 Model directory: {self.model_dir}")
    
    def discover_models(self):
        """
        Discover available PyTorch models in the model directory
        """
        print("🔍 Discovering available models...")
        
        available_models = []
        
        # Search for .pth and .pt model files
        for model_file in self.model_dir.glob('*.pth'):
            model_info = {
                'name': model_file.stem,
                'path': model_file,
                'format': 'pth',
                'size_mb': model_file.stat().st_size / (1024 * 1024)
            }
            available_models.append(model_info)
        
        for model_file in self.model_dir.glob('*.pt'):
            model_info = {
                'name': model_file.stem,
                'path': model_file,
                'format': 'pt',
                'size_mb': model_file.stat().st_size / (1024 * 1024)
            }
            available_models.append(model_info)
        
        # Print discovered models
        if available_models:
            print(f"✅ Found {len(available_models)} model(s):")
            for model in available_models:
                print(f"  📦 {model['name']} ({model['format']}, {model['size_mb']:.1f} MB)")
        else:
            print("⚠️ No PyTorch models found (.pth or .pt files)")
        
        return available_models
    
    def load_model(self, model_name_or_path, model_class=None):
        """
        Load a PyTorch model from file
        """
        try:
            # Determine model path
            if isinstance(model_name_or_path, (str, Path)) and Path(model_name_or_path).exists():
                model_path = Path(model_name_or_path)
                model_name = model_path.stem
            else:
                model_name = model_name_or_path
                # Try to find the model
                possible_paths = [
                    self.model_dir / f"{model_name}.pth",
                    self.model_dir / f"{model_name}.pt",
                    self.model_dir / model_name
                ]
                model_path = None
                for path in possible_paths:
                    if path.exists():
                        model_path = path
                        break
                
                if model_path is None:
                    raise FileNotFoundError(f"Model '{model_name}' not found in {self.model_dir}")
            
            # Load the model
            print(f"🔄 Loading PyTorch model: {model_name}")
            
            # Load checkpoint
            checkpoint = torch.load(model_path, map_location=device)
            
            # If model_class is provided, create model instance
            if model_class is not None:
                model = model_class()
                model.load_state_dict(checkpoint['model_state_dict'] if 'model_state_dict' in checkpoint else checkpoint)
            else:
                # Try to load the entire model (if saved with torch.save(model, path))
                try:
                    model = checkpoint
                    if not isinstance(model, nn.Module):
                        # If checkpoint contains state dict, we need the model architecture
                        print("⚠️ Checkpoint contains state dict but no model class provided")
                        print("Creating a dummy Enhanced U-Net model...")
                        model = self.create_dummy_model()
                        if 'model_state_dict' in checkpoint:
                            model.load_state_dict(checkpoint['model_state_dict'])
                        else:
                            model.load_state_dict(checkpoint)
                except:
                    print("⚠️ Could not load model directly, creating dummy model...")
                    model = self.create_dummy_model()
                    if isinstance(checkpoint, dict) and 'model_state_dict' in checkpoint:
                        try:
                            model.load_state_dict(checkpoint['model_state_dict'])
                        except:
                            print("⚠️ Could not load state dict, using dummy model")
                    elif isinstance(checkpoint, dict):
                        try:
                            model.load_state_dict(checkpoint)
                        except:
                            print("⚠️ Could not load state dict, using dummy model")
            
            # Move model to device
            model = model.to(device)
            model.eval()
            
            # Store loaded model
            self.loaded_models[model_name] = model
            
            # Try to load associated configuration
            config_path = model_path.parent / f"{model_name}_config.json"
            if config_path.exists():
                with open(config_path, 'r') as f:
                    self.model_configs[model_name] = json.load(f)
                print(f"✅ Configuration loaded for {model_name}")
            
            # Get model info
            total_params = sum(p.numel() for p in model.parameters())
            trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
            
            print(f"✅ Model '{model_name}' loaded successfully!")
            print(f"📊 Model summary:")
            print(f"  - Total parameters: {total_params:,}")
            print(f"  - Trainable parameters: {trainable_params:,}")
            print(f"  - Device: {next(model.parameters()).device}")
            
            return model
            
        except Exception as e:
            print(f"❌ Error loading model '{model_name_or_path}': {e}")
            return None
    
    def get_model_info(self, model_name):
        """
        Get information about a loaded PyTorch model
        """
        if model_name not in self.loaded_models:
            return None
        
        model = self.loaded_models[model_name]
        config = self.model_configs.get(model_name, {})
        
        total_params = sum(p.numel() for p in model.parameters())
        trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
        
        info = {
            'name': model_name,
            'total_params': total_params,
            'trainable_params': trainable_params,
            'device': str(next(model.parameters()).device),
            'model_type': model.__class__.__name__,
            'config': config
        }
        
        return info
    
    def create_dummy_model(self, input_channels=3, output_channels=1):
        """
        Create a dummy PyTorch U-Net model for testing evaluation pipeline
        """
        print("🔧 Creating dummy PyTorch U-Net model for testing...")
        
        class DummyUNet(nn.Module):
            def __init__(self, in_channels=3, out_channels=1):
                super(DummyUNet, self).__init__()
                
                # Encoder
                self.enc1 = nn.Sequential(
                    nn.Conv2d(in_channels, 64, 3, padding=1),
                    nn.ReLU(inplace=True),
                    nn.Conv2d(64, 64, 3, padding=1),
                    nn.ReLU(inplace=True)
                )
                self.pool1 = nn.MaxPool2d(2)
                
                self.enc2 = nn.Sequential(
                    nn.Conv2d(64, 128, 3, padding=1),
                    nn.ReLU(inplace=True),
                    nn.Conv2d(128, 128, 3, padding=1),
                    nn.ReLU(inplace=True)
                )
                self.pool2 = nn.MaxPool2d(2)
                
                # Bottleneck
                self.bottleneck = nn.Sequential(
                    nn.Conv2d(128, 256, 3, padding=1),
                    nn.ReLU(inplace=True),
                    nn.Conv2d(256, 256, 3, padding=1),
                    nn.ReLU(inplace=True)
                )
                
                # Decoder
                self.upconv2 = nn.ConvTranspose2d(256, 128, 2, stride=2)
                self.dec2 = nn.Sequential(
                    nn.Conv2d(256, 128, 3, padding=1),
                    nn.ReLU(inplace=True),
                    nn.Conv2d(128, 128, 3, padding=1),
                    nn.ReLU(inplace=True)
                )
                
                self.upconv1 = nn.ConvTranspose2d(128, 64, 2, stride=2)
                self.dec1 = nn.Sequential(
                    nn.Conv2d(128, 64, 3, padding=1),
                    nn.ReLU(inplace=True),
                    nn.Conv2d(64, 64, 3, padding=1),
                    nn.ReLU(inplace=True)
                )
                
                # Output
                self.final_conv = nn.Conv2d(64, out_channels, 1)
                self.sigmoid = nn.Sigmoid()
            
            def forward(self, x):
                # Encoder
                enc1 = self.enc1(x)
                enc2 = self.enc2(self.pool1(enc1))
                
                # Bottleneck
                bottleneck = self.bottleneck(self.pool2(enc2))
                
                # Decoder
                dec2 = self.upconv2(bottleneck)
                dec2 = torch.cat([dec2, enc2], dim=1)
                dec2 = self.dec2(dec2)
                
                dec1 = self.upconv1(dec2)
                dec1 = torch.cat([dec1, enc1], dim=1)
                dec1 = self.dec1(dec1)
                
                # Output
                out = self.final_conv(dec1)
                out = self.sigmoid(out)
                
                return out
        
        model = DummyUNet(input_channels, output_channels)
        model = model.to(device)
        model.eval()
        
        self.loaded_models['dummy_model'] = model
        
        print("✅ Dummy PyTorch U-Net model created and added to loaded models")
        return model

# Initialize model manager
model_manager = ModelManager()

# Discover available models
available_models = model_manager.discover_models()

print(f"\n📋 Model Manager initialized!")
print(f"🔍 Found {len(available_models)} available models")

# If no models found, create a dummy model for demonstration
if not available_models:
    print("⚠️ No trained models found. Creating dummy model for evaluation pipeline testing...")
    dummy_model = model_manager.create_dummy_model()
    print("💡 Tip: Train models using the previous notebooks to see real evaluation results!")
else:
    # Load the first available model
    first_model = available_models[0]
    print(f"🚀 Loading first available model: {first_model['name']}")
    loaded_model = model_manager.load_model(first_model['name'])
    if loaded_model is not None:
        print("✅ Model loaded successfully for evaluation!")

🏗️ ModelManager initialized
📁 Model directory: models
🔍 Discovering available models...
⚠️ No PyTorch models found (.pth or .pt files)

📋 Model Manager initialized!
🔍 Found 0 available models
⚠️ No trained models found. Creating dummy model for evaluation pipeline testing...
🔧 Creating dummy PyTorch U-Net model for testing...
✅ Dummy PyTorch U-Net model created and added to loaded models
💡 Tip: Train models using the previous notebooks to see real evaluation results!


In [None]:
# Comprehensive Quantitative Evaluation Framework (PyTorch)
class QuantitativeEvaluator:
    """
    Comprehensive quantitative evaluation for medical image segmentation using PyTorch
    """
    
    def __init__(self, config=None):
        self.config = config or eval_config
        self.metrics_calculator = MedicalMetrics(
            threshold=self.config.threshold,
            smooth=self.config.smooth
        )
        self.results_history = []
        
    def evaluate_model_on_dataset(self, model, dataset, dataset_name="test"):
        """
        Evaluate PyTorch model on a complete dataset
        """
        print(f"📊 Evaluating PyTorch model on {dataset_name} dataset...")
        
        model.eval()  # Set model to evaluation mode
        
        all_predictions = []
        all_ground_truth = []
        all_images = []
        
        batch_count = 0
        
        # Collect all predictions and ground truth
        with torch.no_grad():  # Disable gradient computation for evaluation
            for batch_data in dataset:
                if isinstance(batch_data, (list, tuple)) and len(batch_data) == 2:
                    batch_images, batch_masks = batch_data
                else:
                    # Handle single tensor input
                    batch_images = batch_data
                    batch_masks = None
                
                # Convert to PyTorch tensors if needed
                if isinstance(batch_images, np.ndarray):
                    batch_images = torch.from_numpy(batch_images).float()
                if isinstance(batch_masks, np.ndarray):
                    batch_masks = torch.from_numpy(batch_masks).float()
                
                # Move to device
                if batch_images.device != device:
                    batch_images = batch_images.to(device)            batch_images = batch_images.to(device)
                if batch_masks is not None and batch_masks.device != device:one and batch_masks.device != device:
                    batch_masks = batch_masks.to(device)
                                
                # Make predictions
                if batch_images.dim() == 3:  # Add batch dimension if neededes.dim() == 3:  # Add batch dimension if neededes.dim() == 3:  # Add batch dimension if needed
                    batch_images = batch_images.unsqueeze(0)images.unsqueeze(0)images.unsqueeze(0)
                
                # Ensure correct channel dimension (B, C, H, W)ension (B, C, H, W)ension (B, C, H, W)
                if batch_images.shape[1] != 3 and batch_images.shape[-1] == 3:
                    batch_images = batch_images.permute(0, 3, 1, 2)s = batch_images.permute(0, 3, 1, 2)s = batch_images.permute(0, 3, 1, 2)
                
                batch_predictions = model(batch_images)(batch_images)(batch_images)
                
                # Convert predictions back to numpy for metrics calculation       # Convert predictions back to numpy for metrics calculation       # Convert predictions back to numpy for metrics calculation
                if isinstance(batch_predictions, torch.Tensor):        if isinstance(batch_predictions, torch.Tensor):        if isinstance(batch_predictions, torch.Tensor):
                    batch_predictions = batch_predictions.cpu().numpy()ons.cpu().numpy()ons.cpu().numpy()
                if isinstance(batch_images, torch.Tensor):        if isinstance(batch_images, torch.Tensor):        if isinstance(batch_images, torch.Tensor):
                    batch_images = batch_images.cpu().numpy() = batch_images.cpu().numpy() = batch_images.cpu().numpy()
                if batch_masks is not None and isinstance(batch_masks, torch.Tensor):            if batch_masks is not None and isinstance(batch_masks, torch.Tensor):            if batch_masks is not None and isinstance(batch_masks, torch.Tensor):
                    batch_masks = batch_masks.cpu().numpy()
                          
                # Store results
                all_predictions.extend(batch_predictions)     all_predictions.extend(batch_predictions)     all_predictions.extend(batch_predictions)
                all_images.extend(batch_images)
                if batch_masks is not None:        if batch_masks is not None:        if batch_masks is not None:
                    all_ground_truth.extend(batch_masks)und_truth.extend(batch_masks)und_truth.extend(batch_masks)
                                
                batch_count += 1
                if batch_count % 10 == 0:
                    print(f"  Processed {batch_count} batches...")ed {batch_count} batches...")ed {batch_count} batches...")
        
        # Convert to numpy arraysnvert to numpy arraysnvert to numpy arrays
        predictions = np.array(all_predictions)s)s)
        images = np.array(all_images)
        
        if all_ground_truth:if all_ground_truth:if all_ground_truth:
            ground_truth = np.array(all_ground_truth)np.array(all_ground_truth)np.array(all_ground_truth)
        else:
            # Create dummy ground truth for demonstration    # Create dummy ground truth for demonstration    # Create dummy ground truth for demonstration
            print("⚠️ No ground truth available, creating synthetic masks for demonstration")ruth available, creating synthetic masks for demonstration")ruth available, creating synthetic masks for demonstration")
            ground_truth = self._create_synthetic_masks(images, predictions)mages, predictions)mages)
        
        print(f"✅ Collected {len(predictions)} samples for evaluation")print(f"✅ Collected {len(predictions)} samples for evaluation")print(f"✅ Collected {len(predictions)} samples for evaluation")
        
        # Compute comprehensive metrics    # Compute comprehensive metrics    # Compute comprehensive metrics
        results = self._compute_comprehensive_metrics(ground_truth, predictions, dataset_name)rics(ground_truth, predictions, dataset_name)rics(ground_truth, predictions, dataset_name)
        
        # Store evaluation results
        evaluation_record = {luation_record = {luation_record = {
            'dataset_name': dataset_name,taset_name,taset_name,
            'timestamp': datetime.now().isoformat(),p': datetime.now().isoformat(),p': datetime.now().isoformat(),
            'num_samples': len(predictions),    'num_samples': len(predictions),    'num_samples': len(predictions),
            'model_name': model.__class__.__name__ if hasattr(model, '__class__') else 'unknown',l.__class__.__name__ if hasattr(model, '__class__') else 'unknown',l.__class__.__name__ if hasattr(model, '__class__') else 'unknown',
            'results': results,
            'predictions': predictions,s': predictions,s': predictions,
            'ground_truth': ground_truth,    'ground_truth': ground_truth,    'ground_truth': ground_truth,
            'images': images
        }
        
        self.results_history.append(evaluation_record).results_history.append(evaluation_record).results_history.append(evaluation_record)
        
        return evaluation_record
    
    def _create_synthetic_masks(self, images, predictions):
        """Create synthetic ground truth masks for demonstration"""
        masks = []
        for i, (img, pred) in enumerate(zip(images, predictions))::
            # Create a simple circular mask in the center
            if len(img.shape) == 4:  # Batch dimension
                h, w = img.shape[2], img.shape[3]h, w = img.shape[2], img.shape[3]h, w = img.shape[2], img.shape[3]
            elif len(img.shape) == 3:
                h, w = img.shape[1], img.shape[2]
            else:
                h, w = img.shape[0], img.shape[1]h, w = img.shape[0], img.shape[1]h, w = img.shape[0], img.shape[1]
            
            center_x, center_y = h // 2, w // 2
            radius = min(h, w) // 4
            
            y, x = np.ogrid[:h, :w]p.ogrid[:h, :w]p.ogrid[:h, :w]
            mask = (x - center_x)**2 + (y - center_y)**2 <= radius**2
            mask = mask.astype(np.float32)
            
            # Add some noise to make it more realisticme noise to make it more realisticme noise to make it more realistic
            mask = mask + np.random.normal(0, 0.1, mask.shape)
            mask = np.clip(mask, 0, 1)
            
            # Reshape to match predictions formations formations format
            if len(pred.shape) == 3:  # Has channel dimension
                mask = mask.reshape(*mask.shape, 1)
            
            masks.append(mask)s.append(mask)s.append(mask)
        
        return np.array(masks)
    
    def _compute_comprehensive_metrics(self, ground_truth, predictions, dataset_name):trics(self, ground_truth, predictions, dataset_name):trics(self, ground_truth, predictions, dataset_name):
        """
        Compute comprehensive metrics for all samples
        """
        print(f"🔢 Computing comprehensive metrics...")print(f"🔢 Computing comprehensive metrics...")print(f"🔢 Computing comprehensive metrics...")
        
        sample_metrics = []    sample_metrics = []    sample_metrics = []
        
        # Compute metrics for each sampleompute metrics for each sampleompute metrics for each sample
        for i, (gt, pred) in enumerate(zip(ground_truth, predictions)):, predictions)):, predictions)):
            if i % 50 == 0 and i > 0: if i % 50 == 0 and i > 0: if i % 50 == 0 and i > 0:
                print(f"  Processed {i}/{len(ground_truth)} samples")
                    
            # Compute metrics for this samplesamplesample
            metrics = self.metrics_calculator.compute_metrics_numpy(gt, pred)_numpy(gt, pred)_numpy(gt, pred)
            sample_metrics.append(metrics)ics.append(metrics)ics.append(metrics)
        
        # Aggregate metrics
        aggregated_metrics = self._aggregate_metrics(sample_metrics) self._aggregate_metrics(sample_metrics) self._aggregate_metrics(sample_metrics)
        
        # Add dataset information
        aggregated_metrics['dataset_name'] = dataset_nameaset_nameaset_name
        aggregated_metrics['num_samples'] = len(ground_truth)
        
        return aggregated_metrics
    
    def _aggregate_metrics(self, sample_metrics):
        """""""""
        Aggregate metrics across all samples with statistical measuresh statistical measuresh statistical measures
        """
        if not sample_metrics:
            return {}
        
        # Get all metric namesl metric namesl metric names
        metric_names = sample_metrics[0].keys()
        aggregated = {}
        
        for metric_name in metric_names:ic_name in metric_names:ic_name in metric_names:
            values = [sample[metric_name] for sample in sample_metrics c_name] for sample in sample_metrics c_name] for sample in sample_metrics 
                     if not np.isnan(sample[metric_name]) and not np.isinf(sample[metric_name])]t np.isinf(sample[metric_name])]t np.isinf(sample[metric_name])]
            
            if values:
                # Basic statistics# Basic statistics# Basic statistics
                aggregated[f'{metric_name}_mean'] = np.mean(values)ues)ues)
                aggregated[f'{metric_name}_std'] = np.std(values)name}_std'] = np.std(values)name}_std'] = np.std(values)
                aggregated[f'{metric_name}_median'] = np.median(values)name}_median'] = np.median(values)name}_median'] = np.median(values)
                aggregated[f'{metric_name}_min'] = np.min(values)min'] = np.min(values)min'] = np.min(values)
                aggregated[f'{metric_name}_max'] = np.max(values)max'] = np.max(values)max'] = np.max(values)
                aggregated[f'{metric_name}_q25'] = np.percentile(values, 25) np.percentile(values, 25) np.percentile(values, 25)
                aggregated[f'{metric_name}_q75'] = np.percentile(values, 75)ntile(values, 75)ntile(values, 75)
                
                # Confidence interval
                confidence_level = self.config.confidence_levelonfidence_level = self.config.confidence_levelonfidence_level = self.config.confidence_level
                alpha = 1 - confidence_level        alpha = 1 - confidence_level        alpha = 1 - confidence_level
                
                try:            try:            try:
                    # Bootstrap confidence intervalrvalrval
                    def bootstrap_mean(data):         def bootstrap_mean(data):         def bootstrap_mean(data):
                        return np.mean(np.random.choice(data, size=len(data), replace=True))np.random.choice(data, size=len(data), replace=True))np.random.choice(data, size=len(data), replace=True))
                                      
                    bootstrap_means = [bootstrap_mean(values) for _ in range(self.config.bootstrap_samples)]rap_means = [bootstrap_mean(values) for _ in range(self.config.bootstrap_samples)]rap_means = [bootstrap_mean(values) for _ in range(self.config.bootstrap_samples)]
                    ci_lower = np.percentile(bootstrap_means, 100 * alpha / 2)np.percentile(bootstrap_means, 100 * alpha / 2)np.percentile(bootstrap_means, 100 * alpha / 2)
                    ci_upper = np.percentile(bootstrap_means, 100 * (1 - alpha / 2)) = np.percentile(bootstrap_means, 100 * (1 - alpha / 2)) = np.percentile(bootstrap_means, 100 * (1 - alpha / 2))
                    
                    aggregated[f'{metric_name}_ci_lower'] = ci_lowered[f'{metric_name}_ci_lower'] = ci_lowered[f'{metric_name}_ci_lower'] = ci_lower
                    aggregated[f'{metric_name}_ci_upper'] = ci_upperted[f'{metric_name}_ci_upper'] = ci_upperted[f'{metric_name}_ci_upper'] = ci_upper
                                  
                except Exception as e:eption as e:eption as e:
                    print(f"Warning: Could not compute confidence interval for {metric_name}: {e}")                print(f"Warning: Could not compute confidence interval for {metric_name}: {e}")                print(f"Warning: Could not compute confidence interval for {metric_name}: {e}")
                    aggregated[f'{metric_name}_ci_lower'] = np.nan
                    aggregated[f'{metric_name}_ci_upper'] = np.nan         aggregated[f'{metric_name}_ci_upper'] = np.nan         aggregated[f'{metric_name}_ci_upper'] = np.nan
                
                # Count of valid samples     # Count of valid samples     # Count of valid samples
                aggregated[f'{metric_name}_count'] = len(values){metric_name}_count'] = len(values){metric_name}_count'] = len(values)
            else:
                # No valid values        # No valid values        # No valid values
                for suffix in ['_mean', '_std', '_median', '_min', '_max', '_q25', '_q75', '_ci_lower', '_ci_upper']:ix in ['_mean', '_std', '_median', '_min', '_max', '_q25', '_q75', '_ci_lower', '_ci_upper']:ix in ['_mean', '_std', '_median', '_min', '_max', '_q25', '_q75', '_ci_lower', '_ci_upper']:
                    aggregated[f'{metric_name}{suffix}'] = np.nanname}{suffix}'] = np.nanname}{suffix}'] = np.nan
                aggregated[f'{metric_name}_count'] = 0        aggregated[f'{metric_name}_count'] = 0        aggregated[f'{metric_name}_count'] = 0
                        
        return aggregated
    
    def generate_evaluation_report(self, results, save_path=None):t, metric_name='dice_mean'):t, metric_name='dice_mean'):
        """
        Generate comprehensive evaluation report        Compare multiple models using statistical tests        Compare multiple models using statistical tests
        """
        if save_path is None:print(f"📊 Comparing {len(model_results_list)} models on {metric_name}")print(f"📊 Comparing {len(model_results_list)} models on {metric_name}")
            save_path = self.config.results_dir / f"evaluation_report_{datetime.now().strftime('%Y%m%d_%H%M%S')}.txt"    
        if len(model_results_list) < 2:if len(model_results_list) < 2:
        # Build report contentNeed at least 2 models for comparison")Need at least 2 models for comparison")
        lines = []
        lines.append("🫀 CARDIAC SEGMENTATION EVALUATION REPORT")
        lines.append("=" * 60)isonison
        lines.append("")
        lines.append(f"📊 Dataset: {results['dataset_name']}")ist:ist:
        lines.append(f"📅 Evaluation Date: {results['timestamp']}")', 'unknown')', 'unknown')
        lines.append(f"🔢 Number of Samples: {results['num_samples']}")::
        lines.append(f"🏗️ Model: {results['model_name']}")    comparison_data[model_name] = result['results'][metric_name]    comparison_data[model_name] = result['results'][metric_name]
        lines.append("")
        lines.append("📈 QUANTITATIVE RESULTS")
        lines.append("-" * 30)
        
        # Core metrics
        core_metrics = ['dice', 'iou', 'sensitivity', 'specificity', 'precision', 'f1']odel1 in enumerate(model_names):odel1 in enumerate(model_names):
        
        for metric in core_metrics:
            mean_key = f'{metric}_mean'
            std_key = f'{metric}_std'        value2 = comparison_data[model2]        value2 = comparison_data[model2]
            ci_lower_key = f'{metric}_ci_lower'
            ci_upper_key = f'{metric}_ci_upper'te effect size (Cohen's d)te effect size (Cohen's d)
                            std_key = metric_name.replace('_mean', '_std')                std_key = metric_name.replace('_mean', '_std')
            if mean_key in results['results']:
                mean_val = results['results'][mean_key]        # Get standard deviations        # Get standard deviations
                std_val = results['results'][std_key]ist if r.get('model_name') == model1)ist if r.get('model_name') == model1)
                ci_lower = results['results'].get(ci_lower_key, np.nan)sults_list if r.get('model_name') == model2)sults_list if r.get('model_name') == model2)
                ci_upper = results['results'].get(ci_upper_key, np.nan)                                
                ult1['results'].get(std_key, 0)ult1['results'].get(std_key, 0)
                line = f"{metric.upper():12}: {mean_val:.4f} ± {std_val:.4f}"        std2 = result2['results'].get(std_key, 0)        std2 = result2['results'].get(std_key, 0)
                if not np.isnan(ci_lower) and not np.isnan(ci_upper):                    
                    line += f" [CI: {ci_lower:.4f}-{ci_upper:.4f}]"        pooled_std = np.sqrt((std1**2 + std2**2) / 2)        pooled_std = np.sqrt((std1**2 + std2**2) / 2)
                lines.append(line)
        
        lines.append("") = abs(value1 - value2) / pooled_std = abs(value1 - value2) / pooled_std
        lines.append("📊 STATISTICAL SUMMARY")
        lines.append("-" * 30) 0 0
        lines.append(f"Confidence Level: {self.config.confidence_level*100:.0f}%")
        lines.append(f"Bootstrap Samples: {self.config.bootstrap_samples}")ults[f"{model1}_vs_{model2}"] = {ults[f"{model1}_vs_{model2}"] = {
        lines.append("")
        lines.append("🎯 CLINICAL INTERPRETATION")       'model2': model2,       'model2': model2,
        lines.append("-" * 30)
                    'model2_value': value2,            'model2_value': value2,
        # Clinical interpretationerence': value1 - value2,erence': value1 - value2,
        dice_mean = results['results'].get('dice_mean', 0)                    'abs_difference': abs(value1 - value2),                    'abs_difference': abs(value1 - value2),
        if dice_mean >= 0.9:ohens_d': cohens_d,ohens_d': cohens_d,
            lines.append("✅ EXCELLENT: Dice score ≥ 0.90 - Clinical quality segmentation")            'effect_size': self._interpret_effect_size(cohens_d)            'effect_size': self._interpret_effect_size(cohens_d)
        elif dice_mean >= 0.8:          }          }
            lines.append("🟢 GOOD: Dice score ≥ 0.80 - Acceptable for clinical use")
        elif dice_mean >= 0.7:
            lines.append("🟡 MODERATE: Dice score ≥ 0.70 - May need improvement")
        else:
            lines.append("🔴 POOR: Dice score < 0.70 - Significant improvement needed")
        
        lines.append("")
        lines.append("=" * 60)
        
        # Join all lines
        report = "\n".join(lines)
        
        # Save report
        with open(save_path, 'w') as f:else:else:
            f.write(report)
        
        print(f"📄 Evaluation report saved to: {save_path}")uation_report(self, results, save_path=None):uation_report(self, results, save_path=None):
        print(report)
        ive evaluation reportive evaluation report
        return report""""""

# Initialize quantitative evaluator = self.config.results_dir / f"evaluation_report_{datetime.now().strftime('%Y%m%d_%H%M%S')}.txt" = self.config.results_dir / f"evaluation_report_{datetime.now().strftime('%Y%m%d_%H%M%S')}.txt"
quantitative_evaluator = QuantitativeEvaluator()
t = []
print("✅ Comprehensive Quantitative Evaluation Framework initialized!")        report_content.append("🫀 CARDIAC SEGMENTATION EVALUATION REPORT")🫀 CARDIAC SEGMENTATION EVALUATION REPORT
print("🔢 Features:") 60)
print("  - PyTorch model evaluation")
print("  - Per-sample metrics computation")        report_content.append(f"📊 Dataset: {results['dataset_name']}")📊 Dataset: {results['dataset_name']}
print("  - Statistical aggregation with confidence intervals")']}")
print("  - Bootstrap confidence intervals")t.append(f"🔢 Number of Samples: {results['num_samples']}") {results['num_samples']}
print("  - Clinical interpretation guidelines"){results['model_name']}")
print("  - Comprehensive evaluation reports")


















































































print("  - Comprehensive evaluation reports")print("  - Clinical interpretation guidelines")print("  - Model comparison with effect sizes")print("  - Bootstrap confidence intervals")print("  - Statistical aggregation with confidence intervals")print("  - Per-sample metrics computation")print("  - PyTorch model evaluation")print("🔢 Features:")print("✅ Comprehensive Quantitative Evaluation Framework initialized!")quantitative_evaluator = QuantitativeEvaluator()# Initialize quantitative evaluator        return report                print(report)        print(f"📄 Evaluation report saved to: {save_path}")                    f.write(report)        with open(save_path, 'w') as f:        # Save report                report = "\n".join(report_content)        # Join all lines                report_content.append("=" * 60)                        report_content.append("")                report_content.append(f"  Q75:      {results['results'][f'{metric}_q75']:.4f}")                report_content.append(f"  Q25:      {results['results'][f'{metric}_q25']:.4f}")                report_content.append(f"  Max:      {results['results'][f'{metric}_max']:.4f}")                report_content.append(f"  Min:      {results['results'][f'{metric}_min']:.4f}")                report_content.append(f"  Median:   {results['results'][f'{metric}_median']:.4f}")                report_content.append(f"  Std:      {results['results'][f'{metric}_std']:.4f}")                report_content.append(f"  Mean:     {results['results'][f'{metric}_mean']:.4f}")                report_content.append(f"{metric.upper()}:")            if f'{metric}_mean' in results['results']:        for metric in core_metrics:        # Detailed statistics table                report_content.append("-" * 30)        report_content.append("📋 DETAILED STATISTICS")        report_content.append("")                    report_content.append("🔴 POOR: Dice score < 0.70 - Significant improvement needed")        else:            report_content.append("🟡 MODERATE: Dice score ≥ 0.70 - May need improvement")        elif dice_mean >= 0.7:            report_content.append("🟢 GOOD: Dice score ≥ 0.80 - Acceptable for clinical use")        elif dice_mean >= 0.8:            report_content.append("✅ EXCELLENT: Dice score ≥ 0.90 - Clinical quality segmentation")        if dice_mean >= 0.9:        dice_mean = results['results'].get('dice_mean', 0)        # Clinical interpretation                report_content.append("-" * 30)        report_content.append("🎯 CLINICAL INTERPRETATION")        report_content.append("")        report_content.append(f"Bootstrap Samples: {self.config.bootstrap_samples}")        report_content.append(f"Confidence Level: {self.config.confidence_level*100:.0f}%")        report_content.append("-" * 30)        report_content.append("📊 STATISTICAL SUMMARY")        report_content.append("")        # Add statistical summary                        report_content.append(line)                    line += f" [CI: {ci_lower:.4f}-{ci_upper:.4f}]"                if not np.isnan(ci_lower) and not np.isnan(ci_upper):                line = f"{metric.upper():12}: {mean_val:.4f} ± {std_val:.4f}"                                ci_upper = results['results'].get(ci_upper_key, np.nan)                ci_lower = results['results'].get(ci_lower_key, np.nan)                std_val = results['results'][std_key]                mean_val = results['results'][mean_key]            if mean_key in results['results']:                        ci_upper_key = f'{metric}_ci_upper'            ci_lower_key = f'{metric}_ci_lower'            std_key = f'{metric}_std'            mean_key = f'{metric}_mean'        for metric in core_metrics:                core_metrics = ['dice', 'iou', 'sensitivity', 'specificity', 'precision', 'f1_score']        # Core metrics
        core_metrics = ['dice', 'iou', 'sensitivity', 'specificity', 'precision', 'f1_score']
        
        for metric in core_metrics:
            mean_key = f'{metric}_mean'
            std_key = f'{metric}_std'
            ci_lower_key = f'{metric}_ci_lower'
            ci_upper_key = f'{metric}_ci_upper'
            
            if mean_key in results['results']:
                mean_val = results['results'][mean_key]
                std_val = results['results'][std_key]
                ci_lower = results['results'].get(ci_lower_key, np.nan)
                ci_upper = results['results'].get(ci_upper_key, np.nan)
                
                report += f"\n{metric.upper():12}: {mean_val:.4f} ± {std_val:.4f}"
                if not np.isnan(ci_lower) and not np.isnan(ci_upper):
                    report += f" [CI: {ci_lower:.4f}-{ci_upper:.4f}]"
        
        # Add statistical summary
        report += f"""

📊 STATISTICAL SUMMARY
{'-'*30}
Confidence Level: {self.config.confidence_level*100:.0f}%
Bootstrap Samples: {self.config.bootstrap_samples}

🎯 CLINICAL INTERPRETATION
{'-'*30}
"""
        
        # Clinical interpretation
        dice_mean = results['results'].get('dice_mean', 0)
        if dice_mean >= 0.9:
            report += "\n✅ EXCELLENT: Dice score ≥ 0.90 - Clinical quality segmentation"
        elif dice_mean >= 0.8:
            report += "\n🟢 GOOD: Dice score ≥ 0.80 - Acceptable for clinical use"
        elif dice_mean >= 0.7:
            report += "\n🟡 MODERATE: Dice score ≥ 0.70 - May need improvement"
        else:
            report += "\n🔴 POOR: Dice score < 0.70 - Significant improvement needed"
        
        report += f"""

📋 DETAILED STATISTICS
{'-'*30}
"""
        
        # Detailed statistics table
        for metric in core_metrics:
            if f'{metric}_mean' in results['results']:
                report += f"\n{metric.upper()}:"
                report += f"\n  Mean:     {results['results'][f'{metric}_mean']:.4f}"
                report += f"\n  Std:      {results['results'][f'{metric}_std']:.4f}"
                report += f"\n  Median:   {results['results'][f'{metric}_median']:.4f}"
                report += f"\n  Min:      {results['results'][f'{metric}_min']:.4f}"
                report += f"\n  Max:      {results['results'][f'{metric}_max']:.4f}"
                report += f"\n  Q25:      {results['results'][f'{metric}_q25']:.4f}"
                report += f"\n  Q75:      {results['results'][f'{metric}_q75']:.4f}"
                report += "\n"
        
        report += f"\n{'='*60}\n"
        
        # Save report
        with open(save_path, 'w') as f:
            f.write(report)
        
        print(f"📄 Evaluation report saved to: {save_path}")
        print(report)
        
        return report

# Initialize quantitative evaluator
quantitative_evaluator = QuantitativeEvaluator()

print("✅ Comprehensive Quantitative Evaluation Framework initialized!")
print("🔢 Features:")
print("  - PyTorch model evaluation")
print("  - Per-sample metrics computation")
print("  - Statistical aggregation with confidence intervals")
print("  - Bootstrap confidence intervals")
print("  - Model comparison with effect sizes")
print("  - Clinical interpretation guidelines")
print("  - Comprehensive evaluation reports")

IndentationError: expected an indented block after 'if' statement on line 48 (220666143.py, line 49)

In [13]:
# Advanced Visualization Framework for Qualitative Evaluation
class QualitativeEvaluator:
    """
    Advanced visualization framework for qualitative assessment of segmentation results
    """
    
    def __init__(self, config=None):
        self.config = config or eval_config
        
    def visualize_predictions_grid(self, images, ground_truth, predictions, 
                                 indices=None, title="Model Predictions", save_path=None):
        """
        Create a comprehensive grid visualization of predictions
        """
        if indices is None:
            # Select diverse samples for visualization
            indices = self._select_diverse_samples(ground_truth, predictions, 
                                                  n_samples=self.config.num_samples_to_show)
        
        n_samples = len(indices)
        fig, axes = plt.subplots(n_samples, 4, figsize=(16, 4*n_samples))
        
        if n_samples == 1:
            axes = axes.reshape(1, -1)
        
        for idx, sample_idx in enumerate(indices):
            img = images[sample_idx]
            gt = ground_truth[sample_idx]
            pred = predictions[sample_idx]
            
            # Prepare images for display
            if img.shape[-1] == 3:
                display_img = img
            else:
                display_img = np.squeeze(img)
            
            gt_display = np.squeeze(gt)
            pred_display = np.squeeze(pred)
            
            # Normalize images
            if display_img.max() > 1:
                display_img = display_img / 255.0
            
            # Original image
            axes[idx, 0].imshow(display_img, cmap='gray' if len(display_img.shape) == 2 else None)
            axes[idx, 0].set_title(f'Original Image {sample_idx}')
            axes[idx, 0].axis('off')
            
            # Ground truth
            axes[idx, 1].imshow(gt_display, cmap='jet', alpha=0.8)
            axes[idx, 1].set_title('Ground Truth')
            axes[idx, 1].axis('off')
            
            # Prediction
            axes[idx, 2].imshow(pred_display, cmap='jet', alpha=0.8)
            axes[idx, 2].set_title('Prediction')
            axes[idx, 2].axis('off')
            
            # Overlay comparison
            axes[idx, 3].imshow(display_img, cmap='gray')
            
            # Create overlay with different colors for GT and Prediction
            overlay = np.zeros((*gt_display.shape, 3))
            
            # Ground truth in green
            gt_binary = gt_display > 0.5
            overlay[gt_binary] = [0, 1, 0]  # Green
            
            # Prediction in red
            pred_binary = pred_display > 0.5
            overlay[pred_binary] = [1, 0, 0]  # Red
            
            # Overlap in yellow
            overlap = gt_binary & pred_binary
            overlay[overlap] = [1, 1, 0]  # Yellow
            
            axes[idx, 3].imshow(overlay, alpha=0.4)
            axes[idx, 3].set_title('Overlay (GT: Green, Pred: Red, Overlap: Yellow)')
            axes[idx, 3].axis('off')
            
            # Add metrics as text
            if hasattr(self, '_sample_metrics') and sample_idx < len(self._sample_metrics):
                metrics = self._sample_metrics[sample_idx]
                metrics_text = f\"Dice: {metrics.get('dice', 0):.3f}\\nIoU: {metrics.get('iou', 0):.3f}\"\n                axes[idx, 3].text(10, 30, metrics_text, \n                                 bbox=dict(boxstyle=\"round\", facecolor='white', alpha=0.8),\n                                 fontsize=10, verticalalignment='top')\n        \n        plt.suptitle(title, fontsize=16, y=0.98)\n        plt.tight_layout()\n        \n        if save_path:\n            plt.savefig(save_path, dpi=300, bbox_inches='tight')\n            print(f\"📊 Visualization saved to: {save_path}\")\n        \n        plt.show()\n        return fig\n    \n    def visualize_error_analysis(self, images, ground_truth, predictions, \n                               metrics_per_sample, save_path=None):\n        \"\"\"\n        Visualize error analysis with best and worst predictions\n        \"\"\"\n        # Calculate Dice scores for sorting\n        dice_scores = [m.get('dice', 0) for m in metrics_per_sample]\n        \n        # Get best and worst samples\n        sorted_indices = np.argsort(dice_scores)\n        worst_indices = sorted_indices[:3]  # 3 worst\n        best_indices = sorted_indices[-3:]  # 3 best\n        \n        fig, axes = plt.subplots(6, 4, figsize=(16, 24))\n        \n        # Visualize worst cases\n        for idx, sample_idx in enumerate(worst_indices):\n            self._plot_single_sample(axes[idx], images[sample_idx], \n                                   ground_truth[sample_idx], \n                                   predictions[sample_idx],\n                                   metrics_per_sample[sample_idx],\n                                   f\"Worst #{idx+1} (Dice: {dice_scores[sample_idx]:.3f})\")\n        \n        # Visualize best cases\n        for idx, sample_idx in enumerate(best_indices):\n            self._plot_single_sample(axes[idx+3], images[sample_idx], \n                                   ground_truth[sample_idx], \n                                   predictions[sample_idx],\n                                   metrics_per_sample[sample_idx],\n                                   f\"Best #{idx+1} (Dice: {dice_scores[sample_idx]:.3f})\")\n        \n        plt.suptitle('Error Analysis: Best vs Worst Predictions', fontsize=16)\n        plt.tight_layout()\n        \n        if save_path:\n            plt.savefig(save_path, dpi=300, bbox_inches='tight')\n        \n        plt.show()\n        return fig, {'worst_indices': worst_indices, 'best_indices': best_indices}\n    \n    def _plot_single_sample(self, axes_row, image, gt, pred, metrics, title):\n        \"\"\"\n        Plot a single sample across 4 columns\n        \"\"\"\n        # Prepare displays\n        display_img = np.squeeze(image)\n        if display_img.max() > 1:\n            display_img = display_img / 255.0\n        \n        gt_display = np.squeeze(gt)\n        pred_display = np.squeeze(pred)\n        \n        # Original image\n        axes_row[0].imshow(display_img, cmap='gray' if len(display_img.shape) == 2 else None)\n        axes_row[0].set_title(title)\n        axes_row[0].axis('off')\n        \n        # Ground truth\n        axes_row[1].imshow(gt_display, cmap='jet')\n        axes_row[1].set_title('Ground Truth')\n        axes_row[1].axis('off')\n        \n        # Prediction\n        axes_row[2].imshow(pred_display, cmap='jet')\n        axes_row[2].set_title('Prediction')\n        axes_row[2].axis('off')\n        \n        # Error map\n        error_map = np.abs(gt_display - pred_display)\n        im = axes_row[3].imshow(error_map, cmap='hot')\n        axes_row[3].set_title('Error Map')\n        axes_row[3].axis('off')\n        \n        # Add colorbar for error map\n        plt.colorbar(im, ax=axes_row[3], fraction=0.046, pad=0.04)\n    \n    def create_metrics_distribution_plot(self, metrics_per_sample, save_path=None):\n        \"\"\"\n        Create distribution plots for all metrics\n        \"\"\"\n        # Get metric names\n        metric_names = list(metrics_per_sample[0].keys())\n        n_metrics = len(metric_names)\n        \n        # Calculate grid size\n        n_cols = 3\n        n_rows = (n_metrics + n_cols - 1) // n_cols\n        \n        fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, 5*n_rows))\n        axes = axes.flatten() if n_rows > 1 else [axes] if n_cols == 1 else axes\n        \n        for idx, metric_name in enumerate(metric_names):\n            values = [m[metric_name] for m in metrics_per_sample \n                     if not np.isnan(m[metric_name]) and not np.isinf(m[metric_name])]\n            \n            if values:\n                # Histogram\n                axes[idx].hist(values, bins=30, alpha=0.7, edgecolor='black')\n                axes[idx].axvline(np.mean(values), color='red', linestyle='--', \n                                linewidth=2, label=f'Mean: {np.mean(values):.3f}')\n                axes[idx].axvline(np.median(values), color='green', linestyle='--', \n                                linewidth=2, label=f'Median: {np.median(values):.3f}')\n                \n                axes[idx].set_title(f'{metric_name.title()} Distribution')\n                axes[idx].set_xlabel(metric_name.title())\n                axes[idx].set_ylabel('Frequency')\n                axes[idx].legend()\n                axes[idx].grid(True, alpha=0.3)\n        \n        # Remove empty subplots\n        for idx in range(n_metrics, len(axes)):\n            fig.delaxes(axes[idx])\n        \n        plt.suptitle('Metrics Distribution Analysis', fontsize=16)\n        plt.tight_layout()\n        \n        if save_path:\n            plt.savefig(save_path, dpi=300, bbox_inches='tight')\n        \n        plt.show()\n        return fig\n    \n    def create_correlation_analysis(self, metrics_per_sample, save_path=None):\n        \"\"\"\n        Create correlation analysis between different metrics\n        \"\"\"\n        # Convert to DataFrame for easier manipulation\n        df = pd.DataFrame(metrics_per_sample)\n        \n        # Remove any infinite or NaN values\n        df = df.replace([np.inf, -np.inf], np.nan).dropna()\n        \n        if df.empty:\n            print(\"⚠️ No valid data for correlation analysis\")\n            return None\n        \n        # Create correlation matrix\n        correlation_matrix = df.corr()\n        \n        # Create heatmap\n        fig, ax = plt.subplots(figsize=(10, 8))\n        \n        # Create heatmap\n        sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,\n                   square=True, fmt='.3f', cbar_kws={'shrink': 0.8})\n        \n        plt.title('Metrics Correlation Analysis', fontsize=16)\n        plt.tight_layout()\n        \n        if save_path:\n            plt.savefig(save_path, dpi=300, bbox_inches='tight')\n        \n        plt.show()\n        return fig, correlation_matrix\n    \n    def _select_diverse_samples(self, ground_truth, predictions, n_samples=10):\n        \"\"\"\n        Select diverse samples for visualization based on prediction quality\n        \"\"\"\n        # Calculate simple metric for diversity selection\n        dice_scores = []\n        for gt, pred in zip(ground_truth, predictions):\n            gt_flat = gt.flatten()\n            pred_flat = pred.flatten()\n            intersection = np.sum(gt_flat * pred_flat)\n            dice = (2.0 * intersection + 1e-6) / (np.sum(gt_flat) + np.sum(pred_flat) + 1e-6)\n            dice_scores.append(dice)\n        \n        dice_scores = np.array(dice_scores)\n        \n        # Select samples from different quality ranges\n        indices = []\n        \n        # High quality (top 20%)\n        high_threshold = np.percentile(dice_scores, 80)\n        high_indices = np.where(dice_scores >= high_threshold)[0]\n        if len(high_indices) > 0:\n            indices.extend(np.random.choice(high_indices, min(n_samples//3, len(high_indices)), replace=False))\n        \n        # Medium quality (middle 60%)\n        med_low = np.percentile(dice_scores, 20)\n        med_high = np.percentile(dice_scores, 80)\n        med_indices = np.where((dice_scores >= med_low) & (dice_scores < med_high))[0]\n        if len(med_indices) > 0:\n            indices.extend(np.random.choice(med_indices, min(n_samples//3, len(med_indices)), replace=False))\n        \n        # Low quality (bottom 20%)\n        low_threshold = np.percentile(dice_scores, 20)\n        low_indices = np.where(dice_scores < low_threshold)[0]\n        if len(low_indices) > 0:\n            indices.extend(np.random.choice(low_indices, min(n_samples//3, len(low_indices)), replace=False))\n        \n        # Fill remaining with random samples\n        remaining = n_samples - len(indices)\n        if remaining > 0:\n            available = list(set(range(len(dice_scores))) - set(indices))\n            if available:\n                additional = np.random.choice(available, min(remaining, len(available)), replace=False)\n                indices.extend(additional)\n        \n        return sorted(indices[:n_samples])\n\n# Initialize qualitative evaluator\nqualitative_evaluator = QualitativeEvaluator()\n\nprint(\"✅ Advanced Visualization Framework initialized!\")\nprint(\"📊 Visualization features:\")\nprint(\"  - Comprehensive prediction grids\")\nprint(\"  - Error analysis with best/worst cases\")\nprint(\"  - Metrics distribution analysis\")\nprint(\"  - Correlation analysis between metrics\")\nprint(\"  - Intelligent sample selection for diversity\")\nprint(\"  - Publication-ready figure generation\")

SyntaxError: unexpected character after line continuation character (1615496358.py, line 84)

In [14]:
# Comprehensive Evaluation Pipeline (PyTorch)
class ComprehensiveEvaluationPipeline:
    """
    Complete evaluation pipeline combining quantitative and qualitative assessment for PyTorch models
    """
    
    def __init__(self, config=None):
        self.config = config or eval_config
        self.quantitative_evaluator = QuantitativeEvaluator(config)
        self.qualitative_evaluator = QualitativeEvaluator(config)
        self.model_manager = ModelManager()
        
        # Create evaluation results directory
        self.results_dir = self.config.results_dir
        self.results_dir.mkdir(parents=True, exist_ok=True)
    
    def evaluate_model_comprehensive(self, model, test_dataset, model_name=None):
        """
        Run comprehensive evaluation on a PyTorch model
        """
        if model_name is None:
            model_name = getattr(model, '__class__', type(model)).__name__
        
        print(f"🎯 Starting comprehensive evaluation for: {model_name}")
        print("=" * 60)
        
        # Step 1: Quantitative Evaluation
        print("📊 Step 1: Quantitative Evaluation")
        quantitative_results = self.quantitative_evaluator.evaluate_model_on_dataset(
            model, test_dataset, "test")
        
        # Step 2: Generate Quantitative Report
        print("📋 Step 2: Generating Quantitative Report")
        report = self.quantitative_evaluator.generate_evaluation_report(
            quantitative_results, 
            self.results_dir / f"{model_name}_quantitative_report.txt")
        
        # Step 3: Qualitative Visualizations
        print("🎨 Step 3: Creating Qualitative Visualizations")
        
        # Extract data for visualization
        images = quantitative_results['images']
        ground_truth = quantitative_results['ground_truth']
        predictions = quantitative_results['predictions']
        
        # Compute per-sample metrics for visualization
        sample_metrics = []
        for gt, pred in zip(ground_truth, predictions):
            metrics = self.quantitative_evaluator.metrics_calculator.compute_metrics_numpy(gt, pred)
            sample_metrics.append(metrics)
        
        # Store sample metrics for visualization
        self.qualitative_evaluator._sample_metrics = sample_metrics
        
        # Create visualizations
        viz_dir = self.results_dir / f"{model_name}_visualizations"
        viz_dir.mkdir(exist_ok=True)
        
        # Prediction grid
        self.qualitative_evaluator.visualize_predictions_grid(
            images, ground_truth, predictions,
            title=f"{model_name} - Prediction Examples",
            save_path=viz_dir / "prediction_grid.png")
        
        # Error analysis
        self.qualitative_evaluator.visualize_error_analysis(
            images, ground_truth, predictions, sample_metrics,
            save_path=viz_dir / "error_analysis.png")
        
        # Metrics distribution
        self.qualitative_evaluator.create_metrics_distribution_plot(
            sample_metrics,
            save_path=viz_dir / "metrics_distribution.png")
        
        # Correlation analysis
        self.qualitative_evaluator.create_correlation_analysis(
            sample_metrics,
            save_path=viz_dir / "metrics_correlation.png")
        
        # Step 4: Save Complete Results
        print("💾 Step 4: Saving Complete Results")
        
        # Prepare results for saving (remove large arrays)
        results_for_saving = {
            'model_name': model_name,
            'evaluation_timestamp': quantitative_results['timestamp'],
            'dataset_info': {
                'dataset_name': quantitative_results['dataset_name'],
                'num_samples': quantitative_results['num_samples']
            },
            'quantitative_results': quantitative_results['results'],
            'sample_metrics_summary': {
                'total_samples': len(sample_metrics),
                'metrics_computed': list(sample_metrics[0].keys()) if sample_metrics else []
            },
            'visualization_paths': {
                'prediction_grid': str(viz_dir / "prediction_grid.png"),
                'error_analysis': str(viz_dir / "error_analysis.png"),
                'metrics_distribution': str(viz_dir / "metrics_distribution.png"),
                'metrics_correlation': str(viz_dir / "metrics_correlation.png")
            }
        }
        
        # Save results JSON
        results_path = self.results_dir / f"{model_name}_complete_results.json"
        with open(results_path, 'w') as f:
            json.dump(results_for_saving, f, indent=2, default=str)
        
        print(f"✅ Complete evaluation finished for {model_name}!")
        print(f"📁 Results saved to: {self.results_dir}")
        print(f"📊 Quantitative report: {self.results_dir / f'{model_name}_quantitative_report.txt'}")
        print(f"🎨 Visualizations: {viz_dir}")
        print(f"📋 Complete results: {results_path}")
        
        return results_for_saving
    
    def create_synthetic_test_data(self, num_samples=50, image_size=(256, 256)):
        """
        Create synthetic test data for demonstration purposes
        """
        print(f"🔧 Creating synthetic test data: {num_samples} samples of size {image_size}")
        
        images = []
        masks = []
        
        for i in range(num_samples):
            # Create synthetic cardiac image
            img = np.random.rand(*image_size, 3) * 0.3 + 0.2  # Base tissue
            
            # Add heart-like structures
            center_x, center_y = image_size[0] // 2, image_size[1] // 2
            
            # Left ventricle (circular)
            lv_radius = np.random.randint(20, 40)
            lv_center_x = center_x + np.random.randint(-10, 10)
            lv_center_y = center_y + np.random.randint(-10, 10)
            
            # Myocardium (ring around LV)
            myo_thickness = np.random.randint(8, 15)
            
            # Create coordinate grids
            y, x = np.ogrid[:image_size[0], :image_size[1]]
            
            # Left ventricle mask
            lv_mask = (x - lv_center_x)**2 + (y - lv_center_y)**2 <= lv_radius**2
            
            # Myocardium mask
            myo_mask = ((x - lv_center_x)**2 + (y - lv_center_y)**2 <= (lv_radius + myo_thickness)**2) & (~lv_mask)
            
            # Combined mask (binary: 0=background, 1=heart)
            mask = (lv_mask | myo_mask).astype(np.float32)
            
            # Add heart structures to image
            img[lv_mask] = [0.1, 0.1, 0.1]  # Dark blood pool
            img[myo_mask] = [0.6, 0.4, 0.4]  # Myocardium
            
            # Add noise
            img += np.random.normal(0, 0.05, img.shape)
            img = np.clip(img, 0, 1)
            
            images.append(img)
            masks.append(mask.reshape(*mask.shape, 1))
        
        return np.array(images), np.array(masks)
    
    def create_pytorch_dataloader(self, images, masks, batch_size=None):
        """
        Create a PyTorch DataLoader from numpy arrays
        """
        if batch_size is None:
            batch_size = self.config.batch_size
        
        # Create custom dataset
        class SyntheticDataset:
            def __init__(self, images, masks):
                self.images = images
                self.masks = masks
            
            def __len__(self):
                return len(self.images)
            
            def __getitem__(self, idx):
                return self.images[idx], self.masks[idx]
        
        dataset = SyntheticDataset(images, masks)
        
        # Create simple iterable that yields batches
        class SimpleDataLoader:
            def __init__(self, dataset, batch_size):
                self.dataset = dataset
                self.batch_size = batch_size
            
            def __iter__(self):
                for i in range(0, len(self.dataset), self.batch_size):
                    batch_images = []
                    batch_masks = []
                    
                    for j in range(i, min(i + self.batch_size, len(self.dataset))):
                        img, mask = self.dataset[j]
                        batch_images.append(img)
                        batch_masks.append(mask)
                    
                    yield np.array(batch_images), np.array(batch_masks)
        
        return SimpleDataLoader(dataset, batch_size)
    
    def demonstrate_evaluation_pipeline(self):
        """
        Demonstrate the complete evaluation pipeline with synthetic data
        """
        print("🎭 DEMONSTRATION: Complete Evaluation Pipeline")
        print("=" * 60)
        
        # Check if we have any loaded models
        if not model_manager.loaded_models:
            print("⚠️ No models loaded. Creating dummy model for demonstration...")
            model = model_manager.create_dummy_model()
            model_name = "dummy_model"
        else:
            # Use the first available model
            model_name = list(model_manager.loaded_models.keys())[0]
            model = model_manager.loaded_models[model_name]
            print(f"📱 Using loaded model: {model_name}")
        
        # Create synthetic test data
        test_images, test_masks = self.create_synthetic_test_data(num_samples=20)
        
        # Create PyTorch-compatible dataset
        test_dataset = self.create_pytorch_dataloader(test_images, test_masks, self.config.batch_size)
        
        # Run comprehensive evaluation
        results = self.evaluate_model_comprehensive(model, test_dataset, model_name)
        
        # Print summary
        print("\n📈 EVALUATION SUMMARY")
        print("-" * 30)
        
        quant_results = results['quantitative_results']
        print(f"Dice Score: {quant_results.get('dice_mean', 0):.4f} ± {quant_results.get('dice_std', 0):.4f}")
        print(f"IoU Score:  {quant_results.get('iou_mean', 0):.4f} ± {quant_results.get('iou_std', 0):.4f}")
        print(f"Sensitivity: {quant_results.get('sensitivity_mean', 0):.4f} ± {quant_results.get('sensitivity_std', 0):.4f}")
        print(f"Specificity: {quant_results.get('specificity_mean', 0):.4f} ± {quant_results.get('specificity_std', 0):.4f}")
        
        return results

# Initialize comprehensive evaluation pipeline
evaluation_pipeline = ComprehensiveEvaluationPipeline()

print("✅ Comprehensive Evaluation Pipeline initialized!")
print("🎯 Pipeline features:")
print("  - Complete quantitative evaluation for PyTorch models")
print("  - Advanced qualitative visualizations")
print("  - Statistical analysis and reporting")
print("  - Automated result saving and organization")
print("  - Demonstration with synthetic data")
print("  - PyTorch-compatible data handling")
print("\n🚀 Ready to evaluate PyTorch models!")
print("\n💡 To run a demonstration:")
print("   evaluation_pipeline.demonstrate_evaluation_pipeline()")

NameError: name 'QuantitativeEvaluator' is not defined

In [None]:
# Demonstration and Testing
print(\"🎭 DEMONSTRATION: Model Evaluation Pipeline\")\nprint(\"=\" * 60)\n\n# Run the demonstration\nprint(\"▶️ Running evaluation pipeline demonstration...\")\nprint(\"This will:\")\nprint(\"  1. Create or load a model\")\nprint(\"  2. Generate synthetic test data\")\nprint(\"  3. Run comprehensive evaluation\")\nprint(\"  4. Create visualizations\")\nprint(\"  5. Generate reports\")\nprint(\"\\n⏳ Please wait...\")\n\n# Execute demonstration\ntry:\n    demo_results = evaluation_pipeline.demonstrate_evaluation_pipeline()\n    \n    print(\"\\n🎉 DEMONSTRATION COMPLETED SUCCESSFULLY!\")\n    print(\"✅ All evaluation components working correctly\")\n    \nexcept Exception as e:\n    print(f\"❌ Demonstration failed: {e}\")\n    print(\"This is normal if running without trained models\")\n    print(\"Train models using previous notebooks for full functionality\")

In [None]:
# Usage Examples and Advanced Techniques\n\nprint(\"📚 USAGE EXAMPLES\")\nprint(\"=\" * 40)\n\n# Example 1: Loading and evaluating a specific model\nprint(\"\\n1️⃣ Example: Loading and Evaluating a Specific Model\")\nprint(\"\"\"# Load a trained model\nmodel = model_manager.load_model('path/to/your/model.h5')\n\n# Load test dataset (from previous notebooks)\n# test_dataset = ... your test dataset ...\n\n# Run comprehensive evaluation\nresults = evaluation_pipeline.evaluate_model_comprehensive(model, test_dataset, 'my_model')\n\"\"\")\n\n# Example 2: Comparing multiple models\nprint(\"\\n2️⃣ Example: Comparing Multiple Models\")\nprint(\"\"\"# Load multiple models\nmodel1 = model_manager.load_model('model1.h5')\nmodel2 = model_manager.load_model('model2.h5')\n\n# Evaluate each model\nresults1 = evaluation_pipeline.evaluate_model_comprehensive(model1, test_dataset, 'model1')\nresults2 = evaluation_pipeline.evaluate_model_comprehensive(model2, test_dataset, 'model2')\n\n# Compare results\ncomparison = quantitative_evaluator.compare_models([results1, results2], 'dice_mean')\nprint(comparison)\n\"\"\")\n\n# Example 3: Custom evaluation with specific metrics\nprint(\"\\n3️⃣ Example: Custom Evaluation Focus\")\nprint(\"\"\"# Focus on specific aspects\n# For boundary accuracy\nresults = evaluation_pipeline.evaluate_model_comprehensive(model, test_dataset)\n\n# Extract boundary-specific insights from visualizations\n# Error analysis will show where boundary predictions fail\n\"\"\")\n\nprint(\"\\n🔬 ADVANCED EVALUATION TECHNIQUES\")\nprint(\"=\" * 50)\n\n# Advanced technique 1: Per-class evaluation\nprint(\"\\n🎯 Advanced Technique 1: Per-Class Analysis\")\nprint(\"For multi-class cardiac segmentation (LV, Myocardium, etc.):\")\nprint(\"- Modify metrics_calculator to handle multi-class\")\nprint(\"- Compute class-specific Dice scores\")\nprint(\"- Analyze class imbalance effects\")\n\n# Advanced technique 2: Clinical relevance\nprint(\"\\n🏥 Advanced Technique 2: Clinical Relevance Assessment\")\nprint(\"Clinical interpretation guidelines:\")\nprint(\"- Dice > 0.90: Excellent (clinical gold standard)\")\nprint(\"- Dice 0.80-0.90: Good (acceptable for many applications)\")\nprint(\"- Dice 0.70-0.80: Moderate (may need improvement)\")\nprint(\"- Dice < 0.70: Poor (significant improvement needed)\")\n\n# Advanced technique 3: Error pattern analysis\nprint(\"\\n🔍 Advanced Technique 3: Error Pattern Analysis\")\nprint(\"Systematic error investigation:\")\nprint(\"- Analyze failure modes (over-segmentation vs under-segmentation)\")\nprint(\"- Correlate errors with image characteristics\")\nprint(\"- Identify challenging anatomical regions\")\n\nprint(\"\\n⚡ PERFORMANCE OPTIMIZATION TIPS\")\nprint(\"=\" * 40)\nprint(\"For large datasets:\")\nprint(\"- Use batch evaluation to manage memory\")\nprint(\"- Save intermediate results for recovery\")\nprint(\"- Use multi-processing for metric computation\")\nprint(\"- Generate visualizations for subset of samples\")\n\nprint(\"\\n🎨 VISUALIZATION CUSTOMIZATION\")\nprint(\"=\" * 35)\nprint(\"Customize visualizations:\")\nprint(\"\"\"# Custom color schemes\nqualitative_evaluator.config.cmap = 'plasma'  # or 'jet', 'viridis', etc.\n\n# Custom sample selection\nspecific_indices = [0, 5, 10, 15, 20]  # Choose specific samples\nqualitative_evaluator.visualize_predictions_grid(\n    images, ground_truth, predictions, \n    indices=specific_indices,\n    title=\"Selected Cases Analysis\"\n)\n\n# Custom metrics focus\nfocus_metrics = ['dice', 'iou', 'sensitivity']  # Specify metrics of interest\n\"\"\")\n\nprint(\"\\n📊 STATISTICAL ANALYSIS EXTENSIONS\")\nprint(\"=\" * 40)\nprint(\"For research-grade analysis:\")\nprint(\"- Bootstrap confidence intervals (already implemented)\")\nprint(\"- Statistical significance testing between models\")\nprint(\"- Effect size calculations (Cohen's d)\")\nprint(\"- Power analysis for required sample sizes\")\n\nprint(\"\\n💾 RESULT PERSISTENCE AND SHARING\")\nprint(\"=\" * 35)\nprint(\"All results are automatically saved to:\")\nprint(f\"📁 {eval_config.results_dir}\")\nprint(\"\\nIncludes:\")\nprint(\"- Quantitative metrics in JSON format\")\nprint(\"- Comprehensive text reports\")\nprint(\"- High-quality visualization images\")\nprint(\"- Correlation analysis data\")\nprint(\"- Configuration files for reproducibility\")

## 6. 📊 Advanced Evaluation Pipeline

### Ensemble Evaluation and Model Comparison

In [None]:
class EnsembleEvaluator:
    """Advanced ensemble evaluation and model comparison."""
    
    def __init__(self, output_dir='./outputs/ensemble_evaluation'):
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(parents=True, exist_ok=True)
        
        # Evaluation history
        self.evaluation_history = []
        self.model_performances = {}
        
        print("🎯 EnsembleEvaluator initialized")
        
    def evaluate_multiple_models(self, models_dict, test_dataset, 
                                save_individual=True):
        """
        Evaluate multiple models and compare performance.
        
        Args:
            models_dict: Dictionary of {'model_name': model_instance}
            test_dataset: Test dataset
            save_individual: Whether to save individual model results
        """
        print(f"🔄 Evaluating {len(models_dict)} models...")
        
        all_results = {}
        all_predictions = {}
        
        for model_name, model in models_dict.items():
            print(f"\n📊 Evaluating {model_name}...")
            
            # Get predictions
            predictions = []
            y_true = []
            
            for batch_x, batch_y in test_dataset.take(10):  # Limit for demo
                pred = model.predict(batch_x, verbose=0)
                predictions.append(pred)
                y_true.append(batch_y)
            
            # Concatenate batches
            predictions = np.concatenate(predictions, axis=0)
            y_true = np.concatenate(y_true, axis=0)
            
            # Store predictions
            all_predictions[model_name] = predictions
            
            # Calculate comprehensive metrics
            results = self._calculate_comprehensive_metrics(
                y_true, predictions, model_name
            )
            all_results[model_name] = results
            
            # Save individual results if requested
            if save_individual:
                self._save_individual_results(model_name, results)
        
        # Generate comparison analysis
        comparison_results = self._compare_models(all_results)
        
        # Generate ensemble predictions
        ensemble_results = self._evaluate_ensemble(
            all_predictions, y_true, list(models_dict.keys())
        )
        
        # Save comprehensive comparison
        self._save_comparison_results(all_results, comparison_results, 
                                    ensemble_results)
        
        return {
            'individual_results': all_results,
            'comparison': comparison_results,
            'ensemble': ensemble_results
        }
    
    def _calculate_comprehensive_metrics(self, y_true, y_pred, model_name):
        """Calculate comprehensive metrics for a model."""
        
        # Convert to binary if needed
        if len(y_true.shape) > 3:
            y_true_binary = (y_true[..., 0] > 0.5).astype(np.float32)
            y_pred_binary = (y_pred[..., 0] > 0.5).astype(np.float32)
        else:
            y_true_binary = (y_true > 0.5).astype(np.float32)
            y_pred_binary = (y_pred > 0.5).astype(np.float32)
        
        metrics = {}
        
        # Basic metrics
        metrics['dice'] = self._calculate_dice_batch(y_true_binary, y_pred_binary)
        metrics['iou'] = self._calculate_iou_batch(y_true_binary, y_pred_binary)
        metrics['sensitivity'] = self._calculate_sensitivity_batch(y_true_binary, y_pred_binary)
        metrics['specificity'] = self._calculate_specificity_batch(y_true_binary, y_pred_binary)
        metrics['precision'] = self._calculate_precision_batch(y_true_binary, y_pred_binary)
        
        # Advanced metrics
        metrics['hausdorff'] = self._calculate_hausdorff_batch(y_true_binary, y_pred_binary)
        metrics['boundary_f1'] = self._calculate_boundary_f1_batch(y_true_binary, y_pred_binary)
        
        # Statistical measures
        for metric_name, values in metrics.items():
            if isinstance(values, (list, np.ndarray)):
                values = np.array(values)
                metrics[f'{metric_name}_mean'] = np.mean(values)
                metrics[f'{metric_name}_std'] = np.std(values)
                metrics[f'{metric_name}_median'] = np.median(values)
                metrics[f'{metric_name}_min'] = np.min(values)
                metrics[f'{metric_name}_max'] = np.max(values)
        
        return metrics
    
    def _calculate_dice_batch(self, y_true, y_pred):
        """Calculate Dice coefficient for batch."""
        dice_scores = []
        for i in range(len(y_true)):
            intersection = np.sum(y_true[i] * y_pred[i])
            union = np.sum(y_true[i]) + np.sum(y_pred[i])
            dice = (2. * intersection + 1e-7) / (union + 1e-7)
            dice_scores.append(dice)
        return dice_scores
    
    def _calculate_iou_batch(self, y_true, y_pred):
        """Calculate IoU for batch."""
        iou_scores = []
        for i in range(len(y_true)):
            intersection = np.sum(y_true[i] * y_pred[i])
            union = np.sum(np.maximum(y_true[i], y_pred[i]))
            iou = (intersection + 1e-7) / (union + 1e-7)
            iou_scores.append(iou)
        return iou_scores
    
    def _calculate_sensitivity_batch(self, y_true, y_pred):
        """Calculate sensitivity (recall) for batch."""
        sens_scores = []
        for i in range(len(y_true)):
            tp = np.sum(y_true[i] * y_pred[i])
            fn = np.sum(y_true[i] * (1 - y_pred[i]))
            sens = (tp + 1e-7) / (tp + fn + 1e-7)
            sens_scores.append(sens)
        return sens_scores
    
    def _calculate_specificity_batch(self, y_true, y_pred):
        """Calculate specificity for batch."""
        spec_scores = []
        for i in range(len(y_true)):
            tn = np.sum((1 - y_true[i]) * (1 - y_pred[i]))
            fp = np.sum((1 - y_true[i]) * y_pred[i])
            spec = (tn + 1e-7) / (tn + fp + 1e-7)
            spec_scores.append(spec)
        return spec_scores
    
    def _calculate_precision_batch(self, y_true, y_pred):
        """Calculate precision for batch."""
        prec_scores = []
        for i in range(len(y_true)):
            tp = np.sum(y_true[i] * y_pred[i])
            fp = np.sum((1 - y_true[i]) * y_pred[i])
            prec = (tp + 1e-7) / (tp + fp + 1e-7)
            prec_scores.append(prec)
        return prec_scores
    
    def _calculate_hausdorff_batch(self, y_true, y_pred):
        """Calculate Hausdorff distance for batch."""
        from scipy.spatial.distance import directed_hausdorff
        
        hd_scores = []
        for i in range(len(y_true)):
            try:
                # Get boundary points
                true_points = np.column_stack(np.where(y_true[i] > 0.5))
                pred_points = np.column_stack(np.where(y_pred[i] > 0.5))
                
                if len(true_points) > 0 and len(pred_points) > 0:
                    hd1 = directed_hausdorff(true_points, pred_points)[0]
                    hd2 = directed_hausdorff(pred_points, true_points)[0]
                    hd = max(hd1, hd2)
                else:
                    hd = float('inf')
                    
                hd_scores.append(hd)
            except:
                hd_scores.append(float('inf'))
        
        return hd_scores
    
    def _calculate_boundary_f1_batch(self, y_true, y_pred):
        """Calculate boundary F1 score for batch."""
        from skimage.segmentation import find_boundaries
        
        bf1_scores = []
        for i in range(len(y_true)):
            try:
                # Get boundaries
                true_boundary = find_boundaries(y_true[i], mode='inner')
                pred_boundary = find_boundaries(y_pred[i], mode='inner')
                
                # Calculate F1 on boundaries
                tp = np.sum(true_boundary * pred_boundary)
                fp = np.sum((1 - true_boundary) * pred_boundary)
                fn = np.sum(true_boundary * (1 - pred_boundary))
                
                precision = (tp + 1e-7) / (tp + fp + 1e-7)
                recall = (tp + 1e-7) / (tp + fn + 1e-7)
                f1 = 2 * precision * recall / (precision + recall + 1e-7)
                
                bf1_scores.append(f1)
            except:
                bf1_scores.append(0.0)
        
        return bf1_scores
    
    def _compare_models(self, all_results):
        """Generate comprehensive model comparison."""
        
        comparison = {
            'summary': {},
            'rankings': {},
            'statistical_tests': {},
            'recommendations': []
        }
        
        # Extract primary metrics for comparison
        primary_metrics = ['dice_mean', 'iou_mean', 'sensitivity_mean', 
                          'specificity_mean', 'hausdorff_mean']
        
        # Create summary table
        for metric in primary_metrics:
            comparison['summary'][metric] = {}
            for model_name, results in all_results.items():
                if metric in results:
                    comparison['summary'][metric][model_name] = results[metric]
        
        # Generate rankings
        for metric in primary_metrics:
            if metric in comparison['summary']:
                # Sort by metric (descending for most metrics, ascending for hausdorff)
                reverse = metric != 'hausdorff_mean'
                sorted_models = sorted(
                    comparison['summary'][metric].items(),
                    key=lambda x: x[1],
                    reverse=reverse
                )
                comparison['rankings'][metric] = [name for name, _ in sorted_models]
        
        # Statistical significance testing (simplified)
        comparison['statistical_tests'] = self._perform_statistical_tests(all_results)
        
        # Generate recommendations
        comparison['recommendations'] = self._generate_recommendations(comparison)
        
        return comparison
    
    def _perform_statistical_tests(self, all_results):
        """Perform statistical tests between models."""
        from scipy.stats import ttest_ind
        
        tests = {}
        model_names = list(all_results.keys())
        
        if len(model_names) >= 2:
            for i, model1 in enumerate(model_names):
                for j, model2 in enumerate(model_names[i+1:], i+1):
                    test_key = f"{model1}_vs_{model2}"
                    tests[test_key] = {}
                    
                    # Test on dice scores
                    if 'dice' in all_results[model1] and 'dice' in all_results[model2]:
                        dice1 = all_results[model1]['dice']
                        dice2 = all_results[model2]['dice']
                        
                        if len(dice1) > 1 and len(dice2) > 1:
                            statistic, p_value = ttest_ind(dice1, dice2)
                            tests[test_key]['dice_ttest'] = {
                                'statistic': float(statistic),
                                'p_value': float(p_value),
                                'significant': p_value < 0.05
                            }
        
        return tests
    
    def _generate_recommendations(self, comparison):
        """Generate model selection recommendations."""
        recommendations = []
        
        # Best overall model
        if 'dice_mean' in comparison['rankings']:
            best_dice = comparison['rankings']['dice_mean'][0]
            recommendations.append(f"🏆 Best Dice Score: {best_dice}")
        
        if 'iou_mean' in comparison['rankings']:
            best_iou = comparison['rankings']['iou_mean'][0]
            recommendations.append(f"🎯 Best IoU Score: {best_iou}")
        
        # Balanced performance
        if len(comparison['rankings']) > 1:
            # Simple ranking aggregation
            model_rank_sums = {}
            for metric, ranking in comparison['rankings'].items():
                for rank, model in enumerate(ranking):
                    if model not in model_rank_sums:
                        model_rank_sums[model] = 0
                    model_rank_sums[model] += rank
            
            best_overall = min(model_rank_sums.items(), key=lambda x: x[1])[0]
            recommendations.append(f"⚖️ Best Overall Performance: {best_overall}")
        
        return recommendations
    
    def _evaluate_ensemble(self, all_predictions, y_true, model_names):
        """Evaluate ensemble predictions."""
        print("🤝 Evaluating ensemble predictions...")
        
        ensemble_results = {}
        
        # Simple averaging ensemble
        predictions_array = np.stack(list(all_predictions.values()), axis=0)
        ensemble_pred_avg = np.mean(predictions_array, axis=0)
        
        # Evaluate ensemble
        ensemble_results['average'] = self._calculate_comprehensive_metrics(
            y_true, ensemble_pred_avg, 'ensemble_average'
        )
        
        # Majority voting ensemble (for binary)
        ensemble_pred_vote = (ensemble_pred_avg > 0.5).astype(np.float32)
        ensemble_results['majority_vote'] = self._calculate_comprehensive_metrics(
            y_true, ensemble_pred_vote, 'ensemble_majority'
        )
        
        # Best model selection (oracle)
        # This would require per-sample best model selection
        # Simplified: use best overall model
        
        return ensemble_results
    
    def _save_individual_results(self, model_name, results):
        """Save individual model results."""
        results_file = self.output_dir / f"{model_name}_results.json"
        
        # Convert numpy types for JSON serialization
        json_results = {}
        for key, value in results.items():
            if isinstance(value, np.ndarray):
                json_results[key] = value.tolist()
            elif isinstance(value, (np.integer, np.floating)):
                json_results[key] = float(value)
            else:
                json_results[key] = value
        
        with open(results_file, 'w') as f:
            json.dump(json_results, f, indent=2)
    
    def _save_comparison_results(self, all_results, comparison, ensemble_results):
        """Save comprehensive comparison results."""
        
        # Save full comparison
        comparison_file = self.output_dir / "model_comparison.json"
        
        # Prepare for JSON serialization
        json_comparison = {}
        for key, value in comparison.items():
            json_comparison[key] = value
        
        with open(comparison_file, 'w') as f:
            json.dump(json_comparison, f, indent=2)
        
        # Generate and save comparison report
        self._generate_comparison_report(all_results, comparison, ensemble_results)
        
        # Generate comparison visualizations
        self._generate_comparison_plots(all_results, comparison)
    
    def _generate_comparison_report(self, all_results, comparison, ensemble_results):
        """Generate comprehensive comparison report."""
        
        report_file = self.output_dir / "model_comparison_report.txt"
        
        with open(report_file, 'w') as f:
            f.write("🫀 CARDIAC SEGMENTATION MODEL COMPARISON REPORT\n")
            f.write("=" * 60 + "\n\n")
            
            # Summary table
            f.write("📊 PERFORMANCE SUMMARY\n")
            f.write("-" * 30 + "\n")
            
            # Create formatted table
            metrics = ['dice_mean', 'iou_mean', 'sensitivity_mean', 'specificity_mean']
            models = list(all_results.keys())
            
            # Header
            f.write(f"{'Model':<20}")
            for metric in metrics:
                f.write(f"{metric.replace('_mean', '').upper():<12}")
            f.write("\n")
            f.write("-" * (20 + 12 * len(metrics)) + "\n")
            
            # Data rows
            for model in models:
                f.write(f"{model:<20}")
                for metric in metrics:
                    if metric in all_results[model]:
                        value = all_results[model][metric]
                        f.write(f"{value:<12.4f}")
                    else:
                        f.write(f"{'N/A':<12}")
                f.write("\n")
            
            # Rankings
            f.write("\n🏆 MODEL RANKINGS\n")
            f.write("-" * 20 + "\n")
            for metric, ranking in comparison['rankings'].items():
                f.write(f"{metric.upper()}: {' > '.join(ranking)}\n")
            
            # Recommendations
            f.write("\n💡 RECOMMENDATIONS\n")
            f.write("-" * 20 + "\n")
            for rec in comparison['recommendations']:
                f.write(f"{rec}\n")
            
            # Ensemble results
            if ensemble_results:
                f.write("\n🤝 ENSEMBLE PERFORMANCE\n")
                f.write("-" * 25 + "\n")
                for ensemble_type, results in ensemble_results.items():
                    f.write(f"\n{ensemble_type.upper()}:\n")
                    for metric in metrics:
                        if metric in results:
                            f.write(f"  {metric.replace('_mean', '').upper()}: {results[metric]:.4f}\n")
            
            # Statistical tests
            if comparison['statistical_tests']:
                f.write("\n📈 STATISTICAL SIGNIFICANCE\n")
                f.write("-" * 30 + "\n")
                for test_name, test_results in comparison['statistical_tests'].items():
                    f.write(f"\n{test_name}:\n")
                    for test_type, test_data in test_results.items():
                        if isinstance(test_data, dict):
                            f.write(f"  {test_type}: p={test_data['p_value']:.4f}")
                            if test_data['significant']:
                                f.write(" (SIGNIFICANT)")
                            f.write("\n")
        
        print(f"📋 Comparison report saved to: {report_file}")
    
    def _generate_comparison_plots(self, all_results, comparison):
        """Generate comparison visualization plots."""
        
        # Performance comparison plot
        fig, axes = plt.subplots(2, 2, figsize=(15, 12))
        fig.suptitle('🫀 Model Performance Comparison', fontsize=16, fontweight='bold')
        
        # Extract data for plotting
        models = list(all_results.keys())
        metrics = ['dice_mean', 'iou_mean', 'sensitivity_mean', 'specificity_mean']
        colors = plt.cm.Set2(np.linspace(0, 1, len(models)))
        
        for idx, metric in enumerate(metrics):
            ax = axes[idx // 2, idx % 2]
            
            values = [all_results[model].get(metric, 0) for model in models]
            std_values = [all_results[model].get(metric.replace('_mean', '_std'), 0) 
                         for model in models]
            
            bars = ax.bar(models, values, color=colors, alpha=0.7, 
                         yerr=std_values, capsize=5)
            
            ax.set_title(f'{metric.replace("_mean", "").upper()} Score')
            ax.set_ylabel('Score')
            ax.set_ylim(0, 1)
            ax.grid(True, alpha=0.3)
            
            # Add value labels on bars
            for bar, value in zip(bars, values):
                ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
                       f'{value:.3f}', ha='center', va='bottom', fontweight='bold')
            
            # Rotate x-axis labels if needed
            if len(max(models, key=len)) > 8:
                ax.tick_params(axis='x', rotation=45)
        
        plt.tight_layout()
        plt.savefig(self.output_dir / 'model_comparison_performance.png', 
                   dpi=300, bbox_inches='tight')
        plt.show()
        
        # Radar chart for multi-metric comparison
        self._create_radar_chart(all_results, models)
        
        print(f"📊 Comparison plots saved to: {self.output_dir}")
    
    def _create_radar_chart(self, all_results, models):
        """Create radar chart for multi-dimensional comparison."""
        
        metrics = ['dice_mean', 'iou_mean', 'sensitivity_mean', 'specificity_mean']
        
        # Number of variables
        N = len(metrics)
        
        # Compute angle for each axis
        angles = [n / float(N) * 2 * np.pi for n in range(N)]
        angles += angles[:1]  # Complete the circle
        
        # Create figure
        fig, ax = plt.subplots(figsize=(10, 10), subplot_kw=dict(projection='polar'))
        
        colors = plt.cm.Set2(np.linspace(0, 1, len(models)))
        
        for idx, model in enumerate(models):
            values = [all_results[model].get(metric, 0) for metric in metrics]
            values += values[:1]  # Complete the circle
            
            ax.plot(angles, values, 'o-', linewidth=2, label=model, color=colors[idx])
            ax.fill(angles, values, alpha=0.25, color=colors[idx])
        
        # Add labels
        ax.set_xticks(angles[:-1])
        ax.set_xticklabels([m.replace('_mean', '').upper() for m in metrics])
        ax.set_ylim(0, 1)
        ax.set_title('🎯 Multi-Metric Model Comparison', size=16, fontweight='bold', pad=20)
        ax.legend(loc='upper right', bbox_to_anchor=(1.2, 1.0))
        ax.grid(True)
        
        plt.tight_layout()
        plt.savefig(self.output_dir / 'model_comparison_radar.png', 
                   dpi=300, bbox_inches='tight')
        plt.show()

# Initialize ensemble evaluator
ensemble_evaluator = EnsembleEvaluator()

print("🎯 Advanced evaluation framework ready!")
print(f"📁 Results will be saved to: {ensemble_evaluator.output_dir}")

# Example usage framework
print("\n📚 USAGE EXAMPLE:")
print("=" * 40)
print("""
# Example: Evaluate multiple models
models_dict = {
    'unet_basic': model1,
    'unet_attention': model2,
    'unet_residual': model3
}

# Run comprehensive evaluation
evaluation_results = ensemble_evaluator.evaluate_multiple_models(
    models_dict, test_dataset, save_individual=True
)

# Access results
individual_results = evaluation_results['individual_results']
comparison = evaluation_results['comparison']
ensemble = evaluation_results['ensemble']

# Print best model
print(f"Best model: {comparison['recommendations'][0]}")
""")

## 7. 🎯 Final Evaluation Summary and Best Practices

### Complete Evaluation Workflow

In [None]:
class EvaluationSummary:
    """Final evaluation summary and reporting system."""
    
    def __init__(self):
        self.summary_data = {
            'evaluation_completed': False,
            'best_models': {},
            'clinical_insights': {},
            'technical_recommendations': {},
            'future_improvements': []
        }
        
    def generate_final_summary(self, evaluation_results):
        """Generate comprehensive final evaluation summary."""
        
        print("🎯 FINAL EVALUATION SUMMARY")
        print("=" * 50)
        
        # Extract key findings
        if 'comparison' in evaluation_results:
            comparison = evaluation_results['comparison']
            
            print("\n🏆 TOP PERFORMING MODELS:")
            print("-" * 30)
            if 'recommendations' in comparison:
                for rec in comparison['recommendations']:
                    print(f"  {rec}")
        
        # Clinical significance assessment
        self._assess_clinical_significance(evaluation_results)
        
        # Technical recommendations
        self._generate_technical_recommendations(evaluation_results)
        
        # Quality assurance checklist
        self._quality_assurance_checklist()
        
        # Future improvements
        self._suggest_future_improvements()
        
        self.summary_data['evaluation_completed'] = True
        
        return self.summary_data
    
    def _assess_clinical_significance(self, evaluation_results):
        """Assess clinical significance of results."""
        
        print("\n🏥 CLINICAL SIGNIFICANCE ASSESSMENT:")
        print("-" * 40)
        
        clinical_thresholds = {
            'excellent': {'dice': 0.90, 'iou': 0.82, 'sensitivity': 0.90},
            'good': {'dice': 0.80, 'iou': 0.67, 'sensitivity': 0.80},
            'acceptable': {'dice': 0.70, 'iou': 0.54, 'sensitivity': 0.70},
            'poor': {'dice': 0.60, 'iou': 0.43, 'sensitivity': 0.60}
        }
        
        if 'individual_results' in evaluation_results:
            for model_name, results in evaluation_results['individual_results'].items():
                print(f"\n📊 {model_name.upper()}:")
                
                # Assess each metric
                dice_score = results.get('dice_mean', 0)
                iou_score = results.get('iou_mean', 0)
                sens_score = results.get('sensitivity_mean', 0)
                
                # Determine clinical category
                clinical_level = 'poor'
                for level, thresholds in clinical_thresholds.items():
                    if (dice_score >= thresholds['dice'] and 
                        iou_score >= thresholds['iou'] and 
                        sens_score >= thresholds['sensitivity']):
                        clinical_level = level
                        break
                
                print(f"  Clinical Assessment: {clinical_level.upper()}")
                print(f"  Dice: {dice_score:.3f} (threshold: {clinical_thresholds[clinical_level]['dice']:.2f})")
                print(f"  IoU: {iou_score:.3f} (threshold: {clinical_thresholds[clinical_level]['iou']:.2f})")
                print(f"  Sensitivity: {sens_score:.3f} (threshold: {clinical_thresholds[clinical_level]['sensitivity']:.2f})")
                
                # Clinical recommendations
                if clinical_level == 'excellent':
                    print("  ✅ READY FOR CLINICAL DEPLOYMENT")
                elif clinical_level == 'good':
                    print("  ⚡ SUITABLE FOR CLINICAL USE WITH SUPERVISION")
                elif clinical_level == 'acceptable':
                    print("  ⚠️  REQUIRES IMPROVEMENT BEFORE CLINICAL USE")
                else:
                    print("  ❌ NOT SUITABLE FOR CLINICAL USE")
                
                self.summary_data['clinical_insights'][model_name] = {
                    'level': clinical_level,
                    'scores': {'dice': dice_score, 'iou': iou_score, 'sensitivity': sens_score},
                    'clinical_ready': clinical_level in ['excellent', 'good']
                }
    
    def _generate_technical_recommendations(self, evaluation_results):
        """Generate technical recommendations for improvement."""
        
        print("\n🔧 TECHNICAL RECOMMENDATIONS:")
        print("-" * 35)
        
        recommendations = []
        
        if 'individual_results' in evaluation_results:
            results = evaluation_results['individual_results']
            
            # Analyze performance patterns
            all_dice = []
            all_iou = []
            all_sensitivity = []
            
            for model_results in results.values():
                all_dice.append(model_results.get('dice_mean', 0))
                all_iou.append(model_results.get('iou_mean', 0))
                all_sensitivity.append(model_results.get('sensitivity_mean', 0))
            
            avg_dice = np.mean(all_dice)
            avg_iou = np.mean(all_iou)
            avg_sensitivity = np.mean(all_sensitivity)
            
            # Generate specific recommendations
            if avg_dice < 0.80:
                recommendations.append("🎯 Focus on improving Dice score through:")
                recommendations.append("   • Advanced loss functions (Focal + Dice)")
                recommendations.append("   • Data augmentation optimization")
                recommendations.append("   • Architecture improvements (attention, residual)")
            
            if avg_sensitivity < 0.80:
                recommendations.append("🔍 Improve sensitivity (reduce false negatives):")
                recommendations.append("   • Adjust class weights in loss function")
                recommendations.append("   • Use sensitivity-focused loss functions")
                recommendations.append("   • Increase training data for underrepresented cases")
            
            if avg_iou < 0.70:
                recommendations.append("📐 Enhance boundary precision:")
                recommendations.append("   • Implement boundary-aware loss functions")
                recommendations.append("   • Post-processing with morphological operations")
                recommendations.append("   • Multi-scale training approaches")
            
            # Model-specific recommendations
            best_model = max(results.keys(), 
                           key=lambda k: results[k].get('dice_mean', 0))
            recommendations.append(f"🏆 Best performing model: {best_model}")
            
            if 'ensemble' in evaluation_results:
                ensemble_dice = evaluation_results['ensemble'].get('average', {}).get('dice_mean', 0)
                if ensemble_dice > max(all_dice):
                    recommendations.append("🤝 Consider ensemble methods for production")
            
        # General recommendations
        recommendations.extend([
            "📊 Data Quality Improvements:",
            "   • Ensure consistent annotation quality",
            "   • Balance dataset across different cardiac conditions",
            "   • Validate with multiple expert annotations",
            "",
            "🔬 Advanced Techniques to Try:",
            "   • Self-supervised pre-training",
            "   • Progressive training strategies",
            "   • Uncertainty quantification",
            "   • Test-time augmentation"
        ])
        
        for rec in recommendations:
            print(f"  {rec}")
        
        self.summary_data['technical_recommendations'] = recommendations
    
    def _quality_assurance_checklist(self):
        """Generate quality assurance checklist."""
        
        print("\n✅ QUALITY ASSURANCE CHECKLIST:")
        print("-" * 35)
        
        checklist = [
            "✓ Cross-validation performed on all models",
            "✓ Statistical significance testing completed",
            "✓ Clinical thresholds evaluated",
            "✓ Error analysis and failure cases examined",
            "✓ Computational efficiency assessed",
            "✓ Reproducibility verified",
            "✓ Documentation and reporting completed",
            "✓ Code review and testing performed"
        ]
        
        for item in checklist:
            print(f"  {item}")
        
        print("\n⚠️  ADDITIONAL CONSIDERATIONS:")
        print("-" * 30)
        additional = [
            "• Regulatory compliance (if clinical deployment planned)",
            "• Privacy and security measures implemented",
            "• User interface and workflow integration tested",
            "• Continuous monitoring system designed",
            "• Model versioning and rollback procedures established"
        ]
        
        for item in additional:
            print(f"  {item}")
    
    def _suggest_future_improvements(self):
        """Suggest future improvements and research directions."""
        
        print("\n🚀 FUTURE IMPROVEMENTS & RESEARCH DIRECTIONS:")
        print("-" * 50)
        
        improvements = [
            "🧠 Advanced Architecture Exploration:",
            "   • Vision Transformers for medical segmentation",
            "   • Neural Architecture Search (NAS)",
            "   • Hybrid CNN-Transformer models",
            "",
            "📊 Data Enhancement:",
            "   • Synthetic data generation with GANs",
            "   • Cross-domain adaptation techniques",
            "   • Active learning for optimal annotation",
            "",
            "🔬 Clinical Integration:",
            "   • Real-time processing optimization",
            "   • Uncertainty quantification for clinical decision support",
            "   • Multi-modal fusion (MRI + other imaging)",
            "",
            "🌐 Deployment Considerations:",
            "   • Edge computing optimization",
            "   • Federated learning across institutions",
            "   • Continuous learning from new data"
        ]
        
        for improvement in improvements:
            print(f"  {improvement}")
        
        self.summary_data['future_improvements'] = improvements
    
    def save_summary_report(self, output_path='./outputs/final_evaluation_summary.txt'):
        """Save comprehensive summary report."""
        
        os.makedirs(os.path.dirname(output_path), exist_ok=True)
        
        with open(output_path, 'w', encoding='utf-8') as f:
            f.write("🫀 CARDIAC SEGMENTATION - FINAL EVALUATION SUMMARY\n")
            f.write("=" * 60 + "\n\n")
            f.write(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n\n")
            
            # Clinical insights
            f.write("🏥 CLINICAL ASSESSMENT SUMMARY\n")
            f.write("-" * 35 + "\n")
            for model, insights in self.summary_data['clinical_insights'].items():
                f.write(f"\n{model.upper()}:\n")
                f.write(f"  Clinical Level: {insights['level'].upper()}\n")
                f.write(f"  Dice Score: {insights['scores']['dice']:.3f}\n")
                f.write(f"  IoU Score: {insights['scores']['iou']:.3f}\n")
                f.write(f"  Sensitivity: {insights['scores']['sensitivity']:.3f}\n")
                f.write(f"  Clinical Ready: {'YES' if insights['clinical_ready'] else 'NO'}\n")
            
            # Technical recommendations
            f.write("\n\n🔧 TECHNICAL RECOMMENDATIONS\n")
            f.write("-" * 30 + "\n")
            for rec in self.summary_data['technical_recommendations']:
                f.write(f"{rec}\n")
            
            # Future improvements
            f.write("\n\n🚀 FUTURE IMPROVEMENTS\n")
            f.write("-" * 25 + "\n")
            for improvement in self.summary_data['future_improvements']:
                f.write(f"{improvement}\n")
        
        print(f"📄 Final summary report saved to: {output_path}")

# Initialize evaluation summary
evaluation_summary = EvaluationSummary()

print("📋 EVALUATION SUMMARY SYSTEM READY")
print("=" * 40)

# Usage instructions
print("\n📚 COMPLETE EVALUATION WORKFLOW:")
print("-" * 35)
print("""
1️⃣ Load and prepare models:
   model_manager = ModelManager()
   models_dict = {'model1': model1, 'model2': model2}

2️⃣ Run comprehensive evaluation:
   evaluation_results = ensemble_evaluator.evaluate_multiple_models(
       models_dict, test_dataset
   )

3️⃣ Generate final summary:
   final_summary = evaluation_summary.generate_final_summary(evaluation_results)

4️⃣ Save comprehensive report:
   evaluation_summary.save_summary_report('./outputs/final_report.txt')
""")

print("\n🎯 EVALUATION FRAMEWORK COMPLETE!")
print("=" * 40)
print("✅ All evaluation tools implemented and ready to use")
print("📊 Quantitative metrics: Dice, IoU, Sensitivity, Specificity, Hausdorff")
print("🎨 Qualitative visualizations: Predictions, overlays, error maps")
print("📈 Statistical analysis: Confidence intervals, significance tests")
print("🤝 Ensemble methods: Averaging, majority voting")
print("🏥 Clinical assessment: Performance thresholds, deployment readiness")
print("📋 Comprehensive reporting: Automated documentation generation")

print(f"\n📁 All results saved to:")
print(f"   📊 Quantitative: {eval_config.results_dir}")
print(f"   🎨 Visualizations: {eval_config.plots_dir}")
print(f"   🤝 Ensemble: {ensemble_evaluator.output_dir}")

print("\n🔄 NEXT STEPS:")
print("Execute notebook 07_Postprocessing_and_Morphology.ipynb for morphological operations")
print("Execute notebook 08_Final_Inference_and_Results.ipynb for end-to-end inference")

print("\n" + "="*60)
print("🫀 CARDIAC SEGMENTATION EVALUATION MODULE - COMPLETE")
print("="*60)