# Subset Selection Notebook 3: Subset Selection & Results

## Overview
This notebook is the **final step** in the Subset Selection Pipeline. It uses the embeddings generated in Notebook 2 to select diverse, representative subsets using the submodular optimization algorithm.

## Purpose in Subset Selection
The Facility Location algorithm selects a subset of samples that:
1. **Maximizes Diversity**: Selects samples that are representative of the entire dataset
2. **Minimizes Redundancy**: Avoids selecting similar samples
3. **Optimizes Coverage**: Ensures all semantic regions of the dataset are represented
4. **Maintains Quality**: Preserves the distribution and characteristics of the original dataset

## Output
- **Subset JSONL files**: Selected subsets in JSONL format (e.g., `combined_cut_50x_percent_0.1_subset.jsonl`)
- **Metadata files**: `.npz` files containing selected indices and gain scores for each subset
- **Statistics**: Summary of subset sizes, selection time, and quality metrics

In [None]:
# Import from Notebook 2 (which already imports Notebook 1)
%run "embedding_generation.ipynb"

# Additional imports for subset selection
import math
import logging
from dataclasses import dataclass, field
from typing import TypedDict, Union, List, Dict, Optional, Any
# Import submodlib for Facility Location
from submodlib import FacilityLocationFunction

# Set up logger for this notebook
logger = logging.getLogger(__name__)

print("✅ Successfully imported from previous notebooks!")
print(f"   • config: {type(config).__name__ if 'config' in locals() else 'Not defined'}")
print(f"   • dataset: {len(dataset) if 'dataset' in locals() and dataset else 'None'} samples")
print(f"   • processor: {type(processor).__name__ if 'processor' in locals() else 'Not defined'}")
print(f"   • DataProcessor has generate_embeddings(): {hasattr(processor, 'generate_embeddings') if 'processor' in locals() else False}")
print(f"\n✅ Subset selection libraries loaded!")
print(f"   • submodlib: FacilityLocationFunction available")
print(f"   • All utility functions from previous notebooks available")

### GPU-Based Fold Processing

Implements the core subset selection algorithm using facility location.

**Why Folds?**
The dataset is split into multiple folds for three key reasons:
1. **Memory Efficiency**: Processing smaller chunks prevents GPU out-of-memory errors
2. **Parallelization**: Multiple GPUs can work on different folds simultaneously
3. **Better Coverage**: Cross-validation-like approach ensures diverse sample selection

**How It Works:**
```
Full Dataset (N samples)
↓
Split into K folds (e.g., 25 folds)
↓
Assign folds to GPUs
↓
For each fold:
- Compute pairwise similarities
- Run facility location algorithm
- Select diverse samples with high gains
↓
Merge results from all folds
↓
Select top samples by gain scores
```
**Algorithm**: Uses LazierThanLazyGreedy optimizer for efficient submodular maximization.

In [None]:
def process_folds_with_gpu(args):
    """
    Process folds on GPU or CPU with support for both percentage and absolute size specifications.
    """
    (
        gpu_id,
        gpu_folds_info,
        embeddings,
        subset_sizes,
        total_samples,
        epsilon,
        testing_mode,
    ) = args

    try:
        if torch.cuda.is_available():
            torch.cuda.set_device(gpu_id)
            device = f"cuda:{gpu_id}"
        else:
            if not testing_mode:
                raise RuntimeError("GPU processing required but CUDA is not available")
            logger.warning(
                "Running in CPU mode for testing. Production use requires GPU acceleration."
            )
            device = "cpu"

        results = []
        for fold_idx, fold_indices in gpu_folds_info:
            try:
                logger.info(f"Processing fold {fold_idx + 1} on GPU {gpu_id}")

                fold_embeddings = embeddings[fold_indices].to(device)

                logger.info(f"Computing similarity matrix for fold {fold_idx + 1}")
                max_sim_mat = compute_pairwise_dense(
                    fold_embeddings,
                    batch_size=50000,
                    metric="cosine",
                    device=device,
                    scaling="additive",
                )
                similarity_matrix = max_sim_mat.cpu().numpy()

                subsets = {}
                ds_func = FacilityLocationFunction(
                    n=similarity_matrix.shape[0],
                    sijs=similarity_matrix,
                    mode="dense",
                    separate_rep=False,
                )

                for size_spec in subset_sizes:
                    if isinstance(size_spec, float):
                        # Percentage-based selection
                        budget = max(
                            1, math.ceil(size_spec * similarity_matrix.shape[0])
                        )
                    else:
                        # Absolute number-based selection
                        budget = max(
                            1,
                            math.ceil(
                                size_spec * (similarity_matrix.shape[0] / total_samples)
                            ),
                        )

                    logger.info(
                        f"Selecting subset of size {budget} for fold {fold_idx + 1}"
                    )

                    subset_result = ds_func.maximize(
                        budget=budget,
                        optimizer="LazierThanLazyGreedy",
                        epsilon=epsilon,
                        stopIfZeroGain=False,
                        stopIfNegativeGain=False,
                        verbose=False,
                    )

                    subset_indices = [fold_indices[x[0]] for x in subset_result]
                    subset_gains = [x[1] for x in subset_result]
                    subsets[size_spec] = {
                        "indices": subset_indices,
                        "gains": subset_gains,
                    }

                results.append((fold_idx, subsets))

            except Exception as e:
                logger.error(
                    f"Error processing fold {fold_idx + 1} on GPU {gpu_id}: {str(e)}"
                )
                raise
            finally:
                # Cleanup - ADDED THIS SECTION
                for var in ["ds_func", "similarity_matrix", "fold_embeddings"]:
                    if var in locals():
                        del locals()[var]
                gc.collect()
                if torch.cuda.is_available():
                    torch.cuda.empty_cache()

        return results
    except Exception as e:
        logger.error(f"Error in process_folds_with_gpu on GPU {gpu_id}: {str(e)}")
        raise

print("✅ process_folds_with_gpu function defined!")

### 🎯 Two Ways to Run Subset Selection

**✅ RECOMMENDED: Use Existing Embeddings (Fast)**
- Function: `run_subset_selection_only()`
- Use when: You've already run Notebook 2 and have embeddings
- Benefits: Much faster (minutes vs hours), no redundant computation
- What it does: Loads embeddings from Notebook 2 → Runs facility location → Saves subsets

**🔄 Alternative: Full Pipeline (Slow)**
- Function: `subset_datasets()`
- Use when: Running standalone without Notebook 2, or need fresh embeddings
- What it does: Generates embeddings → Runs facility location → Saves subsets
- Note: This regenerates embeddings even if they exist!


In [None]:
def run_subset_selection_only(
    embeddings_file: str,
    dataset,
    output_dir: str,
    subset_sizes: List[Union[int, float]],
    dataset_name: str = "combined_cut_50x", #Change it according to your dataset
    num_folds: int = 25,
    epsilon: float = 160.0,
    num_gpus: int = None,
    testing_mode: bool = True,
) -> None:
    """
    Run ONLY subset selection using pre-computed embeddings from Notebook 2.
    
    This function is designed for the notebook workflow where embeddings 
    are already generated. It skips the embedding generation step entirely.
    
    Args:
        embeddings_file: Path to the embeddings.h5 file from Notebook 2
        dataset: The dataset loaded in Notebook 1
        output_dir: Where to save subset results
        subset_sizes: List of subset sizes (floats for %, ints for absolute)
        dataset_name: Name of the dataset (for output files)
        num_folds: Number of folds for facility location
        epsilon: Epsilon parameter for the optimizer
        num_gpus: Number of GPUs to use (auto-detected if None)
        testing_mode: Enable testing mode for CPU fallback
    """
    # Verify embeddings file exists
    if not os.path.exists(embeddings_file):
        raise FileNotFoundError(
            f"Embeddings file not found: {embeddings_file}\n"
            "Please run Notebook 2 (embedding_generation.ipynb) first!"
        )
    
    # Auto-detect GPUs if not specified
    if num_gpus is None:
        num_gpus = get_default_num_gpus(testing_mode=testing_mode)
    
    logger.info(f"📁 Using existing embeddings from: {embeddings_file}")
    logger.info(f"🎯 Dataset: {dataset_name} with {len(dataset)} samples")
    logger.info(f"📊 Subset sizes: {subset_sizes}")
    logger.info(f"🔧 Folds: {num_folds}, Epsilon: {epsilon}")
    
    # Create configuration (minimal, just for subset selection)
    basic_config = BasicConfig(
        output_dir=output_dir,
        num_folds=num_folds,
        epsilon=epsilon,
        combine_files=False,
    )
    
    # Validate epsilon for dataset size
    basic_config.validate_epsilon_for_dataset_size(len(dataset))
    
    # Create system config (num_gpus is auto-detected in __post_init__)
    system_config = SystemConfig(
        testing_mode=testing_mode,
    )
    
    # Override num_gpus if specified
    if num_gpus is not None:
        system_config.num_gpus = num_gpus
    
    # Create minimal config for subset selection only
    config = ProcessingConfig(
        input_files=[],  # Not needed for this path
        subset_sizes=subset_sizes,
        basic=basic_config,
        encoder=EncoderConfig(testing_mode=testing_mode),
        template=TemplateConfig(),
        system=system_config,
    )
    
    # Initialize processor
    processor = DataProcessor(config)
    
    try:
        # Load embeddings
        logger.info("📥 Loading embeddings...")
        with h5py.File(embeddings_file, "r") as f:
            embeddings_data = f["embeddings"][:]
            if embeddings_data.size == 0:
                raise ValueError("Embeddings file is empty!")
            embeddings = torch.tensor(embeddings_data, dtype=torch.float32)
        
        logger.info(f"✅ Loaded embeddings: shape {embeddings.shape}")
        
        # Create output directory
        dataset_output_dir = os.path.join(output_dir, dataset_name)
        os.makedirs(dataset_output_dir, exist_ok=True)
        
        # Run subset selection
        logger.info("🎯 Running subset selection...")
        start_time = time.time()
        
        subsets = processor.select_subsets(dataset_name, embeddings)
        
        selection_time = time.time() - start_time
        
        # Save subsets to files
        logger.info("💾 Saving selected subsets...")
        for size_spec, indices in subsets.items():
            subset_data = dataset.select(indices)
            subset_name = processor.get_subset_name(size_spec, len(indices))
            
            output_file = os.path.join(
                dataset_output_dir,
                f"{dataset_name}_{subset_name}_subset.jsonl",
            )
            
            processor._save_subset(subset_data, output_file, "dummy.jsonl")
            logger.info(f"✅ Saved {len(indices)} samples to {output_file}")
        
        # Print summary
        print("\n" + "=" * 70)
        print("✅ SUBSET SELECTION COMPLETED!")
        print("=" * 70)
        print(f"⏱️  Selection time: {selection_time / 60:.2f} minutes")
        print(f"📊 Created {len(subsets)} subset(s):")
        for size_spec, indices in subsets.items():
            subset_name = processor.get_subset_name(size_spec, len(indices))
            print(f"   • {subset_name}: {len(indices)} samples")
        print(f"💾 Output directory: {dataset_output_dir}")
        print("=" * 70)
        
    except Exception as e:
        logger.error(f"Error during subset selection: {str(e)}")
        raise
    finally:
        # Cleanup
        gc.collect()
        if torch.cuda.is_available():
            torch.cuda.empty_cache()

print("✅ run_subset_selection_only() function defined!")


### Extend DataProcessor with Subset Selection

Adds the `select_subsets()` method to the DataProcessor class.


In [None]:
def select_subsets(self, dataset_name: str, embeddings: torch.Tensor) -> Dict[Union[int, float], List[int]]:
    """
    Perform diverse subset selection using facility location with multi-GPU support.
    
    This method implements a three-phase approach to select representative subsets from 
    embeddings using submodular optimization. It supports both percentage-based and 
    absolute count specifications for subset sizes.
    
    **Process Overview:**
    
    Phase 1 - Create Folds:
        - Randomly shuffles sample indices
        - Splits into N folds with roughly equal sizes
        - Distributes folds across available GPUs
    
    Phase 2 - Process Folds (Parallel/Serial):
        - Testing Mode: Processes folds serially (one at a time)
        - Production Mode: Processes folds in parallel using multiprocessing.Pool
        - Each GPU runs facility location algorithm on its assigned folds
    
    Phase 3 - Aggregate Results:
        - Collects selected indices and gain scores from all folds
        - Sorts by gain scores (highest first)
        - Selects top N samples for each target subset size
        - Saves metadata to .npz files
    
    **Selection Modes:**
        - Percentage: Float values (0.1 = 10%, 0.05 = 5% of dataset)
        - Absolute: Integer values (100 = exactly 100 samples)
    
    Args:
        dataset_name (str): Name of the dataset, used for output file naming.
        embeddings (torch.Tensor): Embedding tensor of shape (num_samples, embedding_dim).
            Contains the vector representations of all samples in the dataset.
    
    Returns:
        Dict[Union[int, float], List[int]]: Dictionary mapping subset size specifications 
            to lists of selected sample indices. Keys are the original size_spec values 
            (floats for percentages, ints for absolute counts), and values are lists of 
            integer indices representing the selected samples.
    
    Output Files:
        Creates metadata files in the format:
        `{output_dir}/{dataset_name}_fl_{num_folds}_partitions_{subset_name}_metadata.npz`
        
        Each file contains:
            - indices: numpy array of selected sample indices
            - gains: numpy array of corresponding gain scores from facility location
    """
    indices = np.arange(len(embeddings))
    np.random.shuffle(indices)

    fold_size = len(embeddings) // self.config.basic.num_folds
    remainder = len(embeddings) % self.config.basic.num_folds

    folds = []
    start_idx = 0
    for i in range(self.config.basic.num_folds):
        extra = 1 if i < remainder else 0
        end_idx = start_idx + fold_size + extra
        folds.append(indices[start_idx:end_idx])
        start_idx = end_idx

    gpu_assignments = []
    folds_per_gpu = self.config.basic.num_folds // self.config.system.num_gpus
    extra_folds = self.config.basic.num_folds % self.config.system.num_gpus

    start_fold = 0
    for gpu_id in range(self.config.system.num_gpus):
        num_folds_this_gpu = folds_per_gpu + (1 if gpu_id < extra_folds else 0)
        end_fold = start_fold + num_folds_this_gpu
        gpu_folds_info = [
            (fold_idx, folds[fold_idx]) for fold_idx in range(start_fold, end_fold)
        ]

        gpu_assignments.append(
            (
                gpu_id,
                gpu_folds_info,
                embeddings,
                self.config.subset_sizes,
                len(embeddings),
                self.config.basic.epsilon,
                self.config.system.testing_mode,
            )
        )
        start_fold = end_fold

    # Use serial processing in testing mode (notebook-friendly)
    if self.config.system.testing_mode or self.config.system.num_gpus == 1:
        logger.info("Processing folds serially (testing mode or single GPU)")
        gpu_results = []
        for args in gpu_assignments:
            result = process_folds_with_gpu(args)
            gpu_results.append(result)
    else:
        logger.info(f"Processing folds in parallel with {self.config.system.num_gpus} workers")
        with Pool(processes=self.config.system.num_gpus) as pool:
            gpu_results = pool.map(process_folds_with_gpu, gpu_assignments)

    all_results = []
    for gpu_result in gpu_results:
        all_results.extend(gpu_result)

    class SubsetData(TypedDict):
        indices: List[int]
        gains: List[float]

    combined_subsets: Dict[Union[int, float], SubsetData] = {
        size: {"indices": [], "gains": []} for size in self.config.subset_sizes
    }

    for fold_idx, result in all_results:
        for size in self.config.subset_sizes:
            combined_subsets[size]["indices"].extend(result[size]["indices"])
            combined_subsets[size]["gains"].extend(result[size]["gains"])

    base_name = dataset_name
    subsets = {}

    for size_spec in self.config.subset_sizes:
        actual_size = self.calculate_subset_size(len(embeddings), size_spec)
        logger.info(f"Actual subset size: {actual_size}")
        sorted_indices_gains = sorted(
            zip(
                combined_subsets[size_spec]["indices"],
                combined_subsets[size_spec]["gains"],
            ),
            key=lambda x: x[1],
            reverse=True,
        )[:actual_size]

        sorted_indices = [x[0] for x in sorted_indices_gains]
        sorted_gains = [x[1] for x in sorted_indices_gains]

        subset_name = self.get_subset_name(size_spec, actual_size)
        metadata_file = os.path.join(
            self.config.basic.output_dir,
            f"{base_name}_fl_{self.config.basic.num_folds}_partitions_{subset_name}_metadata.npz",
        )

        np.savez(metadata_file, indices=sorted_indices, gains=sorted_gains)
        logger.info(f"Saved metadata to {metadata_file}")
        subsets[size_spec] = sorted_indices

    return subsets


# Add the method to DataProcessor class
DataProcessor.select_subsets = select_subsets

print("✅ select_subsets() method added to DataProcessor!")

### Add File Processing Helper Methods

In [None]:
def _save_subset(self, subset_data, output_file: str, input_file: str):
    """
    Save subset data to file in appropriate format.
    
    Args:
        subset_data: The dataset subset to save
        output_file (str): Output file path
        input_file (str): Original input file path (for determining format)
    """
    extension = input_file.split(".")[-1]
    if extension in ["json", "jsonl"]:
        subset_data.to_json(output_file, orient="records", lines=True)
    elif extension == "csv":
        subset_data.to_csv(output_file, index=False)
    elif extension == "parquet":
        subset_data.to_parquet(output_file)
    
    logger.info(f"Saved subset to {output_file}")


def _process_single_dataset(
    self,
    dataset,
    dataset_name: str,
    output_dir: str,
    input_file: str
):
    """
    Process a single dataset (either combined or individual).
    
    This function orchestrates the complete pipeline for a single dataset:
    1. Validates epsilon parameter
    2. Generates or loads embeddings
    3. Selects subsets using Facility Location
    4. Saves selected subsets to files
    
    Args:
        dataset: The dataset to process
        dataset_name: Name of the dataset
        output_dir: Output directory for results
        input_file: Original input file path (for determining format)
    """
    try:
        # Validate epsilon based on dataset size
        self.config.basic.validate_epsilon_for_dataset_size(len(dataset))

        # Create dataset-specific output directory
        dataset_output_dir = os.path.join(output_dir, dataset_name)
        os.makedirs(dataset_output_dir, exist_ok=True)

        logger.info(f"Generating embeddings for {dataset_name}")
        embedding_file = self.generate_embeddings(
            dataset, os.path.join(dataset_output_dir, "embeddings")
        )

        logger.info("Loading embeddings for subset selection")
        with h5py.File(embedding_file, "r") as f:
            embeddings_data = f["embeddings"][:]
            if embeddings_data.size == 0:
                logger.warning(
                    f"No embeddings generated for dataset {dataset_name}, skipping subset selection"
                )
                return
            embeddings = torch.tensor(embeddings_data, dtype=torch.float32)

        logger.info("Selecting subsets")
        subsets = self.select_subsets(dataset_name, embeddings)

        logger.info("Saving subsets")
        for size_spec, indices in subsets.items():
            subset_data = dataset.select(indices)
            subset_name = self.get_subset_name(size_spec, len(indices))

            # Create subset filename with dataset name
            output_file = os.path.join(
                dataset_output_dir,
                f"{dataset_name}_{subset_name}_subset.{input_file.split('.')[-1]}",
            )

            self._save_subset(subset_data, output_file, input_file)
            logger.info(
                f"Saved subset with {len(indices)} samples to {output_file}"
            )

        # Clean up resources
        del dataset, embeddings
        gc.collect()
        torch.cuda.empty_cache()

    except Exception as e:
        logger.error(f"Error processing dataset {dataset_name}: {str(e)}")
        raise


def process_files(self, input_files: List[str], output_dir: str):
    """
    Process multiple input files with support for both combined and separate processing.
    
    Args:
        input_files (List[str]): List of input files to process
        output_dir (str): Output directory for results
    """
    try:
        if self.config.basic.combine_files:
            # Process combined datasets
            logger.info("Processing combined datasets...")
            dataset = self.load_and_combine_datasets(input_files)
            dataset_name = "combined_dataset"

            # Process combined dataset
            self._process_single_dataset(
                dataset, dataset_name, output_dir, input_files[0]
            )
        else:
            # Process each dataset separately
            logger.info("Processing datasets separately...")
            for input_file in input_files:
                dataset = self.load_and_combine_datasets([input_file])
                dataset_name = self.get_dataset_name(input_file)
                logger.info(f"Processing dataset: {dataset_name}")
                self._process_single_dataset(
                    dataset, dataset_name, output_dir, input_file
                )

    except Exception as e:
        logger.error(f"Error processing files: {str(e)}")
        raise


# Add all methods to DataProcessor class
DataProcessor._save_subset = _save_subset
DataProcessor._process_single_dataset = _process_single_dataset
DataProcessor.process_files = process_files

print("✅ Additional methods added to DataProcessor!")

### Main Wrapper Function: subset_datasets()

In [None]:
def subset_datasets(
    input_files: List[str],
    subset_sizes: List[Union[int, float]],
    testing_mode: bool = False,
    **kwargs: Any,
) -> None:
    """
    Create subsets of datasets using facility location for diverse subset selection.
    
    This is the main entry point from the original codebase. It creates configuration,
    initializes the processor, and runs the complete pipeline.
    
    Args:
        input_files: List of input files to process
        subset_sizes: List of subset sizes (floats 0-1 for percentage, ints for absolute count)
        testing_mode: If True, allows CPU usage (for testing only)
        **kwargs: Additional configuration parameters (e.g., output_dir, epsilon, num_folds)
    """
    # Get system's available GPU count
    available_gpus = get_default_num_gpus(testing_mode=testing_mode)

    # Create configuration groups
    basic_config = BasicConfig()
    encoder_config = EncoderConfig(testing_mode=testing_mode)
    template_config = TemplateConfig()
    system_config = SystemConfig(testing_mode=testing_mode)

    # Update configuration groups from kwargs
    for key, value in kwargs.items():
        if hasattr(basic_config, key):
            setattr(basic_config, key, value)
        elif hasattr(encoder_config, key):
            setattr(encoder_config, key, value)
        elif hasattr(template_config, key):
            setattr(template_config, key, value)
        elif hasattr(system_config, key):
            setattr(system_config, key, value)

    # Ensure num_gpus doesn't exceed available GPUs
    if system_config.num_gpus > available_gpus:
        logger.warning(
            f"Requested {system_config.num_gpus} GPUs but only {available_gpus} available. "
            f"Falling back to using {available_gpus} GPUs."
        )
        system_config.num_gpus = available_gpus

    # Create configuration
    config = ProcessingConfig(
        input_files=input_files,
        subset_sizes=subset_sizes,
        basic=basic_config,
        encoder=encoder_config,
        template=template_config,
        system=system_config,
    )

    try:
        logger.info(f"Processing configuration: {config}")
        processor = DataProcessor(config)
        processor.process_files(input_files, config.basic.output_dir)

    except Exception as e:
        logger.error(f"Processing failed: {str(e)}")
        raise

    finally:
        # Cleanup
        gc.collect()
        if torch.cuda.is_available():
            torch.cuda.empty_cache()


print("✅ subset_datasets() function defined!")

## 🚀 Execute Subset Selection

**Choose ONE of the two options below:**

In [None]:
## ==================== OPTION 1: USE EXISTING EMBEDDINGS (RECOMMENDED) ====================
##
## ✅ This is the FAST approach - it uses embeddings already generated in Notebook 2
## ⚡ Takes only minutes (vs hours for full pipeline)
## 💾 Loads embeddings from: data-preparation-and-config/output/embeddings/embeddings.h5

## **Requirements:**
## - Notebook 2 must be run first
## - Embeddings file must exist at the specified path
##
## Uncomment the code below to run:

_option_1_success = False
try:
    run_subset_selection_only(
        embeddings_file='../../assets/subset-selection/outputs/embeddings/embeddings.h5',  # From Notebook 2
        dataset=dataset,  # Loaded from Notebook 1
        output_dir="../../assets/subset-selection/outputs", # change this to your desired output directory
        subset_sizes=[0.1, 0.05],  # 10% and 5% subsets
        dataset_name='combined_cut_50x',
        num_folds=25,
        epsilon=0.1,
        testing_mode=True,
    )
    _option_1_success = True
except Exception as e:
    print(f"❌ Error: {e}")
    import traceback
    traceback.print_exc()


## ==================== OPTION 2: FULL PIPELINE FROM SCRATCH (SLOW) ====================
##
## 🔄 This regenerates embeddings even if they already exist
## ⏰ Takes hours for large datasets
## 🎯 Use this only if running notebook 3 independently without notebook 2
##
## Uncomment the code below to run:

_option_2_success = False
# try:
#     subset_datasets(
#         input_files=['data/combined_cut_50x.jsonl'],
#         subset_sizes=[0.1, 0.05],  # 10% and 5% subsets
#         testing_mode=True,
#         output_dir='data/output',
#         num_folds=25,
#         epsilon=0.1
#     )
#     _option_2_success = True
# except Exception as e:
#     print(f"❌ Error: {e}")
#     import traceback
#     traceback.print_exc()


## ==================== EXECUTION STATUS ====================
# Smart status reporting based on what actually ran

if _option_1_success or _option_2_success:
    # Success case - at least one option ran successfully
    print("\n" + "="*70)
    print("✅ SUBSET SELECTION WORKFLOW COMPLETED SUCCESSFULLY!")
    print("="*70)
    if _option_1_success:
        print("📍 Executed: Option 1 (Using existing embeddings)")
    if _option_2_success:
        print("📍 Executed: Option 2 (Full pipeline from scratch)")
    
else:
    # No option ran successfully or both are commented out
    print("\n" + "="*70)
    print("⚠️  NO EXECUTION OPTION SELECTED OR ALL OPTIONS FAILED")
    print("="*70)
    print("\n📋 Current Status:")
    print("   • Option 1: Active but failed (check errors above)" if not _option_1_success and '_option_1_success' in locals() else "   • Option 1: Commented out")
    print("   • Option 2: Commented out")