# Experiments Multiview Stacking TDA MNIST

## Introduction
This notebook explores the hypothesis that a **multi-view model** incorporating **topological features** (TDA) outperforms models that do **not** include these features. Specifically, we adopt a technique called **Multiview Stacking**, where multiple “views” of the same dataset (e.g., different image quadrants, topological features, or full original images) are combined via a stacking ensemble to improve classification performance.

### Motivation
1. **Topological Data Analysis (TDA)** aims to capture intrinsic shape information from data, hypothesized to be complementary to the raw pixel view.
2. **Multiview Stacking** merges multiple feature sets—e.g. original image pixels, TDA features, and image quadrants—so each “view” acts like a separate input stream. A meta-learner (here, a RandomForest) then learns from these base learners.

### Hypothesis
> *“Including topological features (via TDA) in a multi-view stacking ensemble yields higher predictive performance than a model without TDA.”*

## Methodology

1. **Dataset and Chunk Processing**  
   - We load MNIST data from `mnist_784.csv` in **chunks** to manage memory usage and to run multiple iterations (`NUMBER_EXPERIMENTS`). Each chunk is split into training and test sets.
   - We optionally reshape images to a square (28×28) for quadrant splitting and TDA extraction.

2. **View Creation**  
   - **Original**: Flattened image (784 features).  
   - **TDA**: Topological features extracted with a pipeline built via `build_tda_pipeline`.  
   - **Quadrants**: Four 14×14 image subsections (`top_left`, `top_right`, `bottom_left`, `bottom_right`).  

3. **Ablation / View Combinations**  
   - We generate all possible subsets of the full set of views: `["original", "tda", "top_left", "top_right", "bottom_left", "bottom_right"]`. 
   - For each subset (e.g., `"original+tda"`, `"top_left+tda"`, …), we:
     1. Concatenate or keep separate features in a dictionary.  
     2. Train a **multi-view stacking** model (via `train_multiview_stacking`) with a RandomForest meta-learner.  
     3. Evaluate classification performance.

4. **Data Flow and Iterations**  
   - Each chunk processes all view combinations, then we move to the next chunk. We limit the total to `NUMBER_EXPERIMENTS`.
   - For each chunk and combination, the code logs performance (`classification_report`) and predictions in a global `results` dictionary.

5. **Main Experiment Steps**  
   1. **Initialize**: Logging, random seeds, create experiment directory.  
   2. **Load Data in Chunks**: `read_csv_in_chunks(file_path, CHUNK_SIZE)`  
   3. **Train/Test Split**: A portion is used for training, the rest for validation.  
   4. **Generate Quadrants + TDA**: `split_image_into_quadrants` and `tda_pipeline`.  
   5. **Loop Over Combinations**: `generate_view_combinations(VIEWS)` enumerates all subsets.  
   6. **Train and Collect**: Call `train_multiview_stacking(...)`, store `report`, `y_pred`, `y_true`.  
   7. **Repeat** for each chunk until `NUMBER_EXPERIMENTS` is reached.  
   8. **Save and Summarize**: The function `finalize_and_save_results` writes out CSV files with metrics and predictions.  

## Using the Main Experiment Function

The entry point is `main_experiment()`. It:
1. Creates logs and an experiment directory for saving outputs.  
2. Loads data, chunk by chunk, from **MNIST** (`mnist_784.csv`).  
3. Generates a TDA pipeline and quadrant data for each chunk.  
4. Iterates over each **view combination** to train and evaluate multi-view stacking.  
5. Collects performance metrics and predictions in a `results` dictionary.  
6. Writes comprehensive metrics (precision, recall, F1-score, support) to CSV, plus predictions and ground truths for further analysis.

### Configurations
Inside the function, you can modify:
- `RF_N_ESTIMATORS`: Number of trees in the RandomForest meta-learner.  
- `RANDOM_STATE`: Seed for reproducibility.  
- `NUMBER_EXPERIMENTS`: How many chunks to process.  
- `CHUNK_SIZE`: How large each data chunk is.  
- `VIEWS`: The set of potential feature subsets.  
- `N_JOBS`: Number of parallel jobs for the RandomForest and TDA pipeline.

### Expected Results
- A CSV of **detailed metrics** for each combination (e.g., accuracy, macro-F1, per-class metrics).  
- (Optionally) A file of **predictions** and **ground truths** for deeper error analysis.  
- Higher performance for view combinations that include TDA features, if our hypothesis holds true.

### Requirements

- You need to have installed multiviewstacking: `pip install multiviewstacking`.

In [None]:
# ============================
# Standard Library Imports
# ============================
import random
import time
import json
from itertools import product
from pathlib import Path
from typing import Optional
from dataclasses import asdict
from dataclasses import fields



# ============================
# Third-Party Library Imports
# ============================
import numpy as np
import pandas as pd
from tqdm import tqdm

# ============================
# Local Application Imports
# ============================
from experiment.Experiment import Experiment
from experiment.ExperimentConfig import ExperimentConfig


def save_configs_to_json(
    configs: list[ExperimentConfig],
    path: Path
) -> None:
    """
    Saves a list of ExperimentConfig instances to a JSON file.

    Args:
        configs (List[ExperimentConfig]): List of configs to save.
        path (Path): Path to the JSON file.
    """
    path.parent.mkdir(parents=True, exist_ok=True)  # Ensure directory exists

    with path.open("w", encoding="utf-8") as f:
        json.dump(
            [asdict(cfg) for cfg in configs],
            f,
            indent=4,
            default=str  # Needed to serialize Path objects
        )


def load_configs_from_json(path: Path) -> list[ExperimentConfig]:
    """
    Loads a list of ExperimentConfig instances from a JSON file.

    Args:
        path (Path): Path to the JSON file.

    Returns:
        List[ExperimentConfig]: List of experiment configurations.
    """
    with path.open("r", encoding="utf-8") as f:
        data = json.load(f)

    config_fields = {field.name for field in fields(ExperimentConfig)}

    configs = []
    for item in data:
        # Filter only valid fields (in case JSON has extra stuff)
        clean_item = {k: v for k, v in item.items() if k in config_fields}
        configs.append(ExperimentConfig(**clean_item))

    return configs


def run_experiments_from_saved_configs(config_file: Path) -> None:
    """
    Runs experiments loaded from a previously saved configuration file.

    Args:
        config_file (Path): Path to the JSON file containing configs.
    """
    configs = load_configs_from_json(config_file)

    start_time = time.perf_counter()

    for config in tqdm(configs, desc="Running Loaded Experiments"):
        run_experiment(config)

    elapsed_time = (time.perf_counter() - start_time) / 60
    print(f"✅ Experiments complete! Time: {elapsed_time:.2f} minutes")


def generate_experiment_configs(
    num_runs: int,
    seed: int,
    exp_name: str,
    train_splits: list[float],
    noise_types: list[str],
    noise_quantities: list[int],
    noise_transparencies: Optional[list[float]] = None,
) -> list[ExperimentConfig]:
    """
    Generates a list of experiment configurations 
    varying train splits, noise types, noise quantities, 
    and optional noise transparencies.
    """
    random.seed(int(seed))
    random_states = random.sample(range(10**5, 10**9), num_runs)

    if noise_transparencies is None:
        noise_transparencies = [1.0]  # Default to full opacity if not set

    vary_train_split = len(train_splits) > 1

    configs = []
    for rs in random_states:
        for (
            train_split,
            noise_type,
            noise_quantity,
            noise_transparency
        ) in product(
            train_splits,
            noise_types,
            noise_quantities,
            noise_transparencies
        ):
            config = ExperimentConfig(
                exp_name=exp_name,
                results_dir=Path(f"results/{exp_name}"),
                log_dir=Path(f"logs/{exp_name}"),
                random_state=rs,
                train_split=train_split,
                vary_train_split=vary_train_split,
                noise_enabled=True if noise_type else False,
                noise_type=noise_type,
                noise_quantity=noise_quantity,
                noise_transparency=noise_transparency
            )
            configs.append(config)

    return configs




def run_experiment(config: ExperimentConfig) -> None:
    """
    Runs a single experiment given an ExperimentConfig.
    """
    exp = Experiment(config)
    exp.run_experiment()


def run_multiple_experiments(
    num_runs: int,
    exp_name: str,
    train_splits: list[float],
    noise_types: list[str],
    noise_quantities: list[int],
    noise_transparencies: Optional[list[float]] = None,
    seed: int = 56) -> None:
    """
    Runs multiple experiments with varying configurations.
    """
    start_time = time.perf_counter()

    
    configs = generate_experiment_configs(
        num_runs=num_runs,
        seed=seed,
        exp_name=exp_name,
        train_splits=train_splits,
        noise_types=noise_types,
        noise_quantities=noise_quantities,
        noise_transparencies=noise_transparencies,
    )

    # Save configs before running
    save_configs_to_json(
        configs,
        path=Path(f"experiments_configs/{exp_name}_configs.json")
    )

    for config in tqdm(configs, desc="Running Experiments"):
        run_experiment(config)

    
    def format_elapsed_time(seconds: float) -> str:
        """Formats time for better readability."""
        minutes = seconds / 60
        if minutes < 60:
            return f"{minutes:.2f} minutes"
        hours = minutes / 60
        return f"{hours:.2f} hours"

    elapsed_time = time.perf_counter() - start_time
    print(f"✅ Experiments complete! Total time elapsed: {format_elapsed_time(elapsed_time)}")

# Run HERE

In [None]:
RUN_FROM_SAVED_CONFIG = False
CONFIG_PATH = ""

if __name__ == "__main__":
    if RUN_FROM_SAVED_CONFIG:
        if not CONFIG_PATH:
            raise ValueError(
                "CONFIG_PATH must be set if RUN_FROM_SAVED_CONFIG is True."
            )
        print(f"Running experiments from saved configs at {CONFIG_PATH}")
        run_experiments_from_saved_configs(Path(CONFIG_PATH))

    
    else:
        print("Generating and running new experiment configurations.")
        run_multiple_experiments(
            num_runs=1,
            exp_name="test",
            train_splits=[0.9],
            noise_types=[""],
            noise_quantities=[0],
            noise_transparencies=[0],
            seed=44712
        )