# ðŸ“Š Hierarchical Sampling & Experiment Management

## Advanced SeedHash Tutorial #2

**Welcome!** This notebook covers advanced features of SeedHash:
- `SeedExperimentManager` for systematic experimentation
- Hierarchical seed generation (master â†’ seeds â†’ sub-seeds)
- 4 sampling methods: simple, stratified, cluster, systematic
- ML experiment tracking with metrics
- DataFrame export and analysis

**Prerequisites**: Complete `01_Complete_SeedHash_Tutorial.ipynb` first

**Duration**: ~45 minutes

---

## Table of Contents
1. Introduction to SeedExperimentManager
2. Hierarchical Seed Generation
3. Simple Random Sampling
4. Stratified Random Sampling
5. Cluster Random Sampling
6. Systematic Random Sampling
7. ML Experiment Tracking
8. DataFrame Export & Analysis
9. Complete Example: Multi-Experiment Study
10. Best Practices

In [None]:
# Install seedhash if needed
import sys
sys.path.insert(0, '../Python')

# Import required libraries
import numpy as np
import pandas as pd
from seedhash import SeedExperimentManager

print("âœ… All imports successful!")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

## 1. Introduction to SeedExperimentManager ðŸŽ¯

The `SeedExperimentManager` is a powerful tool for:
- Managing multiple experiments with hierarchical seeds
- Tracking ML metrics across different methods
- Organizing results in pandas DataFrames
- Comparing sampling strategies

Let's create our first manager:

In [None]:
# Create an experiment manager
manager = SeedExperimentManager("my_ml_project")

print(f"Project: {manager.project_name}")
print(f"Master seed: {manager.master_seed}")
print(f"Experiments tracked: {len(manager.results)}")

## 2. Hierarchical Seed Generation ðŸŒ³

Generate a **hierarchy** of seeds: master â†’ seeds â†’ sub-seeds â†’ sub-sub-seeds

This is perfect for:
- Cross-validation folds
- Ensemble methods
- Nested experiments
- Multi-level sampling

In [None]:
# Generate hierarchical seeds
hierarchy = manager.generate_seed_hierarchy(
    n_seeds=3,        # 3 main seeds
    n_sub_seeds=2,    # 2 sub-seeds per main seed
    max_depth=2       # 2 levels deep
)

print(f"Master seed: {hierarchy[0][0]}\n")
print(f"Level 0 (master): {hierarchy[0]}")
print(f"Level 1 (seeds): {hierarchy[1]}")
print(f"Level 2 (sub-seeds): {hierarchy[2]}\n")

print("Hierarchy structure:")
for level, seeds in hierarchy.items():
    print(f"  Level {level}: {len(seeds)} seed(s)")
    if len(seeds) <= 10:  # Only print if not too many
        print(f"    {seeds[:5]}{'...' if len(seeds) > 5 else ''}")

print(f"\nTotal: {hierarchy[0]} â†’ {len(hierarchy[1])} seeds â†’ {len(hierarchy[2])} sub-seeds")

## 3. Simple Random Sampling ðŸŽ²

The most basic sampling method: randomly select samples without stratification.

**Use case**: Quick experiments, baseline comparisons

In [None]:
# Simple random sampling
population_size = 1000
sample_size = 100

samples = manager.simple_random_sample(
    population_size=population_size,
    sample_size=sample_size,
    seed=12345
)

print(f"Population: {population_size}")
print(f"Sample size: {len(samples)}")
print(f"First 10 samples: {sorted(samples)[:10]}")
print(f"Sample range: [{min(samples)}, {max(samples)}]")

## 4. Stratified Random Sampling ðŸ“Š

Ensures proportional representation across different strata (groups) in your data.

**Use case**: Balanced experiments, class-imbalanced datasets

In [None]:
# Stratified sampling ensures balanced coverage
samples = manager.stratified_random_sample(
    population_size=1000,
    sample_size=100,
    n_strata=10,  # Divide into 10 strata
    seed=12345
)

print(f"Stratified sample size: {len(samples)}")
print(f"First 10 samples: {sorted(samples)[:10]}")

# Verify stratification
import numpy as np
strata_sizes = [len([s for s in samples if i*100 <= s < (i+1)*100]) for i in range(10)]
print(f"\nSamples per stratum: {strata_sizes}")
print(f"Expected per stratum: ~{100//10}")

## 5. Cluster Random Sampling ðŸŽ¯

Groups related samples together before selection.

**Use case**: Geographic data, grouped experiments

In [None]:
# Cluster sampling groups related samples
samples = manager.cluster_random_sample(
    population_size=1000,
    sample_size=100,
    n_clusters=5,  # Create 5 clusters
    seed=12345
)

print(f"Cluster sample size: {len(samples)}")
print(f"First 10 samples: {sorted(samples)[:10]}")
print(f"\nClusters provide natural grouping for batch experiments")

## 6. Systematic Random Sampling âš¡

Selects samples at regular intervals.

**Use case**: Time-series data, evenly distributed experiments

In [None]:
# Systematic sampling with regular intervals
samples = manager.systematic_random_sample(
    population_size=1000,
    sample_size=100,
    seed=12345
)

print(f"Systematic sample size: {len(samples)}")
print(f"First 10 samples: {sorted(samples)[:10]}")

# Check interval
if len(samples) > 1:
    intervals = [samples[i+1] - samples[i] for i in range(min(5, len(samples)-1))]
    print(f"\nFirst 5 intervals: {intervals}")
    print(f"Samples are evenly spaced!")

## 7. ML Experiment Tracking ðŸ“ˆ

Track experiments with metrics and export to DataFrame for analysis.

In [None]:
# Create new manager for tracking
tracker = SeedExperimentManager("ml_tracking_demo")

# Generate seeds for experiments
hierarchy = tracker.generate_seed_hierarchy(n_seeds=5, n_sub_seeds=2, max_depth=2)

# Simulate regression experiments
for seed in hierarchy[1][:3]:  # Use first 3 seeds
    # Simulate metrics
    rmse = 5.0 + np.random.rand()
    r2 = 0.95 + np.random.rand() * 0.04
    
    # Track result
    tracker.add_experiment_result(
        seed=seed,
        ml_task="regression",
        metrics={"rmse": rmse, "r2": r2, "mae": rmse * 0.8},
        sampling_method="simple",
        metadata={"model": "linear_regression", "n_samples": 100}
    )
    print(f"Tracked seed {seed}: RMSE={rmse:.3f}, RÂ²={r2:.3f}")

print(f"\nâœ“ Tracked {len(tracker.results)} experiments")

## 8. DataFrame Export & Analysis ðŸ“Š

Export all tracked experiments to pandas DataFrame for powerful analysis.

In [None]:
# Export to DataFrame
df = tracker.get_results_dataframe()

print("DataFrame Preview:")
print(df.head())

print(f"\nDataFrame shape: {df.shape}")
print(f"Columns: {list(df.columns)}")

# Analyze metrics
print("\nMetric Statistics:")
print(df[['metric_rmse', 'metric_r2', 'metric_mae']].describe())

# Export to files
df.to_csv('experiment_results.csv', index=False)
df.to_json('experiment_results.json', orient='records', indent=2)

print("\nâœ“ Exported to CSV and JSON!")

## 9. Complete Example: Multi-Experiment Study ðŸ”¬

Putting it all together: run multiple experiments with different sampling methods.

In [None]:
# Complete study comparing all sampling methods
study = SeedExperimentManager("complete_study")

sampling_methods = ["simple", "stratified", "cluster", "systematic"]

for method in sampling_methods:
    # Generate hierarchy with this method
    hierarchy = study.generate_seed_hierarchy(
        n_seeds=3,
        n_sub_seeds=2,
        max_depth=2,
        sampling_method=method
    )
    
    # Run experiments with level 1 seeds
    for seed in hierarchy[1]:
        # Simulate classification metrics
        accuracy = 0.80 + np.random.rand() * 0.15
        f1_score = accuracy * (0.95 + np.random.rand() * 0.05)
        
        study.add_experiment_result(
            seed=seed,
            ml_task="classification",
            metrics={"accuracy": accuracy, "f1": f1_score},
            sampling_method=method,
            metadata={"classifier": "random_forest", "cv_folds": 5}
        )
    
    print(f"âœ“ Completed {method} sampling: {len(hierarchy[1])} experiments")

# Analyze all results
df = study.get_results_dataframe()
print(f"\nTotal experiments: {len(df)}")
print("\nAccuracy by sampling method:")
print(df.groupby('sampling_method')['metric_accuracy'].agg(['mean', 'std', 'min', 'max']).round(3))

## 10. Best Practices & Summary ðŸ’¡

### âœ… Best Practices:

**When to use each sampling method:**
- **Simple**: Quick experiments, no special requirements
- **Stratified**: Balanced experiments, class imbalance
- **Cluster**: Grouped data, batch processing
- **Systematic**: Time-series, even distribution

**Hierarchy depth guidelines:**
- **1 level**: Single experiment with multiple runs
- **2 levels**: Cross-validation folds
- **3+ levels**: Nested experiments, ensemble methods

**Experiment tracking tips:**
- Always include metadata for reproducibility
- Export to DataFrame for analysis
- Track timestamp and sampling method
- Use descriptive experiment IDs

---

## ðŸŽ‰ Summary

You learned:
- âœ… `SeedExperimentManager` for systematic experimentation
- âœ… Hierarchical seed generation (master â†’ seeds â†’ sub-seeds)
- âœ… 4 sampling methods with different use cases
- âœ… ML experiment tracking with metrics
- âœ… DataFrame export for analysis

**Next:** Try `03_Advanced_ML_Paradigms.ipynb` for semi-supervised, reinforcement, and federated learning!

---

**Happy experimenting! ðŸ“Š**