## What this notebook does (Dataset Sampling & Preparation)

This notebook creates a **manageable, balanced subset** of the full PLAsTiCC dataset that can be processed locally and used consistently across all notebooks.

### 1) Configuration: dataset size
- Sets `N_PER_CLASS = 1000`, meaning:
  - 1,000 Type Ia supernovae (SNIa, target = 90)
  - 1,000 Type II supernovae (SNII, target = 42)
- This increases statistical stability compared to earlier smaller samples while remaining computationally feasible.

### 2) Load full PLAsTiCC metadata
- Loads `training_set_metadata.csv`, which contains one row per astronomical object.
- Filters the metadata to only the two classes of interest (SNIa vs SNII).
- Prints how many total examples of each class are available in the full dataset.

### 3) Balanced sampling by class
- Randomly samples up to `N_PER_CLASS` objects from each class using a fixed random seed for reproducibility.
- Combines both classes into a single balanced metadata table (`sample_metadata`).
- Ensures:
  - No class imbalance
  - Identical object set is reused across feature extraction, classical ML, and quantum ML

### 4) Save sampled metadata
- Writes `sample_metadata.csv`, which lists the exact objects included in the study.
- This file defines the **ground truth object set** for all downstream notebooks.

### 5) Extract corresponding light curves
- Reads the very large `training_set.csv` file **in chunks** to avoid loading it all into memory.
- For each chunk:
  - Keeps only rows belonging to the sampled object IDs
- Concatenates all matching rows into `sample_lightcurves.csv`.

### 6) Save sampled light curves
- Writes `sample_lightcurves.csv`, containing all time-series observations for the sampled objects.
- This dramatically reduces data size while preserving realistic, irregular astronomical light curves.

### Why this step matters
- Keeps the project **reproducible**: everyone uses the same object subset
- Keeps it **practical**: avoids multi-GB processing in later notebooks
- Preserves **scientific validity**: real, unmodified PLAsTiCC observations
- Enables fair comparison between classical and quantum models on identical data

**Next step:** run the feature extraction notebook to convert these light curves into engineered features.

In [1]:
import os
import numpy as np
import pandas as pd

# ============================================================================
# CONFIGURATION - INCREASED DATASET SIZE
# ============================================================================

N_PER_CLASS = 1000  

DATA_DIR = os.path.join("..", "data", "plasticc")
META_PATH = os.path.join(DATA_DIR, "training_set_metadata.csv")
LC_PATH = os.path.join(DATA_DIR, "training_set.csv")

print(f"Target: {N_PER_CLASS} samples per class = {N_PER_CLASS * 2} total")

# ============================================================================
# LOAD FULL METADATA
# ============================================================================

metadata = pd.read_csv(META_PATH)
print(f"Total objects in metadata: {len(metadata)}")

# Focus on SNIa (90) vs SNII (42)
SELECTED_CLASSES = [90, 42]
binary_metadata = metadata[metadata["target"].isin(SELECTED_CLASSES)].copy()

print(f"\nAvailable samples:")
print(f"  SNIa (90): {(binary_metadata['target'] == 90).sum()}")
print(f"  SNII (42): {(binary_metadata['target'] == 42).sum()}")

# ============================================================================
# SAMPLE 1000 OF EACH CLASS
# ============================================================================

np.random.seed(42)

snia_all = binary_metadata[binary_metadata["target"] == 90]
snii_all = binary_metadata[binary_metadata["target"] == 42]

# Take 1000 of each (or max available)
n_snia = min(N_PER_CLASS, len(snia_all))
n_snii = min(N_PER_CLASS, len(snii_all))

snia_sample = snia_all.sample(n_snia, random_state=42)
snii_sample = snii_all.sample(n_snii, random_state=42)

sample_metadata = pd.concat([snia_sample, snii_sample], ignore_index=True)

print(f"\nâœ“ Sampled dataset:")
print(f"  Total: {len(sample_metadata)}")
print(f"  SNIa: {(sample_metadata['target'] == 90).sum()}")
print(f"  SNII: {(sample_metadata['target'] == 42).sum()}")

# ============================================================================
# SAVE SAMPLE METADATA
# ============================================================================

SAMPLE_META_PATH = os.path.join(DATA_DIR, "sample_metadata.csv")
sample_metadata.to_csv(SAMPLE_META_PATH, index=False)
print(f"\nâœ“ Saved: {SAMPLE_META_PATH}")

# ============================================================================
# EXTRACT LIGHTCURVES FOR SAMPLE
# ============================================================================

sample_ids = sample_metadata["object_id"].values
sample_ids_set = set(sample_ids)

print(f"\nðŸ“Š Extracting lightcurves for {len(sample_ids_set)} objects...")

chunksize = 100000
collected_chunks = []

for i, chunk in enumerate(pd.read_csv(LC_PATH, chunksize=chunksize)):
    mask = chunk["object_id"].isin(sample_ids_set)
    sub = chunk[mask]
    if not sub.empty:
        collected_chunks.append(sub)
    
    if i % 10 == 0:
        print(f"  Processed {i * chunksize:,} rows...")

if collected_chunks:
    sample_lightcurves = pd.concat(collected_chunks, ignore_index=True)
else:
    sample_lightcurves = pd.DataFrame()

print(f"\nâœ“ Total lightcurve rows: {len(sample_lightcurves):,}")

# ============================================================================
# SAVE SAMPLE LIGHTCURVES
# ============================================================================

SAMPLE_LC_PATH = os.path.join(DATA_DIR, "sample_lightcurves.csv")
sample_lightcurves.to_csv(SAMPLE_LC_PATH, index=False)
print(f"âœ“ Saved: {SAMPLE_LC_PATH}")

print("\n" + "=" * 70)
print("DATASET GENERATION COMPLETE!")
print("=" * 70)
print(f"Next step: Run feature extraction notebook (02_)")

Target: 1000 samples per class = 2000 total
Total objects in metadata: 7848

Available samples:
  SNIa (90): 2313
  SNII (42): 1193

âœ“ Sampled dataset:
  Total: 2000
  SNIa: 1000
  SNII: 1000

âœ“ Saved: ../data/plasticc/sample_metadata.csv

ðŸ“Š Extracting lightcurves for 2000 objects...
  Processed 0 rows...
  Processed 1,000,000 rows...

âœ“ Total lightcurve rows: 381,810
âœ“ Saved: ../data/plasticc/sample_lightcurves.csv

DATASET GENERATION COMPLETE!
Next step: Run feature extraction notebook (02_)
