<a href="https://colab.research.google.com/github/kpmadabhu/EEE598_proj_tuh_eeg/blob/main/EEE598_Proj_TUH_EEG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TUSZ Preprocessing Pipeline

This notebook builds a seizure vs background window dataset from TUH Seizure Corpus (TUSZ) tcp-bipolar montage EEG.

- Recursively walks `train/`, `dev/`, or `eval` under `ROOT_DIR`.
- For each EDF:
  - Load EEG + its seizure annotations (.csv).
  - Bandpass 0.5–40 Hz using a Kaiser FIR, zero-phase (`filtfilt`).
  - Downsample to 250 Hz *if* original fs is higher; never upsample.
  - Robust z-score per channel (median/IQR).
  - Slice into 10 s windows with 5 s hop.
  - Label each window seizure(1) or background(0), pooling all seizure subtypes.
- Skips recordings that are too short for `filtfilt` padding.
- Saves:
  - `split_windows.npz` with windows, labels, and metadata.
  - `qc_window.png` plotting an informative example window.
  - `class_balance.png` showing seizure/background counts.

## Smart Canonical Montage (what makes THIS version special)
We do **not** force a giant hard-coded canonical bipolar channel list upfront.
Instead we learn a canonical layout from *inside the split*:

1. The **first usable recording** becomes our **reference layout**.
   - We store its channel names/ordering. Call this `ref_ch_names`.
2. Every later recording is aligned to that same reference layout:
   - For each reference channel name (like `FP1-F7`), we try to find the best-matching channel name in the new recording using a robust normalizer.
   - If found, we copy that channel.
   - If missing, we fill zeros in that row.
3. We compute a **coverage ratio** = fraction of reference channels we could actually fill with real data from this recording.
   - If coverage is too low (default <30%), we SKIP that recording to avoid mostly-zero data.


## Runtime knobs
- `MAX_SUBJECTS`: cap how many subjects to include (fast debug).
- `MAX_RECORDINGS`: cap recordings per subject (or overall).
- `NUM_TAPS`: FIR length. Shorter => fewer clips skipped for being "too short".
- `GROUP_MODE`: iterate by `"subject"` or just all EDFs (`"recording"`).
- `min_coverage`: skip a recording if < this fraction of reference channels could be mapped.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
import mne
from scipy.signal import firwin, filtfilt, resample_poly
from math import gcd
import io, re

# ------------------ USER CONFIG ------------------
# Point this at the root of the TUSZ split folders that contain train/dev/eval
ROOT_DIR = Path("Enter the mounted onedrive location of the edf folder of tusz database").expanduser().resolve()

# Output directory for NPZ + plots
OUT_DIR  = Path("output folder to save the windows ans coverage stats").expanduser().resolve()
OUT_DIR.mkdir(parents=True, exist_ok=True)

# Windowing params
WIN_LEN_S = 10.0   # seconds per window
HOP_S     = 5.0    # seconds hop (50% overlap)

# DSP params
NUM_TAPS  = 1001    # Kaiser FIR taps for bandpass 0.5-40Hz (shorter -> fewer skips)
TARGET_FS = 250.0  # We'll downsample to 250 Hz if fs_orig > 250 Hz; never upsample

# Runtime limiting params
MAX_SUBJECTS   = 10        # limit number of subjects used (debug speed). None = no explicit cap
MAX_RECORDINGS = 10      # max recordings per subject / overall. None = no explicit cap
GROUP_MODE     = "subject" # "subject" or "recording"

# Seizure subtype labels we consider "seizure" (label=1)
# Everything else becomes background (label=0)
SEIZURE_LABELS = {
    'seiz','fnsz','gnsz','spsz','cpsz','absz',
    'tnsz','cnsz','tcsz','atsz','mysz'
}

print("ROOT_DIR:", ROOT_DIR)
print("OUT_DIR:", OUT_DIR)
