# Data Preparation Pipeline

## Pipeline Overview

**Stage 1: Environment Setup**
- Initialize reproducible environment with fixed random seeds
- Load and validate all configuration files
- Audit raw data file inventory

**Stage 3: Feature Engineering**
- Select neuroimaging features by family (DTI, cortical area, thickness, etc.)
- Create derived labels (anxiety groups from t-scores)
- Apply transformations (e.g., sex coding)

**Stage 4: Quality Control**
- Apply surface holes QC policy to remove poor-quality scans
- Generate QC visualizations and reports
- Save pre-QC dataset for visualization and post-QC for downstream analysis

**Stage 5: Data Splitting**
- Create stratified train/validation/test splits
- Ensure reproducible splits with fixed random seeds

In [None]:
%load_ext autoreload
%autoreload 2

from core.config import initialize_notebook, save_summary
from core.preprocess import save_datasets, load_and_merge

env = initialize_notebook("dataprep", "first_run") 


**Stage 2: Data Merging & Validation**
- Merge multiple raw CSV files on subject ID
- Validate data schema and integrity
- Filter to baseline timepoint only

In [None]:
load_and_merge(env)

In [None]:
save_summary(env)