# Data Preparation Pipeline

## Pipeline Overview

**Stage 1: Environment Setup**
- Initialize reproducible environment with fixed random seeds
- Load and validate all configuration files
- Audit raw data file inventory

**Stage 3: Feature Engineering**
- Select neuroimaging features by family (DTI, cortical area, thickness, etc.)
- Create derived labels (anxiety groups from t-scores)
- Apply transformations (e.g., sex coding)

**Stage 4: Quality Control**
- Apply surface holes QC policy to remove poor-quality scans
- Generate QC visualizations and reports
- Save pre-QC dataset for visualization and post-QC for downstream analysis

**Stage 5: Data Splitting**
- Create stratified train/validation/test splits
- Ensure reproducible splits with fixed random seeds

In [None]:
%load_ext autoreload
%autoreload 2

from core.config import initialize_notebook, save_summary
from core.preprocess import save_datasets, load_and_merge

env = initialize_notebook("dataprep", "first_run") 


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Initialized notebook: dataprep 
Use "env.variable" to access project variables
Use "env.configs.yaml_file[key]" to access yaml files
Saved output summary to /Users/lukeengel/Desktop/abcd/outputs/first_run:1


**Stage 2: Data Merging & Validation**
- Merge multiple raw CSV files on subject ID
- Validate data schema and integrity
- Filter to baseline timepoint only

In [29]:
load_and_merge(env)

                                                              


Dtype Applications Summary:
Column                    Requested       Final           Status     File
-----------------------------------------------------------------------------------------------
src_subject_id            string          object          ✗          data/raw/abcd_p_demo.csv
eventname                 category        object          ✗          data/raw/abcd_p_demo.csv
demo_sex_v2               category        float64         ✗          data/raw/abcd_p_demo.csv
demo_brthdat_v2           int16           float64         ✗          data/raw/abcd_p_demo.csv
src_subject_id            string          string          ✓          data/raw/mh_p_cbcl.csv
eventname                 category        category        ✓          data/raw/mh_p_cbcl.csv
cbcl_scr_dsm5_anxdisord_t float32         float32         ✓          data/raw/mh_p_cbcl.csv
mri_info_manufacturer     category        category        ✓          data/raw/mri_y_adm_info.csv
mri_info_deviceserialnumber string          string  

Unnamed: 0,src_subject_id,eventname,demo_brthdat_v2,demo_sex_v2,cbcl_scr_dsm5_anxdisord_t,mri_info_manufacturer,mri_info_deviceserialnumber,apqc_smri_topo_ndefect,dmri_dtifa_fiberat_allfibers,dmri_dtifa_fiberat_allfiblh,...,mrisdp_595,mrisdp_596,mrisdp_597,mrisdp_598,mrisdp_599,mrisdp_600,mrisdp_601,mrisdp_602,mrisdp_603,mrisdp_604
0,NDAR_INV003RTV85,baseline_year_1_arm_1,10.0,2.0,50.0,SIEMENS,HASH96a0c182,22,0.524779,0.527154,...,2617.0,2458.0,649.0,2889.0,1819.0,13322.0,616.0,280171.0,279542.0,559713.0
1,NDAR_INV005V6D2C,baseline_year_1_arm_1,10.0,2.0,50.0,GE MEDICAL SYSTEMS,HASHe3ce02d3,24,0.516735,0.516379,...,1961.0,2954.0,365.0,1900.0,2136.0,11746.0,468.0,256557.0,258381.0,514938.0
2,NDAR_INV007W6H7B,baseline_year_1_arm_1,10.0,1.0,54.0,GE MEDICAL SYSTEMS,HASH48f7cbc3,24,0.502483,0.515888,...,3910.0,2454.0,543.0,3036.0,3000.0,16321.0,667.0,352798.0,349789.0,702587.0
3,NDAR_INV00BD7VDC,baseline_year_1_arm_1,9.0,1.0,52.0,SIEMENS,HASH65b39280,12,0.499305,0.501345,...,3963.0,3540.0,315.0,3395.0,2802.0,18065.0,590.0,363851.0,364491.0,728342.0
4,NDAR_INV00CY2MDM,baseline_year_1_arm_1,10.0,1.0,52.0,SIEMENS,HASHd422be27,9,0.508850,0.511712,...,3225.0,2458.0,616.0,2601.0,2092.0,11821.0,466.0,274089.0,278844.0,552933.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21873,NDAR_INVZZZNB0XC,baseline_year_1_arm_1,9.0,2.0,50.0,SIEMENS,HASH5b0cf1bb,9,0.496990,0.501284,...,4324.0,2409.0,253.0,2169.0,2318.0,12817.0,601.0,290800.0,292562.0,583362.0
21874,NDAR_INVZZZNB0XC,4_year_follow_up_y_arm_1,,,52.0,SIEMENS,HASH5b0cf1bb,10,0.521199,0.524874,...,4214.0,2509.0,255.0,2085.0,2114.0,12499.0,600.0,280475.0,282326.0,562801.0
21875,NDAR_INVZZZP87KR,baseline_year_1_arm_1,10.0,2.0,50.0,Philips Medical Systems,HASH5ac2b20b,34,0.494268,0.494769,...,2929.0,2777.0,400.0,2326.0,1969.0,14295.0,800.0,301536.0,303373.0,604909.0
21876,NDAR_INVZZZP87KR,2_year_follow_up_y_arm_1,,,50.0,Philips Medical Systems,HASH5ac2b20b,52,0.517065,0.524581,...,2923.0,2368.0,399.0,2394.0,1786.0,13515.0,703.0,283040.0,283992.0,567032.0


In [None]:
save_summary(env)