In [None]:
# Data Prep
## Set-up

####

### 01_make_full_hierarchy.py
**Purpose:** Creates a comprehensive geographic hierarchy by combining FHS and LSAE structures.

**Inputs/Output:**
- FHS: `/mnt/team/rapidresponse/pub/population-model/admin-inputs/raking/gbd-inputs/hierarchy_fhs_2021.parquet`
- LSAE: `/mnt/team/rapidresponse/pub/population-model/admin-inputs/raking/gbd-inputs/hierarchy_lsae_1209.parquet`
- GBD: `/mnt/team/rapidresponse/pub/population-model/admin-inputs/raking/gbd-inputs/hierarchy_gbd_2021.parquet`

- Output: `/mnt/team/idd/pub/forecast-mbp/02-processed_data/full_hierarchy_lsae_1209.parquet`
- Output: `/mnt/team/idd/pub/forecast-mbp/02-processed_data/lsae_to_fhs_table.parquet`
- Output: `/mnt/team/idd/pub/forecast-mbp/02-processed_data/lsae_to_gbd_table.parquet`

**Process:** Loads both hierarchies → Cleans columns → Creates base from FHS → Integrates LSAE locations (preserving parent-child relationships, adjusting paths, setting levels 4-5, maintaining regions) → Handles sort order → Resolves duplicates (prioritizing `most_detailed_fhs=1`) → Links each location to the most deatiled FHS location above it (or it) → Links each location to the most deatiled GBD location above it (or it) → Saves output

**Notes:** There are 640 locations in the lsae hierarchy that don't have ancestors in GBD until you get to global (e.g., Aruba and its childre). They are all dropped in this code.

**Significance:** Creates foundational geographic structure used throughout the forecasting pipeline.

### 02_rake_A2_to_GBD.py
**Purpose:** Adjusts malaria and dengue estimates from climate model outputs to match official GBD totals while preserving spatial patterns.

**Inputs/Output:**
- LSAE Hierarchy: `/mnt/team/rapidresponse/pub/population-model/admin-inputs/raking/gbd-inputs/hierarchy_lsae_1209.parquet`
- GBD Hierarchy: `/mnt/team/rapidresponse/pub/population-model/admin-inputs/raking/gbd-inputs/hierarchy_gbd_2023.parquet`
- Malaria GBD: `01-raw_data/gbd/gbd_2023_malaria_aa.csv`
- Dengue GBD: `01-raw_data/gbd/gbd_2023_dengue_aa.csv`
- Population: `/mnt/team/rapidresponse/pub/climate-aggregates/2025_03_20/results/lsae_1209/population.parquet`
- Output: `02-processed_data/raked_malaria_aa.parquet` and `02-processed_data/raked_dengue_aa.parquet`

**Process:** Load hierarchies and population → Process malaria data → Calculate raking factors (GBD/local ratios) → Apply factors to local estimates → Repeat for dengue → Calculate rates → Save outputs

**Significance:** Creates GBD-consistent disease estimates at high geographic resolution that maintain local patterns while aligning with official global totals.

### 03_as_fhs_and_admin_2_population.py
**Purpose:** Creates age-specific population datasets for LSAE geographic locations by applying age-sex distribution patterns from FHS locations to total population counts.

**Inputs/Output:**
- Hierarchy data: `/02-processed_data/full_hierarchy_lsae_1209.parquet`
- Age metadata: `/02-processed_data/age_specific_fhs/age_metadata.parquet`
- FHS hierarchy: `/02-processed_data/age_specific_fhs/fhs_hierarchy.parquet`
- LSAE population: `/mnt/team/rapidresponse/pub/climate-aggregates/2025_03_20/results/lsae_1209/population.parquet`
- Past FHS population: `/mnt/share/forecasting/data/9/past/population/20231002_etl_run_id_359/population.nc`
- Future FHS population: `/mnt/share/forecasting/data/9/future/population/20250219_draining_fix_old_pop_v5/population.nc`
- Output FHS fractions: `/02-processed_data/fhs_population.parquet`
- Output LSAE population: `/03-modeling_data/as_lsae_population_df.parquet`

**Process:** Set up environment and constants → Load geographical hierarchies → Process LSAE total population → Extract FHS population data from NetCDF files → Calculate age-specific population fractions → Generate all age-sex combinations → Apply FHS demographic patterns to LSAE locations → Save disaggregated population dataset

**Significance:** Enables age-specific disease modeling by providing properly disaggregated demographic data across all geographic units, allowing for more precise forecasting that accounts for age-specific disease patterns.

### 04_fhs_cause_quantities.py
**Purpose:** Calculates age-specific disease risk patterns from FHS locations to enable age-structured disease modeling.

**Inputs/Output:**
- Hierarchy: `/02-processed_data/full_hierarchy_lsae_1209.parquet`
- Age metadata: `/02-processed_data/age_specific_fhs/age_metadata.parquet`
- Malaria data: `/03-modeling_data/malaria_stage_2_modeling_df.parquet`
- FHS population: `/02-processed_data/fhs_population.parquet`
- FHS disease data: `/02-processed_data/age_specific_fhs/{age_type}_cause_id_{cause_id}_measure_id_{measure_id}_metric_id_{metric_id}_fhs.parquet`
- Output: `/03-modeling_data/fhs_{cause}_{measure}_{metric}_df.parquet`

**Process:** Load hierarchies and reference data → Apply location and time filters → Process all-age and age-specific datasets → Calculate absolute risks → Identify reference age-sex groups → Compute relative risk patterns → Save disease-specific modeling datasets

**Significance:** Creates standardized representations of how disease risk varies across age groups, enabling age-structured forecasting models to accurately distribute disease burden across demographic groups.

### 01_malaria_modeling_dataframe.py
**Purpose:** Prepares comprehensive malaria modeling datasets by integrating disease metrics with climate, economic, and urbanization covariates.

**Inputs/Output:**
- Malaria Data: `/02-processed_data/raked_malaria_aa.parquet`
- Hierarchy: `/02-processed_data/full_hierarchy_lsae_1209.parquet`
- Climate Variables: `/mnt/team/rapidresponse/pub/climate-aggregates/2025_03_20/results/lsae_1209/...`
- Economic Indicators: `/02-processed_data/lsae_1209/gdppc_mean.parquet`, `/02-processed_data/lsae_1209/ldipc_mean.parquet`
- Development Assistance: `/02-processed_data/lsae_1209/dah_df.parquet`
- Output Stage 1: `/03-modeling_data/malaria_stage_1_modeling_df.parquet`
- Output Stage 2: `/03-modeling_data/malaria_stage_2_modeling_df.parquet`

**Process:** Configure data paths → Load malaria data → Merge economic indicators → Integrate urban metrics → Add climate variables → Create Stage 1 dataset → Filter to high-burden areas (mortality > 100) → Select most detailed LSAE locations → Apply transformations (log, logit) → Save Stage 2 modeling dataset

**Significance:** Creates the foundational datasets required for malaria burden modeling, incorporating all relevant predictors and applying appropriate statistical transformations for regression modeling.

Prompt for next time


Please create a compact markdown documentation for this Python script and present it inside a code block (```), so I can copy the raw markdown text with all formatting symbols visible. This allows me to paste it directly into my Jupyter notebook markdown cell. Follow this exact format:

### script_name.py
**Purpose:** One sentence summary of what the script does.

**Inputs/Output:**
- Input1: `/path/to/input1`
- Input2: `/path/to/input2`
- Output: `/path/to/output`

**Process:** Step 1 → Step 2 → Step 3 → etc. (use arrows between steps and keep it as a single line)

**Significance:** Brief statement on why this script is important to the pipeline.

Please use the actual paths, steps, and details from the script, and ensure I can see all markdown symbols (###, **, -, etc.) in your response by placing everything inside a code block.

In [None]:
malaria_stage_2_df[malaria_stage_2_df["location_id"] == 46346][["location_id", "year_id", "malaria_pfpr", "malaria_mort_count"]]




data_prep/malaria/04_forecasted_dataframes_non_draw_part.py
	- in	LOTS OF THINGS
	- out	"{FORECASTING_DATA_PATH}/malaria_forecast_scenario_{ssp_scenario}_non_draw_part.parquet")

#### data_prep/malaria/04_forecasted_dataframes_parallel.py
	- in	"{FORECASTING_DATA_PATH}/malaria_forecast_scenario_{ssp_scenario}_non_draw_part.parquet")
	- out	"{FORECASTING_DATA_PATH}/malaria_forecast_ssp_scenario_{ssp_scenario}_dah_scenario_{dah_scenario_name}_draw_{draw}.parquet"

#### forecasting/malaria/forecast_admin_2s_launcher.r
	- in 	"{FORECASTING_DATA_PATH}/malaria_forecast_ssp_scenario_{ssp_scenario}_dah_scenario_{dah_scenario_name}_draw_{draw}.parquet"
	- out	"{FORECASTING_DATA_PATH}/malaria_forecast_ssp_scenario_{ssp_scenario}_dah_scenario_{dah_scenario_name}_draw_{draw}_with_predictions.parquet"


#### forecasting/malaria/01_as_malaria_shifts_parallel.py
	- in: 	"{FORECASTING_DATA_PATH}/malaria_forecast_ssp_scenario_{ssp_scenario}_dah_scenario_{dah_scenario_name}_draw_{draw}_with_predictions.parquet"
	- out:	"{FORECASTING_DATA_PATH}/as_malaria_measure_{measure}_ssp_scenario_{ssp_scenario}_dah_scenario_{dah_scenario_name}_draw_{draw}_with_predictions.parquet"
	
#### aggregation/malaria/01_malaria_as_aggregation_by_draw_parallel.py
	- in: 	"{FORECASTING_DATA_PATH}/as_malaria_measure_{measure}_ssp_scenario_{ssp_scenario}_dah_scenario_{dah_scenario}_draw_{draw}_with_predictions.parquet"
	- out: 	"{UPLOAD_DATA_PATH}/full_as_malaria_measure_{measure}_ssp_scenario_{ssp_scenario}_dah_scenario_{dah_scenario}_draw_{draw}_with_predictions.parquet"
	
upload/malaria/combine_as_draws.ipynb
	- in:	"{UPLOAD_DATA_PATH}/full_as_malaria_measure_{measure}_ssp_scenario_{ssp_scenario}_dah_scenario_{dah_scenario}_draw_{draw}_with_predictions.parquet"
	- out: 	"{UPLOAD_DATA_PATH}/fhs_upload_folders/cause_id_{cause_id}_measure_id_{measure_id}_sceanrio_{scenario}_{run_date}/draws.h5"
	
	
/mnt/team/fhs/pub/venv/fhs_save_results /mnt/team/idd/pub/forecast-mbp/05-upload_data/fhs_upload_folders/cause_id_345_measure_id_1_scenario_0_2025_06_09 --has-past-data False
/mnt/team/fhs/pub/venv/fhs_save_results /mnt/team/idd/pub/forecast-mbp/05-upload_data/fhs_upload_folders/cause_id_345_measure_id_1_scenario_54_2025_06_09 --has-past-data False
/mnt/team/fhs/pub/venv/fhs_save_results /mnt/team/idd/pub/forecast-mbp/05-upload_data/fhs_upload_folders/cause_id_345_measure_id_1_scenario_66_2025_06_09 --has-past-data False

	