# American Samoa Rainfall Prediction with Custom Configuration

This notebook demonstrates how to use the refactored rainfall prediction pipeline with custom configurations.

## 1. Setup

First, we must import the necessary modules and set up our environment.

In [1]:
import os
import sys
import yaml
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Define project root
PROJECT_ROOT = os.path.abspath(os.getcwd())
print(f"Project root: {PROJECT_ROOT}")

# Add scripts directory to Python path for imports
sys.path.append(os.path.join(PROJECT_ROOT, '2_Create_ML_Data', 'scripts'))

# Import the config utilities
from utils.config_utils import load_config, merge_config_with_args

Project root: /Users/jlee/Desktop/github/AS_rainfall


## 2. Load Default Configuration

Next, we load the default configuration from the YAML file.

In [10]:
# Load the default configuration
default_config = load_config()

# Display the configuration
print("Default Configuration:")
for key, value in default_config.items():
    if key not in ['patch_sizes', 'km_per_cell']:
        print(f"  {key}: {value}")
print(f"  patch_sizes: local={default_config['patch_sizes']['local']}, regional={default_config['patch_sizes']['regional']}")
print(f"  km_per_cell: local={default_config['km_per_cell']['local']}, regional={default_config['km_per_cell']['regional']}")

# Set default config path
default_config_path = os.path.join(PROJECT_ROOT, '2_Create_ML_Data', 'config', 'config.yaml')

Default Configuration:
  dem_path: /Users/jlee/Desktop/github/AS_rainfall/raw_data/DEM/DEM_Tut1.tif
  climate_data_path: /Users/jlee/Desktop/github/AS_rainfall/2_Create_ML_Data/output/processed_climate_data.nc
  raw_climate_dir: /Users/jlee/Desktop/github/AS_rainfall/raw_data/climate_variables
  rainfall_dir: /Users/jlee/Desktop/github/AS_rainfall/1_Process_Rainfall_Data/output/monthly_rainfall
  station_locations_path: /Users/jlee/Desktop/github/AS_rainfall/raw_data/AS_raingages/as_raingage_list2.csv
  output_dir: /Users/jlee/Desktop/github/AS_rainfall/2_Create_ML_Data/output
  grid_size: 5
  patch_sizes: local=3, regional=3
  km_per_cell: local=2, regional=8


## 3. Create a Custom Configuration

Now, let's create a custom configuration by modifying some parameters.

In [3]:
# Create a custom configuration file
custom_config_path = os.path.join(PROJECT_ROOT, '2_Create_ML_Data', 'config', 'custom_config.yaml')

# Define custom configuration
custom_config = {
    'paths': {
        'dem': "raw_data/DEM/DEM_Tut1.tif",
        'climate_data': "2_Create_ML_Data/output/processed_climate_data.nc",
        'raw_climate': "raw_data/climate_variables",
        'rainfall': "1_Process_Rainfall_Data/output/monthly_rainfall",
        'stations': "raw_data/AS_raingages/as_raingage_list2.csv",
        'output': "2_Create_ML_Data/output/custom_run"
    },
    'model': {
        'grid_size': 7,  # Changed from 5 to 7
        'patch_sizes': {
            'local': 5,    # Changed from 3 to 5
            'regional': 3
        },
        'km_per_cell': {
            'local': 1.5,  # Changed from 2 to 1.5
            'regional': 8
        }
    }
}

# Save the custom configuration to a YAML file
os.makedirs(os.path.dirname(custom_config_path), exist_ok=True)
with open(custom_config_path, 'w') as f:
    yaml.dump(custom_config, f, default_flow_style=False)

print(f"Custom configuration saved to {custom_config_path}")

Custom configuration saved to /Users/jlee/Desktop/github/AS_rainfall/2_Create_ML_Data/config/custom_config.yaml


## 4. Load Custom Configuration

Now, let's load our custom configuration.

In [4]:
# Load the custom configuration
custom_config_loaded = load_config(custom_config_path)

# Display the custom configuration
print("Custom Configuration:")
for key, value in custom_config_loaded.items():
    if key not in ['patch_sizes', 'km_per_cell']:
        print(f"  {key}: {value}")
print(f"  patch_sizes: local={custom_config_loaded['patch_sizes']['local']}, regional={custom_config_loaded['patch_sizes']['regional']}")
print(f"  km_per_cell: local={custom_config_loaded['km_per_cell']['local']}, regional={custom_config_loaded['km_per_cell']['regional']}")

Custom Configuration:
  dem_path: /Users/jlee/Desktop/github/AS_rainfall/raw_data/DEM/DEM_Tut1.tif
  climate_data_path: /Users/jlee/Desktop/github/AS_rainfall/2_Create_ML_Data/output/processed_climate_data.nc
  raw_climate_dir: /Users/jlee/Desktop/github/AS_rainfall/raw_data/climate_variables
  rainfall_dir: /Users/jlee/Desktop/github/AS_rainfall/1_Process_Rainfall_Data/output/monthly_rainfall
  station_locations_path: /Users/jlee/Desktop/github/AS_rainfall/raw_data/AS_raingages/as_raingage_list2.csv
  output_dir: /Users/jlee/Desktop/github/AS_rainfall/2_Create_ML_Data/output/custom_run
  grid_size: 7
  patch_sizes: local=5, regional=3
  km_per_cell: local=1.5, regional=8


## 5. Running the Full Pipeline with Custom Configuration

To run the full pipeline with a custom configuration, you can use the following approaches:

### Run from the notebook with subprocess

This approach runs the pipeline script as a subprocess

In [12]:
import subprocess

def run_command(cmd):
    process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
    stdout, stderr = process.communicate()
    return stdout.decode('utf-8') + stderr.decode('utf-8')

# Set up default configuration
pipeline_script = os.path.join(PROJECT_ROOT, '2_Create_ML_Data', 'scripts', 'rainfall_prediction_pipeline.py')
cmd = f"python3 {pipeline_script} --config {default_config_path} --output-dir {PROJECT_ROOT}/2_Create_ML_Data/output/"

# Run the pipeline
output = run_command(cmd)
print(output)

Found 11 raw climate data files in /Users/jlee/Desktop/github/AS_rainfall/raw_data/climate_variables
Found existing processed climate data at: /Users/jlee/Desktop/github/AS_rainfall/2_Create_ML_Data/output/processed_climate_data.nc



In [8]:
# Run the pipeline with custom configuration
pipeline_script = os.path.join(PROJECT_ROOT, '2_Create_ML_Data', 'scripts', 'rainfall_prediction_pipeline.py')
cmd = f"python3 {pipeline_script} --config {custom_config_path} --output-dir {PROJECT_ROOT}/2_Create_ML_Data/output/notebook_run"

output = run_command(cmd)
print(output)

Found 11 raw climate data files in /Users/jlee/Desktop/github/AS_rainfall/raw_data/climate_variables
Found existing processed climate data at: /Users/jlee/Desktop/github/AS_rainfall/2_Create_ML_Data/output/processed_climate_data.nc



In [23]:
# Import the progress tracking utility
import sys
import os
from utils.progress_utils import run_pipeline_with_progress, display_pipeline_timing

# Run with custom configuration
print("Running pipeline with custom configuration...")
custom_config_path = os.path.join(PROJECT_ROOT, '2_Create_ML_Data', 'config', 'custom_config.yaml')
custom_result = run_pipeline_with_progress(
    project_root=PROJECT_ROOT,
    config_path=custom_config_path,
    output_dir=os.path.join(PROJECT_ROOT, '2_Create_ML_Data', 'output', 'custom_notebook_run')
)

# Display timing information
if custom_result["success"]:
    display_pipeline_timing(custom_result)

Running pipeline: python3 /Users/jlee/Desktop/github/AS_rainfall/2_Create_ML_Data/scripts/rainfall_prediction_pipeline.py --config /Users/jlee/Desktop/github/AS_rainfall/2_Create_ML_Data/config/config.yaml --output-dir /Users/jlee/Desktop/github/AS_rainfall/2_Create_ML_Data/output


Pipeline Progress:   0%|          | 0/6 [00:00<?, ?it/s]

2025-05-27 14:33:53 - PROGRESS: 1/6 - Setting up environment
2025-05-27 14:33:53 - PROGRESS: 2/6 - Processing DEM data
2025-05-27 14:33:53 - CPLE_AppDefined in PROJ: internal_proj_create_from_database: Cannot find proj.db
2025-05-27 14:33:54 - PROGRESS: 3/6 - Processing climate data
2025-05-27 14:33:54 - PROGRESS: 4/6 - Processing rainfall data
2025-05-27 14:34:04 - PROGRESS: 5/6 - Generating training data
2025-05-27 14:34:11 - PROGRESS: 6/6 - Converting H5 data to CSV format
Found 11 raw climate data files in /Users/jlee/Desktop/github/AS_rainfall/raw_data/climate_variables
Found existing processed climate data at: /Users/jlee/Desktop/github/AS_rainfall/2_Create_ML_Data/output/processed_climate_data.nc

Pipeline completed successfully!

Progress Summary:
PROGRESS: 1/6 - Setting up environment
PROGRESS: 2/6 - Processing DEM data
PROGRESS: 3/6 - Processing climate data
PROGRESS: 4/6 - Processing rainfall data
PROGRESS: 5/6 - Generating training data
PROGRESS: 6/6 - Converting H5 data to

Pipeline Progress:   0%|          | 0/6 [00:00<?, ?it/s]

2025-05-27 14:34:14 - PROGRESS: 1/6 - Setting up environment
2025-05-27 14:34:14 - PROGRESS: 2/6 - Processing DEM data
2025-05-27 14:34:14 - CPLE_AppDefined in PROJ: internal_proj_create_from_database: Cannot find proj.db
2025-05-27 14:34:15 - PROGRESS: 3/6 - Processing climate data
2025-05-27 14:34:15 - PROGRESS: 4/6 - Processing rainfall data
2025-05-27 14:34:25 - PROGRESS: 5/6 - Generating training data
2025-05-27 14:34:31 - PROGRESS: 6/6 - Converting H5 data to CSV format
Found 11 raw climate data files in /Users/jlee/Desktop/github/AS_rainfall/raw_data/climate_variables
Found existing processed climate data at: /Users/jlee/Desktop/github/AS_rainfall/2_Create_ML_Data/output/processed_climate_data.nc

Pipeline completed successfully!

Progress Summary:
PROGRESS: 1/6 - Setting up environment
PROGRESS: 2/6 - Processing DEM data
PROGRESS: 3/6 - Processing climate data
PROGRESS: 4/6 - Processing rainfall data
PROGRESS: 5/6 - Generating training data
PROGRESS: 6/6 - Converting H5 data to