# Preprocessing Guide: RainfallProcessor

This notebook provides a guide to the `RainfallProcessor`, a component designed for **spatial interpolation of rainfall data**. This is a critical preprocessing step for distributed hydrological models, which require rainfall inputs at specific points (e.g., the center of a sub-basin), whereas real-world data typically comes from a sparse network of rain gauges.

## 1. How It Works

The `RainfallProcessor` is a special component that runs *before* the main simulation loop. It is designed to be listed in the `preprocessing` section of a simulation configuration.

Its workflow is:
1.  It reads the locations and time series data paths for several **rain gauges** from the `datasets` section of the config.
2.  It identifies all components in the simulation that need rainfall data (e.g., hydrological models) and gets their coordinates.
3.  It uses a specified **interpolation strategy** (like Inverse Distance Weighting) to calculate the rainfall at each target location for every timestamp present in the gauge data.
4.  The final result is a single DataFrame of interpolated rainfall time series, which can then be used by other models during the main simulation.

## 2. Code Example: Interpolating a Rainfall Field

We will demonstrate a complete example using the `SimulationManager`. We will define a set of rain gauges and a mock hydrological model with target sub-basins. Then, we will run the `RainfallProcessor` and inspect its output.

In [None]:
import sys
import os
import yaml
import numpy as np
import pandas as pd

# Add the project root to the path
sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), '..')))

from water_system_sdk.src.chs_sdk.simulation_manager import SimulationManager

### Setting up the YAML Configuration

We will define a full simulation configuration in YAML. This includes the `datasets` block for our rain gauges and a `preprocessing` block to tell the `SimulationManager` to run our `RainfallProcessor`.

In [None]:
# We use an object for the mock sub-basins to make them accessible later
class MockSubBasin:
    def __init__(self, id, coords):
        self.id = id
        self.coords = coords

# Define the configuration as a string
config_yaml = """
datasets:
  rain_gauges:
    - id: 'station_A'
      coords: [0, 0]
      time_series_path: '../data/rain_gauges/station_A.csv'
    - id: 'station_B'
      coords: [10, 10]
      time_series_path: '../data/rain_gauges/station_B.csv'
    - id: 'station_C'
      coords: [0, 10]
      time_series_path: '../data/rain_gauges/station_C.csv'

components:
  # The processor itself is a component
  rainfall_processor:
    type: RainfallProcessor
    params:
      source_dataset: 'rain_gauges'
      strategy:
        type: InverseDistanceWeightingInterpolator
        params:
          power: 2
  
  # A mock hydrology model that has sub-basins needing rainfall data
  hydro_model:
    type: SemiDistributedHydrologyModel # This is just a placeholder type
    params:
      sub_basins: !python/object:__main__.create_sub_basins []

# Define which components to run before the main simulation
preprocessing:
  - rainfall_processor

# No execution order needed as we are only running the preprocessing step
execution_order: []
simulation_params:
  total_time: 1
  dt: 1
"""

# Helper function to be called by the YAML loader
def create_sub_basins():
    return [
        MockSubBasin(id='basin1', coords=(2, 2)),
        MockSubBasin(id='basin2', coords=(8, 8))
    ]

# Add a constructor for our python object
yaml.add_constructor('!python/object:__main__.create_sub_basins', lambda l, n: create_sub_basins(), Loader=yaml.SafeLoader)

# Load the config
config = yaml.safe_load(config_yaml)


### Running the Preprocessing Step

Now we can instantiate the `SimulationManager` with this configuration. The manager will automatically run the `rainfall_processor` because it's listed in the `preprocessing` section.

In [None]:
# This is a workaround because the mock hydro model isn't a real component
# In a real scenario, the model would be in the registry.
from chs_sdk.modules.modeling.hydrology.semi_distributed import SemiDistributedHydrologyModel
from chs_sdk.simulation_manager import ComponentRegistry
ComponentRegistry._CLASS_MAP['SemiDistributedHydrologyModel'] = SemiDistributedHydrologyModel

# Instantiate the manager. This will run the preprocessing step.
manager = SimulationManager(config)

# Get the processor component to inspect its output
processor = manager.components['rainfall_processor']

# The result is a pandas DataFrame
interpolated_data = processor.interpolated_rainfall

print("Interpolated Rainfall Data:")
interpolated_data

The output DataFrame shows the calculated rainfall for our two target basins at each timestamp from the source files. This demonstrates that the `RainfallProcessor` has successfully taken scattered point data and created a complete time series for the locations required by the simulation models.