# Product Requirements Document (PRD): CHF-Automator

## 1. Executive Summary
**Project Name:** `CHF-Automator`  
**Objective:** Automate the "Crop Health Factor (CHF)" workflow (Murthy et al., 2022) for paddy crop insurance using Google Earth Engine (GEE).  
**Core Philosophy:** 
1. **Decoupled Architecture:** Heavy GEE extraction is separated from mathematical modeling. Data is persisted to local CSVs in between steps to prevent timeouts.
2. **Dynamic Masking:** Every year uses its own specific Crop Map (`N+1` strategy).
3. **Training vs. Application:** Weights are learned from Historic variance but applied to All years.

## 2. Technical Stack
* **Platform:** Google Earth Engine (Python API) for extraction.
* **Environment:** Python 3.9+ (Pandas/NumPy) for modeling.
* **Authentication:** Modular GEE Project initialization (handled externally).

## 3. Input Specifications
The tool requires the following inputs via a configuration file or main script:

1.  **Insurance Units (Vector):** GEE Asset ID (User provided) containing `Unit_ID` (Unique) and `Strata_ID` (Grouping).
2.  **Season Dates:** `season_start`, `season_end`, `peak_start`, `peak_end`.
3.  **Analysis Configuration:**
    * **Years List:** A list of all years to process (e.g., `[2018, 2019, 2020, 2021, 2022, 2023]`).
    * **Crop Map Dictionary:** A mapping of `Year -> Crop_Map_Asset_ID`.
    * **Training Years:** A subset list (e.g., `[2018, 2019, 2020, 2021, 2022]`) used strictly for weight generation.

## 4. The 8 Indicators (GEE Processing Logic)
**Constraint:** This logic applies to *every* year independently, using that year's specific Crop Mask.

| Indicator | Data Source | GEE Processing Logic |
| :--- | :--- | :--- |
| **1. Max NDVI** | Sentinel-2 | **Cloud Mask:** Use `MSK_CLDPRB`. Mask pixels where `MSK_CLDPRB > 20`. <br>**Reducer:** `.max()` over `season_start` to `season_end`. |
| **2. Max LSWI** | Sentinel-2 | **Cloud Mask:** Same (`MSK_CLDPRB > 20`). <br>**Calc:** `(NIR-SWIR1)/(NIR+SWIR1)`. **Reducer:** `.max()`. |
| **3. Max Backscatter** | Sentinel-1 | **Filter:** IW, VH, Descending. <br>**Process:** Apply Refined Lee Speckle Filter (5x5). <br>**Reducer:** `.max()`. |
| **4. Integrated Backscatter** | Sentinel-1 | **Calc:** `.sum()` of VH backscatter over season (Area under curve). |
| **5. Integrated FAPAR** | MODIS | **Product:** `MODIS/061/MCD15A3H`. **Band:** `Fpar`. <br>**Window:** `.sum()` over **`peak_start` to `peak_end`**. |
| **6. Condition Variability** | Sentinel-2 | **Spatial CV:** Calculate `(Standard Deviation / Mean)` of the 'Max NDVI' image *spatially* within each IU polygon. |
| **7. Rainy Days** | CHIRPS | **Product:** `UCSB-CHG/CHIRPS/DAILY`. <br>**Calc:** Count days where `precipitation > 2.5mm`. |
| **8. Adjusted Rainfall** | CHIRPS | **Step A:** Calc Current Year Total Rain. <br>**Step B:** Calc `Normal` (10-yr avg) for same dates. <br>**Step C:** If `Current` > (1.5 * `Normal`), cap at (1.5 * `Normal`). |

## 5. Algorithmic Workflow (Decoupled)

### Phase 1: Batch Extraction (GEE -> Local CSV)
**Goal:** Download raw data to disk using Client-Side Chunking to prevent timeouts.
1.  **Preparation:** Fetch list of `Unit_ID`s from the Input Asset.
2.  **Loop:** Iterate through every year in the User Config.
3.  **Action:** Call `fetch_metrics(year, crop_map_asset, roi_asset)` inside a batch loop.
    *   Break `Unit_ID`s into batches (e.g., 50 units).
    *   For each batch:
        *   Filter ROI.
        *   Run `reduceRegions`.
        *   Fetch data using `getInfo()` or `geemap.ee_to_df()`.
        *   Append results incrementally to `outputs/raw_data/indicators_{year}.csv`.
        *   Use `tqdm` for progress monitoring.

### Phase 2: Weight Training (Local CSV -> Weights)
**Goal:** Load historic CSVs and learn the weights using the Entropy Method.
**Process:**
1.  **Load:** Read CSVs for **Training Years Only** and concatenate into `df_historic`.
2.  **Group:** Group by `Strata_ID`.
3.  **Calculations (Per Strata):**
    * **Step A: Normalization ($x_{ij}$)**
        * For Positive Indicators (NDVI, LSWI, Backscatter, FAPAR, Rain):
        $$ x_{ij} = \frac{X_{ij} - \min(X_j)}{\max(X_j) - \min(X_j)} $$
        * For Negative Indicators (Variability):
        $$ x_{ij} = \frac{\max(X_j) - X_{ij}}{\max(X_j) - \min(X_j)} $$
    * **Step B: Probability ($P_{ij}$)**
        $$ P_{ij} = \frac{x_{ij}}{\sum_{i=1}^{n} x_{ij}} $$
    * **Step C: Entropy ($E_j$)**
        $$ E_j = -k \sum_{i=1}^{n} (P_{ij} \times \ln(P_{ij})) $$
        * Where constant $k$: $$ k = \frac{1}{\ln(n)} $$
    * **Step D: Divergence ($D_j$)**
        $$ D_j = 1 - E_j $$
    * **Step E: Final Weight ($w_j$)**
        $$ w_j = \frac{D_j}{\sum_{j=1}^{m} D_j} $$
    * **Edge Case Handling:** If `Max(X_j) == Min(X_j)` (Zero Variance), explicitly set $w_j = 0$ to prevent division errors.
4.  **Persist:** Save weights to `outputs/model/strata_weights.csv` and Min/Max scaling factors to `outputs/model/scaling_factors.csv`.

### Phase 3: Scoring (Local CSV + Weights -> Scores)
**Goal:** Apply weights to all years.
1.  **Load:** Read CSVs for **All Years** (Historic + Current).
2.  **Normalize:** Normalize the data using the **Min/Max factors saved in Phase 2**. 
    * *Crucial:* Do NOT re-calculate Min/Max on the current year data. Use the historic benchmarks.
3.  **Calculate CHF:**
    $$ CHF_i = \sum_{j=1}^{m} (w_j \times x_{ij}^{normalized}) $$
4.  **Persist:** Save final results to `outputs/results/chf_scores_all_years.csv`.

## 6. Project Structure

```text
CHF-software/
├── inputs/
│   └── strata_shapefile.shp
├── outputs/
│   ├── raw_data/          # Phase 1 Output (Intermediate CSVs)
│   ├── model/             # Phase 2 Output (Weights & Scaling Factors)
│   └── results/           # Phase 3 Output (Final Scores)
├── src/
│   ├── __init__.py
│   ├── gee_utils.py       # GEE Band Math & Masking
│   ├── data_fetcher.py    # Phase 1: GEE -> CSV
│   └── chf_engine.py      # Phase 2 & 3: CSV -> Weights -> CHF
├── main.py                # Orchestrator
└── requirements.txt
```