## ⏳ Modeling `NDVI`

The variable `NDVI` is a **bounded signal** (it ranges from roughly $0 \to 1$) with **no zero-inflation**. It is a **strong global biomass proxy** but weak for composition. The modeling strategy should focus on capturing **nonlinearity** and **species effects**, not heavy transformations.

### 1. Data Analysis (What NDVI is Telling You)

* **Distribution:** `NDVI` is a clean, bounded feature ($0 \to \approx 0.91$) with no zero-inflation. It has a fairly tight central mass ($\text{median} \approx 0.69$, $75 \text{th} \approx 0.77$).
* **Key Relationships:**
    * It is the **strongest signal for overall biomass**.
    * It aligns strongly with targets like **GDM\_g, Dry\_Green\_g, and Dry\_Total\_g** (correlations range from $\sim 0.35$ to $\sim 0.65$).
    * It is **weak/inconsistent** for composition like **Dry\_Dead\_g** and **Dry\_Clover\_g**.
* **Structure:** Variability grows as `NDVI` increases, indicating **heteroscedasticity**. Species-dependent offsets and slopes are visible in the data.

### 2. Strategy: Training Data

The data strategy focuses on **split stability** in the high `NDVI` / high-biomass region where variance is highest, and not on heavy transformations.

| Goal | Action |
| :--- | :--- |
| ❌ **Preprocessing** | **No Leakage:** Compute mean/std on the **train split only** and apply those stats to validation/test sets for any standardization. Since `NDVI` is bounded and already well-scaled, standardization is primarily for model conditioning. |
| ✅ **Split Stability** | **Stratify by target regime** (e.g., `NDVI` quantiles) to prevent validation splits that accidentally get too few high-NDVI / high-biomass rows. |
| ✅ **Composition** | Don't expect `NDVI` alone to solve for composition targets (like `Dry_Clover_g` or `Dry_Dead_g`). If you want to improve those, focus your strategy on **species coverage** and **tail coverage** for the respective biomass targets. |

### 3. Strategy: Modeling Approach

The core recommendation is to **keep the model simple, skip $\log_1p$, and focus on capturing non-linearity and species structure**.

| Goal | Action |
| :--- | :--- |
| ❌ **Transformation** | **Skip $\log_1p(\text{NDVI})$:** Since `NDVI` is bounded and already nearly linear in this range, the transformation doesn't add stability and isn't needed. |
| ❌ **Non-linearity** | **Use a non-linear regressor** (like Tree-based models or a non-linear regression head). This is crucial because the NDVI-biomass relationship is often "S-shaped" or non-monotonic. |
| ❌ **Species Structure** | **Include species as a feature** (**one-hot or embedding**). This often gives a direct jump in accuracy by helping the model separate the species-dependent patterns. |
| ❌ **Loss Function** | Be aware of the **heteroscedastic** nature (variance grows with `NDVI`). Standard MSE may work, but monitor residuals for patterns. |
| ❌ **Multi-task (Optional)** | **Align modeling effort with stronger targets** (like **GDM\_g** and **Dry\_Total\_g**), as `NDVI` correlations are stronger there. `NDVI` can be treated as a secondary feature for a primary target like `Dry_Green_g`. |


## ⏳ Modeling `Height_Ave_cm`

The data for `Height_Ave_cm` is characterized by a high number of small values ($\approx 80\%$ are $0-8 \text{ cm}$) and a small number of very large outliers (the "tail"). These **top 10% of values drive $\approx 84.7\%$ of the total variability (SST)**, meaning model accuracy on these tall plants dictates the final score (weighted $R^2$).

### 1. Data Analysis (What Height is Telling You)

* **Distribution:** Not zero-inflated, but severely **right-skewed** and **tail-dominated**.
* **Challenge:** A handful of tall samples (the top $10\%$) account for most of the variation the model needs to explain.
* **Relationships:** Height is **strongly correlated** with Green and Total Biomass (e.g., Spearman $\sim 0.80$ with `Dry_Green_g`) but is a weak proxy for composition like Clover or Dead matter.
* **Structure:** Species effects are clearly visible, suggesting that different species account for different height/biomass relationships.

### 2. Strategy: Training Data

The data strategy must prioritize **tail coverage** and **split stability** over zero-handling.

| Goal | Action |
| :--- | :--- |
| ❌ **Split Stability** | **Stratify** your data splits by Height regimes (e.g., bulk $\le p90$ vs. tail $> p90$). This prevents lucky/unlucky splits that bias your validation score. |
| ❌ **Tail Coverage** | Use **mild upweighting or oversampling** for the tall points ($> p90$ or $> p95$), such as $1.5\times$ to $3\times$. This makes the model see the score-critical points more often. |
| ❌ **Data Quality** | Sanity-check the extremely tall labels (max $\approx 70 \text{ cm}$) for possible unit or entry errors. |

### 3. Strategy: Modeling Approach

Treat `Height_Ave_cm` as a classic **heavy-tail regression target** using a **single regressor**, rather than the two-part model used for zero-inflation.

| Goal | Action |
| :--- | :--- |
| ✅ **Handle Skew** | **Bake-off:** Try training on **raw Height** versus **$\log_1p(\text{Height})$**. If using $\log_1p$, remember to use $\exp{m}1(\text{pred})$ and clip the output to $\ge 0$. |
| ❌ **Focus on Tail** | If plain MSE underfits the tail, switch to a more robust loss function like **Huber** or **SmoothL1**, or simply use **MSE with the mild tail sample weights** recommended above. |
| ❌ **Species Structure** | **Include species as a feature** (one-hot or embedding) to explicitly account for the species-dependent height patterns. |
| ❌ **Output Constraints** | Always **clip predictions at 0** and consider a soft cap for very high predictions if the model tends to "explode." |

The overall message is to focus your model's attention on the few, highly-influential tall samples to optimize your final metric.

## ⏳ Modeling with Species Data

Species data is highly **imbalanced by count** but severely **concentrated by weighted variance (SST)**. This means the overall metric is driven by a small number of species, dictating the focus of your data split and training strategy.

### 1. Data Analysis (What Species is Telling You)

* **Count Imbalance:** The dataset is dominated by a few species:
    * `Ryegrass_Clover` is the largest block at $27.5\%$ of samples ($98$ samples).
    * The top four species groups account for over $60\%$ of the dataset by count. * **Weighted SST (Variance) Concentration:** The overall score is dominated by species that have high variance or extreme values.
    * **Only 3 species contribute $\approx 59.4\%$ of the total weighted SST**:
        * `Fescue`: $27.8\%$ SST (from only $7.9\%$ of data).
        * `Ryegrass_Clover`: $18.8\%$ SST (from $27.5\%$ of data).
        * `Phalaris`: $12.8\%$ SST (from only $2.2\%$ of data).
    * The **top 7 species contribute $\approx 92.6\%$ of the weighted SST**.

### 2. Strategy: Training Data

The data split and sampling strategy must focus on reliably representing the species that dominate the SST, not just the count.

| Goal | Action |
| :--- | :--- |
| **Split Stability** | **Do NOT** aim for the count distribution. Instead, ensure your **train/validation/test** splits reliably contain the **high SST species**. |
| **Critical Representation** | **Always represent** `Fescue`, `Ryegrass_Clover`, and `Phalaris`. You should also protect `Clover`, `Lucerne`, and `Phalaris_Clover` to cover over $90\%$ of the SST mass. |
| **Training Bias** | Keep the dataset intact, but if using sampling techniques (like mini-batch sampling), you can **mildly bias** toward species with high SST share (especially `Fescue` and `Phalaris`) because they have a disproportionate impact on the weighted $R^2$. |

### 3. Strategy: Modeling Approach

Species should be incorporated explicitly into the model, as the heatmap shows they have different mean $\log_{1}p$ biomass values across targets.

| Goal | Action |
| :--- | :--- |
| **Capture Structure** | **Include species as an input feature** (e.g., one-hot encoding or an embedding layer). This allows the model to learn species-dependent offsets and slopes, separating the visible "species clouds". |
| **Loss Weighting** | Use the **natural data distribution** (all rows) with correct target weighting in the loss. |
| **Advanced Weighting (Optional)** | If you need to further bias the model toward "what matters for $R^2$," use per-sample weighted SST or species-level importance ($\propto \text{SST share}$) in your mini-batch sampling probabilities. |

## ⏳ Modeling `Dry_Clover_g`

The variable `Dry_Clover_g` behaves like a **two-regime target** (zero vs. positive) and is dominated by its rare, high-value tail. The goal is to maximize the weighted $R^2$ by reducing error (SSE) on both the mass of zeros and the rare, high-SST tail.

### 1. Data Analysis (What Clover is Telling You)

* **Zero-Inflation:** $37.9\%$ of samples are **exactly 0**. This is a **separate regime** that a model must explicitly learn.
* **Tail Dominance (High SST):** The **top $10\%$ of samples contribute $78.3\%$ of the target's weighted SST** (variance). This means tail accuracy is highly score-critical.
* **Typical Values:** The bulk of the non-zero data is small: $75\%$ of samples are $\le 7.26 \text{ g}$.

### 2. Strategy: Training Data

The data strategy must ensure stable splits that reliably cover the zero regime, the low-value bulk, and the high-SST tail.

| Goal | Action |
| :--- | :--- |
| **Split Stability** | Create **stratified splits** by clover regime to avoid validation sets missing the tail (high-SST) or missing the zero/low-value bulk. |
| **Stratification** | Define bins like: **Zero** ($y=0$), **Low** ($0 < y \le p75$ or $p90$), and **Tail** ($y \ge p98$ or $p95$). |
| **Tail Coverage** | **Oversample tail rows** (e.g., $2\times$ to $5\times$) to ensure the model sees these high-impact points frequently enough to learn them, but **do not let them dominate batches**. |
| **Species Insight** | Species can be mildly upweighted if they dominate `Dry_Clover_g` weighted SST, but **stratifying by species is the primary benefit** for split stability. |

### 3. Strategy: Modeling Approach (The Recommended Model)

A single regressor (even with $\log_1p$) is likely to fail by either predicting small positives when the truth is 0, or by underfitting the tail.

The **Recommended Model** is the **Two-Part (Zero-Inflated) Approach**:

| Part | Role | Target | Formula / PyTorch Head |
| :--- | :--- | :--- | :--- |
| **1. Classifier Head** | Learns the **presence/absence boundary**. | Predict $p(y > 0)$ | $p$ |
| **2. Regressor Head** | Focuses **only on non-zero rows** to predict magnitude. | Predict $\log_{1}p(y)$ | $\hat{y}_{\text{log}} = \log(1+y)$ |
| **Inference** | Uses the **Expected Value** to combine probability and magnitude. | $\hat{y}$ | $\hat{y} = p \cdot \exp{m}1(\hat{y}_{\text{log}})$ |

This approach maximizes the weighted $R^2$ because:
* It reduces **SSE on the $37.9\%$ zero mass** by learning the presence/absence boundary.
* It reduces **SSE on the rare tail** ($78.3\%$ of SST) because the regressor is not distracted by the tiny, zero/low-value points.


## ⏳ Modeling `Dry_Dead_g`

The variable `Dry_Dead_g` is dominated by its heavy tail, which drives $\approx 60\%$ of the total weighted variability (SST). However, its low zero rate ($\approx 11\%$) means a **single regressor model** with a transformation and special attention to the tail is the preferred approach.

### 1. Data Analysis (What Dead Mass is Telling You)

* **Zero-Inflation:** Only **$\approx 11.0\%$ are exactly 0**. This is a "mostly non-zero dead mass" distribution, so a **classifier head is not necessary**.
* **Tail Dominance (High SST):** The **top $10\%$ of samples contribute $\approx 60.1\%$ of the target's weighted SST**. This means tail accuracy is highly score-critical, similar to Height.
* **Variability:** The mean ($12.078$) is almost identical to the standard deviation ($12.403$), resulting in a Coefficient of Variation $(\text{CV})$ of $1.027$. This high CV indicates **extreme spread** and skewness.

### 2. Strategy: Training Data

The training strategy must focus on ensuring the model accurately learns the highly impactful (high-SST) tail region.

| Goal | Action |
| :--- | :--- |
| **Split Stability** | **Stratify your splits** by $\text{Dry\_Dead\_g}$ quantiles/regimes (e.g., bulk $\le p90$ vs. tail $> p90$). This prevents random splits from leaving the critical tail samples out of the training set. |
| **Tail Coverage** | Use **mild tail oversampling** (e.g., $2\times$ oversample for $> p90$ or $> p95$) to ensure the model sees the high-SST rows often enough to learn them well. **Keep zeros present** since they are only $11\%$. |
| **Species Effects** | If species effects are important, stratify by species *within* the top-SST regimes for stable representation. |

### 3. Strategy: Modeling Approach

The core recommendation is to use a **single regressor** combined with a log transformation to stabilize the severe skew.

| Goal | Action |
| :--- | :--- |
| **Regression Model** | Use a **Single Regressor** (not a two-part model). |
| **Transformation** | Train the regressor on the transformed target: $\mathbf{y}_{\text{log}} = \log_{1}p(\text{Dry\_Dead\_g})$. Inference is $\exp{m}1(\hat{y}_{\text{log}})$ clipped to $\ge 0$. |
| **Loss/Tail Protection** | Since the top $10\%$ drives $\approx 60\%$ of the SST: |
| | * Use **Huber loss** (or $\text{SmoothL}1$). **OR** |
| | * Use **MSE with the mild tail upweighting/oversampling** recommended above. |
| **Zero Calibration (Optional)** | If validation shows systematic prediction of small positive values when the truth is $0$ ("false positives"), you can add a **light zero-calibration rule** (thresholding) as a final post-process step. |


## ⏳Modeling `Dry_Dead_g`

The variable `Dry_Dead_g` is a **"tail matters" target**, dominated by a heavy tail that drives $\approx 60\%$ of the total variability (SST). The low zero rate of $\approx 11\%$ means the most effective strategy is a **Single Regressor** that prioritizes stabilizing the skew and accurately predicting the tail.

### 1. Data Analysis (What Dead Mass is Telling You)

* **Zero-Inflation:** Only $\approx 11.0\%$ of samples are exactly 0. This is **not enough to justify a complex two-part model** by default.
* **Tail Dominance (High SST):** The **top $10\%$ of samples contribute $\approx 60.1\%$ of the target's weighted SST**. This makes tail accuracy the primary lever for improving the weighted $R^2$ score.
* **Variability/Skew:** The data is highly spread out ($\text{CV}=1.027$) and extremely right-skewed, with a max value of $83.84 \text{ g}$.

### 2. Strategy: Training Data

The data strategy budget should focus on **tail coverage** and **split stability**.

| Goal | Action |
| :--- | :--- |
| **Split Stability** | **Stratify your splits** by $\text{Dry\_Dead\_g}$ quantiles (e.g., bulk $\le p90$ vs. tail $> p90$). This ensures the critical $60\%$ of SST is reliably present in both train and validation sets. |
| **Tail Coverage** | Use **mild oversampling** for tail rows (e.g., $1.5\times \text{ to } 3\times$ for samples $\ge p90/p95$). This makes the model focus on the score-critical points more often without destroying the bulk fit. |
| **Data Quality** | **Sanity-check** the labels for the high-end tail values ($\sim 83.8 \text{ g}$) to confirm they are not unit errors, as they heavily influence the score stability. |

### 3. Strategy: Modeling Approach

The core recommendation is to use a **single regressor** with a log transformation for stability.

| Goal | Action |
| :--- | :--- |
| **Regression Model** | Use a **Single Regressor**. |
| **Transformation** | Train the regressor on the transformed target: $\mathbf{y}_{\text{log}} = \log_{1}p(\text{Dry\_Dead\_g})$. Inference is $\exp{m}1(\hat{y}_{\text{log}})$ clipped to $\ge 0$. |
| **Loss/Tail Protection** | Choose a loss that handles the extreme values: |
| | * Use **Huber loss** or $\text{SmoothL}1$ in log-space. **OR** |
| | * Use **MSE with the mild tail sample weighting** (oversampling) from the training strategy. |
| **Zero Calibration (Conditional)** | Only consider a Two-Part Model (or simple post-process thresholding) if diagnostics show the current model **keeps predicting small positives when the truth is 0**. |


## ⏳ Modeling `Dry_Total_g`

`Dry_Total_g` is your **most important target** (highest weight). It is characterized by a moderate median and a very **long, massive tail** that contains the majority of the variance signal. The modeling strategy is a **Single Regressor** focused on aggressive tail handling.

### 1. Data Analysis (What Total Mass is Telling You)

* **Importance:** It is your highest-weight target ($w=0.5$), making it the biggest influence on your final weighted $R^2$ score.
* **Distribution:** It is **not zero-inflated** (0% zeros). It is centered in a broad mid-range (median $\approx 40.4\text{ g}$) with a massive right tail.
* **Tail Dominance (High SST):** The **top $10\%$ of samples contribute $57.3\%$ of this target's weighted SST**. The error on these extreme values is the largest driver of your overall score. 
### 2. Strategy: Training Data

The entire data budget should be spent on **split stability** and **optimizing the model for the tail**.

| Goal | Action |
| :--- | :--- |
| **Split Stability** | **Stratify your splits** by $\text{Dry\_Total\_g}$ bins. Ensure both train and validation sets reliably contain the **high-SST tail rows** (e.g., those $\ge p90$ or $p95$ regions). |
| **Tail Coverage** | **Oversample high total-mass records** (e.g., $2\times$ for samples $\ge p95$). This is the most direct way to focus your model on the data that dominates the $R^2$ denominator (SST) and is most damaging if you mispredict (SSE). |
| **Model Proxy** | Use the sample weights from an SST proxy (like $\propto (y-\bar{y})^2$) to directly target the samples that contribute most to the variance. |

### 3. Strategy: Modeling Approach

The call is a **Single Regressor** on a transformed scale, with emphasis on loss or weighting to protect the tail.

| Goal | Action |
| :--- | :--- |
| **Regression Model** | Use a **Single Regressor** (not a two-part model). |
| **Transformation** | Train the regressor on $\mathbf{y}_{\text{log}} = \log_{1}p(\text{Dry\_Total\_g})$. Inference is $\exp{m}1(\hat{y}_{\text{log}})$ clipped to $\ge 0$. This stabilizes the severe skew. |
| **Loss/Tail Protection** | Use a **simple regressor with strong emphasis on tail samples** (either via sampling/oversampling or per-sample weighting). This is often more effective than complex architectural changes. |

## ⏳ Modeling `GDM_g`

`GDM_g` is identified as a **high-impact target** ($w=0.2$) with **no zero-inflation**. Its challenge is the **significant right tail**, where the top $10\%$ of samples contribute $59.2\%$ of the total weighted variability (SST). This makes tail accuracy the biggest lever for score improvement.

### 1. Data Analysis (What Green Dry Matter is Telling You)

* **Zero-Inflation:** $0.0\%$ of samples are exactly 0. Therefore, no "presence/absence" regime exists, and a two-part model is unnecessary.
* **Tail Dominance (High SST):** The **top $10\%$ of samples contribute $59.2\%$ of this target's weighted SST**. This means errors in the high-GDM region ($> 83.20 \text{ g}$) will overwhelmingly affect the final weighted $R^2$ score.
* **Distribution:** The center is mid-range ($\text{median} \approx 27.11 \text{ g}$). Although there's a mass near the low end ($32.9\%$ in $0-20 \text{ g}$), it doesn't create a separate regime like the zero-inflated targets.

### 2. Strategy: Training Data

The data strategy budget should be spent on **tail coverage** and **split stability**.

| Goal | Action |
| :--- | :--- |
| **Split Stability** | **Stratify your splits** by $\text{GDM\_g}$ quantiles (e.g., bulk $\le p90$ vs. tail $> p90$). This ensures the $59.2\%$ of SST in the tail is always properly represented in both the training and validation sets. |
| **Tail Coverage** | **Oversample** (or use per-sample weighting on) the high-GDM rows (e.g., $1.5\times$ to $3\times$ factor for samples $\ge p90/p95$). This forces the model to prioritize fitting the variance-driving points. |
| **Data Quality** | Sanity-check the highest, most influential outlier labels ($\sim 157.98 \text{ g}$) to confirm they are not unit/entry errors, as they are crucial for score stability. |

### 3. Strategy: Modeling Approach

The recommended approach is a **Single Regressor** that uses a log transformation to handle skew and a robust loss function to protect the tail.

| Goal | Action |
| :--- | :--- |
| **Regression Model** | Use a **Single Regressor** (not a two-part model). |
| **Transformation** | Train on the transformed target: $\mathbf{y}_{\text{log}} = \log_{1}p(\text{GDM\_g})$. Inference is $\exp{m}1(\hat{y}_{\text{log}})$ clipped to $\ge 0$. This reduces variance and makes the tail easier to fit. |
| **Loss/Tail Protection** | Use a robust regression loss like **Huber** or $\text{SmoothL}1$ on the log scale, or use **MSE with the strong tail weighting/oversampling**. |
| **Tail Bias** | If the model shows consistent underprediction for the upper GDM regime, add a small per-sample weight boost above the $p90/p95$ threshold to increase tail sampling frequency. |