modelado (paramétrico, no paramétrico y ML) para estimar parámetros de los citricos a partir de variables SAR

**Objetivo (Y)**: estimar parámetros biofísicos de citricos medidos en campo (altura del arbol, diametro de la copa y biomasa).

**Predictores (X)**: derivados de SAR Sentinel‑1

# 1- Carga y preparación de datos

In [1]:
from utils.model_prep import load_dataset
data_path = '../../data/processed/SARCitric.csv'
df = load_dataset(data_path)
# Cambiar el nombre de una columna
df = df.rename(columns={"biomasa_total_m3_kg": "biomass"})
df.head()

Unnamed: 0,id_point,datetime,tree_height,canopy_diam,biomass,Sigma0_RATIO_VH_VV,Gamma0_RATIO_VH_VV
0,O1-1,2025-02-20 15:23:00,4.16625,4.459995,49.540402,0.810841,0.801533
1,O1-13,2025-02-20 15:36:00,4.415,4.846434,60.126747,0.810891,0.801586
2,O1-17,2025-02-20 16:06:00,4.16,4.258532,48.985622,0.810909,0.801612
3,O1-25,2025-02-20 16:15:00,4.86,4.923015,68.788006,0.810925,0.801633
4,O1-9,2025-02-20 15:30:00,3.255,4.61076,48.888886,0.810869,0.801555


# 2 - Modeling
## Lineal

In [2]:
from utils.model_prep import select_xy, random_splits

target_col = ['tree_height', 'canopy_diam', 'biomass']
feature_cols = ['Sigma0_RATIO_VH_VV', 'Gamma0_RATIO_VH_VV']
X, y = select_xy(df, target_col, feature_cols)
num_features = list(X.columns)

# Dividir en conjuntos de entrenamiento/validación
splits = random_splits(X, y, val_size=0.3, random_state=42)
(X_train, y_train) = splits["train"]
(X_val,   y_val)   = splits["val"]

In [3]:
out_dir = '../../results/canopy/artifacts_linear'

from models.linear import run_baseline_from_splits
# Ejecutar el modelo lineal para cada variable
for target in target_col:
    print(f"Ejecutando LR para: {target}")
    metrics_LR = run_baseline_from_splits(
        X_train, y_train[target],
        X_val, y_val[target],
        target,
        num_features,
        model_type="ridge",
        alpha=1.0,
        outdir=out_dir
    )
    print(f"Resultados para {target}: {metrics_LR}")

Ejecutando LR para: tree_height
Resultados para tree_height: {'train': {'RMSE': 0.8316349452442439, 'MAE': 0.7163199285737488, 'R2': 0.15539351218728414, 'rRMSE(%)': 30.122800758965994}, 'val': {'RMSE': 0.9249952112902957, 'MAE': 0.7763487672883226, 'R2': 0.08919663580712123, 'rRMSE(%)': 33.15738348831116}}
Ejecutando LR para: canopy_diam
Resultados para canopy_diam: {'train': {'RMSE': 0.7883531938874485, 'MAE': 0.6483857104417027, 'R2': 0.09845064109481627, 'rRMSE(%)': 23.150391718722716}, 'val': {'RMSE': 1.0131042016658034, 'MAE': 0.7749977494864083, 'R2': 0.1621494843231298, 'rRMSE(%)': 29.968426692988814}}
Ejecutando LR para: biomass
Resultados para biomass: {'train': {'RMSE': 16.9159189768231, 'MAE': 13.99742386464722, 'R2': 0.14398351069209858, 'rRMSE(%)': 71.2122214931305}, 'val': {'RMSE': 19.262648248595724, 'MAE': 16.314296846189105, 'R2': 0.13051739816144425, 'rRMSE(%)': 77.23388217643567}}


<Figure size 640x480 with 0 Axes>

## No paramétrico

In [6]:
from models.nonparam import run_nw_from_splits
import numpy as np
outdir = '../../results/canopy/artifacts_nadaraya-watson'
bandwidth_grid = np.logspace(-3, 1, 50) 

# Ejecutar el modelo Nadaraya-Watson para cada variable
for target in target_col:
    print(f"Ejecutando NW para: {target}")
    metrics_NW = run_nw_from_splits(
        X_train, y_train[target],
        X_val, y_val[target],
        target,
        num_features,
        bandwidth_grid=bandwidth_grid,
        reg_type="lc",
        outdir=outdir
    )
    print(f"Resultados para {target}: {metrics_NW}")

Ejecutando NW para: tree_height
Resultados para tree_height: {'train': {'RMSE': 0.38275776397797734, 'MAE': 0.3069636858329997, 'R2': 0.8210890298752966, 'rRMSE(%)': 13.863938653840192}, 'val': {'RMSE': 0.44420111817516195, 'MAE': 0.32365500154478305, 'R2': 0.7899585768646205, 'rRMSE(%)': 15.922835752549792}}
Ejecutando NW para: canopy_diam
Resultados para canopy_diam: {'train': {'RMSE': 0.47355745119295534, 'MAE': 0.38142876740692433, 'R2': 0.6746931601115966, 'rRMSE(%)': 13.906254939333703}, 'val': {'RMSE': 0.9528056552879783, 'MAE': 0.6334175966008182, 'R2': 0.25891681275914413, 'rRMSE(%)': 28.184747813909667}}
Ejecutando NW para: biomass
Resultados para biomass: {'train': {'RMSE': 8.573333313101772, 'MAE': 6.208818253341033, 'R2': 0.7801176575769329, 'rRMSE(%)': 36.09180864861873}, 'val': {'RMSE': 11.81823690664671, 'MAE': 8.459720194300623, 'R2': 0.6727088281546412, 'rRMSE(%)': 47.38540126992651}}


## Random Forest

In [13]:
from models.random_forest import run_rf_from_splits
outdir = '../../results/canopy/artifacts_rf'
param_dist = {
    "n_estimators": np.arange(100, 1500, 100),
    "max_depth": [None] + list(np.arange(10, 50, 5)),
    "min_samples_split": np.arange(2, 21, 2),
    "min_samples_leaf": np.arange(1, 21, 2),
    "max_features": [None, 'sqrt', 'log2'],
    "bootstrap": [True, False]
}
# Ejecutar el modelo Random Forest para cada variable
for target in target_col:
    print(f"Ejecutando RF para: {target}")
    metrics_RF = run_rf_from_splits(
        X_train, y_train[target],
        X_val, y_val[target],
        target_col=target,
        param_dist=param_dist,
        num_features=num_features,
        outdir=outdir
    )
    print(f"Resultados para {target}: {metrics_RF}")

Ejecutando RF para: tree_height
Fitting 5 folds for each of 100 candidates, totalling 500 fits




Resultados para tree_height: {'train': {'RMSE': 0.2746721542752153, 'MAE': 0.21551430071066727, 'R2': 0.9078663619290935, 'rRMSE(%)': 9.948950106754245}, 'val': {'RMSE': 0.5042074297299551, 'MAE': 0.386542730881033, 'R2': 0.7293773520690924, 'rRMSE(%)': 18.0738223302705}}
Ejecutando RF para: canopy_diam
Fitting 5 folds for each of 100 candidates, totalling 500 fits




Resultados para canopy_diam: {'train': {'RMSE': 0.43324540846716303, 'MAE': 0.3531841726361095, 'R2': 0.7277199739189786, 'rRMSE(%)': 12.722471341677325}, 'val': {'RMSE': 0.9745556513240644, 'MAE': 0.66243541036433, 'R2': 0.2246967638192171, 'rRMSE(%)': 28.82812996621788}}
Ejecutando RF para: biomass
Fitting 5 folds for each of 100 candidates, totalling 500 fits




Resultados para biomass: {'train': {'RMSE': 6.050665577071749, 'MAE': 4.4558749996631715, 'R2': 0.8904790410471357, 'rRMSE(%)': 25.471943785357112}, 'val': {'RMSE': 12.738460525083498, 'MAE': 9.068187690346235, 'R2': 0.6197556241400558, 'rRMSE(%)': 51.0750519142766}}


## LightGBM

In [15]:
from models.lightgbm import run_lgbm_from_splits
outdir = '../../results/canopy/artifacts_lgbm'

# Ejecutar el modelo LightGBM para cada variable
for target in target_col:
    print(f"Ejecutando LGBM para: {target}")
    metrics_LGBM = run_lgbm_from_splits(
        X_train, y_train[target],
        X_val, y_val[target],
        target_col=target,
        param_dist=None,
        num_features=num_features,
        outdir=outdir
    )
    print(f"Resultados para {target}: {metrics_LGBM}")

Ejecutando LGBM para: tree_height
Fitting 5 folds for each of 100 candidates, totalling 500 fits
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000024 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 50
[LightGBM] [Info] Number of data points in the train set: 70, number of used features: 2
[LightGBM] [Info] Start training from score 2.760815
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000024 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 50
[LightGBM] [Info] Number of data points in the train set: 70, number of used features: 2
[LightGBM] [Info] Start training from score 2.760815
Resultados para tree_height: {'train': {'RMSE': 0.3284005628346387, 'MAE': 0.26119515677693367, 'R2': 0.8682966752526726, 'rRMSE(%)': 11.8950565749673}, 'val': {'RMSE': 0.5293592128605635, 'MAE': 0.37492471021632895, 'R2': 0.701704