# Timeseries and met data based on paper that uses XGBoost

https://www.mdpi.com/2072-4292/13/6/1147/pdf

https://www.tandfonline.com/doi/full/10.1080/17538947.2020.1808718

The date of this record can be truncated to the month, i.e. this record is for 2015-05. For this month;
- **CHIRPS_SPI_actual** is the actual SPI for the month 2015-05 relative to all May months 1984 through 2021. It is available for the month before. So I can use it.
- **MIXED_SPI** is the SPI for the month 2015-05 using the first **12 days from CHIRPS and the rest of the month from the mean seasonal forecast**; taking a weighted mean of the respective SPIs. The idea is this is the value we'd be able to use in production if we ran production at the end of each month
- **FORECAST_SPI_{ii}** is the SPI based on the ensemble mean for 1 through six months in advance, i.e., in this case, for 2015-06, ..., 2015-11 (e: and this forecast was made on 2015-05-13. So each record has the next 6 months)
- the **FORECAST_SPI** values are relative to all May forecasts 1993-2021, so the May forecast for July is relative all May forecasts for July, but not the June forecasts for July
- there are sometimes when the SPI is inf  because all forecast data is 0. But inf can't into JSON so **I've filled in with the value 999. - so we'll need to do a data cleaning step before prediction**
- **NDVI** is the mean of the whole months. If it says 2015-05-01 it means that it is the mean ndvi for the month 2015-05


To be sure I don't have a data leakeage, let's study each feature:
For each date time d (for example d is August) I assume d is truncated by month and ignore the day.
Actually a better day would be to put last day of the month:
- d is August (xxxx-08-01)
- month_ndvi_mean: is the mean of whole August NDVI
- mixed_spi: first 12 days of chirps, rest of forecast for all August
- forecast_spi: mean for 1..6 months in advance (sept, oct, nov...)
- chirps_actual: i cannot use this for training, because I don't have this info in production (or I can use it but it is not the same as mixed_spi)

- If we run in production at the end of each month I'm using current month forecasted_spi so I cannot precit 7 months, but only 6 months

### Missing values:
Gotta take care of missing values, because otherwise the targets are not going to be realistic. If I shift after droping NaNs I can have a target that is for N months ahead but it shouldn't


### Test set

I'm leaving 21 points for the test set. So I can have 21-7 total predictions for each of the 7 models (one per month)


### Model Types
- We can train 7 models for the same input row
- Or we can train 7 models but based on data predicted by previous model (it's like using one single model)
    - The train can be done only for real data (current_ndiv, mixed_spi)
    - But the the prediction I can use for current_ndvi the pred for previous month and instead of mixed spi the forecast_spi_1...7
    
# Baseline vs NDVI vs SM
- Baseline model using only current and forecast_spi
- NDVI adding ndvi derivatives
- SM adding sm derivatives.

Each of them trained in the same dataset

In [None]:
import xgboost as xgb

In [None]:
import json
import seaborn as sns
import altair as alt
import httpx
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from shapely import geometry
from shapely.ops import unary_union
import os
from dotenv import load_dotenv

load_dotenv()  # take environment variables from .env.

In [None]:
tete_aga = 2179

# API Auth

In [None]:
base_url = "http://localhost:8081/"

In [None]:
client = httpx.Client(base_url=base_url)

In [None]:
r = client.post(
    "auth/token",
    data={"username": "fran.dorr@gmail.com", "password": "fran123"},
)

In [None]:
token = json.loads(r.text)["access_token"]
headers = {"Authorization": f"Bearer {token}"}

# Get agricultural areas geoms from the API

In [None]:

r = client.get("aoi/", params=dict(id=tete_aga), headers=headers)
res = json.loads(r.text)
polygons = geometry.shape(res["features"][0]["geometry"])


In [None]:
# get the bbox for all the ag areas
box = polygons.bounds


# Get Events from DB 
- Date range from 2019 to 2022

In [None]:
start_datetime = "1985-01-02"
end_datetime = "2022-08-31"

In [None]:
r = client.get(
        "events/",
        params=dict(
            aoi_id=tete_aga,
            start_datetime=start_datetime,
            end_datetime=end_datetime,
            limit=10000,
        ),
        headers=headers,
        timeout=60,
    )
    
    

In [None]:
aga_results = json.loads(r.text)["events"]

In [None]:
from dateutil.relativedelta import relativedelta


def get_keyed_values(results, label, keyed_value, new_col):
    df = pd.DataFrame(results)
    df.labels = df.labels.map(lambda x: x[0])
    df = df[df.labels == label]
    df[new_col] = df.keyed_values.apply(lambda x: x.get(keyed_value))
    df = df.drop_duplicates(subset=["aoi_id", "datetime"]).dropna()
    df.datetime = pd.to_datetime(df.datetime)
    
    
    if keyed_value == "FORECAST_SPI":
        months_df = df[new_col].apply(pd.Series)
        months_df.columns = [f"{new_col}_1", f"{new_col}_2",
                             f"{new_col}_3", f"{new_col}_4",
                             f"{new_col}_5", f"{new_col}_6"]
        df = pd.concat([df.drop([new_col], axis=1), months_df], axis=1)
    
        df.index = df["datetime"]
        
    elif keyed_value == "mean_value":
        df = df.groupby(["datetime"]).mean().resample("MS").mean()
        #df.index = df.index.map(lambda x: x.replace(day=1))
      
    else:
        df.index = df["datetime"]
        
        
    return df
    

In [None]:
ndvi = get_keyed_values(aga_results, "ndvi", "mean_value", "month_ndvi_mean")
sm = get_keyed_values(aga_results, "soil_moisture", "mean_value", "month_sm_mean")
chirps_actual = get_keyed_values(aga_results, "total_precipitation", "CHIRPS_SPI_actual", "chirps_actual")
forecast_spi = get_keyed_values(aga_results, "total_precipitation", "FORECAST_SPI", "forecast_spi")
mixed_spi = get_keyed_values(aga_results, "total_precipitation", "MIXED_SPI", "mixed_spi")

In [None]:
pivots = [6]
for pivot in pivots:
    sm[f"past_{pivot}_sm"] = sm["month_sm_mean"].shift(pivot).interpolate() 
    sm[f"past_{pivot}_sm_adj_left"] = sm["month_sm_mean"].shift(pivot+1).interpolate() 
    sm[f"past_{pivot}_sm_adj_right"] = sm["month_sm_mean"].shift(pivot-1).interpolate() 

    sm[f"past_{pivot}_sm_diff_left"] = sm[f"past_{pivot}_sm"]-sm[f"past_{pivot}_sm_adj_left"]
    sm[f"past_{pivot}_sm_diff_right"] = sm[f"past_{pivot}_sm"]-sm[f"past_{pivot}_sm_adj_right"]

    sm[f"past_{pivot}_sm_ratio_left"] = sm[f"past_{pivot}_sm_adj_left"] / sm[f"past_{pivot}_sm"]
    sm[f"past_{pivot}_sm_ratio_right"] = sm[f"past_{pivot}_sm_adj_right"] / sm[f"past_{pivot}_sm"]

    sm[f"past_{pivot}_sm_adj_sum"] = sm[f"past_{pivot}_sm"] + sm[f"past_{pivot}_sm_adj_left"] + sm[f"past_{pivot}_sm_adj_right"] 

    sm[f"past_{pivot}_sm_adj_mean"] = sm[f"past_{pivot}_sm_adj_sum"]/3 
    
    
    chirps_actual[f"past_{pivot}_chirps_actual"] = chirps_actual["chirps_actual"].shift(pivot).interpolate() 
    chirps_actual[f"past_{pivot}_chirps_actual_adj_left"] = chirps_actual["chirps_actual"].shift(pivot+1).interpolate() 
    chirps_actual[f"past_{pivot}_chirps_actual_adj_right"] = chirps_actual["chirps_actual"].shift(pivot-1).interpolate() 

    chirps_actual[f"past_{pivot}_chirps_actual_diff_left"] = chirps_actual[f"past_{pivot}_chirps_actual"]-chirps_actual[f"past_{pivot}_chirps_actual_adj_left"]
    chirps_actual[f"past_{pivot}_chirps_actual_diff_right"] = chirps_actual[f"past_{pivot}_chirps_actual"]-chirps_actual[f"past_{pivot}_chirps_actual_adj_right"]

    chirps_actual[f"past_{pivot}_chirps_actual_ratio_left"] = chirps_actual[f"past_{pivot}_chirps_actual_adj_left"] / chirps_actual[f"past_{pivot}_chirps_actual"]
    chirps_actual[f"past_{pivot}_chirps_actual_ratio_right"] = chirps_actual[f"past_{pivot}_chirps_actual_adj_right"] / chirps_actual[f"past_{pivot}_chirps_actual"]

    chirps_actual[f"past_{pivot}_chirps_actual_adj_sum"] = chirps_actual[f"past_{pivot}_chirps_actual"] + chirps_actual[f"past_{pivot}_chirps_actual_adj_left"] + chirps_actual[f"past_{pivot}_chirps_actual_adj_right"] 

    chirps_actual[f"past_{pivot}_chirps_actual_adj_mean"] = chirps_actual[f"past_{pivot}_chirps_actual_adj_sum"]/3 

sm_cols = []
chirps_cols = []
for pivot in pivots:
    sm_cols = sm_cols+[f'past_{pivot}_sm',f'past_{pivot}_sm_adj_left',
                  f'past_{pivot}_sm_adj_right',f'past_{pivot}_sm_diff_left',f'past_{pivot}_sm_diff_right',
                  f'past_{pivot}_sm_ratio_left', f'past_{pivot}_sm_ratio_right',f'past_{pivot}_sm_adj_sum',
                  f'past_{pivot}_sm_adj_mean']
    chirps_cols = chirps_cols + [f'past_{pivot}_chirps_actual', f'past_{pivot}_chirps_actual_adj_left',
       f'past_{pivot}_chirps_actual_adj_right', f'past_{pivot}_chirps_actual_diff_left',
       f'past_{pivot}_chirps_actual_diff_right', f'past_{pivot}_chirps_actual_ratio_left',
       f'past_{pivot}_chirps_actual_ratio_right', f'past_{pivot}_chirps_actual_adj_sum',
       f'past_{pivot}_chirps_actual_adj_mean']



In [None]:
ndvi_cols = []

pivots = [11,23,35,47]
for pivot in pivots:

    
    ndvi[f"past_{pivot}_ndvi"] = ndvi[f"month_ndvi_mean"].shift(pivot).interpolate() 
    ndvi[f"past_{pivot}_ndvi_adj_left"] = ndvi[f"month_ndvi_mean"].shift(pivot+1).interpolate() 
    ndvi[f"past_{pivot}_ndvi_adj_right"] = ndvi[f"month_ndvi_mean"].shift(pivot-1).interpolate() 

    ndvi[f"past_{pivot}_ndvi_diff_left"] = ndvi[f"past_{pivot}_ndvi"]-ndvi[f"past_{pivot}_ndvi_adj_left"]
    ndvi[f"past_{pivot}_ndvi_diff_right"] = ndvi[f"past_{pivot}_ndvi"]-ndvi[f"past_{pivot}_ndvi_adj_right"]

    ndvi[f"past_{pivot}_ndvi_ratio_left"] = ndvi[f"past_{pivot}_ndvi_adj_left"] / ndvi[f"past_{pivot}_ndvi"]
    ndvi[f"past_{pivot}_ndvi_ratio_right"] = ndvi[f"past_{pivot}_ndvi_adj_right"] / ndvi[f"past_{pivot}_ndvi"]

    ndvi[f"past_{pivot}_ndvi_adj_sum"] = ndvi[f"past_{pivot}_ndvi"] + ndvi[f"past_{pivot}_ndvi_adj_left"] + ndvi[f"past_{pivot}_ndvi_adj_right"] 

    ndvi[f"past_{pivot}_ndvi_adj_mean"] = ndvi[f"past_{pivot}_ndvi_adj_sum"]/3 

for pivot in pivots:
    ndvi_cols = ndvi_cols+[f'past_{pivot}_ndvi',f'past_{pivot}_ndvi_adj_left',
                  f'past_{pivot}_ndvi_adj_right',f'past_{pivot}_ndvi_diff_left',f'past_{pivot}_ndvi_diff_right',
                  f'past_{pivot}_ndvi_ratio_left', f'past_{pivot}_ndvi_ratio_right',f'past_{pivot}_ndvi_adj_sum',
                  f'past_{pivot}_ndvi_adj_mean']


In [None]:
cols_to_use = ndvi_cols + sm_cols #+ chirps_cols

In [None]:
final_df = ndvi.join(sm, lsuffix="", rsuffix="_r")
final_df = final_df.join(forecast_spi, lsuffix="", rsuffix="_r")
final_df = final_df.join(mixed_spi, lsuffix="", rsuffix="_r")
final_df = final_df.join(chirps_actual, lsuffix="", rsuffix="_r")



In [None]:
final_df[f"target_ndvi"] = final_df.month_ndvi_mean.shift(-1)
#final_df["chirps_actual"] = final_df["forecast_spi_1"]

#final_df["chirps_actual"] = final_df["chirps_actual"].shift(1)


In [None]:
import numpy as np

final_df.replace([np.inf, -np.inf], np.nan, inplace=True)
final_df = final_df.drop(columns=["month_ndvi_mean"]).dropna(subset=cols_to_use+["target_ndvi"])
#final_df = final_df.dropna()
final_df.shape

# One Month model

- Train each model separetely to get the most out of the dataset. If I filter everything many points dissapear because of NaN
- 



In [None]:
from sklearn.ensemble import RandomForestRegressor
from boruta import boruta_py
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import TimeSeriesSplit
from tqdm import tqdm
from collections import defaultdict
from sklearn.model_selection import ShuffleSplit, cross_validate
from sklearn.metrics import r2_score, mean_squared_error
  
from lofo import LOFOImportance, Dataset, plot_importance
%matplotlib inline


In [None]:
test_size = 20
cols_to_drop = ["target_ndvi"]
X_train = final_df.drop(columns=cols_to_drop)[:-test_size]
X_test = final_df.drop(columns=cols_to_drop)[-test_size:]



y_train = final_df["target_ndvi"][:-test_size]
y_test = final_df["target_ndvi"][-test_size:]


In [None]:
def train_models(X_train,y_train, grid_search_params,
                 cv_scoring= {'r2': 'r2','neg_mse': 'neg_mean_squared_error'},
                 boruta=True):
                 
    X_train_copy = X_train[cols_to_use].copy()
    
    features = X_train_copy.columns
    cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
    forest = xgb.XGBRegressor(n_jobs=-1)
    if boruta:
        
        # define Boruta feature selection method

        feat_selector = boruta_py.BorutaPy(forest, verbose=0, perc=90, n_estimators=200)
        # find all relevant features
        feat_selector.fit(X_train_copy[features].values, y_train.values)
        Z = [x for r,x in sorted(zip(feat_selector.ranking_,X_train[features].columns)) if r ==1 or r ==2]
        selected_features = Z
    else:
        dataset = Dataset(df=pd.concat([X_train_copy, y_train],axis=1), target="target_ndvi", features=[col for col in X_train_copy.columns])
        lofo_imp = LOFOImportance(dataset, cv=cv, scoring="neg_mean_absolute_error", n_jobs=-1)
        importance_df = lofo_imp.get_importance()
        
        plot_importance(importance_df, figsize=(12, 20))

        selected_features = importance_df[importance_df.importance_mean > 0].feature.values
                 
    print("Features to be use: ", selected_features)
    #print("Boruta selected: ", selected_features)

    xg_reg = xgb.XGBRegressor(n_jobs=-1)

    params = grid_search_params

    

    xgb_grid = GridSearchCV(xg_reg,
                            params,
                            cv = cv,
                            n_jobs = -1,
                            verbose=True)

    xgb_grid.fit(X_train_copy[selected_features],y_train)
    best_params = xgb_grid.best_params_
    best_models = xgb_grid.best_estimator_

    cv_scores = cross_validate(xgb_grid.best_estimator_,X_train_copy[selected_features],
                 y_train,cv=cv, scoring=cv_scoring)

    boruta_features = selected_features

    return best_models, best_params, cv_scores, boruta_features, importance_df

In [None]:

def train_final_and_eval(cv_train_results, month=1):
    X_test_cp = X_test.copy()
    X_train_cp = X_train.copy()
    models = []
    _, model_params, _, features,_ = cv_train_results

    
    #if month==7:
    #    X_test_cp["chirps_actual"] = X_test_cp[f"forecast_spi_6"]
    #    X_train_cp["chirps_actual"] = X_train_cp[f"forecast_spi_6"]
    #else:
    #    X_test_cp["chirps_actual"] = X_test_cp[f"forecast_spi_{month}"]
    #    X_train_cp["chirps_actual"] = X_train_cp[f"forecast_spi_{month}"]
        
    
    
    xgb_reg = xgb.XGBRegressor(**model_params)
   
    xgb_reg.fit(X_train_cp[features], y_train)
    
    
    
    

    preds = xgb_reg.predict(X_test_cp[features])
    y_true = y_test
    
    y_true.plot(label="y_true", legend=True)
    pd.Series(preds,index=y_true.index).plot(label="y_pred", legend=True)
    plt.show()
    return y_true, preds, models

In [None]:
gsp = {'objective':['reg:squarederror'],
              'max_depth': [1,2,3,5,15,20,30],
              'n_estimators': [50,100, 200,500,1000]}
train_results = train_models(X_train,y_train,gsp, boruta=False)


In [None]:
pd.DataFrame(wfp_scores)

In [None]:
['past_11_ndvi', 'past_11_ndvi_adj_right', 'past_35_ndvi_adj_right',
       'past_47_ndvi_diff_right', 'past_35_ndvi_diff_right',
       'past_23_ndvi_adj_right', 'past_6_sm_diff_left',
       'past_35_ndvi_adj_sum', 'past_47_ndvi', 'past_11_ndvi_diff_right']

In [None]:
train_results[3]

In [None]:
wfp_models, wfp_params, wfp_scores, _ ,_= train_results


y_true, y_pred, models = train_final_and_eval(train_results,month=4)
print("10-Fold CV R2:", pd.DataFrame(wfp_scores)["test_r2"].median(),
      "\n10-Fold CV RMSE", -pd.DataFrame(wfp_scores)["test_neg_mse"].median())
print("TEST R2",r2_score(y_true,y_pred), 
      "\nTEST RMSE", mean_squared_error(y_true,y_pred))