# Timeseries and met data based on paper that uses XGBoost

https://www.mdpi.com/2072-4292/13/6/1147/pdf

https://www.tandfonline.com/doi/full/10.1080/17538947.2020.1808718

The date of this record can be truncated to the month, i.e. this record is for 2015-05. For this month;
- **CHIRPS_SPI_actual** is the actual SPI for the month 2015-05 relative to all May months 1984 through 2021. It is available for the month before. So I can use it.
- **MIXED_SPI** is the SPI for the month 2015-05 using the first **12 days from CHIRPS and the rest of the month from the mean seasonal forecast**; taking a weighted mean of the respective SPIs. The idea is this is the value we'd be able to use in production if we ran production at the end of each month
- **FORECAST_SPI_{ii}** is the SPI based on the ensemble mean for 1 through six months in advance, i.e., in this case, for 2015-06, ..., 2015-11 (e: and this forecast was made on 2015-05-13. So each record has the next 6 months)
- the **FORECAST_SPI** values are relative to all May forecasts 1993-2021, so the May forecast for July is relative all May forecasts for July, but not the June forecasts for July
- there are sometimes when the SPI is inf  because all forecast data is 0. But inf can't into JSON so **I've filled in with the value 999. - so we'll need to do a data cleaning step before prediction**
- **NDVI** is the mean of the whole months. If it says 2015-05-01 it means that it is the mean ndvi for the month 2015-05


To be sure I don't have a data leakeage, let's study each feature:
For each date time d (for example d is August) I assume d is truncated by month and ignore the day.
Actually a better day would be to put last day of the month:
- d is August (xxxx-08-01)
- month_ndvi_mean: is the mean of whole August NDVI
- mixed_spi: first 12 days of chirps, rest of forecast for all August
- forecast_spi: mean for 1..6 months in advance (sept, oct, nov...)
- chirps_actual: i cannot use this for training, because I don't have this info in production (or I can use it but it is not the same as mixed_spi)

- If we run in production at the end of each month I'm using current month forecasted_spi so I cannot precit 7 months, but only 6 months

### Missing values:
Gotta take care of missing values, because otherwise the targets are not going to be realistic. If I shift after droping NaNs I can have a target that is for N months ahead but it shouldn't


### Test set

I'm leaving 21 points for the test set. So I can have 21-7 total predictions for each of the 7 models (one per month)


### Model Types
- We can train 7 models for the same input row
- Or we can train 7 models but based on data predicted by previous model (it's like using one single model)
    - The train can be done only for real data (current_ndiv, mixed_spi)
    - But the the prediction I can use for current_ndvi the pred for previous month and instead of mixed spi the forecast_spi_1...7
    
# Baseline vs NDVI vs SM
- Baseline model using only current and forecast_spi
- NDVI adding ndvi derivatives
- SM adding sm derivatives.

Each of them trained in the same dataset

In [None]:
import xgboost as xgb

In [None]:
import json
import seaborn as sns
import altair as alt
import httpx
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from shapely import geometry
from shapely.ops import unary_union
import os
from dotenv import load_dotenv

load_dotenv()  # take environment variables from .env.

In [None]:
tete_aga = 2179

# API Auth

In [None]:
base_url = "http://localhost:8081/"

In [None]:
client = httpx.Client(base_url=base_url)

In [None]:
r = client.post(
    "auth/token",
    data={"username": "fran.dorr@gmail.com", "password": "fran123"},
)

In [None]:
token = json.loads(r.text)["access_token"]
headers = {"Authorization": f"Bearer {token}"}

# Get agricultural areas geoms from the API

In [None]:

r = client.get("aoi/", params=dict(id=tete_aga), headers=headers)
res = json.loads(r.text)
polygons = geometry.shape(res["features"][0]["geometry"])


In [None]:
# get the bbox for all the ag areas
box = polygons.bounds


# Get Events from DB 
- Date range from 2019 to 2022

In [None]:
start_datetime = "1985-01-02"
end_datetime = "2022-08-31"

In [None]:
r = client.get(
        "events/",
        params=dict(
            aoi_id=tete_aga,
            start_datetime=start_datetime,
            end_datetime=end_datetime,
            limit=10000,
        ),
        headers=headers,
        timeout=60,
    )
    
    

In [None]:
aga_results = json.loads(r.text)["events"]

In [None]:
from dateutil.relativedelta import relativedelta


def get_keyed_values(results, label, keyed_value, new_col):
    df = pd.DataFrame(results)
    df.labels = df.labels.map(lambda x: x[0])
    df = df[df.labels == label]
    df[new_col] = df.keyed_values.apply(lambda x: x.get(keyed_value))
    df = df.drop_duplicates(subset=["aoi_id", "datetime"]).dropna()
    df.datetime = pd.to_datetime(df.datetime)
    
    
    if keyed_value == "FORECAST_SPI":
        months_df = df[new_col].apply(pd.Series)
        months_df.columns = [f"{new_col}_1", f"{new_col}_2",
                             f"{new_col}_3", f"{new_col}_4",
                             f"{new_col}_5", f"{new_col}_6"]
        df = pd.concat([df.drop([new_col], axis=1), months_df], axis=1)
    
        df.index = df["datetime"]
        
    elif keyed_value == "mean_value":
        df = df.groupby(["datetime"]).mean().resample("MS").mean()
        #df.index = df.index.map(lambda x: x.replace(day=1))
      
    else:
        df.index = df["datetime"]
        
        
    return df
    

In [None]:
ndvi = get_keyed_values(aga_results, "ndvi", "mean_value", "month_ndvi_mean")
sm = get_keyed_values(aga_results, "soil_moisture", "mean_value", "month_sm_mean")


In [None]:
sm = sm.dropna(subset=["month_sm_mean"])


In [None]:
forecast_spi = get_keyed_values(aga_results, "total_precipitation", "FORECAST_SPI", "forecast_spi")
mixed_spi = get_keyed_values(aga_results, "total_precipitation", "MIXED_SPI", "mixed_spi")
chirps_actual = get_keyed_values(aga_results, "total_precipitation", "CHIRPS_SPI_actual", "chirps_actual")

In [None]:
sm = sm.dropna(subset=["month_sm_mean"])
ndvi["month_ndvi_mean"] = ndvi["month_ndvi_mean"].interpolate() 
ndvi["one_year_ndvi"] = ndvi["month_ndvi_mean"].shift(11).interpolate() 
ndvi["one_year_ndvi_adj_left"] = ndvi["month_ndvi_mean"].shift(12).interpolate() 
ndvi["one_year_ndvi_adj_right"] = ndvi["month_ndvi_mean"].shift(10).interpolate() 

ndvi["one_year_ndvi_diff_left"] = ndvi["one_year_ndvi"]-ndvi["one_year_ndvi_adj_left"]
ndvi["one_year_ndvi_diff_right"] = ndvi["one_year_ndvi"]-ndvi["one_year_ndvi_adj_right"]

ndvi["one_year_ndvi_ratio_left"] = ndvi["one_year_ndvi_adj_left"] / ndvi["one_year_ndvi"]
ndvi["one_year_ndvi_ratio_right"] = ndvi["one_year_ndvi_adj_right"] / ndvi["one_year_ndvi"]

ndvi["one_year_ndvi_adj_sum"] = ndvi["one_year_ndvi"] + ndvi["one_year_ndvi_adj_left"] + ndvi["one_year_ndvi_adj_right"] 

ndvi["one_year_ndvi_adj_mean"] = ndvi["one_year_ndvi_adj_sum"]/3 


ndvi = ndvi[ndvi.index.isin(sm.index)]

forecast_spi = forecast_spi[forecast_spi.index.isin(sm.index)]
mixed_spi = mixed_spi[mixed_spi.index.isin(sm.index)]
chirps_actual = chirps_actual[chirps_actual.index.isin(sm.index)]

In [None]:
cols_to_use = ['aoi_id', 
              'month_ndvi_mean','one_year_ndvi','one_year_ndvi_adj_left',
              'one_year_ndvi_adj_right','one_year_ndvi_diff_left','one_year_ndvi_diff_right',
              'one_year_ndvi_ratio_left', 'one_year_ndvi_ratio_right','one_year_ndvi_adj_sum',
              'one_year_ndvi_adj_mean',
              'month_sm_mean', 'forecast_spi_1',
       'forecast_spi_2', 'forecast_spi_3', 'forecast_spi_4', 'forecast_spi_5',
       'forecast_spi_6','mixed_spi','chirps_actual']

In [None]:
final_df = ndvi.join(sm, lsuffix="", rsuffix="_r")
final_df = final_df.join(forecast_spi, lsuffix="", rsuffix="_r")
final_df = final_df.join(mixed_spi, lsuffix="", rsuffix="_r")
final_df = final_df.join(chirps_actual, lsuffix="", rsuffix="_r")[cols_to_use]

In [None]:
final_df["chirps_actual"] = final_df["chirps_actual"].shift(1) # to use chirps from month before

In [None]:
for i in range(1,5):
    final_df[f"chirps_actual_{i}m"] = final_df["chirps_actual"].shift(i)
    final_df[f"month_ndvi_mean_{i}m"] = final_df["month_ndvi_mean"].shift(i)
    final_df[f"month_sm_mean_{i}m"] = final_df["month_sm_mean"].shift(i)
    final_df[f"mixed_spi_{i}m"] = final_df.mixed_spi.rolling(min_periods=1, window=i+1).sum()

for i in range(1,7):
   
    if i == 1:        
        final_df[f"forecast_spi_{i}_cumsum_2"] = final_df[f"mixed_spi"] + final_df[f"forecast_spi_1"] 
        final_df[f"forecast_spi_{i}_cumsum_3"] = final_df[f"mixed_spi_1m"] + final_df[f"forecast_spi_1"]
    elif i == 2:
        final_df[f"forecast_spi_{i}_cumsum_2"] = final_df[f"forecast_spi_1"] + final_df[f"forecast_spi_2"] 
        final_df[f"forecast_spi_{i}_cumsum_3"] = final_df[f"mixed_spi"] + final_df[f"forecast_spi_{i}_cumsum_2"]
    else:
        final_df[f"forecast_spi_{i}_cumsum_2"] = final_df[f"forecast_spi_{i-1}"] + final_df[f"forecast_spi_{i}"] 
        final_df[f"forecast_spi_{i}_cumsum_3"] = final_df[f"forecast_spi_{i-2}"] + final_df[f"forecast_spi_{i}_cumsum_2"] 
        
#final_df["month_ndvi_mean_cumsum_2"] = final_df.month_ndvi_mean.rolling(min_periods=1, window=2).sum()
#final_df["month_ndvi_mean_cumsum_3"] = final_df.month_ndvi_mean.rolling(min_periods=1, window=3).sum()

#final_df["month_sm_mean_cumsum_2"] = final_df.month_sm_mean.rolling(min_periods=1, window=2).sum()
#final_df["month_sm_mean_cumsum_3"] = final_df.month_sm_mean.rolling(min_periods=1, window=3).sum()


In [None]:
for i in range(1,8):
    final_df[f"target_ndvi_{i}"] = final_df.month_ndvi_mean.shift(-i)




# One Month model

- Train each model separetely to get the most out of the dataset. If I filter everything many points dissapear because of NaN
- 



In [None]:
final_df.columns

In [None]:
       
target_cols = ['target_ndvi_1','target_ndvi_2',
               'target_ndvi_3','target_ndvi_4',
               'target_ndvi_5', 'target_ndvi_6', 'target_ndvi_7']

forecast_spi_cols = ['forecast_spi_1', 'forecast_spi_2', 'forecast_spi_3', 
                     'forecast_spi_4', 'forecast_spi_5','forecast_spi_6',
                     'forecast_spi_1_cumsum_2',
                     'forecast_spi_1_cumsum_3',
                     'forecast_spi_2_cumsum_2',
                     'forecast_spi_2_cumsum_3',
                     'forecast_spi_3_cumsum_2',
                     'forecast_spi_3_cumsum_3',
                     'forecast_spi_4_cumsum_2',
                     'forecast_spi_4_cumsum_3',
                     'forecast_spi_5_cumsum_2',
                     'forecast_spi_5_cumsum_3',
                     'forecast_spi_6_cumsum_2',
                     'forecast_spi_6_cumsum_3',
                     ]
             
wfp_cols = forecast_spi_cols + ['mixed_spi',
                                'chirps_actual','chirps_actual_1m', 'chirps_actual_2m']
            
oxeo_cols_ndvi = wfp_cols + ['month_ndvi_mean', 'month_ndvi_mean_1m','month_ndvi_mean_2m',
'month_ndvi_mean_3m','month_ndvi_mean_4m','one_year_ndvi','one_year_ndvi_adj_left',
              'one_year_ndvi_adj_right','one_year_ndvi_diff_left','one_year_ndvi_diff_right',
              'one_year_ndvi_ratio_left', 'one_year_ndvi_ratio_right','one_year_ndvi_adj_sum',
              'one_year_ndvi_adj_mean']
oxeo_cols_sm = oxeo_cols_ndvi + ['month_sm_mean', 'month_sm_mean_1m', 'month_sm_mean_2m',
                                    'month_sm_mean_3m', 'month_sm_mean_4m']
                        
                    
            
all_cols = {
    "wfp":wfp_cols,
    "oxeo_ndvi": oxeo_cols_ndvi,
    "oxeo_sm": oxeo_cols_sm,
    "target": target_cols,
    "all":oxeo_cols_sm+target_cols,
    "all_no_target":oxeo_cols_sm,
    "spi": forecast_spi_cols,
    
}

In [None]:
from sklearn.ensemble import RandomForestRegressor
from boruta import boruta_py
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import TimeSeriesSplit
from tqdm import tqdm
from collections import defaultdict
from sklearn.model_selection import ShuffleSplit, cross_validate
from sklearn.metrics import r2_score, mean_squared_error
  



In [None]:
model_df = final_df[all_cols['all']].dropna()
X_train = model_df[all_cols['all_no_target']][:-10]
X_test = model_df[all_cols['all_no_target']][-10:]

y_train = model_df[all_cols["target"]][:-10]
y_test = model_df[all_cols["target"]][-10:]

X_train_wfp = X_train[all_cols["wfp"]]
X_train_oxeo_ndvi = X_train[all_cols["oxeo_ndvi"]]
X_train_oxeo_sm = X_train[all_cols["oxeo_sm"]]


X_test_wfp = X_test[all_cols["wfp"]]
X_test_oxeo_ndvi = X_test[all_cols["oxeo_ndvi"]]
X_test_oxeo_sm = X_test[all_cols["oxeo_sm"]]

In [None]:
X_train.shape, X_train_wfp.shape,X_train_oxeo_sm.shape

In [None]:
def train_models(X_train,y_train, grid_search_params,
                 cv_scoring= {'r2': 'r2','neg_mse': 'neg_mean_squared_error'},
                 boruta=True):
                 
    cols_to_use = [x for x in X_train.columns if x not in all_cols['spi']]  
    
    valid_forecasts = {
        i: [f'forecast_spi_{i}', f'forecast_spi_{i}_cumsum_2',f'forecast_spi_{i}_cumsum_3'] for i in range(1,7)
    }
    
    #valid_forecasts = {
    #    i: [f'forecast_spi_{i}'] for i in range(1,7)
    #}
    
    
    #features = X_train.columns

    best_params = {}
    best_models = {}
    cv_scores = {}
    boruta_features = {}
    for i in range(7):
        features = cols_to_use 
        for j in range(1,i+1):
            features = features + valid_forecasts[j]
        
        target_col = f"target_ndvi_{i+1}"

        if boruta:
            forest = xgb.XGBRegressor(n_jobs=-1)
            # define Boruta feature selection method

            feat_selector = boruta_py.BorutaPy(forest, verbose=0, perc=90, n_estimators=50)
            # find all relevant features
            feat_selector.fit(X_train[features].values, y_train[target_col].values)
            Z = [x for r,x in sorted(zip(feat_selector.ranking_,X_train[features].columns)) if r ==1]
            selected_features = Z
        else:
            selected_features = features
        print("Features to be use: ", selected_features)
        #print("Boruta selected: ", selected_features)

        xg_reg = xgb.XGBRegressor(n_jobs=-1)

        params = grid_search_params
        
        cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=42)

        xgb_grid = GridSearchCV(xg_reg,
                                params,
                                cv = cv, scoring="r2",
                                n_jobs = -1,
                                verbose=True)

        xgb_grid.fit(X_train[selected_features],y_train[target_col])
        best_params[i] = xgb_grid.best_params_
        best_models[i] = xgb_grid.best_estimator_
        
        cv_scores[i] = cross_validate(xgb_grid.best_estimator_,X_train[selected_features],
                     y_train[target_col],cv=cv, scoring=cv_scoring)
                     
        boruta_features[i] = selected_features

    return best_models, best_params, cv_scores, boruta_features

In [None]:

def train_final_and_eval(cv_train_results):
    models = []
    _, model_params, _, boruta_features = cv_train_results
    for i in range(7):
        xgb_reg = xgb.XGBRegressor(**model_params[i])
        xgb_reg.fit(X_train[boruta_features[i]], y_train[target_cols[i]])
        models.append(xgb_reg)

    preds = []
    y_true = []
    for i in range(7):
        preds.append(models[i].predict(X_test[boruta_features[i]]))
        y_true.append(y_test[target_cols[i]])

    for i in range(7):
        y_true[i].plot(label="y_true", legend=True)
        pd.Series(preds[i],index=y_true[i].index).plot(label="y_true", legend=True)
        plt.show()
    return models

In [None]:
gsp = {'objective':['reg:squarederror'],
              'max_depth': [2,3],
              'n_estimators': [50]}
train_wfp_results = train_models(X_train_wfp,y_train,gsp, boruta=False)
train_ndvi_results = train_models(X_train_oxeo_ndvi,y_train,gsp, boruta=True)
train_sm_results = train_models(X_train_oxeo_sm,y_train,gsp, boruta=True)

In [None]:
wfp_models, wfp_params, wfp_scores, _ = train_wfp_results
mean_scores_wfp = defaultdict(dict)
for i in range(7):
    for key, value in wfp_scores[i].items():
        mean_scores_wfp[i][key] = value.mean()

oxeo_models, oxeo_params, oxeo_scores, _ = train_ndvi_results

mean_scores_ndvi = defaultdict(dict)
for i in range(7):
    for key, value in oxeo_scores[i].items():
        mean_scores_ndvi[i][key] = value.mean() 

_, _, sm_scores, _ = train_sm_results
mean_scores_sm = defaultdict(dict)
for i in range(7):
    for key, value in sm_scores[i].items():
        mean_scores_sm[i][key] = value.mean()


(-pd.DataFrame(mean_scores_wfp).loc["test_neg_mse"]).plot(label="wfp", legend=True)
(-pd.DataFrame(mean_scores_ndvi).loc["test_neg_mse"]).plot(label="ndvi", legend=True)
(-pd.DataFrame(mean_scores_sm).loc["test_neg_mse"]).plot(label="sm", legend=True)

plt.ylim([0.0005, 0.006])

In [None]:
wfp_models, wfp_params, wfp_scores, _ = train_wfp_results
mean_scores_wfp = defaultdict(dict)
for i in range(7):
    for key, value in wfp_scores[i].items():
        mean_scores_wfp[i][key] = value.mean()

oxeo_models, oxeo_params, oxeo_scores, _ = train_ndvi_results

mean_scores_ndvi = defaultdict(dict)
for i in range(7):
    for key, value in oxeo_scores[i].items():
        mean_scores_ndvi[i][key] = value.mean() 

_, _, sm_scores, _ = train_sm_results
mean_scores_sm = defaultdict(dict)
for i in range(7):
    for key, value in sm_scores[i].items():
        mean_scores_sm[i][key] = value.mean()


(-pd.DataFrame(mean_scores_wfp).loc["test_neg_mse"]).plot(label="wfp", legend=True)
(-pd.DataFrame(mean_scores_ndvi).loc["test_neg_mse"]).plot(label="ndvi", legend=True)
(-pd.DataFrame(mean_scores_sm).loc["test_neg_mse"]).plot(label="sm", legend=True)

plt.ylim([0.0005, 0.006])

In [None]:
wfp_models, wfp_params, wfp_scores, _ = train_wfp_results
mean_scores_wfp = defaultdict(dict)
for i in range(7):
    for key, value in wfp_scores[i].items():
        mean_scores_wfp[i][key] = value.mean()

oxeo_models, oxeo_params, oxeo_scores, _ = train_ndvi_results

mean_scores_ndvi = defaultdict(dict)
for i in range(7):
    for key, value in oxeo_scores[i].items():
        mean_scores_ndvi[i][key] = value.mean() 

_, _, sm_scores, _ = train_sm_results
mean_scores_sm = defaultdict(dict)
for i in range(7):
    for key, value in sm_scores[i].items():
        mean_scores_sm[i][key] = value.mean()


(-pd.DataFrame(mean_scores_wfp).loc["test_neg_mse"]).plot(label="wfp", legend=True)
(-pd.DataFrame(mean_scores_ndvi).loc["test_neg_mse"]).plot(label="ndvi", legend=True)
(-pd.DataFrame(mean_scores_sm).loc["test_neg_mse"]).plot(label="sm", legend=True)

plt.ylim([0.0005, 0.006])

In [None]:
train_final_and_eval(train_sm_results)

In [None]:
r2_scores

In [None]:
r2_scores

# Model training with moisture




In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import TimeSeriesSplit
from tqdm import tqdm
from collections import defaultdict

models = []
extra_cols = ["month_sm_mean"]
test_size = 20 # percentage
for i in tqdm(range(7)):

    if i == 0:
        cols_to_use = ['month_ndvi_mean','mixed_spi_cumsum_3']
    else:
        cols_to_use = ['month_ndvi_mean', 
                     f'forecast_spi_{i}_cumsum_3']
    cols_to_use+=extra_cols
    print(f"Using features: {cols_to_use}")
    target_col = f"target_ndvi_{i+1}"
    sm_df = final_df.dropna(subset="month_sm_mean")
    model_df = sm_df[cols_to_use + [target_col]].dropna()
    model_df = model_df[2:] # to remove repeated cumsums
    train = model_df[:-test_size]
    test = model_df[-test_size:]
    
    y_train = train[[target_col]]
    y_test = test[[target_col]]

    xg_reg = xgb.XGBRegressor()

    params = {'objective':['reg:squarederror'],
              'learning_rate': [0.1,0.3,0.5], #so called `eta` value
              'max_depth': [2, 4, 6,8,10,12],
              'n_estimators': [50,100]}


    xgb_grid = GridSearchCV(xg_reg,
                            params,
                            cv = TimeSeriesSplit(n_splits=3),
                            n_jobs = -1,
                            verbose=True)

    xgb_grid.fit(train[cols_to_use],
             y_train)
    print(xgb_grid.best_params_)
    #xg_reg.fit(X_train.values,y_train)
    models.append(xgb_grid.best_estimator_)
    
    
    
    

In [None]:
cols_to_use

In [None]:
models[0].feature_importances_

In [None]:
from sklearn.metrics import r2_score, mean_squared_error
preds = {}
y_true = {}

r2_scores = []
for i in range(7):
    if i == 0:
        cols_to_use = ['month_ndvi_mean','mixed_spi_cumsum_3']
    else:
        cols_to_use = ['month_ndvi_mean', 
                     f'forecast_spi_{i}_cumsum_3']
        
    cols_to_use+=extra_cols
    target_col = f"target_ndvi_{i+1}"
    sm_df = final_df.dropna(subset="month_sm_mean")
    model_df = sm_df[cols_to_use + [target_col]].dropna()
    
    train = model_df[:-test_size]
    test = model_df[-test_size:]
    preds[i] = models[i].predict(test[cols_to_use])
    y_true[i] = test[[f"target_ndvi_{i+1}"]].values[:,0]
    
    r2_scores.append(mean_squared_error(y_true[i], preds[i]))

    sns.lineplot(data=pd.DataFrame({"y_true":y_true[i],"y_pred":preds[i]}))
    plt.show()

In [None]:
r2_scores

In [None]:
r2_scores

In [None]:
r2_scores

In [None]:
instance

In [None]:
fig, axs = plt.subplots(2, 5, figsize=(20, 10), sharey=True)
for j in range(0,10):
    sm_df = final_df.dropna(subset="month_sm_mean")
    instance = sm_df.dropna().iloc[-10+j]
    instances_preds = []
    for i in range(7):
        if i == 0:
            cols_to_use = ['month_ndvi_mean', 'mixed_spi', 'mixed_spi_cumsum_2', 'mixed_spi_cumsum_3']
        else:
            cols_to_use = ['month_ndvi_mean', 
                        f'forecast_spi_{i}', f'forecast_spi_{i}_cumsum_2', f'forecast_spi_{i}_cumsum_3']

        cols_to_use+=extra_cols
        instances_preds.append(models[i].predict(instance[cols_to_use].values.reshape(1,-1))[0])
    axs[(j//5),j%5].plot(instance.values[-7:],label=f"y_true {str(instance.name)}")
    axs[(j//5),j%5].plot(instances_preds, label="y_pred")
    axs[(j//5),j%5].legend()
plt.legend(loc='best')
plt.show()

# Models training

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import TimeSeriesSplit
from tqdm import tqdm
from collections import defaultdict

baseline_models = []

test_size = 20 # percentage
for i in tqdm(range(7)):

    if i == 0:
        cols_to_use = ['month_ndvi_mean', 'mixed_spi', 'mixed_spi_cumsum_2', 'mixed_spi_cumsum_3']
    else:
        cols_to_use = ['month_ndvi_mean', 
                    f'forecast_spi_{i}', f'forecast_spi_{i}_cumsum_2', f'forecast_spi_{i}_cumsum_3']

    print(f"Using features: {cols_to_use}")
    target_col = f"target_ndvi_{i+1}"
    sm_df = final_df.dropna(subset="month_sm_mean")
    model_df = sm_df[cols_to_use + [target_col]].dropna()
    model_df = model_df[2:] # to remove repeated cumsums
    train = model_df[:-test_size]
    test = model_df[-test_size:]
    
    y_train = train[[target_col]]
    y_test = test[[target_col]]

    xg_reg = xgb.XGBRegressor()

    params = {'objective':['reg:squarederror'],
              'learning_rate': [0.1,0.3,0.5], #so called `eta` value
              'max_depth': [4, 6,8,10,12],
              'n_estimators': [100,200,1000]}


    xgb_grid = GridSearchCV(xg_reg,
                            params,
                            cv = TimeSeriesSplit(n_splits=3),
                            n_jobs = -1,
                            verbose=True)

    xgb_grid.fit(train[cols_to_use],
             y_train)
    print(xgb_grid.best_params_)
    #xg_reg.fit(X_train.values,y_train)
    baseline_models.append(xgb_grid.best_estimator_)
    
    
    
    

In [None]:
from sklearn.metrics import r2_score
preds = {}
y_true = {}

baseline_r2_scores = []
for i in range(7):
    if i == 0:
        cols_to_use = ['month_ndvi_mean', 'mixed_spi', 'mixed_spi_cumsum_2', 'mixed_spi_cumsum_3']
    else:
        cols_to_use = ['month_ndvi_mean', 
                    f'forecast_spi_{i}', f'forecast_spi_{i}_cumsum_2', f'forecast_spi_{i}_cumsum_3']
    target_col = f"target_ndvi_{i+1}"
    sm_df = final_df.dropna(subset="month_sm_mean")
    model_df = sm_df[cols_to_use + [target_col]].dropna()
    
    train = model_df[:-test_size]
    test = model_df[-test_size:]
    preds[i] = baseline_models[i].predict(test[cols_to_use])
    y_true[i] = test[[f"target_ndvi_{i+1}"]].values[:,0]
    
    baseline_r2_scores.append(mean_squared_error(y_true[i], preds[i]))

    sns.lineplot(data=pd.DataFrame({"y_true":y_true[i],"y_pred":preds[i]}))
    plt.show()

In [None]:
str(instance.name)

In [None]:
fig, axs = plt.subplots(2, 5, figsize=(20, 10), sharey=True)
for j in range(0,10):
    sm_df = final_df.dropna(subset="month_sm_mean")
    instance = sm_df.dropna().iloc[-10+j]
    instances_preds = []
    for i in range(7):
        if i == 0:
            cols_to_use = ['month_ndvi_mean', 'mixed_spi', 'mixed_spi_cumsum_2', 'mixed_spi_cumsum_3']
        else:
            cols_to_use = ['month_ndvi_mean', 
                        f'forecast_spi_{i}', f'forecast_spi_{i}_cumsum_2', f'forecast_spi_{i}_cumsum_3']
        instances_preds.append(baseline_models[i].predict(instance[cols_to_use].values.reshape(1,-1))[0])
    axs[(j//5),j%5].plot(instance.values[-7:],label=f"y_true {str(instance.name)}")
    axs[(j//5),j%5].plot(instances_preds, label="y_pred")
    axs[(j//5),j%5].legend()
plt.legend(loc='best')
plt.show()

In [None]:
pd.DataFrame({"sm":r2_scores, "baseline": baseline_r2_scores})