# Timeseries and met data based on paper that uses XGBoost

https://www.mdpi.com/2072-4292/13/6/1147/pdf

The date of this record can be truncated to the month, i.e. this record is for 2015-05. For this month;
- **CHIRPS_SPI_actual** is the actual SPI for the month 2015-05 relative to all May months 1984 through 2021.
- **MIXED_SPI** is the SPI for the month 2015-05 using the first **12 days from CHIRPS and the rest of the month from the mean seasonal forecast**; taking a weighted mean of the respective SPIs. The idea is this is the value we'd be able to use in production if we ran production at the end of each month
- **FORECAST_SPI_{ii}** is the SPI based on the ensemble mean for 1 through six months in advance, i.e., in this case, for 2015-06, ..., 2015-11 (e: and this forecast was made on 2015-05-13. So each record has the next 6 months)
- the **FORECAST_SPI** values are relative to all May forecasts 1993-2021, so the May forecast for July is relative all May forecasts for July, but not the June forecasts for July
- there are sometimes when the SPI is inf  because all forecast data is 0. But inf can't into JSON so **I've filled in with the value 999. - so we'll need to do a data cleaning step before prediction**
- **NDVI** is the mean of the whole months. If it says 2015-05-01 it means that it is the mean ndvi for the month 2015-05


To be sure I don't have a data leakeage, let's study each feature:
For each date time d (for example d is August) I assume d is truncated by month and ignore the day.
Actually a better day would be to put last day of the month:
- d is August (xxxx-08-01)
- month_ndvi_mean: is the mean of whole August NDVI
- mixed_spi: first 12 days of chirps, rest of forecast for all August
- forecast_spi: mean for 1..6 months in advance (sept, oct, nov...)
- chirps_actual: i cannot use this for training, because I don't have this info in production (or I can use it but it is not the same as mixed_spi)

- If we run in production at the end of each month I'm using current month forecasted_spi so I cannot precit 7 months, but only 6 months

### Missing values:
Gotta take care of missing values, because otherwise the targets are not going to be realistic. If I shift after droping NaNs I can have a target that is for N months ahead but it shouldn't


### Test set

I'm leaving 21 points for the test set. So I can have 21-7 total predictions for each of the 7 models (one per month)


### Model Types
- We can train 7 models for the same input row
- Or we can train 7 models but based on data predicted by previous model (it's like using one single model)
    - The train can be done only for real data (current_ndiv, mixed_spi)
    - But the the prediction I can use for current_ndvi the pred for previous month and instead of mixed spi the forecast_spi_1...7

In [None]:
import xgboost as xgb

In [None]:
import json
import seaborn as sns
import altair as alt
import httpx
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from shapely import geometry
from shapely.ops import unary_union
import os
from dotenv import load_dotenv

load_dotenv()  # take environment variables from .env.

In [None]:
tete_aga = 2179

# API Auth

In [None]:
base_url = "http://localhost:8081/"

In [None]:
client = httpx.Client(base_url=base_url)

In [None]:
r = client.post(
    "auth/token",
    data={"username": "fran.dorr@gmail.com", "password": "fran123"},
)

In [None]:
token = json.loads(r.text)["access_token"]
headers = {"Authorization": f"Bearer {token}"}

# Get agricultural areas geoms from the API

In [None]:

r = client.get("aoi/", params=dict(id=tete_aga), headers=headers)
res = json.loads(r.text)
polygons = geometry.shape(res["features"][0]["geometry"])


In [None]:
# get the bbox for all the ag areas
box = polygons.bounds


# Get Events from DB 
- Date range from 2019 to 2022

In [None]:
start_datetime = "1985-01-02"
end_datetime = "2022-08-31"

In [None]:
r = client.get(
        "events/",
        params=dict(
            aoi_id=tete_aga,
            start_datetime=start_datetime,
            end_datetime=end_datetime,
            limit=10000,
        ),
        headers=headers,
        timeout=60,
    )
    
    

In [None]:
aga_results = json.loads(r.text)["events"]

In [None]:
from dateutil.relativedelta import relativedelta
def get_keyed_values(results, keyed_value, new_col):
    df = pd.DataFrame(results)
    df.labels = df.labels.map(lambda x: x[0])
    df[new_col] = df.keyed_values.apply(lambda x: x.get(keyed_value))
    df = df.drop_duplicates(subset=["aoi_id", "datetime"]).dropna()
    df.datetime = pd.to_datetime(df.datetime)
    
    
    if keyed_value == "FORECAST_SPI":
        months_df = df[new_col].apply(pd.Series)
        months_df.columns = [f"{new_col}_1", f"{new_col}_2",
                             f"{new_col}_3", f"{new_col}_4",
                             f"{new_col}_5", f"{new_col}_6"]
        df = pd.concat([df.drop([new_col], axis=1), months_df], axis=1)
    
        df.index = df["datetime"]
        
    elif keyed_value == "mean_value":
        df = df.groupby(["datetime"]).mean().resample("M").mean()
        df.index = df.index.map(lambda x: x.replace(day=1))
      
    else:
        df.index = df["datetime"]
        
        
    return df
    

In [None]:
keyed_values ={"mean_value":"ndvi_mean", "MIXED_SPI":"mixed_spi", "FORECAST_SPI":"forecast_spi",
            "CHIRPS_SPI_actual":"chirps_spi_actual"
                }

In [None]:
ndvi = get_keyed_values(aga_results, "mean_value", "month_ndvi_mean")

In [None]:
forecast_spi = get_keyed_values(aga_results, "FORECAST_SPI", "forecast_spi")
mixed_spi = get_keyed_values(aga_results, "MIXED_SPI", "mixed_spi")
chirps_actual = get_keyed_values(aga_results, "CHIRPS_SPI_actual", "chirps_actual")

In [None]:
cols_to_use = ['aoi_id', 'month_ndvi_mean', 'forecast_spi_1',
       'forecast_spi_2', 'forecast_spi_3', 'forecast_spi_4', 'forecast_spi_5',
       'forecast_spi_6','mixed_spi','chirps_actual']

In [None]:
final_df = ndvi.join(forecast_spi, lsuffix="", rsuffix="_r")
final_df = final_df.join(mixed_spi, lsuffix="", rsuffix="_r")
final_df = final_df.join(chirps_actual, lsuffix="", rsuffix="_r")[cols_to_use]

In [None]:
# interpolate to fill values
#final_df["month_ndvi_mean"] = final_df["month_ndvi_mean"].interpolate("time") 

In [None]:
final_df["mixed_spi_cumsum_2"] = final_df.mixed_spi.rolling(min_periods=1, window=2).sum()
final_df["mixed_spi_cumsum_3"] = final_df.mixed_spi.rolling(min_periods=1, window=3).sum()

for i in range(1,7):
    if i == 1:        
        final_df[f"forecast_spi_{i}_cumsum_2"] = final_df[f"mixed_spi"] + final_df[f"forecast_spi_1"] 
        final_df[f"forecast_spi_{i}_cumsum_3"] = final_df[f"mixed_spi_cumsum_2"] + final_df[f"forecast_spi_1"]
    elif i == 2:
        final_df[f"forecast_spi_{i}_cumsum_2"] = final_df[f"forecast_spi_1"] + final_df[f"forecast_spi_2"] 
        final_df[f"forecast_spi_{i}_cumsum_3"] = final_df[f"mixed_spi"] + final_df[f"forecast_spi_{i}_cumsum_2"]
    else:
        final_df[f"forecast_spi_{i}_cumsum_2"] = final_df[f"forecast_spi_{i-1}"] + final_df[f"forecast_spi_{i}"] 
        final_df[f"forecast_spi_{i}_cumsum_3"] = final_df[f"forecast_spi_{i-2}"] + final_df[f"forecast_spi_{i}_cumsum_2"] 
        
final_df["month_ndvi_mean_cumsum_2"] = final_df.month_ndvi_mean.rolling(min_periods=1, window=2).sum()
final_df["month_ndvi_mean_cumsum_3"] = final_df.month_ndvi_mean.rolling(min_periods=1, window=3).sum()


In [None]:
for i in range(1,8):
    final_df[f"target_ndvi_{i}"] = final_df.month_ndvi_mean.shift(-i)




# Models training

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import TimeSeriesSplit
from tqdm import tqdm
from collections import defaultdict

models = []

test_size = 20 # percentage
for i in tqdm(range(7)):

    if i == 0:
        cols_to_use = ['month_ndvi_mean', 'mixed_spi', 'mixed_spi_cumsum_2', 'mixed_spi_cumsum_3']
    else:
        cols_to_use = ['month_ndvi_mean', 
                    f'forecast_spi_{i}', f'forecast_spi_{i}_cumsum_2', f'forecast_spi_{i}_cumsum_3']
        
    print(f"Using features: {cols_to_use}")
    target_col = f"target_ndvi_{i+1}"
    model_df = final_df[cols_to_use + [target_col]].dropna()
    model_df = model_df[2:] # to remove repeated cumsums
    train = model_df[:-test_size]
    test = model_df[-test_size:]
    
    y_train = train[[target_col]]
    y_test = test[[target_col]]

    xg_reg = xgb.XGBRegressor()

    params = {'objective':['reg:squarederror'],
              'learning_rate': [0.1,0.3,0.5], #so called `eta` value
              'max_depth': [4, 6,8,10,12],
              'n_estimators': [100,200,1000]}


    xgb_grid = GridSearchCV(xg_reg,
                            params,
                            cv = TimeSeriesSplit(n_splits=3),
                            n_jobs = -1,
                            verbose=True)

    xgb_grid.fit(train[cols_to_use],
             y_train)
    print(xgb_grid.best_params_)
    #xg_reg.fit(X_train.values,y_train)
    models.append(xgb_grid.best_estimator_)
    
    
    
    

In [None]:
preds = {}
y_true = {}

r2_scores = []
for i in range(7):
    if i == 0:
        cols_to_use = ['month_ndvi_mean', 'mixed_spi', 'mixed_spi_cumsum_2', 'mixed_spi_cumsum_3']
    else:
        cols_to_use = ['month_ndvi_mean', 
                    f'forecast_spi_{i}', f'forecast_spi_{i}_cumsum_2', f'forecast_spi_{i}_cumsum_3']
        

    target_col = f"target_ndvi_{i+1}"
    model_df = final_df[cols_to_use + [target_col]].dropna()
    
    train = model_df[:-test_size]
    test = model_df[-test_size:]
    preds[i] = models[i].predict(test[cols_to_use])
    y_true[i] = test[[f"target_ndvi_{i+1}"]].values[:,0]
    
    r2_scores.append(r2_score(y_true[i], preds[i]))

    sns.lineplot(data=pd.DataFrame({"y_true":y_true[i],"y_pred":preds[i]}))
    plt.show()

In [None]:
str(instance.name)

In [None]:
fig, axs = plt.subplots(2, 5, figsize=(20, 10), sharey=True)
for j in range(0,10):
    instance = final_df.dropna().iloc[-10+j]
    instances_preds = []
    for i in range(7):
        if i == 0:
            cols_to_use = ['month_ndvi_mean', 'mixed_spi', 'mixed_spi_cumsum_2', 'mixed_spi_cumsum_3']
        else:
            cols_to_use = ['month_ndvi_mean', 
                        f'forecast_spi_{i}', f'forecast_spi_{i}_cumsum_2', f'forecast_spi_{i}_cumsum_3']


        instances_preds.append(models[i].predict(instance[cols_to_use].values.reshape(1,-1))[0])
    axs[(j//5),j%5].plot(instance.values[-7:],label=f"y_true {str(instance.name)}")
    axs[(j//5),j%5].plot(instances_preds, label="y_pred")
    axs[(j//5),j%5].legend()
plt.legend(loc='best')
plt.show()