Originally I was using ARIMA which was producing excellent results, however this was before I extended the dataset to include the varying demand collection frequency which I thought would be easy to migrate to. I struggled to use it for the final prediction and therefor resorted to XGBoost which performed more poorly on my predictions in the first half of the dataset.

XGBoost has an edge in capturing non-linear patterns that traditional models like ARIMA might struggle to, while ARIMA work mostly with lagged versions of the target variable and seasonal adjustments, although this produced better results for me

In [32]:
import pandas as pd
import numpy as np
from xgboost import XGBRegressor
from XGBoost_model_testing import model_testing_xgboost
from plot_predictions import plot_predictions
from rmses import rmse_evaluator

In [33]:
df_cyclic = pd.read_pickle('../data/fe_temp_cyclic_data.pkl')


Test dates are chosen due to varying nature curves they produce, and after the increased frequency of readings

In [35]:
test_dates = [
        '2023-10-28 09:33',
        '2023-10-28 13:09',
        '2023-11-02 06:27',
        '2023-11-02 09:48'
        ]

In [36]:
df_cyclic_features = ['tempmax', 'tempmin', 'temp', 'precip',
       'snow', 'snowdepth', 'windspeedmean', 'solarenergy', 'year',
       'sum_3min', 'sum_5min', 'sum_10min', 'sum_15min', 'sum_20min',
       'sum_30min', 'sum_45min', 'sum_1h', 'sum_24h', 'rolling_mean_30min',
       'rolling_std_30min', 'rolling_mean_1h', 'rolling_std_1h',
       'rolling_mean_24h', 'rolling_std_24h', 'rolling_min_24h',
       'rolling_max_24h', 'quarter_sin', 'quarter_cos', 'day_of_month_sin',
       'day_of_month_cos', 'hour_sin', 'hour_cos', 'minute_of_day_sin',
       'minute_of_day_cos', 'minute_sin', 'minute_cos', 'month_sin',
       'month_cos', 'day_of_year_sin', 'day_of_year_cos', 'day_of_week_sin',
       'day_of_week_cos']

In [None]:
df_with_predictions = model_testing_xgboost(df_cyclic,test_dates,df_cyclic_features)
print(rmse_evaluator(df_with_predictions,test_dates,mini = True))

In [None]:
plot_predictions(df_with_predictions,test_dates)

It seems to broadly work, but struggles on the datapoints close to 0

Here we will try some different data starting points 

In [27]:
df_cyclic_after_2022 = df_cyclic[(df_cyclic.DateTime > '2022')].copy()
df_cyclic_after_2022 = df_cyclic_after_2022.reset_index(drop = 'index')

df_cyclic_after_2023 = df_cyclic[(df_cyclic.DateTime > '2023')].copy()
df_cyclic_after_2023 = df_cyclic_after_2023.reset_index(drop = 'index')

df_cyclic_after_2023_09 = df_cyclic[(df_cyclic.DateTime > '2023-09')].copy()
df_cyclic_after_2023_09 = df_cyclic_after_2023_09.reset_index(drop = 'index')

dfs = [df_cyclic,df_cyclic_after_2022,df_cyclic_after_2023,df_cyclic_after_2023_09 ]

Using the hyperparamaters found with Optuna and randomised search

In [None]:
for df in dfs:
    df_with_predictions = model_testing_xgboost(df,test_dates,df_cyclic_features, max_depth= 3,learning_rate=0.07,
                                                n_estimators = 700, gamma = 0.25,subsample = 0.75)
    print(rmse_evaluator(df_with_predictions,test_dates,mini = True))

In [None]:
for df in dfs:
    df = df.copy()
    df.demand = np.log(df.demand)
    df_with_predictions = model_testing_xgboost(df,test_dates,df_cyclic_features, max_depth= 3,learning_rate=0.07,
                                                n_estimators = 700, gamma = 0.25,subsample = 0.75)
    df_with_predictions['demand'] = df_with_predictions['demand'].apply(lambda x: np.exp(x) if pd.notna(x) else x)
    df_with_predictions['predictions'] = df_with_predictions['predictions'].apply(lambda x: np.exp(x) if pd.notna(x) else x)

    print(rmse_evaluator(df_with_predictions,test_dates,mini = True))

The best result we have seen is actually with the shortest time frame, and with logging the demand, so we will try some predictions restricting the time frame into the final year

In [None]:
for i in range(7):
    df = df_cyclic.copy()
    month = 3 + i
    
    df = df[(df.DateTime > '2023-0{}'.format(month))]
    df.reset_index()
    df.demand = np.log(df.demand)
    df_with_predictions = model_testing_xgboost(df,test_dates,df_cyclic_features, max_depth= 3,learning_rate=0.07,
                                                n_estimators = 700, gamma = 0.25,subsample = 0.75)
    df_with_predictions['demand'] = df_with_predictions['demand'].apply(lambda x: np.exp(x) if pd.notna(x) else x)
    df_with_predictions['predictions'] = df_with_predictions['predictions'].apply(lambda x: np.exp(x) if pd.notna(x) else x)

    print("month",month)
    print(rmse_evaluator(df_with_predictions,test_dates,mini = True))

In [None]:
df = df_cyclic.copy()
df = df[(df.DateTime > '2023-08')]
df.reset_index()
df.demand = np.log(df.demand)
df_with_predictions = model_testing_xgboost(df,test_dates,df_cyclic_features, max_depth= 3,learning_rate=0.07,
                                            n_estimators = 700, gamma = 0.25,subsample = 0.75)
df_with_predictions['demand'] = df_with_predictions['demand'].apply(lambda x: np.exp(x) if pd.notna(x) else x)
df_with_predictions['predictions'] = df_with_predictions['predictions'].apply(lambda x: np.exp(x) if pd.notna(x) else x)
rmse_evaluator(df_with_predictions,test_dates,mini = True)

# final prediction

In [None]:
df_final = df_cyclic[(df_cyclic.DateTime > '2023-08')].copy()
predicted_df = df_final.copy()
predicted_df['predictions'] = None
trainsize = len(predicted_df) - 1
non_numeric_columns = df_final.select_dtypes(include=['object']).columns

df_final = df_final.drop(columns=non_numeric_columns)
df_train = df_final.iloc[:trainsize]

X_train = df_train[df_cyclic_features]
y_train = df_train['demand']
xgb_model = XGBRegressor(n_estimators=700, learning_rate=0.07, max_depth=3,subsample =  0.75,gamma = 0.25)
xgb_model.fit(X_train, y_train)

X_test = df_final.iloc[[trainsize]][df_cyclic_features]

xgb_model.predict(X_test)





[0.05571267]

With more time I would've explored more cleaning of the data, as there seemed to be some fishy demand reporting,
I would've attempted reducing the number of variables, as many would overlap (through PCA) or just provide unhelpful noise which would surely increase accuracy

The hyperparamater searches were also not explicitly on the data we used for our final predictions, so could be better optimised, as with the model choice