# Forecasting Wave Height and Wave period Ensemble Methods
## XGBoost and Random Forest

# Introduction

This notebook focuses on the independent forecasting of wave height and wave period, current features for wave height are not excluded in forecasting wave period and current values for wave period are not excluded in forecasting wave height. Results from this notebook will be compared to the results from the last notebook where wave period and wave height will be forecasted together for the same day. The differences in error will be compared. This will lead to a better understanding of the relationship between wave height and wave period in forecasting both. As seen in previous linear regression both values show importance in predicting the other. 
To forecast wave height and wave period XGBoost models as well as Random Forest models will be used. 

# Methodology

First the daily sampled data frame for time series is imported. From there the rolling data frame is created. All directional data is converted to radians and then lags of all features are created. Features are also added. Moon phase is added and cyclicly encoded. Temporal variables are added as well such as week, season, and year, these variables are also cyclicly encoded, this may help in model interpretation of variables. Once there is a rolling data frame with added features the modelling is performed for wave height and wave period separately. Each model was optimized using gridsearch and time series split from sklearn for cross validation. Number of splits was set to n=3. The same time split for test and train was maintained throughout this process. 


# Data Dictionary

<details>
  <summary>Data Dictionary</summary>
  
| Field       | Description                                             |
|-------------|---------------------------------------------------------|
| Index       | Date time excluding minutes (used to join df_buoy and df_hind) |
| DEPTH       | Depth in meters                                        |
| VWH         | Characteristic significant wave height (reported by the buoy) (m) |
| VCMX        | Maximum zero crossing wave height (reported by the buoy) (m) |
| VTP         | Wave spectrum peak period (reported by the buoy) (s)   |
| WDIR        | Direction from which the wind is blowing (° True)      |
| WDIR.1      | Estimated Direction from which the wind is blowing (° True)      |
| WSPD        | Horizontal wind speed (m/s)                           |
| WSPD.1      | Estimated wind speed within 10 meters. (m/s)
| WSS         | Horizontal scalar wind speed (m/s)                     |
| GSPD        | Gust wind speed (m/s)                                   |
| GSPD.1      | Documentation not found            |
| ATMS        | Atmospheric pressure at sea level (mbar)               |
| DRYT        | Dry bulb temperature (air temperature) (°C)            |
| SSTP        | Sea surface temperature (°C)                           |
| WD          | Wind Direction (deg from which wind is blowing (° True)) |
| WS          | Wind Speed (m/s)                                       |
| ETOT        | Total Variance of Total Spectrum (m^2)                |
| TP          | Peak Spectral Period of Total Spectrum (sec)           |
| VMD         | Vector Mean Direction of Total Spectrum (deg to which) |
| ETTSea      | Total Variance of Primary Partition (m^2)             |
| TPSea       | Peak Spectral Period of Primary Partition (sec)        |
| VMDSea      | Vector Mean Direction of Primary Partition (deg to which) |
| ETTSw       | Total Variance of Secondary Partition (m^2)           |
| TPSw        | Peak Spectral Period of Secondary Partition (sec)      |
| VMDSw       | Vector Mean Direction of Secondary Partition (deg to which) |
| MO1         | First Spectral Moment of Total Spectrum (m^2/s)       |
| MO2         | Second Spectral Moment of Total Spectrum (m^2/s^2)    |
| HS          | Significant Wave Height (m)                            |
| DMDIR       | Dominant Direction (deg to which)                       |
| ANGSPR      | Angular Spreading Function                             |
| INLINE      | In-Line Variance Ratio                                 |



In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import math
import decimal
from datetime import datetime

In [2]:
df=pd.read_csv('../Data/df_daily_imputed.csv', index_col=0)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9902 entries, 1988-11-22 to 2016-01-01
Data columns (total 33 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   LATITUDE   9902 non-null   float64
 1   LONGITUDE  9902 non-null   float64
 2   DEPTH      9902 non-null   float64
 3   VWH$       9902 non-null   float64
 4   VCMX       9902 non-null   float64
 5   VTP$       9902 non-null   float64
 6   WDIR       9902 non-null   float64
 7   WSPD       9902 non-null   float64
 8   GSPD       9902 non-null   float64
 9   WDIR.1     9902 non-null   float64
 10  WSPD.1     9902 non-null   float64
 11  GSPD.1     9902 non-null   float64
 12  ATMS       9902 non-null   float64
 13  DRYT       9902 non-null   float64
 14  SSTP       9902 non-null   float64
 15  YEAR       9902 non-null   float64
 16  WD         9902 non-null   float64
 17  WS         9902 non-null   float64
 18  ETOT       9902 non-null   float64
 19  TP         9902 non-null   float64
 20

## Feature Engineering

In [4]:
# convert directions(degrees North) into radians
columns_to_convert = ['VMD', 'VMDSea', 'VMDSw', 'WD', 'WDIR', 'WDIR.1']

# Convert specified columns to radians
df[columns_to_convert] = np.radians(df[columns_to_convert])

In [5]:
# Define lags for different time intervals
lags = {'1_day': 1, '1_week': 7, '1_month': 30, '3_month': 90}


# Create a new DataFrame to avoid modifying the original DataFrame in place
new_df = pd.DataFrame()

# Create lags
for column in df.select_dtypes(include='number').columns:
    # Create lags for different time intervalsa
    for lag_name, lag_value in lags.items():
        new_df[f'{column}_lag_{lag_name}'] = df[column].shift(lag_value)


# Combine the new features with the original DataFrame
features_df = pd.concat([df, new_df], axis=1)

# Drop rows with null values
features_df = features_df.dropna()

# Display the modified DataFrame
print(features_df.head())


               LATITUDE  LONGITUDE  DEPTH      VWH$      VCMX       VTP$  \
Datetime_buoy                                                              
1989-02-20        48.83      126.0   73.0  2.000417  3.683333  12.922500   
1989-02-21        48.83      126.0   73.0  2.281739  3.926087  13.330435   
1989-02-22        48.83      126.0   73.0  2.645000  4.691667  10.620833   
1989-02-23        48.83      126.0   73.0  2.488750  4.337500  10.010000   
1989-02-24        48.83      126.0   73.0  2.564583  4.475000  12.950833   

                   WDIR       WSPD       GSPD    WDIR.1  ...  \
Datetime_buoy                                            ...   
1989-02-20     2.070397   9.537500  11.558333  1.914772  ...   
1989-02-21     2.057971  10.847826  13.182609  1.891026  ...   
1989-02-22     2.600541   8.416667  10.287500  2.431098  ...   
1989-02-23     2.953970   5.193750   6.987500  2.781618  ...   
1989-02-24     1.845686   5.983333   7.516667  1.693697  ...   

               DMD

  new_df[f'{column}_lag_{lag_name}'] = df[column].shift(lag_value)
  new_df[f'{column}_lag_{lag_name}'] = df[column].shift(lag_value)
  new_df[f'{column}_lag_{lag_name}'] = df[column].shift(lag_value)
  new_df[f'{column}_lag_{lag_name}'] = df[column].shift(lag_value)
  new_df[f'{column}_lag_{lag_name}'] = df[column].shift(lag_value)
  new_df[f'{column}_lag_{lag_name}'] = df[column].shift(lag_value)
  new_df[f'{column}_lag_{lag_name}'] = df[column].shift(lag_value)
  new_df[f'{column}_lag_{lag_name}'] = df[column].shift(lag_value)
  new_df[f'{column}_lag_{lag_name}'] = df[column].shift(lag_value)
  new_df[f'{column}_lag_{lag_name}'] = df[column].shift(lag_value)
  new_df[f'{column}_lag_{lag_name}'] = df[column].shift(lag_value)
  new_df[f'{column}_lag_{lag_name}'] = df[column].shift(lag_value)
  new_df[f'{column}_lag_{lag_name}'] = df[column].shift(lag_value)
  new_df[f'{column}_lag_{lag_name}'] = df[column].shift(lag_value)
  new_df[f'{column}_lag_{lag_name}'] = df[column].shift(lag_va

In [6]:
features_df.shape

(9812, 165)

In [7]:
features_df.index = pd.to_datetime(features_df.index)

In [8]:
features_df.head(1)

Unnamed: 0_level_0,LATITUDE,LONGITUDE,DEPTH,VWH$,VCMX,VTP$,WDIR,WSPD,GSPD,WDIR.1,...,DMDIR_lag_1_month,DMDIR_lag_3_month,ANGSPR_lag_1_day,ANGSPR_lag_1_week,ANGSPR_lag_1_month,ANGSPR_lag_3_month,INLINE_lag_1_day,INLINE_lag_1_week,INLINE_lag_1_month,INLINE_lag_3_month
Datetime_buoy,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1989-02-20,48.83,126.0,73.0,2.000417,3.683333,12.9225,2.070397,9.5375,11.558333,1.914772,...,94.429252,67.0,0.797225,0.6411,0.73862,0.8075,0.730675,0.660987,0.727566,0.7153


**Cyclic Encode Temporal Features**

In [9]:
#Create cyclical encoded features for month, season, and week
features_df['month_sin'] = np.sin(2 * np.pi * features_df.index.month / 12)
features_df['month_cos'] = np.cos(2 * np.pi * features_df.index.month / 12)

#Assume seasons are defined as quarters (1-4)
features_df['season_sin'] = np.sin(2 * np.pi * features_df.index.month % 12 / 4)
features_df['season_cos'] = np.cos(2 * np.pi * features_df.index.month % 12 / 4)

features_df['week_sin'] = np.sin(2 * np.pi * features_df.index.strftime('%U').astype(int) / 52)  # Assuming 52 weeks in a year
features_df['week_cos'] = np.cos(2 * np.pi * features_df.index.strftime('%U').astype(int) / 52)


  features_df['month_sin'] = np.sin(2 * np.pi * features_df.index.month / 12)
  features_df['month_cos'] = np.cos(2 * np.pi * features_df.index.month / 12)
  features_df['season_sin'] = np.sin(2 * np.pi * features_df.index.month % 12 / 4)
  features_df['season_cos'] = np.cos(2 * np.pi * features_df.index.month % 12 / 4)
  features_df['week_sin'] = np.sin(2 * np.pi * features_df.index.strftime('%U').astype(int) / 52)  # Assuming 52 weeks in a year
  features_df['week_cos'] = np.cos(2 * np.pi * features_df.index.strftime('%U').astype(int) / 52)


In [10]:
features_df.shape

(9812, 171)

In [11]:
# add moonphase as a column (Code for moonphase taken from kaggle: https://www.kaggle.com/competitions/m5-forecasting-accuracy/discussion/154776)
def get_moon_phase(d):  # 0=new, 4=full; 4 days/phase
    diff = d - datetime(2001, 1, 1)
    days = decimal.Decimal(diff.days) + (decimal.Decimal(diff.seconds) / decimal.Decimal(86400))
    lunations = decimal.Decimal("0.20439731") + (days * decimal.Decimal("0.03386319269"))
    phase_index = math.floor((lunations % decimal.Decimal(1) * decimal.Decimal(8)) + decimal.Decimal('0.5'))
    return int(phase_index) & 7

In [12]:
features_df['moon_phase'] = features_df.index.map(get_moon_phase)

  features_df['moon_phase'] = features_df.index.map(get_moon_phase)


In [13]:
#cyclic encode the moonphase as it is ordinal, then drop moon phase
features_df['moon_phase_sin'] = np.sin(2 * np.pi * features_df['moon_phase'] / 8)
features_df['moon_phase_cos'] = np.cos(2 * np.pi * features_df['moon_phase'] / 8)

  features_df['moon_phase_sin'] = np.sin(2 * np.pi * features_df['moon_phase'] / 8)
  features_df['moon_phase_cos'] = np.cos(2 * np.pi * features_df['moon_phase'] / 8)


In [14]:
features_df=features_df.drop('moon_phase', axis=1)

In [15]:
features_df.shape

(9812, 173)

# Modelling XGBoost

In [16]:
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import TimeSeriesSplit
import time
from sklearn.metrics import make_scorer

import plotly.express as px
from sklearn.metrics import mean_squared_error, mean_absolute_error



## Wave Period

**Train Test Split**

First a test train split will be done, on the datetime index. Since datetime index is in order 80% of the len of the data frame will be taken in order for the train set. TimeSeries Split will be used for cross validation from sklearn.

In [17]:
#split on date for train and test
split_point = '2006-01-01'
#filter data on split point
train_xg= features_df.index < split_point
test_xg = features_df.index >= split_point

In [18]:
#define X and y
X_train = features_df[train_xg].drop(['VTP$'],axis=1)
y_train = features_df[train_xg]['VTP$']

X_test = features_df[test_xg].drop(['VTP$'], axis=1)
y_test = features_df[test_xg]['VTP$']

In [19]:
X_train.shape

(6159, 172)

In [20]:
X_test.shape

(3653, 172)

**Pipeline**

 **System Specifications and Parallelization**

- **Model Name:** Mac mini
- **Chip:** Apple M2
- **Total Number of Cores:** 8 (4 performance and 4 efficiency)
- **Memory:** 16 GB


**Parallization**
n_jobs = 3


In [21]:
#pipeline object
xgb_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('xgb_model',xgb.XGBRegressor(objective ='reg:squarederror',random_state=42))
])

In [22]:
#Param Grid
param_grid_gbtree = {
    'xgb_model__booster': ['gbtree'],
    'xgb_model__learning_rate': [0.01, 0.1, 0.2],
    'xgb_model__max_depth': [3, 5, 7],
    'xgb_model__n_estimators': [50, 100, 200],
}

# Param grid for gblinear booster
param_grid_gblinear = {
    'xgb_model__booster': ['gblinear'],
    'xgb_model__learning_rate': [0.01, 0.1, 0.2],
    'xgb_model__reg_alpha': [0, 0.1, 0.5],
}


param_grid =[param_grid_gbtree, param_grid_gblinear]

In [23]:
#grid search
#cv = sklearn TimeSeries Split
tscv = TimeSeriesSplit(n_splits=3)
#set timer
start_time = time.time()
grid_search1 = GridSearchCV(xgb_pipeline, param_grid, cv=tscv, scoring='neg_mean_squared_error',n_jobs=3) #will optimize for smallest mse
grid_search1.fit(X_train, y_train)
#end timer
end_time = time.time()
elapsed_time = end_time - start_time
print(f"Grid search completed in {elapsed_time:.2f} seconds.")

Grid search completed in 46.08 seconds.


In [24]:
#get the best params
best_params = grid_search1.best_params_
best_score1 = grid_search1.best_score_
print('Best Score MSE:', best_score1)
print('Best Hyperparameters:')
print(best_params)

Best Score MSE: -3.253739990909796
Best Hyperparameters:
{'xgb_model__booster': 'gbtree', 'xgb_model__learning_rate': 0.2, 'xgb_model__max_depth': 3, 'xgb_model__n_estimators': 50}


In [25]:
results_list =[]

Hyperparameters are at boundaries of ranges. Grid Search will be run again with an expanded hyperparameter space and just on gbtree as it was the best model. In future may come back to gblinear and try to optimize. 

In [26]:
#Use function for grid search
def run_grid_search(X_train, y_train, param_grid, pipeline, scoring_metric, identifier):
    """
    Run a grid search with the specified parameters.

    Parameters:
    - X_train: Training features
    - y_train: Training labels
    - param_grid: Parameter grid for the grid search
    - pipeline: pipeline object
    - scoring_metric: Scikit-learn scoring metric
    - identifier: Identifier for iteration of Gridsearch'

    Returns:
    - grid_search: Fitted GridSearchCV object 
    -resluts of cross validation in terms of best params, best score (validation set)
    - time it took to run GridSearch
    """
    #initiate timer module
    start_time = time.time()
    #time series cv
    tscv = TimeSeriesSplit(n_splits=3)
    # Set up the scoring metric
    scoring = scoring_metric

    # Instantiate GridSearchCV with 5-fold cross-validation, n_jobs=3, and specified scoring metric
    grid_search = GridSearchCV(pipeline, param_grid, cv=tscv, scoring=scoring, n_jobs=3)

    # Fit and run grid search
    grid_search.fit(X_train, y_train)

    #end timer
    end_time = time.time()
    elapsed_time = end_time - start_time
    
     # Store the results (hyperparameters and scores)
    results_list.append({
        'identifier': identifier,
        'best_params': grid_search.best_params_,
        'best_score': grid_search.best_score_,
        'elapsed_time': elapsed_time
    })
    
    # Print the best parameters with the identifier
    print(f"Best Parameters for {identifier}: {grid_search.best_params_}")

    # Print the best score on the validation sets, 
    #.best_score_ is attribute of GridSearch CV that accesses best validation score(score specified in GS)
    print(f"Best {scoring_metric} Score for {identifier}: {grid_search.best_score_}")
    # Print the elapsed time
    print(f"Elapsed Time for {identifier}: {elapsed_time} seconds")

    return grid_search


**GridSearch run, Identifier = optimize_tree**

In [27]:
#run gridsearch with expanded params: 
param_grid_gbtree = {
    'xgb_model__booster': ['gbtree'],
    'xgb_model__learning_rate': [0.01, 0.1, 0.2, 0.5],  # Added 0.5
    'xgb_model__max_depth': [2,3, 5, 7, 10],  # Added 2, 10
    'xgb_model__n_estimators': [50, 100, 200, 300],  # Added 300
    'xgb_model__subsample': [0.8, 0.9, 1.0],  # Add subsample
    'xgb_model__colsample_bytree': [0.8, 0.9, 1.0],  # Add colsample_bytree
}

In [28]:
scoring_metric = 'neg_mean_squared_error'
results_gridsearch = run_grid_search(X_train, y_train, param_grid_gbtree, xgb_pipeline, scoring_metric, 'optimize_tree')



Best Parameters for optimize_tree: {'xgb_model__booster': 'gbtree', 'xgb_model__colsample_bytree': 1.0, 'xgb_model__learning_rate': 0.1, 'xgb_model__max_depth': 2, 'xgb_model__n_estimators': 200, 'xgb_model__subsample': 0.8}
Best neg_mean_squared_error Score for optimize_tree: -3.1566080942885093
Elapsed Time for optimize_tree: 2238.1694531440735 seconds


In [30]:
#get best model
best_model_xg_wp =results_gridsearch.best_estimator_
#predict
y_pred_xg_wp = best_model_xg_wp.predict(X_test)
mae_xg_wp = mean_absolute_error(y_test, y_pred_xg_wp)
mse_xg_wp = mean_squared_error(y_test, y_pred_xg_wp)
print(f'Mean Squared Error: {mse_xg_wp}')
print(f'Mean Absolute Error: {mae_xg_wp}')


Mean Squared Error: 2.60989806744921
Mean Absolute Error: 1.1922717609930682


Appears that model is overfitting as mse of test set is lower than mse of train. Try predictions with first model from grid search, with lower learnig rate.  

In [31]:
#get best model
best_model_xg_wp1 = grid_search1.best_estimator_
#predict
y_pred_xg_wp1 = best_model_xg_wp1.predict(X_test)
mae_xg_wp1 = mean_absolute_error(y_test, y_pred_xg_wp1)
mse_xg_wp1 = mean_squared_error(y_test, y_pred_xg_wp1)
print(f'Mean Squared Error: {mse_xg_wp1}')
print(f'Mean Absolute Error: {mae_xg_wp1}')


Mean Squared Error: 2.7901761154688134
Mean Absolute Error: 1.2144449633650984


Model is still overfitting, parameters will be adjusted.

**GridSearch Run, Identifier 'adjust_for_ofit'**

In [32]:
param_grid_reduce_of = {
    'xgb_model__booster': ['gbtree'],
    'xgb_model__learning_rate': [0.01, 0.05],  # lower the learning rate 
    'xgb_model__max_depth': [2],  #Limit max depth to 2
    'xgb_model__reg_alpha': [0.3, 0.5, 1.0], #add alpha for regularization
    'xgb_model__reg_lambda': [0.3, 0.5, 1.0], #add lambda for regularization
    'xgb_model__n_estimators': [50, 100, 200, 300],  # Added 300
    'xgb_model__subsample': [0.8, 0.9, 1.0],  # Add subsample
    'xgb_model__colsample_bytree': [0.8, 0.9, 1.0],  # Add colsample_bytree
}

In [33]:
scoring_metric = 'neg_mean_squared_error'
results_gridsearch_ofit = run_grid_search(X_train, y_train, param_grid_reduce_of, xgb_pipeline, scoring_metric, 'adjust_for_ofit')

Best Parameters for adjust_for_ofit: {'xgb_model__booster': 'gbtree', 'xgb_model__colsample_bytree': 0.9, 'xgb_model__learning_rate': 0.05, 'xgb_model__max_depth': 2, 'xgb_model__n_estimators': 300, 'xgb_model__reg_alpha': 1.0, 'xgb_model__reg_lambda': 1.0, 'xgb_model__subsample': 0.8}
Best neg_mean_squared_error Score for adjust_for_ofit: -3.136918799854543
Elapsed Time for adjust_for_ofit: 235.61184096336365 seconds


In [None]:
#get best model
best_model_xg_wp_o1 =results_gridsearch_ofit.best_estimator_
#predict
y_pred_xg_wp_o1 = best_model_xg_wp_o1.predict(X_test)
mae_xg_wp_o1 = mean_absolute_error(y_test, y_pred_xg_wp_o1)
mse_xg_wp_o1 = mean_squared_error(y_test, y_pred_xg_wp_o1)
print(f'Mean Squared Error: {mse_xg_wp_o1}')
print(f'Mean Absolute Error: {mae_xg_wp_o1}')


Model is still overfitting, adjust parameters further.

In [None]:
param_grid_reduce_of2 = {
    'xgb_model__booster': ['gbtree'],
    'xgb_model__learning_rate': [0.01, 0.03],  # lower the learning rate further
    'xgb_model__max_depth': [2],  #Limit max depth to 2
    'xgb_model__reg_alpha': [0.5, 1.0, 2.0], #increase range of alpha
    'xgb_model__reg_lambda': [0.5, 1.0, 2.0], #increase range of lambda
    'xgb_model__n_estimators': [50, 100, 200, 300],  # Added 300
    'xgb_model__subsample': [0.6, 0.8, 0.9],  # lower subsampling rate
    'xgb_model__colsample_bytree': [0.8, 0.9, 1.0],  
}

In [None]:
scoring_metric = 'neg_mean_squared_error'
results_gridsearch_ofit2 = run_grid_search(X_train, y_train, param_grid_reduce_of2, xgb_pipeline, scoring_metric, 'adjust_for_ofit')

In [None]:
#get best model
best_model_xg_wp_o2 =results_gridsearch_ofit2.best_estimator_
#predict
y_pred_xg_wp_o2 = best_model_xg_wp_o2.predict(X_test)
mae_xg_wp_o2 = mean_absolute_error(y_test, y_pred_xg_wp_o2)
mse_xg_wp_o2 = mean_squared_error(y_test, y_pred_xg_wp_o2)
print(f'Mean Squared Error: {mse_xg_wp_o1}')
print(f'Mean Absolute Error: {mae_xg_wp_o1}')


In [None]:
#plot predictions vs actual 
X_test.index = pd.to_datetime(X_test.index)

# Create a DataFrame with actual and predicted values
plot_data = pd.DataFrame({'Datetime': X_test.index, 'Actual': y_test, 'Predicted': y_pred_xg_wp_o2})

# Plotly line plot
fig = px.line(plot_data, x='Datetime', y=['Actual', 'Predicted'], title='Actual vs Predicted',
              labels={'value': 'Wave Period', 'Datetime': 'Date'}, line_shape='linear')

# Show the plot
fig.show()


## Wave Height

In [None]:
#define X and y
X_train_wh = features_df[train_xg].drop(['VWH$'],axis=1)
y_train_wh = features_df[train_xg]['VWH$']

X_test_wh = features_df[test_xg].drop(['VWH$'], axis=1)
y_test_wh = features_df[test_xg]['VWH$']

In [None]:
#pipeline object
xgb_pipeline_wh = Pipeline([
    ('scaler', StandardScaler()),
    ('xgb_model',xgb.XGBRegressor(objective ='reg:squarederror',random_state=42))
])

**GridSearch Run, Identifier = wave_height_1**

In [None]:
#Param Grid
param_grid_gbtree = {
    'xgb_model__booster': ['gbtree'],
    'xgb_model__learning_rate': [0.01, 0.1, 0.2],
    'xgb_model__max_depth': [3, 5, 7],
    'xgb_model__n_estimators': [50, 100, 200],
}

# Param grid for gblinear booster
param_grid_gblinear = {
    'xgb_model__booster': ['gblinear'],
    'xgb_model__learning_rate': [0.01, 0.1, 0.2],
    'xgb_model__reg_alpha': [0, 0.1, 0.5],
}


param_grid_wh =[param_grid_gbtree, param_grid_gblinear]

In [None]:
#Grid Search
scoring_metric = 'neg_mean_squared_error'
results_gridsearch_wh = run_grid_search(X_train_wh, y_train_wh, param_grid_wh, xgb_pipeline_wh, scoring_metric, 'wave_height_1')

In [None]:
#get best model
best_model_wh =results_gridsearch_wh.best_estimator_
#predict
y_pred_wh = best_model_wh.predict(X_test_wh)
mae_wh = mean_absolute_error(y_test_wh, y_pred_wh)
mse_wh = mean_squared_error(y_test_wh, y_pred_wh)
print(f'Mean Squared Error: {mse_wh}')
print(f'Mean Absolute Error: {mae_wh}')


In [None]:
#plot predictions vs actual 
X_test_wh.index = pd.to_datetime(X_test_wh.index)

# Create a DataFrame with actual and predicted values
plot_data = pd.DataFrame({'Datetime': X_test_wh.index, 'Actual': y_test_wh, 'Predicted': y_pred_wh})

# Plotly line plot
fig = px.line(plot_data, x='Datetime', y=['Actual', 'Predicted'], title='Actual vs Predicted',
              labels={'value': 'Wave Height', 'Datetime': 'Date'}, line_shape='linear')

# Show the plot
fig.show()


# Modelling Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor


## Wave Period
Same train test split for wave period will be used from above: 
- X_train, X_test, y_train, y_test

In [None]:
#Pipeline
rf_pipeline = Pipeline([
    ('rf_model', RandomForestRegressor(random_state=42))
])

**GridSearch Run, Identifier = optimize_rf_wp**

In [None]:
#param grid
param_grid_rf = {
    'rf_model__n_estimators': [50, 100, 200],
    'rf_model__max_depth': [None, 5, 10],
    'rf_model__min_samples_split': [2, 5, 10],
    'rf_model__min_samples_leaf': [1, 2, 4],
}

In [None]:
#Grid Search 
scoring_metric = 'neg_mean_squared_error'
grid_search_rf_wp = run_grid_search(X_train, y_train, param_grid_rf, rf_pipeline, scoring_metric, 'optimize_rf_wp')

In [None]:
#get best model
best_model_wp =grid_search_rf_wp.best_estimator_
#predict
y_pred_wp = best_model_wp.predict(X_test)
mae_wp = mean_absolute_error(y_test, y_pred_wp)
mse_wp = mean_squared_error(y_test, y_pred_wp)
print(f'Mean Squared Error wp rf: {mse_wp}')
print(f'Mean Absolute Error wp rf: {mae_wp}')

The model is overfitting, parameters are adjusted as to prevent overfitting. 

**GridSearch Run, Identifier = optimze_rf_wp1**

In [None]:
#param grid 
param_grid_rf1 = {
    'rf_model__n_estimators': [25, 50, 100], #decrease range of estimators
    'rf_model__max_depth': [2, 5, 10,], #limit depth
    'rf_model__min_samples_split': [5, 10, 20], #increase range of min sample split
    'rf_model__min_samples_leaf': [3, 6, 12], #increase min samples per leaf , prevent leaves with few samples
}

In [None]:
#Grid Search 
scoring_metric = 'neg_mean_squared_error'
grid_search_rf_wp1 = run_grid_search(X_train, y_train, param_grid_rf1, rf_pipeline, scoring_metric, 'optimize_rf_wp1')

In [None]:
#get best model
best_model_wp1 =grid_search_rf_wp1.best_estimator_
#predict
y_pred_wp1 = best_model_wp1.predict(X_test)
mae_wp1 = mean_absolute_error(y_test, y_pred_wp1)
mse_wp1 = mean_squared_error(y_test, y_pred_wp1)
print(f'Mean Squared Error wp1 rf: {mse_wp1}')
print(f'Mean Absolute Error wp1 rf: {mae_wp1}')

## Wave Height
Same train test split for wave height will be used as well as same pipeline. 
- X_train_wh, X_test_wh, y_train_wh, y_test_wh
- rf_pipeline

**GridSearch Run, Identifier = optimize_rf_wh**

In [None]:
#Param Grid
param_grid_rf = {
    'rf_model__n_estimators': [50, 100, 200],
    'rf_model__max_depth': [None, 5, 10],
    'rf_model__min_samples_split': [2, 5, 10],
    'rf_model__min_samples_leaf': [1, 2, 4],
}

In [None]:
#Grid Search
scoring_metric = 'neg_mean_squared_error'
grid_search_rf_wh = run_grid_search(X_train_wh, y_train_wh, param_grid_rf, rf_pipeline, scoring_metric, 'optimize_rf_wh')

In [None]:
#get best model
best_model_rf_wh =grid_search_rf_wh.best_estimator_
#predict
y_pred_wh = best_model_rf_wh.predict(X_test_wh)
mae_rf_wh = mean_absolute_error(y_test_wh, y_pred_wh)
mse_rf_wh = mean_squared_error(y_test_wh, y_pred_wh)
print(f'Mean Squared Error wp1 rf: {mse_rf_wh}')
print(f'Mean Absolute Error wp1 rf: {mae_rf_wh}')

# Summary 
### Summary of Models: MSE MAE 

| Model              | Wave Period (MSE) | Wave Period (MAE - seconds) | Wave Height (MSE) | Wave Height (MAE - meters) |
| ------------------ | ------------------ | ---------------------------- | ----------------- | --------------------------- |
| Random Forest      | 2.90              | 1.25                         | 0.015             | 0.067                       |
| XGBoost            | 2.80              | 1.22                         | 0.017             | 0.071                       |
| Linear Regression  | 2.83              | 1.28                         | 0.19              | 0.31                        |
| ARIMA              | 6.94              | 2.06                         | 1.47              | 0.91                        |


In Summary the best models, by lowering prediction errors were XGBoost for wave period and Random Forest for wave height. It is important to note however that with optimization the test mse was slightly higher than the train mse for all models. Further research and analysis as well as model tuning could be done to find out why. However for the purpose of this analysis the next phase of modelling will be done. Wave period and wave height will be forecasted for the same time window. Further feature refinement will be done. 
