# Overview of `pywddff`

`pywddff` aims to make wavelet feature engineering for machine learning based forecasting with tabular time series data easier for practitioners and researchers. By "wavelet feature engineering", I mean "using wavelet and scaling coefficients from maximal overlap or A Trous wavelet transform as additional features".

The implementation of maximal overlap discrete wavelet transform (MODWT) and A Trous wavelet transform (ATWT) follow from "Wavelet Methods for Time Series Analysis" by Donald Percival and Andrew Walden.

## Loading a subset of CAMELS

[Catchment Attributes and Meteorology for Large-sample Studies (CAMELS)](https://ral.ucar.edu/solutions/products/camels) is a dataset containing streamflow series and catchment attributes for over 600 basins in the United States. The data for most basins lie between 1981 and 2014.

Here, I will load a subset of the CAMELS data set.

In [4]:
from pywddff.datasets import get_camels_subset

camels_subset = get_camels_subset()
camels_ids = list(camels_subset.keys())
print(f'There are {len(camels_ids)} basins included in this package.')

There are 35 basins included in this package.


Here's an example data set in CAMELS:

In [8]:
df = camels_subset[camels_ids[0]]
df

Unnamed: 0,Q(ft3/s),dayl(s),prcp(mm/day),srad(W/m2),tmax(C),tmin(C),vp(Pa)
0,18.00,34905.61,0.00,175.32,9.00,2.46,720.00
1,15.00,34905.61,0.00,289.07,11.50,-2.51,519.54
2,13.00,34905.61,0.00,297.24,11.99,-3.00,480.00
3,13.00,34905.61,2.26,190.21,8.55,-0.50,600.00
4,33.00,34905.61,5.49,151.74,5.50,-2.00,520.00
...,...,...,...,...,...,...,...
11226,0.02,42508.80,0.00,302.88,29.16,18.00,2080.05
11227,0.47,42163.21,26.81,120.50,22.49,17.21,1953.55
11228,0.50,42163.21,14.10,207.48,23.97,14.50,1640.00
11229,0.69,42163.21,12.19,161.60,22.21,15.13,1730.51


## `pywddff.filters`

The `filters` submodule contains 128 (decomposition level 1) orthogonal scaling and wavelet filters. There's also functionality for level j equivalent scaling and wavelet filters, but I will refer you to the documentation for those (see `equiv_scaling_filter` and `equiv_wavelet_filter`).

In [6]:
from pywddff.filters import scaling_filter, wavelet_filter

scaling_filter('la8')

array([-0.07576571, -0.02963553,  0.49761867,  0.80373875,  0.2978578 ,
       -0.09921954, -0.01260397,  0.0322231 ])

In [7]:
wavelet_filter('la8')

array([ 0.0322231 ,  0.01260397, -0.09921954, -0.2978578 ,  0.80373875,
       -0.49761867, -0.02963553,  0.07576571])

## `pywddff.pywddff`

`multi_stationary_dwt` is the main workhorse function of `pywddff`. As the name suggests, this function performs MODWT or A Trous decomposition on every input feature of a user provided numpy array or pandas data frame containing input features. The argument `approach` allows the user to specify whether they want to 

1. keep both the original input features and the newly created wavelet and scaling coefficient features (`approach = "single hybrid"`)
2. only keep the newly created wavelet and scaling coefficients and discard the original input features (`approach = "single"`)

Let's see the function in action.

In [26]:
from pywddff.pywddff import multi_stationary_dwt

X = df.iloc[:, 1:]
y = df.iloc[:, 0]

X_new, y_new = multi_stationary_dwt(X, y, 
                                    transform = 'modwt', 
                                    filter = 'la8', 
                                    J = 2, 
                                    remove_bc = True, 
                                    approach = "single hybrid")
list(X_new)

['dayl(s)',
 'prcp(mm/day)',
 'srad(W/m2)',
 'tmax(C)',
 'tmin(C)',
 'vp(Pa)',
 'dayl(s)_W1',
 'dayl(s)_W2',
 'dayl(s)_V2',
 'prcp(mm/day)_W1',
 'prcp(mm/day)_W2',
 'prcp(mm/day)_V2',
 'srad(W/m2)_W1',
 'srad(W/m2)_W2',
 'srad(W/m2)_V2',
 'tmax(C)_W1',
 'tmax(C)_W2',
 'tmax(C)_V2',
 'tmin(C)_W1',
 'tmin(C)_W2',
 'tmin(C)_V2',
 'vp(Pa)_W1',
 'vp(Pa)_W2',
 'vp(Pa)_V2']

If you have a 1D numpy array (usually corresponding to a single time series), you can use either `modwt` or `atrousdwt` to decompose the time series into wavelet and scaling coefficients. See the documentation for these two functions.

## `pywddff.utils`

A candy jar of different helper functions that were used to develop the other submodules. I decided to collect these functions into the `utils` module for anyone who might find them useful.

Below, I show a few functions that I suspect will be of most use to most people.

### Data preparation for forecasting

`prep_forecast_data` prepares the target variable for forecasting.

In [27]:
from pywddff.utils import prep_forecast_data

h = 7 # forecast horizon (7 days ahead in this case)

X_new, y_new = prep_forecast_data(X_new, y_new, h = h)
X_new.shape, y_new.shape

((11203, 24), (11203,))

### Add lagged features

In [28]:
from pywddff.utils import add_lagged_variables

X_new, y_new = add_lagged_variables(X_new, y_new, n_lags = 1)
X_new.shape, y_new.shape

((11202, 48), (11202,))

### Splitting data

`X` and `y` must be numpy arrays! Furthermore, make sure `y` is a 1D numpy array (i.e., `len(y.shape) == 1`).

In [29]:
from pywddff.utils import absolute_split_2, absolute_split_3

# I love the .to_numpy() method
X_train, X_test, y_train, y_test = absolute_split_2(X_new.to_numpy(), y_new.to_numpy(), ntest = 365)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((10837, 48), (365, 48), (10837,), (365,))

In [30]:
X_train, X_val, X_test, y_train, y_val, y_test = absolute_split_3(X_new.to_numpy(), 
                                                                  y_new.to_numpy(), 
                                                                  nval = 365, 
                                                                  ntest = 365)
X_train.shape, X_val.shape, X_test.shape, y_train.shape, y_val.shape, y_test.shape

((10472, 48), (365, 48), (365, 48), (10472,), (365,), (365,))

The observant of you will ask "why did you perform MODWT feature engineering prior to splitting the dataset?" This is totally ok, and you don't have to worry about look ahead bias if you're doing time series forecasting. For a deep dive, see the paper [Addressing the incorrect usage of wavelet-based hydrological and water resources forecasting models for real-world applications with best practices and a new forecasting framework](https://www.sciencedirect.com/science/article/abs/pii/S0022169418303317) by John Quilty and Jan Adamowski.