# Preprocess Datasets and Extract Features: linearinterp_avgmodels
> Feature engineering notebook

Dataset columns (same convention as the lab1):

| Col1 | Col2 | Col3 | Col3 | $\dots$ |
|------|------|------|------|---------|
| $Y$  |$Y_0$ | $X_1$| $X_2$| $\dots$ |

- $Y$ : labels or target values, in our case $Y(T+18)$
- $Y_0$ : present value $Y(T+0)$
- $X_1$, $X_2$, $\dots$ : other features


* **Baselines**: use only Y(t): energy as features
    * [x] Naive windows VS Diff windows (step=1): window_size \[20, 40, 80\]
* **Wind Speed** Forecast:
    * [x] Wind speed forecast X(T+18 ): for each location speed*(sin^2+cos^2) --> 8 additional features
    * [x] Wind speed forecast X(T+18 ) for 8 loc-s + diffWindows Y(T) \[20,40\]
* **Wind Speed Differences**:
    * [x] diffWindows Y(T) + Wind speed (T+18, T+0) + naiveWindow X(T)
    * [x] diffWindows Y(T) + Wind speed (T+18, T+0) + diffWindows X(T) \[T+0,T-1,...\] w=20, w=40
* **Wind Speed Differences w/ step**:
    * [x] step_diffWindows Y(T) + Wind speed (T+18, T+0) + step_diffWindows X(T) w=20, 40
* **Wind Directions**
    * [x] Diff-s of Wind speed and **directions** (past), and T+18, T+0 values
    * [ ] Seasons (3 month cycles), count months
    * [ ] 24 hour clock variable
    * [ ] Try adding T+24 (should give next forecast, which is updated 6 hourly)
    
* Next:
    * [ ] sin, cos for each loc-n --> 16 additional features
    * [ ] Encoding schemes e.g. positional encoding of Y and X values. 
    * [ ] Momentum,Force (step=1, 4, 9, 18) \[Fine tuning\]
    * [ ] Separate Forecast Models
    * [ ] Nearest neighbour interpolation dataset
    * [ ] what else?

\[ TRAIN/TEST SPLIT IS DONE AFTER PREPROCESSING FEATURES \]

In [3]:
%load_ext autoreload
%autoreload 2

In [4]:
#For running in JupyterHub:
import os
if os.path.basename(os.getcwd())!='P003':
    print('Not in /P003 folder, changing directory to P003')
    lib_path = os.path.expanduser(os.path.relpath('~/images/codesDIR/datathon2020/P003'))
    os.chdir(lib_path)

Not in /P003 folder, changing directory to P003


In [5]:
import numpy as np
from matplotlib import pyplot as plt
import matplotlib
plt.style.use('ggplot')

%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (8,8)
matplotlib.rcParams['font.size']= 26 # use for presentation

In [6]:
from src.datautils import windowed_data, windowed_diff_data, windowed_diff_with_step, windowed_momentum_force
from src.datautils import locations

## Constants

In [7]:
import os

# Train% / Test% :
TRAIN_PERCENT = 95
TEST_PERCENT = 100-TRAIN_PERCENT
print(f"> We will be using train/test: {TRAIN_PERCENT}% / {TEST_PERCENT}% split.\n")

# Path for Datasets:
# all preprocessed data will be saved in `data_path`
data_path = os.path.relpath('../../../dataDIR/'+'preprocessed_linearinterp_avgmodels')
print(f'> Preprocessed data will be saved in `{data_path}`\n')

# Lead Time :
lead_time = 18 # T+18
print(f'> Lead time is set to T+{lead_time}\n')

> We will be using train/test: 95% / 5% split.

> Preprocessed data will be saved in `../../../dataDIR/preprocessed_linearinterp_avgmodels`

> Lead time is set to T+18



## Import Normalised Dataset (not yet split)

In [8]:
# full dataset
with open('norm_linearinterp_avgmodels.npy', 'rb') as f:
    data_norm = np.load(f) # columns (all normalised): Energy, loc1_sin, loc1_cos, ...,loc8_sin, loc8_cos

In [9]:
import pandas as pd
# data frame contains not normalised data
df = pd.read_csv('./ile_de_france_dataset_linearinterp_avgmodels.csv',header=0,index_col=0)
df.index =pd.to_datetime(df.index)
df['Month'] = df.index.month.values
data_months = df['Month'].values.reshape(-1,1)
data_season = np.concatenate([np.cos(np.pi*(data_months)/12)**2, np.cos(np.pi*(data_months-3)/12)**2,
                              np.cos(np.pi*(data_months-6)/12)**2, np.cos(np.pi*(data_months-9)/12)**2],
                            axis=1) # [winter, Spring, Summer, Autumn]

data_clock = np.cos(np.pi*df.index.hour.values/23)**2

In [47]:
df.loc[df.index==pd.to_datetime('2020-07-14 14:00:00')] 

Unnamed: 0,Energy(kWh),guitrancourt_sin,guitrancourt_cos,lieusaint_sin,lieusaint_cos,lvs-pussay_sin,lvs-pussay_cos,parc-du-gatinais_sin,parc-du-gatinais_cos,arville_sin,arville_cos,boissy-la-riviere_sin,boissy-la-riviere_cos,angerville-1_sin,angerville-1_cos,angerville-2_sin,angerville-2_cos,Month
2020-07-14 14:00:00,10500.0,-2.520584,0.274558,-2.520733,0.514269,-3.017185,0.859413,-2.397853,1.126169,-2.402529,1.086268,-2.877339,0.794478,-2.975787,0.868258,-2.974182,0.867798,7


In [257]:
# plt.plot(data_clock,data_norm[:,0],'o',alpha=.05,ms=20)

In [194]:
# fig = plt.figure(figsize=[10,5])
# t = np.arange(1,13)
# pl = plt.scatter(df['Month'],df['Energy(kWh)'],s=200,alpha=.2,c=np.arange(0,df.shape[0]))
# plt.xticks(t)
# fig.colorbar(pl);
# plt.title('Energy (kWh) by month');

In [155]:
# fig = plt.figure(figsize=[18,6])
# plt.plot(df['Month'].values/12,lw=3,alpha=.8)
# plt.plot(data_season[:,0],lw=3,alpha=.5)
# plt.plot(data_season[:,2],lw=3,alpha=.5)

- 01/01/2017 - 01/06/2017 : 43MW
- 01/06/2017 - 01/12/2017 : 55MW (+ "Arville" 12MW)
- 01/12/2017 - 01/09/2019 : 70MW (+ "Boissy-la-Riviere" 15MW)
- 01/09/2019 - now : 89MW (~89.8MW) (+ "Angerville 1" 8.8MW, "Angerville 2" 11MW)

- Turning off wind forecasts from unopened farms:

In [9]:
# turn off signal from not open farms
data_normfarm = data_norm
data_normfarm[(df.index<pd.to_datetime('2017-06-01 00:00:00')),df.columns.get_loc('arville_sin')]=0
data_normfarm[(df.index<pd.to_datetime('2017-06-01 00:00:00')),df.columns.get_loc('arville_cos')]=0

data_normfarm[(df.index<pd.to_datetime('2017-12-01 00:00:00')),df.columns.get_loc('boissy-la-riviere_sin')]=0
data_normfarm[(df.index<pd.to_datetime('2017-12-01 00:00:00')),df.columns.get_loc('boissy-la-riviere_cos')]=0

data_normfarm[(df.index<pd.to_datetime('2019-09-01 00:00:00')),df.columns.get_loc('angerville-1_sin')]=0
data_normfarm[(df.index<pd.to_datetime('2019-09-01 00:00:00')),df.columns.get_loc('angerville-1_cos')]=0
data_normfarm[(df.index<pd.to_datetime('2019-09-01 00:00:00')),df.columns.get_loc('angerville-2_sin')]=0
data_normfarm[(df.index<pd.to_datetime('2019-09-01 00:00:00')),df.columns.get_loc('angerville-2_cos')]=0

## Clock, Wind Speed

In [10]:
# normalised data
Yt = data_norm[:,0].reshape(-1,1)
wind_speeds = np.sqrt(data_norm[:,1::2]**2 + data_norm[:,2::2]**2) # wind speed from wind vectors

# window size
window_size = 80 # Y(T+0) and "window_size-1" previous points, T-1,T-2,...,T-window_size+1
start_time = window_size-1 # index of last elem of window

# Y(t)
# prepare windows : 1st columns Y(T+lead_time)
Y_norm_diffwind = windowed_diff_data(Yt, lead_time=lead_time, window_size=window_size)
print(f'\nFull Diff windowed dataset (1st column is target Y(T+{lead_time}): {Y_norm_diffwind.shape}',
      f"; windowsize:{window_size}")

X_norm_diffwindows = []
for l in range(wind_speeds.shape[1]):
    X_norm_diffwindows.append(
        windowed_diff_data(wind_speeds[:,l], lead_time=lead_time, window_size=window_size)
    )
print('\nX(t+18),X(T+0),X(T+0)-X(T-1),...,X(T-window_size+2)-X(T-window_size+1):\n',
      [l.shape for l in X_norm_diffwindows])

# X_clock_diffwind = windowed_diff_data(data_clock, lead_time=lead_time, window_size=window_size)
X_clock = data_clock[start_time+lead_time:].reshape(-1,1) 
print(f'data_clock: {X_clock.shape}')


#Y18,Y0, Y(t+0)-Y(t-1),..., Y(t-18)-Y(t-19)

feature_WINDOW_SIZE = [20, 40, 80]

for W_size in feature_WINDOW_SIZE:
    YXdif_speedclock = [Y_norm_diffwind[:,:(W_size+1)]]
    # append X(t+18), X(t+0), X_norm_diffwindows [dir1_sin(T+18),dir1_cos(T+18),...,dir8_sin,dir8_cos(T+18)]
    YXdif_speedclock.extend([Xt[:,:(W_size+1)] for Xt in X_norm_diffwindows])
    # append seasons
    YXdif_speedclock.append(X_clock)
    # concatenation
    YXdif_speedclock = np.concatenate(YXdif_speedclock, axis=1)
    
    # split train/test
    split_index = YXdif_speedclock.shape[0]*TRAIN_PERCENT//100

    X_train = YXdif_speedclock[:split_index,:]
    X_test = YXdif_speedclock[split_index:,:]
    print('Training dataset:',X_train.shape)
    print('Testing dataset:',X_test.shape)

    # training data
    train_file_name = os.path.join(data_path,f'train_preprocessed_YXdifwind{W_size}_speedclock.npy')
    with open(train_file_name, 'wb') as f:
        np.save(f,X_train)
    print(f'Saved : {train_file_name}')

    # testing data
    test_file_name = os.path.join(data_path,f'test_preprocessed_YXdifwind{W_size}_speedclock.npy')
    with open(test_file_name, 'wb') as f:
        np.save(f,X_test)
    print(f'Saved : {test_file_name}')


Full Diff windowed dataset (1st column is target Y(T+18): (30917, 81) ; windowsize:80

X(t+18),X(T+0),X(T+0)-X(T-1),...,X(T-window_size+2)-X(T-window_size+1):
 [(30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81)]
data_clock: (30917, 1)
Training dataset: (29371, 190)
Testing dataset: (1546, 190)
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/train_preprocessed_YXdifwind20_speedclock.npy
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/test_preprocessed_YXdifwind20_speedclock.npy
Training dataset: (29371, 370)
Testing dataset: (1546, 370)
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/train_preprocessed_YXdifwind40_speedclock.npy
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/test_preprocessed_YXdifwind40_speedclock.npy
Training dataset: (29371, 730)
Testing dataset: (1546, 730)
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/train_preprocessed_YXdifwind80_speedclock.np

In [274]:
41*9+1

370

## Seasons, Wind Speed, and/or Direction

In [263]:
# normalised data
Yt = data_norm[:,0].reshape(-1,1)
wind_speeds = np.sqrt(data_norm[:,1::2]**2 + data_norm[:,2::2]**2) # wind speed from wind vectors

# window size
window_size = 80 # Y(T+0) and "window_size-1" previous points, T-1,T-2,...,T-window_size+1
start_time = window_size-1 # index of last elem of window

# Y(t)
# prepare windows : 1st columns Y(T+lead_time)
Y_norm_diffwind = windowed_diff_data(Yt, lead_time=lead_time, window_size=window_size)
print(f'\nFull Diff windowed dataset (1st column is target Y(T+{lead_time}): {Y_norm_diffwind.shape}',
      f"; windowsize:{window_size}")

X_norm_diffwindows = []
for l in range(wind_speeds.shape[1]):
    X_norm_diffwindows.append(
        windowed_diff_data(wind_speeds[:,l], lead_time=lead_time, window_size=window_size)
    )
print('\nX(t+18),X(T+0),X(T+0)-X(T-1),...,X(T-window_size+2)-X(T-window_size+1):\n',
      [l.shape for l in X_norm_diffwindows])

X_seasons_diffwindows = []
for l in range(data_season.shape[1]):
    X_seasons_diffwindows.append(
        windowed_diff_data(data_season[:,l], lead_time=lead_time, window_size=window_size)
    )
print(f'X_seasons: {[l.shape for l in X_seasons_diffwindows]}')


Full Diff windowed dataset (1st column is target Y(T+18): (30917, 81) ; windowsize:80

X(t+18),X(T+0),X(T+0)-X(T-1),...,X(T-window_size+2)-X(T-window_size+1):
 [(30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81)]
X_seasons: [(30917, 81), (30917, 81), (30917, 81), (30917, 81)]


In [265]:
#Y18,Y0, Y(t+0)-Y(t-1),..., Y(t-18)-Y(t-19)

feature_WINDOW_SIZE = [20, 40]

for W_size in feature_WINDOW_SIZE:
    YXdif_speedseason = [Y_norm_diffwind[:,:(W_size+1)]]
    # append X(t+18), X(t+0), X_norm_diffwindows [dir1_sin(T+18),dir1_cos(T+18),...,dir8_sin,dir8_cos(T+18)]
    YXdif_speedseason.extend([Xt[:,:(W_size+1)] for Xt in X_norm_diffwindows])
    # append seasons
    YXdif_speedseason.extend([Xt[:,:(W_size+1)] for Xt in X_seasons_diffwindows])
    # concatenation
    YXdif_speedseason = np.concatenate(YXdif_speedseason, axis=1)
    
    # split train/test
    split_index = YXdif_speedseason.shape[0]*TRAIN_PERCENT//100

    X_train = YXdif_speedseason[:split_index,:]
    X_test = YXdif_speedseason[split_index:,:]
    print('Training dataset:',X_train.shape)
    print('Testing dataset:',X_test.shape)

    # training data
    train_file_name = os.path.join(data_path,f'train_preprocessed_YXdifwind{W_size}_speedseason.npy')
    with open(train_file_name, 'wb') as f:
        np.save(f,X_train)
    print(f'Saved : {train_file_name}')

    # testing data
    test_file_name = os.path.join(data_path,f'test_preprocessed_YXdifwind{W_size}_speedseason.npy')
    with open(test_file_name, 'wb') as f:
        np.save(f,X_test)
    print(f'Saved : {test_file_name}')

Training dataset: (29371, 273)
Testing dataset: (1546, 273)
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/train_preprocessed_YXdifwind20_speedseason.npy
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/test_preprocessed_YXdifwind20_speedseason.npy
Training dataset: (29371, 533)
Testing dataset: (1546, 533)
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/train_preprocessed_YXdifwind40_speedseason.npy
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/test_preprocessed_YXdifwind40_speedseason.npy


### Seasons w/ farm info

In [267]:
# normalised data
Yt = data_normfarm[:,0].reshape(-1,1)
wind_speeds = np.sqrt(data_normfarm[:,1::2]**2 + data_normfarm[:,2::2]**2) # wind speed from wind vectors

# window size
window_size = 80 # Y(T+0) and "window_size-1" previous points, T-1,T-2,...,T-window_size+1
start_time = window_size-1 # index of last elem of window

# Y(t)
# prepare windows : 1st columns Y(T+lead_time)
Y_norm_diffwind = windowed_diff_data(Yt, lead_time=lead_time, window_size=window_size)
print(f'\nFull Diff windowed dataset (1st column is target Y(T+{lead_time}): {Y_norm_diffwind.shape}',
      f"; windowsize:{window_size}")

X_norm_diffwindows = []
for l in range(wind_speeds.shape[1]):
    X_norm_diffwindows.append(
        windowed_diff_data(wind_speeds[:,l], lead_time=lead_time, window_size=window_size)
    )
print('\nX(t+18),X(T+0),X(T+0)-X(T-1),...,X(T-window_size+2)-X(T-window_size+1):\n',
      [l.shape for l in X_norm_diffwindows])

X_seasons_diffwindows = []
for l in range(data_season.shape[1]):
    X_seasons_diffwindows.append(
        windowed_diff_data(data_season[:,l], lead_time=lead_time, window_size=window_size)
    )
print(f'X_seasons: {[l.shape for l in X_seasons_diffwindows]}')


#Y18,Y0, Y(t+0)-Y(t-1),..., Y(t-18)-Y(t-19)
feature_WINDOW_SIZE = [20, 40]

for W_size in feature_WINDOW_SIZE:
    YXdif_speedseason = [Y_norm_diffwind[:,:(W_size+1)]]
    # append X(t+18), X(t+0), X_norm_diffwindows [dir1_sin(T+18),dir1_cos(T+18),...,dir8_sin,dir8_cos(T+18)]
    YXdif_speedseason.extend([Xt[:,:(W_size+1)] for Xt in X_norm_diffwindows])
    # append seasons
    YXdif_speedseason.extend([Xt[:,:(W_size+1)] for Xt in X_seasons_diffwindows])
    # concatenation
    YXdif_speedseason = np.concatenate(YXdif_speedseason, axis=1)
    
    # split train/test
    split_index = YXdif_speedseason.shape[0]*TRAIN_PERCENT//100

    X_train = YXdif_speedseason[:split_index,:]
    X_test = YXdif_speedseason[split_index:,:]
    print('Training dataset:',X_train.shape)
    print('Testing dataset:',X_test.shape)

    # training data
    train_file_name = os.path.join(data_path,f'train_preprocessed_YXdifwind{W_size}_speedseason_farm.npy')
    with open(train_file_name, 'wb') as f:
        np.save(f,X_train)
    print(f'Saved : {train_file_name}')

    # testing data
    test_file_name = os.path.join(data_path,f'test_preprocessed_YXdifwind{W_size}_speedseason_farm.npy')
    with open(test_file_name, 'wb') as f:
        np.save(f,X_test)
    print(f'Saved : {test_file_name}')


Full Diff windowed dataset (1st column is target Y(T+18): (30917, 81) ; windowsize:80

X(t+18),X(T+0),X(T+0)-X(T-1),...,X(T-window_size+2)-X(T-window_size+1):
 [(30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81)]
X_seasons: [(30917, 81), (30917, 81), (30917, 81), (30917, 81)]
Training dataset: (29371, 273)
Testing dataset: (1546, 273)
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/train_preprocessed_YXdifwind20_speedseason_farm.npy
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/test_preprocessed_YXdifwind20_speedseason_farm.npy
Training dataset: (29371, 533)
Testing dataset: (1546, 533)
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/train_preprocessed_YXdifwind40_speedseason_farm.npy
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/test_preprocessed_YXdifwind40_speedseason_farm.npy


In [183]:
# ls $data_path

## Wind Direction and Speed differences as features

### Wind Direction differences 

In [268]:
# normalised data
Yt = data_norm[:,0].reshape(-1,1) # energy
Xt = data_norm[:,1:] # wind vectors

# window size
window_size = 80 # Y(T+0) and "window_size-1" previous points, T-1,T-2,...,T-window_size+1
start_time = window_size-1 # index of last elem of window

# Y(t)
# prepare windows : 1st columns Y(T+lead_time)
Y_norm_diffwind = windowed_diff_data(Yt, lead_time=lead_time, window_size=window_size)
print(f'\nFull Diff windowed dataset (1st column is target Y(T+{lead_time}): {Y_norm_diffwind.shape}',
      f"; windowsize:{window_size}")

X_norm_diffwindows = []
for l in range(Xt.shape[1]):
    X_norm_diffwindows.append(
        windowed_diff_data(Xt[:,l], lead_time=lead_time, window_size=window_size)
    )

print('\nX(t+18),X(T+0),X(T+0)-X(T-1),...,X(T-window_size+2)-X(T-window_size+1):\n',
      [l.shape for l in X_norm_diffwindows])


#Y18,Y0, Y(t+0)-Y(t-1),..., Y(t-18)-Y(t-19)
feature_WINDOW_SIZE = [20, 40]

for W_size in feature_WINDOW_SIZE:
    YXdifwind_dir = [Y_norm_diffwind[:,:(W_size+1)]]
    # append X(t+18), X(t+0), X_norm_diffwindows [dir1_sin(T+18),dir1_cos(T+18),...,dir8_sin,dir8_cos(T+18)]
    YXdifwind_dir.extend([Xt[:,:(W_size+1)] for Xt in X_norm_diffwindows])
    # concatenation
    YXdifwind_dir = np.concatenate(YXdifwind_dir, axis=1)
    
    # split train/test
    split_index = YXdifwind_dir.shape[0]*TRAIN_PERCENT//100

    X_train = YXdifwind_dir[:split_index,:]
    X_test = YXdifwind_dir[split_index:,:]
    print('Training dataset:',X_train.shape)
    print('Testing dataset:',X_test.shape)

    # training data
    train_file_name = os.path.join(data_path,f'train_preprocessed_YXdifwind{W_size}_dir.npy')
    with open(train_file_name, 'wb') as f:
        np.save(f,X_train)
    print(f'Saved : {train_file_name}')

    # testing data
    test_file_name = os.path.join(data_path,f'test_preprocessed_YXdifwind{W_size}_dir.npy')
    with open(test_file_name, 'wb') as f:
        np.save(f,X_test)
    print(f'Saved : {test_file_name}')


Full Diff windowed dataset (1st column is target Y(T+18): (30917, 81) ; windowsize:80

X(t+18),X(T+0),X(T+0)-X(T-1),...,X(T-window_size+2)-X(T-window_size+1):
 [(30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81)]
Training dataset: (29371, 357)
Testing dataset: (1546, 357)
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/train_preprocessed_YXdifwind20_dir.npy
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/test_preprocessed_YXdifwind20_dir.npy
Training dataset: (29371, 697)
Testing dataset: (1546, 697)
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/train_preprocessed_YXdifwind40_dir.npy
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/test_preprocessed_YXdifwind40_dir.npy


### Wind Directions w/ Farm info

In [269]:
# normalised data
Yt = data_normfarm[:,0].reshape(-1,1) # energy
Xt = data_normfarm[:,1:] # wind vectors

# window size
window_size = 80 # Y(T+0) and "window_size-1" previous points, T-1,T-2,...,T-window_size+1
start_time = window_size-1 # index of last elem of window

# Y(t)
# prepare windows : 1st columns Y(T+lead_time)
Y_norm_diffwind = windowed_diff_data(Yt, lead_time=lead_time, window_size=window_size)
print(f'\nFull Diff windowed dataset (1st column is target Y(T+{lead_time}): {Y_norm_diffwind.shape}',
      f"; windowsize:{window_size}")

X_norm_diffwindows = []
for l in range(Xt.shape[1]):
    X_norm_diffwindows.append(
        windowed_diff_data(Xt[:,l], lead_time=lead_time, window_size=window_size)
    )

print('\nX(t+18),X(T+0),X(T+0)-X(T-1),...,X(T-window_size+2)-X(T-window_size+1):\n',
      [l.shape for l in X_norm_diffwindows])


#Y18,Y0, Y(t+0)-Y(t-1),..., Y(t-18)-Y(t-19)
feature_WINDOW_SIZE = [20, 40]

for W_size in feature_WINDOW_SIZE:
    YXdifwind_dir = [Y_norm_diffwind[:,:(W_size+1)]]
    # append X(t+18), X(t+0), X_norm_diffwindows [dir1_sin(T+18),dir1_cos(T+18),...,dir8_sin,dir8_cos(T+18)]
    YXdifwind_dir.extend([Xt[:,:(W_size+1)] for Xt in X_norm_diffwindows])
    # concatenation
    YXdifwind_dir = np.concatenate(YXdifwind_dir, axis=1)
    
    # split train/test
    split_index = YXdifwind_dir.shape[0]*TRAIN_PERCENT//100

    X_train = YXdifwind_dir[:split_index,:]
    X_test = YXdifwind_dir[split_index:,:]
    print('Training dataset:',X_train.shape)
    print('Testing dataset:',X_test.shape)

    # training data
    train_file_name = os.path.join(data_path,f'train_preprocessed_YXdifwind{W_size}_dir_farm.npy')
    with open(train_file_name, 'wb') as f:
        np.save(f,X_train)
    print(f'Saved : {train_file_name}')

    # testing data
    test_file_name = os.path.join(data_path,f'test_preprocessed_YXdifwind{W_size}_dir_farm.npy')
    with open(test_file_name, 'wb') as f:
        np.save(f,X_test)
    print(f'Saved : {test_file_name}')


Full Diff windowed dataset (1st column is target Y(T+18): (30917, 81) ; windowsize:80

X(t+18),X(T+0),X(T+0)-X(T-1),...,X(T-window_size+2)-X(T-window_size+1):
 [(30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81)]
Training dataset: (29371, 357)
Testing dataset: (1546, 357)
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/train_preprocessed_YXdifwind20_dir_farm.npy
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/test_preprocessed_YXdifwind20_dir_farm.npy
Training dataset: (29371, 697)
Testing dataset: (1546, 697)
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/train_preprocessed_YXdifwind40_dir_farm.npy
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/test_preprocessed_YXdifwind40_dir_farm.npy


## Using Only Energy as Features: Y(t)
* Try out window sizes
* Differences with step sizes etc.

- Revision video: [Session 3: The Prediction Pipeline](https://youtu.be/4W6-48wXXEc?t=1246)

number of samples in normalized datasets

### Naive Window as Features
- Dataset with window width `window_size`; columns:`[Y(T+lead_time), Y(T+0), ...,Y(T-window_size+1)]` 

write train/test datasets to :
- `train_preprocessed_naivewin{}.npy` and
- `test_preprocessed_naivewin{}.npy`

`naivewin{}` stands for naive windowed data with window width `{}`

In [84]:
WINDOW_SIZES = [20, 40, 80] # Y(T+0) and "window_size-1" previous points, T-1,T-2,...,T-window_size+1

x_norm = data_norm[:,0] # Y(t) energy

for window_size in WINDOW_SIZES:
    # index of last elem of window
    start_time = window_size-1
    # prepare windows : 1st columns Y(T+lead_time)
    X_norm_wind = windowed_data(x_norm, lead_time=lead_time, window_size=window_size)
    print(f'\nFull windowed dataset (1st column is target Y(T+{lead_time}): {X_norm_wind.shape}',
          f"; windowsize:{window_size}")
    
    # split train/test
    split_index = X_norm_wind.shape[0]*TRAIN_PERCENT//100
    
    X_train = X_norm_wind[:split_index,:]
    X_test = X_norm_wind[split_index:,:]
    print('Training dataset:',X_train.shape)
    print('Testing dataset:',X_test.shape)
    
    # write to files
    
    # training data
    train_file_name = os.path.join(data_path,f'train_preprocessed_naivewin{window_size}.npy')
    with open(train_file_name, 'wb') as f:
        np.save(f,X_train)
    print(f'Saved : {train_file_name}')
    
    # testing data
    test_file_name = os.path.join(data_path,f'test_preprocessed_naivewin{window_size}.npy')
    with open(test_file_name, 'wb') as f:
        np.save(f,X_test)
    print(f'Saved : {test_file_name}')


Full windowed dataset (1st column is target Y(T+18): (30977, 21) ; windowsize:20
Training dataset: (29428, 21)
Testing dataset: (1549, 21)
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/train_preprocessed_naivewin20.npy
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/test_preprocessed_naivewin20.npy

Full windowed dataset (1st column is target Y(T+18): (30957, 41) ; windowsize:40
Training dataset: (29409, 41)
Testing dataset: (1548, 41)
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/train_preprocessed_naivewin40.npy
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/test_preprocessed_naivewin40.npy

Full windowed dataset (1st column is target Y(T+18): (30917, 81) ; windowsize:80
Training dataset: (29371, 81)
Testing dataset: (1546, 81)
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/train_preprocessed_naivewin80.npy
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/test_preprocessed_naivewin80.npy


### Differences as Features
- First column is `Y(T+lead_time)`
- 2nd column is `Y(T+0)`, present value
- 3rd to END are differences: `[Y(T+0)-Y(T-1), Y(T-1)-Y(T-2), Y(T-2)-Y(T-3), ...]`

write training datasets to :
- `train_preprocessed_diff{}.npy` and
- `test_preprocessed_diff{}.npy`

`diff{}` part stands for difference data for differences from `T-window_size+1` to `T+0` (`window_size-1` $\Delta T$'s)

In [86]:
WINDOW_SIZES = [20, 40, 80] # Y(T+0) and "window_size-1" previous points, T-1,T-2,...,T-window_size+1

x_norm = data_norm[:,0] # Y(t) energy

for window_size in WINDOW_SIZES:
    # index of last elem of window
    start_time = window_size-1
    # prepare windows : 1st columns Y(T+lead_time)
    X_norm_wind = windowed_diff_data(x_norm, lead_time=lead_time, window_size=window_size)
    print(f'\nFull windowed dataset (1st column is target Y(T+{lead_time}): {X_norm_wind.shape}',
          f"; windowsize:{window_size}")
    
    # split train/test
    split_index = X_norm_wind.shape[0]*TRAIN_PERCENT//100
    
    X_train = X_norm_wind[:split_index,:]
    X_test = X_norm_wind[split_index:,:]
    print('Training dataset:',X_train.shape)
    print('Testing dataset:',X_test.shape)
    
    # write to files
    
    # training data
    train_file_name = os.path.join(data_path,f'train_preprocessed_diff{window_size}.npy')
    with open(train_file_name, 'wb') as f:
        np.save(f,X_train)
    print(f'Saved : {train_file_name}')
    
    # testing data
    test_file_name = os.path.join(data_path,f'test_preprocessed_diff{window_size}.npy')
    with open(test_file_name, 'wb') as f:
        np.save(f,X_test)
    print(f'Saved : {test_file_name}')


Full windowed dataset (1st column is target Y(T+18): (30977, 21) ; windowsize:20
Training dataset: (29428, 21)
Testing dataset: (1549, 21)
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/train_preprocessed_diff20.npy
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/test_preprocessed_diff20.npy

Full windowed dataset (1st column is target Y(T+18): (30957, 41) ; windowsize:40
Training dataset: (29409, 41)
Testing dataset: (1548, 41)
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/train_preprocessed_diff40.npy
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/test_preprocessed_diff40.npy

Full windowed dataset (1st column is target Y(T+18): (30917, 81) ; windowsize:80
Training dataset: (29371, 81)
Testing dataset: (1546, 81)
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/train_preprocessed_diff80.npy
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/test_preprocessed_diff80.npy


## Adding Wind Forecast Features

In [46]:
# normalised data
Yt = data_norm[:,0].reshape(-1,1)
wind_speeds = np.sqrt(data_norm[:,1::2]**2 + data_norm[:,2::2]**2) # wind speed from wind vectors

# window size
window_size = 80 # Y(T+0) and "window_size-1" previous points, T-1,T-2,...,T-window_size+1
start_time = window_size-1 # index of last elem of window

# Y(t)
# prepare windows : 1st columns Y(T+lead_time)
Y_norm_diffwind = windowed_diff_data(Yt, lead_time=lead_time, window_size=window_size)
print(f'\nFull Diff windowed dataset (1st column is target Y(T+{lead_time}): {Y_norm_diffwind.shape}',
      f"; windowsize:{window_size}")

X_norm_windows = []
for l in range(wind_speeds.shape[1]):
    X_norm_windows.append(
        windowed_data(wind_speeds[:,l], lead_time=lead_time, window_size=window_size)
    )
print('\nX(t+18),X(T+0),X(T-1)...,X(T-window_size+1):\n',[l.shape for l in X_norm_windows])

X_norm_diffwindows = []
for l in range(wind_speeds.shape[1]):
    X_norm_diffwindows.append(
        windowed_diff_data(wind_speeds[:,l], lead_time=lead_time, window_size=window_size)
    )
print('\nX(t+18),X(T+0),X(T+0)-X(T-1),...,X(T-window_size+2)-X(T-window_size+1):\n',
      [l.shape for l in X_norm_diffwindows])


Full Diff windowed dataset (1st column is target Y(T+18): (30917, 81) ; windowsize:80

X(t+18),X(T+0),X(T-1)...,X(T-window_size+1):
 [(30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81)]

X(t+18),X(T+0),X(T+0)-X(T-1),...,X(T-window_size+2)-X(T-window_size+1):
 [(30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81)]


### Wind Speed at T+18
- Wind speed forecast X(T+18 ): for each location speed*(sin^2+cos^2) --> 8 additional features
- columns `[Y(T+18), Y(T+0), speed1(T+18),speed2(T+18),...,speed8(T+18)]`

In [64]:
# columns [Y(T+18), Y(T+0), speed1(T+18),speed2(T+18),...,speed8(T+18)]
Y18_Y0_X18=[Y_norm_wind[:,:2]]
Y18_Y0_X18.extend([Xt[:,0].reshape(-1,1) for Xt in X_norm_windows])

Y18_Y0_X18 = np.concatenate(Y18_Y0_X18, axis=1)

# split train/test
split_index = Y18_Y0_X18.shape[0]*TRAIN_PERCENT//100

X_train = Y18_Y0_X18[:split_index,:]
X_test = Y18_Y0_X18[split_index:,:]
print('Training dataset:',X_train.shape)
print('Testing dataset:',X_test.shape)

# training data
train_file_name = os.path.join(data_path,f'train_preprocessed_naive_Y18_Y0_X18.npy')
with open(train_file_name, 'wb') as f:
    np.save(f,X_train)
print(f'Saved : {train_file_name}')

# testing data
test_file_name = os.path.join(data_path,f'test_preprocessed_naive_Y18_Y0_X18.npy')
with open(test_file_name, 'wb') as f:
    np.save(f,X_test)
print(f'Saved : {test_file_name}')

Training dataset: (29371, 10)
Testing dataset: (1546, 10)
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/train_preprocessed_naive_Y18_Y0_X18.npy
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/test_preprocessed_naive_Y18_Y0_X18.npy


### Y(t+0), diff(Y(t+0),...,Y(t-w+1)), +speed magnitudes X(t+18,t+0)
- columns `[Y(T+18), Y(T+0),Y(t+0)-Y(t-1),..., Y(t-w+2)-Y(t-w+1), speed1(T+18),speed2(T+18),...,speed8(T+18)]`
where `w` is the `window_size`

#### Y(t) window_size=20

In [139]:
# Y_norm_diffwind
# X_norm_windows
Y18_Y0_dYwind20_X18=[Y_norm_diffwind[:,:21]]  #Y18,Y0, Y(t+0)-Y(t-1),..., Y(t-18)-Y(t-19)
# append [speed1(T+18),speed2(T+18),...,speed8(T+18)]
Y18_Y0_dYwind20_X18.extend([Xt[:,0].reshape(-1,1) for Xt in X_norm_windows])

Y18_Y0_dYwind20_X18=np.concatenate(Y18_Y0_dYwind20_X18, axis=1)

# split train/test
split_index = Y18_Y0_dYwind20_X18.shape[0]*TRAIN_PERCENT//100

X_train = Y18_Y0_dYwind20_X18[:split_index,:]
X_test = Y18_Y0_dYwind20_X18[split_index:,:]
print('Training dataset:',X_train.shape)
print('Testing dataset:',X_test.shape)

# training data
train_file_name = os.path.join(data_path,f'train_preprocessed_Y18_Y0_dYwind20_X18.npy')
with open(train_file_name, 'wb') as f:
    np.save(f,X_train)
print(f'Saved : {train_file_name}')

# testing data
test_file_name = os.path.join(data_path,f'test_preprocessed_Y18_Y0_dYwind20_X18.npy')
with open(test_file_name, 'wb') as f:
    np.save(f,X_test)
print(f'Saved : {test_file_name}')

Training dataset: (29371, 29)
Testing dataset: (1546, 29)
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/train_preprocessed_Y18_Y0_dYwind20_X18.npy
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/test_preprocessed_Y18_Y0_dYwind20_X18.npy


#### Y(t) window_size=40

In [141]:
# Y_norm_diffwind
# X_norm_windows
Y18_Y0_dYwind40_X18=[Y_norm_diffwind[:,:41]]  #Y18,Y0, Y(t+0)-Y(t-1),..., Y(t-38)-Y(t-39)

# append [speed1(T+18),speed2(T+18),...,speed8(T+18)]
Y18_Y0_dYwind40_X18.extend([Xt[:,0].reshape(-1,1) for Xt in X_norm_windows])

Y18_Y0_dYwind40_X18=np.concatenate(Y18_Y0_dYwind40_X18, axis=1)

# split train/test
split_index = Y18_Y0_dYwind40_X18.shape[0]*TRAIN_PERCENT//100

X_train = Y18_Y0_dYwind40_X18[:split_index,:]
X_test = Y18_Y0_dYwind40_X18[split_index:,:]
print('Training dataset:',X_train.shape)
print('Testing dataset:',X_test.shape)

# training data
train_file_name = os.path.join(data_path,f'train_preprocessed_Y18_Y0_dYwind40_X18.npy')
with open(train_file_name, 'wb') as f:
    np.save(f,X_train)
print(f'Saved : {train_file_name}')

# testing data
test_file_name = os.path.join(data_path,f'test_preprocessed_Y18_Y0_dYwind40_X18.npy')
with open(test_file_name, 'wb') as f:
    np.save(f,X_test)
print(f'Saved : {test_file_name}')

Training dataset: (29371, 49)
Testing dataset: (1546, 49)
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/train_preprocessed_Y18_Y0_dYwind40_X18.npy
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/test_preprocessed_Y18_Y0_dYwind40_X18.npy


### Wind Speed Forecasts T+0 and T+18
* diffWindows Y(T) + Wind speed (T+18, T+0) + naiveWindow X(T)
* diffWindows Y(T) + Wind speed (T+18, T+0) + diffWindows X(T)`[T+0,T-1,...]` )

In [206]:
# normalised data
Yt = data_norm[:,0].reshape(-1,1)
wind_speeds = np.sqrt(data_norm[:,1::2]**2 + data_norm[:,2::2]**2) # wind speed from wind vectors

# window size
window_size = 80 # Y(T+0) and "window_size-1" previous points, T-1,T-2,...,T-window_size+1
start_time = window_size-1 # index of last elem of window

# Y(t)
# prepare windows : 1st columns Y(T+lead_time)
Y_norm_diffwind = windowed_diff_data(Yt, lead_time=lead_time, window_size=window_size)
print(f'\nFull Diff windowed dataset (1st column is target Y(T+{lead_time}): {Y_norm_diffwind.shape}',
      f"; windowsize:{window_size}")

X_norm_windows = []
for l in range(wind_speeds.shape[1]):
    X_norm_windows.append(
        windowed_data(wind_speeds[:,l], lead_time=lead_time, window_size=window_size)
    )
print('\nX(t+18),X(T+0),X(T-1)...,X(T-window_size+1):\n',[l.shape for l in X_norm_windows])

X_norm_diffwindows = []
for l in range(wind_speeds.shape[1]):
    X_norm_diffwindows.append(
        windowed_diff_data(wind_speeds[:,l], lead_time=lead_time, window_size=window_size)
    )

print('\nX(t+18),X(T+0),X(T+0)-X(T-1),...,X(T-window_size+2)-X(T-window_size+1):\n',
      [l.shape for l in X_norm_diffwindows])


Full Diff windowed dataset (1st column is target Y(T+18): (30917, 81) ; windowsize:80

X(t+18),X(T+0),X(T-1)...,X(T-window_size+1):
 [(30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81)]

X(t+18),X(T+0),X(T+0)-X(T-1),...,X(T-window_size+2)-X(T-window_size+1):
 [(30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81), (30917, 81)]


#### X(t+18), X(t+0) and NaiveWindows( X(t) )

In [10]:
#Y18,Y0, Y(t+0)-Y(t-1),..., Y(t-18)-Y(t-19)
Y18Y0dYwind20_X18X0Xwind20 = [Y_norm_diffwind[:,:21]]

# append X(t+18), X(t+0), Xwind20s [speed1(T+18),speed2(T+18),...,speed8(T+18)]
Y18Y0dYwind20_X18X0Xwind20.extend([Xt[:,:21] for Xt in X_norm_windows])
Y18Y0dYwind20_X18X0Xwind20 = np.concatenate(Y18Y0dYwind20_X18X0Xwind20, axis=1)

# split train/test
split_index = Y18Y0dYwind20_X18X0Xwind20.shape[0]*TRAIN_PERCENT//100

X_train = Y18Y0dYwind20_X18X0Xwind20[:split_index,:]
X_test = Y18Y0dYwind20_X18X0Xwind20[split_index:,:]
print('Training dataset:',X_train.shape)
print('Testing dataset:',X_test.shape)

# training data
train_file_name = os.path.join(data_path,f'train_preprocessed_Y18Y0dYwind20_X18X0Xwind20.npy')
with open(train_file_name, 'wb') as f:
    np.save(f,X_train)
print(f'Saved : {train_file_name}')

# testing data
test_file_name = os.path.join(data_path,f'test_preprocessed_Y18Y0dYwind20_X18X0Xwind20.npy')
with open(test_file_name, 'wb') as f:
    np.save(f,X_test)
print(f'Saved : {test_file_name}')

Training dataset: (29371, 189)
Testing dataset: (1546, 189)
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/train_preprocessed_Y18Y0dYwind20_X18X0Xwind20.npy
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/test_preprocessed_Y18Y0dYwind20_X18X0Xwind20.npy


#### X(t+18), X(t+0) and DiffWindows( X(t) ) step=1

##### w=20

In [207]:
#Y18,Y0, Y(t+0)-Y(t-1),..., Y(t-18)-Y(t-19)
Y18Y0dYwind20_X18X0dXwind20 = [Y_norm_diffwind[:,:21]]

# append X(t+18), X(t+0), Xwind20s [speed1(T+18),speed2(T+18),...,speed8(T+18)]
Y18Y0dYwind20_X18X0dXwind20.extend([Xt[:,:21] for Xt in X_norm_diffwindows])
Y18Y0dYwind20_X18X0dXwind20 = np.concatenate(Y18Y0dYwind20_X18X0dXwind20, axis=1)

# split train/test
split_index = Y18Y0dYwind20_X18X0dXwind20.shape[0]*TRAIN_PERCENT//100

X_train = Y18Y0dYwind20_X18X0dXwind20[:split_index,:]
X_test = Y18Y0dYwind20_X18X0dXwind20[split_index:,:]
print('Training dataset:',X_train.shape)
print('Testing dataset:',X_test.shape)

# training data
train_file_name = os.path.join(data_path,f'train_preprocessed_Y18Y0dYwind20_X18X0dXwind20.npy')
with open(train_file_name, 'wb') as f:
    np.save(f,X_train)
print(f'Saved : {train_file_name}')

# testing data
test_file_name = os.path.join(data_path,f'test_preprocessed_Y18Y0dYwind20_X18X0dXwind20.npy')
with open(test_file_name, 'wb') as f:
    np.save(f,X_test)
print(f'Saved : {test_file_name}')

Training dataset: (29371, 189)
Testing dataset: (1546, 189)
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/train_preprocessed_Y18Y0dYwind20_X18X0dXwind20.npy
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/test_preprocessed_Y18Y0dYwind20_X18X0dXwind20.npy


In [17]:
# Ystr = ['y(t+18)','y(t+0)']
# Ystr.extend([f"y(t{'+' if t>-1 else ''}{t})-y(t{t-1})" for t in range(0,-80+1,-1)])
# print(len(Ystr))
# Ystr[:41]

##### w=40

In [208]:
#Y18,Y0, Y(t+0)-Y(t-1),..., Y(t-38)-Y(t-39)
Y18Y0dYwind40_X18X0dXwind40 = [Y_norm_diffwind[:,:41]]

# append X(t+18), X(t+0), Xwind20s [speed1(T+18),speed2(T+18),...,speed8(T+18)]
Y18Y0dYwind40_X18X0dXwind40.extend([Xt[:,:41] for Xt in X_norm_diffwindows])
Y18Y0dYwind40_X18X0dXwind40 = np.concatenate(Y18Y0dYwind40_X18X0dXwind40, axis=1)

# split train/test
split_index = Y18Y0dYwind40_X18X0dXwind40.shape[0]*TRAIN_PERCENT//100

X_train = Y18Y0dYwind40_X18X0dXwind40[:split_index,:]
X_test = Y18Y0dYwind40_X18X0dXwind40[split_index:,:]
print('Training dataset:',X_train.shape)
print('Testing dataset:',X_test.shape)

# training data
train_file_name = os.path.join(data_path,f'train_preprocessed_Y18Y0dYwind40_X18X0dXwind40.npy')
with open(train_file_name, 'wb') as f:
    np.save(f,X_train)
print(f'Saved : {train_file_name}')

# testing data
test_file_name = os.path.join(data_path,f'test_preprocessed_Y18Y0dYwind40_X18X0dXwind40.npy')
with open(test_file_name, 'wb') as f:
    np.save(f,X_test)
print(f'Saved : {test_file_name}')

Training dataset: (29371, 369)
Testing dataset: (1546, 369)
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/train_preprocessed_Y18Y0dYwind40_X18X0dXwind40.npy
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/test_preprocessed_Y18Y0dYwind40_X18X0dXwind40.npy


## Time Differences With Step Size
- Differences with step size `[Y(T+0)-Y(T-h),...]` where h is a step size
- try h : 4, 9, and 18
- window size w: 20,40

### Difference step size h=4

In [12]:
# normalised data
Yt = data_norm[:,0].reshape(-1,1)
wind_speeds = np.sqrt(data_norm[:,1::2]**2 + data_norm[:,2::2]**2) # wind speed from wind vectors

# window size
window_size = 80 # Y(T+0) and "window_size-1" previous points, T-1,T-2,...,T-window_size+1
h = 4 # step size for differences
start_time = window_size-1 # index of last elem of window

# Y(t)
# prepare windows : 1st columns Y(T+lead_time)
# X_norm_wind = windowed_diff_with_step(x_norm, lead_time=lead_time, window_size=window_size,h=lead_time)
Y_norm_stepdiffwind = windowed_diff_with_step(Yt, lead_time=lead_time, window_size=window_size, h = h)
print(f'\nFull Diff w/ step windowed dataset (1st column is target Y(T+{lead_time}): {Y_norm_stepdiffwind.shape}',
      f"; windowsize:{window_size}; step size (h):{h}")

X_norm_stepdiffwindows = []
for l in range(wind_speeds.shape[1]):
    X_norm_stepdiffwindows.append(
        windowed_diff_with_step(wind_speeds[:,l], lead_time=lead_time, window_size=window_size, h = h)
    )

print(f'\nX(t+18),X(T+0),X(T+0)-X(T-{h}),...,X(T-(window_size-1-{h}))-X(T-window_size+1):\n',
      [l.shape for l in X_norm_stepdiffwindows])


Full Diff w/ step windowed dataset (1st column is target Y(T+18): (30917, 78) ; windowsize:80; step size (h):4

X(t+18),X(T+0),X(T+0)-X(T-4),...,X(T-(window_size-1-4))-X(T-window_size+1):
 [(30917, 78), (30917, 78), (30917, 78), (30917, 78), (30917, 78), (30917, 78), (30917, 78), (30917, 78)]


In [20]:
# Ystr = ['y(t+18)','y(t+0)']
# Ystr.extend([f"y(t{'+' if t>-1 else ''}{t})-y(t{t-h*1})" for t in range(0,-80+1,-1)])
# print(len(Ystr))
# Ystr[:41]

#### w=20

In [16]:
#Y18,Y0, Y(t+0)-Y(t-1),..., Y(t-w+2)-Y(t-w+1) [w: window_size]
Y18Y0dh4Ywind20_X18X0dh4Xwind20 = [Y_norm_stepdiffwind[:,:18]]

# append X(t+18), X(t+0), Xwind20s [speed1(T+18),speed2(T+18),...,speed8(T+18)]
Y18Y0dh4Ywind20_X18X0dh4Xwind20.extend([Xt[:,:18] for Xt in X_norm_stepdiffwindows])
Y18Y0dh4Ywind20_X18X0dh4Xwind20 = np.concatenate(Y18Y0dh4Ywind20_X18X0dh4Xwind20, axis=1)

# split train/test
split_index = Y18Y0dh4Ywind20_X18X0dh4Xwind20.shape[0]*TRAIN_PERCENT//100

X_train = Y18Y0dh4Ywind20_X18X0dh4Xwind20[:split_index,:]
X_test = Y18Y0dh4Ywind20_X18X0dh4Xwind20[split_index:,:]
print('Training dataset:',X_train.shape)
print('Testing dataset:',X_test.shape)

# training data
train_file_name = os.path.join(data_path,f'train_preprocessed_Y18Y0dh4Ywind20_X18X0dh4Xwind20.npy')
with open(train_file_name, 'wb') as f:
    np.save(f,X_train)
print(f'Saved : {train_file_name}')

# testing data
test_file_name = os.path.join(data_path,f'test_preprocessed_Y18Y0dh4Ywind20_X18X0dh4Xwind20.npy')
with open(test_file_name, 'wb') as f:
    np.save(f,X_test)
print(f'Saved : {test_file_name}')

Training dataset: (29371, 162)
Testing dataset: (1546, 162)
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/train_preprocessed_Y18Y0dh4Ywind20_X18X0dh4Xwind20.npy
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/test_preprocessed_Y18Y0dh4Ywind20_X18X0dh4Xwind20.npy


#### w=40

In [18]:
#Y18,Y0, Y(t+0)-Y(t-1),..., Y(t-w+2)-Y(t-w+1) [w: window_size]
Y18Y0dh4Ywind40_X18X0dh4Xwind40 = [Y_norm_stepdiffwind[:,:38]]

# append X(t+18), X(t+0), Xwind20s [speed1(T+18),speed2(T+18),...,speed8(T+18)]
Y18Y0dh4Ywind40_X18X0dh4Xwind40.extend([Xt[:,:38] for Xt in X_norm_stepdiffwindows])
Y18Y0dh4Ywind40_X18X0dh4Xwind40 = np.concatenate(Y18Y0dh4Ywind40_X18X0dh4Xwind40, axis=1)

# split train/test
split_index = Y18Y0dh4Ywind40_X18X0dh4Xwind40.shape[0]*TRAIN_PERCENT//100

X_train = Y18Y0dh4Ywind40_X18X0dh4Xwind40[:split_index,:]
X_test = Y18Y0dh4Ywind40_X18X0dh4Xwind40[split_index:,:]
print('Training dataset:',X_train.shape)
print('Testing dataset:',X_test.shape)

# training data
train_file_name = os.path.join(data_path,f'train_preprocessed_Y18Y0dh4Ywind40_X18X0dh4Xwind40.npy')
with open(train_file_name, 'wb') as f:
    np.save(f,X_train)
print(f'Saved : {train_file_name}')

# testing data
test_file_name = os.path.join(data_path,f'test_preprocessed_Y18Y0dh4Ywind40_X18X0dh4Xwind40.npy')
with open(test_file_name, 'wb') as f:
    np.save(f,X_test)
print(f'Saved : {test_file_name}')

Training dataset: (29371, 342)
Testing dataset: (1546, 342)
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/train_preprocessed_Y18Y0dh4Ywind40_X18X0dh4Xwind40.npy
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/test_preprocessed_Y18Y0dh4Ywind40_X18X0dh4Xwind40.npy


### Difference step size h=9

In [22]:
# normalised data
Yt = data_norm[:,0].reshape(-1,1)
wind_speeds = np.sqrt(data_norm[:,1::2]**2 + data_norm[:,2::2]**2) # wind speed from wind vectors

# window size
window_size = 80 # Y(T+0) and "window_size-1" previous points, T-1,T-2,...,T-window_size+1
h = 9 # step size for differences
start_time = window_size-1 # index of last elem of window

# Y(t)
# prepare windows : 1st columns Y(T+lead_time)
# X_norm_wind = windowed_diff_with_step(x_norm, lead_time=lead_time, window_size=window_size,h=lead_time)
Y_norm_stepdiffwind = windowed_diff_with_step(Yt, lead_time=lead_time, window_size=window_size, h = h)
print(f'\nFull Diff w/ step windowed dataset (1st column is target Y(T+{lead_time}): {Y_norm_stepdiffwind.shape}',
      f"; windowsize:{window_size}; step size (h):{h}")

X_norm_stepdiffwindows = []
for l in range(wind_speeds.shape[1]):
    X_norm_stepdiffwindows.append(
        windowed_diff_with_step(wind_speeds[:,l], lead_time=lead_time, window_size=window_size, h = h)
    )

print(f'\nX(t+18),X(T+0),X(T+0)-X(T-{h}),...,X(T-(window_size-1-{h}))-X(T-window_size+1):\n',
      [l.shape for l in X_norm_stepdiffwindows])


Full Diff w/ step windowed dataset (1st column is target Y(T+18): (30917, 73) ; windowsize:80; step size (h):9

X(t+18),X(T+0),X(T+0)-X(T-9),...,X(T-(window_size-1-9))-X(T-window_size+1):
 [(30917, 73), (30917, 73), (30917, 73), (30917, 73), (30917, 73), (30917, 73), (30917, 73), (30917, 73)]


#### w=20

In [23]:
#Y18,Y0, Y(t+0)-Y(t-1),..., Y(t-w+2)-Y(t-w+1) [w: window_size]
Y18Y0dh9Ywind20_X18X0dh9Xwind20 = [Y_norm_stepdiffwind[:,:13]]

# append X(t+18), X(t+0), Xwind20s [speed1(T+18),speed2(T+18),...,speed8(T+18)]
Y18Y0dh9Ywind20_X18X0dh9Xwind20.extend([Xt[:,:13] for Xt in X_norm_stepdiffwindows])
Y18Y0dh9Ywind20_X18X0dh9Xwind20 = np.concatenate(Y18Y0dh9Ywind20_X18X0dh9Xwind20, axis=1)

# split train/test
split_index = Y18Y0dh9Ywind20_X18X0dh9Xwind20.shape[0]*TRAIN_PERCENT//100

X_train = Y18Y0dh9Ywind20_X18X0dh9Xwind20[:split_index,:]
X_test = Y18Y0dh9Ywind20_X18X0dh9Xwind20[split_index:,:]
print('Training dataset:',X_train.shape)
print('Testing dataset:',X_test.shape)

# training data
train_file_name = os.path.join(data_path,f'train_preprocessed_Y18Y0dh9Ywind20_X18X0dh9Xwind20.npy')
with open(train_file_name, 'wb') as f:
    np.save(f,X_train)
print(f'Saved : {train_file_name}')

# testing data
test_file_name = os.path.join(data_path,f'test_preprocessed_Y18Y0dh9Ywind20_X18X0dh9Xwind20.npy')
with open(test_file_name, 'wb') as f:
    np.save(f,X_test)
print(f'Saved : {test_file_name}')

Training dataset: (29371, 117)
Testing dataset: (1546, 117)
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/train_preprocessed_Y18Y0dh9Ywind20_X18X0dh9Xwind20.npy
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/test_preprocessed_Y18Y0dh9Ywind20_X18X0dh9Xwind20.npy


#### w=40

In [24]:
#Y18,Y0, Y(t+0)-Y(t-1),..., Y(t-w+2)-Y(t-w+1) [w: window_size]
# last_index = 2+ w-h : where h is [1, or 4, or 9, or 18]
Y18Y0dh9Ywind40_X18X0dh9Xwind40 = [Y_norm_stepdiffwind[:,:33]]

# append X(t+18), X(t+0), Xwind20s [speed1(T+18),speed2(T+18),...,speed8(T+18)]
Y18Y0dh9Ywind40_X18X0dh9Xwind40.extend([Xt[:,:33] for Xt in X_norm_stepdiffwindows])
Y18Y0dh9Ywind40_X18X0dh9Xwind40 = np.concatenate(Y18Y0dh9Ywind40_X18X0dh9Xwind40, axis=1)

# split train/test
split_index = Y18Y0dh9Ywind40_X18X0dh9Xwind40.shape[0]*TRAIN_PERCENT//100

X_train = Y18Y0dh9Ywind40_X18X0dh9Xwind40[:split_index,:]
X_test = Y18Y0dh9Ywind40_X18X0dh9Xwind40[split_index:,:]
print('Training dataset:',X_train.shape)
print('Testing dataset:',X_test.shape)

# training data
train_file_name = os.path.join(data_path,f'train_preprocessed_Y18Y0dh9Ywind40_X18X0dh9Xwind40.npy')
with open(train_file_name, 'wb') as f:
    np.save(f,X_train)
print(f'Saved : {train_file_name}')

# testing data
test_file_name = os.path.join(data_path,f'test_preprocessed_Y18Y0dh9Ywind40_X18X0dh9Xwind40.npy')
with open(test_file_name, 'wb') as f:
    np.save(f,X_test)
print(f'Saved : {test_file_name}')

Training dataset: (29371, 297)
Testing dataset: (1546, 297)
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/train_preprocessed_Y18Y0dh9Ywind40_X18X0dh9Xwind40.npy
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/test_preprocessed_Y18Y0dh9Ywind40_X18X0dh9Xwind40.npy


### Difference step size h=18 (lead_time)

In [9]:
# normalised data
Yt = data_norm[:,0].reshape(-1,1)
wind_speeds = np.sqrt(data_norm[:,1::2]**2 + data_norm[:,2::2]**2) # wind speed from wind vectors

# window size
window_size = 80 # Y(T+0) and "window_size-1" previous points, T-1,T-2,...,T-window_size+1
h = 18 # step size for differences
start_time = window_size-1 # index of last elem of window

# Y(t)
# prepare windows : 1st columns Y(T+lead_time)
# X_norm_wind = windowed_diff_with_step(x_norm, lead_time=lead_time, window_size=window_size,h=lead_time)
Y_norm_stepdiffwind = windowed_diff_with_step(Yt, lead_time=lead_time, window_size=window_size, h = h)
print(f'\nFull Diff w/ step windowed dataset (1st column is target Y(T+{lead_time}): {Y_norm_stepdiffwind.shape}',
      f"; windowsize:{window_size}; step size (h):{h}")

X_norm_stepdiffwindows = []
for l in range(wind_speeds.shape[1]):
    X_norm_stepdiffwindows.append(
        windowed_diff_with_step(wind_speeds[:,l], lead_time=lead_time, window_size=window_size, h = h)
    )

print(f'\nX(t+18),X(T+0),X(T+0)-X(T-{h}),...,X(T-(window_size-1-{h}))-X(T-window_size+1):\n',
      [l.shape for l in X_norm_stepdiffwindows])


Full Diff w/ step windowed dataset (1st column is target Y(T+18): (30917, 64) ; windowsize:80; step size (h):18

X(t+18),X(T+0),X(T+0)-X(T-18),...,X(T-(window_size-1-18))-X(T-window_size+1):
 [(30917, 64), (30917, 64), (30917, 64), (30917, 64), (30917, 64), (30917, 64), (30917, 64), (30917, 64)]


#### w=20

In [26]:
#Y18,Y0, Y(t+0)-Y(t-1),..., Y(t-w+2)-Y(t-w+1) [w: window_size]
# last_index = 2+ w-h : where h is [1, or 4, or 9, or 18]
Y18Y0dh18Ywind20_X18X0dh18Xwind20 = [Y_norm_stepdiffwind[:,:4]]

# append X(t+18), X(t+0), Xwind20s [speed1(T+18),speed2(T+18),...,speed8(T+18)]
Y18Y0dh18Ywind20_X18X0dh18Xwind20.extend([Xt[:,:4] for Xt in X_norm_stepdiffwindows])
Y18Y0dh18Ywind20_X18X0dh18Xwind20 = np.concatenate(Y18Y0dh18Ywind20_X18X0dh18Xwind20, axis=1)

# split train/test
split_index = Y18Y0dh18Ywind20_X18X0dh18Xwind20.shape[0]*TRAIN_PERCENT//100

X_train = Y18Y0dh18Ywind20_X18X0dh18Xwind20[:split_index,:]
X_test = Y18Y0dh18Ywind20_X18X0dh18Xwind20[split_index:,:]
print('Training dataset:',X_train.shape)
print('Testing dataset:',X_test.shape)

# training data
train_file_name = os.path.join(data_path,f'train_preprocessed_Y18Y0dh18Ywind20_X18X0dh18Xwind20.npy')
with open(train_file_name, 'wb') as f:
    np.save(f,X_train)
print(f'Saved : {train_file_name}')

# testing data
test_file_name = os.path.join(data_path,f'test_preprocessed_Y18Y0dh18Ywind20_X18X0dh18Xwind20.npy')
with open(test_file_name, 'wb') as f:
    np.save(f,X_test)
print(f'Saved : {test_file_name}')

Training dataset: (29371, 36)
Testing dataset: (1546, 36)
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/train_preprocessed_Y18Y0dh18Ywind20_X18X0dh18Xwind20.npy
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/test_preprocessed_Y18Y0dh18Ywind20_X18X0dh18Xwind20.npy


#### w=40

In [27]:
#Y18,Y0, Y(t+0)-Y(t-1),..., Y(t-w+2)-Y(t-w+1) [w: window_size]
# last_index = 2+ w-h : where h is [1, or 4, or 9, or 18]
Y18Y0dh18Ywind40_X18X0dh18Xwind40 = [Y_norm_stepdiffwind[:,:24]]

# append X(t+18), X(t+0), Xwind20s [speed1(T+18),speed2(T+18),...,speed8(T+18)]
Y18Y0dh18Ywind40_X18X0dh18Xwind40.extend([Xt[:,:24] for Xt in X_norm_stepdiffwindows])
Y18Y0dh18Ywind40_X18X0dh18Xwind40 = np.concatenate(Y18Y0dh18Ywind40_X18X0dh18Xwind40, axis=1)

# split train/test
split_index = Y18Y0dh18Ywind40_X18X0dh18Xwind40.shape[0]*TRAIN_PERCENT//100

X_train = Y18Y0dh18Ywind40_X18X0dh18Xwind40[:split_index,:]
X_test = Y18Y0dh18Ywind40_X18X0dh18Xwind40[split_index:,:]
print('Training dataset:',X_train.shape)
print('Testing dataset:',X_test.shape)

# training data
train_file_name = os.path.join(data_path,f'train_preprocessed_Y18Y0dh18Ywind40_X18X0dh18Xwind40.npy')
with open(train_file_name, 'wb') as f:
    np.save(f,X_train)
print(f'Saved : {train_file_name}')

# testing data
test_file_name = os.path.join(data_path,f'test_preprocessed_Y18Y0dh18Ywind40_X18X0dh18Xwind40.npy')
with open(test_file_name, 'wb') as f:
    np.save(f,X_test)
print(f'Saved : {test_file_name}')

Training dataset: (29371, 216)
Testing dataset: (1546, 216)
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/train_preprocessed_Y18Y0dh18Ywind40_X18X0dh18Xwind40.npy
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/test_preprocessed_Y18Y0dh18Ywind40_X18X0dh18Xwind40.npy


#### w=80

In [10]:
Y_norm_stepdiffwind.shape

(30917, 64)

In [11]:
#Y18,Y0, Y(t+0)-Y(t-1),..., Y(t-w+2)-Y(t-w+1) [w: window_size]
# last_index = 2+ w-h : where h is [1, or 4, or 9, or 18]
Y18Y0dh18Ywind80_X18X0dh18Xwind80 = [Y_norm_stepdiffwind]

# append X(t+18), X(t+0), Xwind20s [speed1(T+18),speed2(T+18),...,speed8(T+18)]
Y18Y0dh18Ywind80_X18X0dh18Xwind80.extend([Xt for Xt in X_norm_stepdiffwindows])
Y18Y0dh18Ywind80_X18X0dh18Xwind80 = np.concatenate(Y18Y0dh18Ywind80_X18X0dh18Xwind80, axis=1)

# split train/test
split_index = Y18Y0dh18Ywind80_X18X0dh18Xwind80.shape[0]*TRAIN_PERCENT//100

X_train = Y18Y0dh18Ywind80_X18X0dh18Xwind80[:split_index,:]
X_test = Y18Y0dh18Ywind80_X18X0dh18Xwind80[split_index:,:]
print('Training dataset:',X_train.shape)
print('Testing dataset:',X_test.shape)

# training data
train_file_name = os.path.join(data_path,f'train_preprocessed_Y18Y0dh18Ywind80_X18X0dh18Xwind80.npy')
with open(train_file_name, 'wb') as f:
    np.save(f,X_train)
print(f'Saved : {train_file_name}')

# testing data
test_file_name = os.path.join(data_path,f'test_preprocessed_Y18Y0dh18Ywind80_X18X0dh18Xwind80.npy')
with open(test_file_name, 'wb') as f:
    np.save(f,X_test)
print(f'Saved : {test_file_name}')

Training dataset: (29371, 576)
Testing dataset: (1546, 576)
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/train_preprocessed_Y18Y0dh18Ywind80_X18X0dh18Xwind80.npy
Saved : ../../../dataDIR/preprocessed_linearinterp_avgmodels/test_preprocessed_Y18Y0dh18Ywind80_X18X0dh18Xwind80.npy


In [87]:
# X_norm_wind = windowed_diff_with_step(x_norm, lead_time=lead_time, window_size=window_size,h=lead_time)

# split_index = X_norm_wind.shape[0]*70//100
# X_train = X_norm_wind[:split_index,:]
# X_test = X_norm_wind[split_index:,:]
# print(f'Full windowed dataset: {X_norm_wind.shape}')
# print(f'Training dataset (1st column is target Y(T+{lead_time})):',X_train.shape,f"\nwindowsize:{window_size}")
# print(f'Testing dataset (1st column is target Y(T+{lead_time})):',X_test.shape,f"\nwindowsize:{window_size}")

# # Training data
# with open(f'train_preprocessed_stepdiff{window_size}.npy', 'wb') as f:
#     np.save(f,X_train)
# # Testing data
# with open(f'test_preprocessed_stepdiff{window_size}.npy', 'wb') as f:
#     np.save(f,X_test)

## Force and Momentum with Step Size
Momentum = Difference of Differences

Force = Difference of Momentum

In [None]:
# X_norm_wind = windowed_momentum_force(x_norm,lead_time=lead_time,window_size=window_size,h=lead_time)

# split_index = X_norm_wind.shape[0]*70//100
# X_train = X_norm_wind[:split_index,:]
# X_test = X_norm_wind[split_index:,:]
# print(f'Full windowed dataset: {X_norm_wind.shape}')
# print(f'Training dataset (1st column is target Y(T+{lead_time})):',X_train.shape,f"\nwindowsize:{window_size}")
# print(f'Testing dataset (1st column is target Y(T+{lead_time})):',X_test.shape,f"\nwindowsize:{window_size}")

# # Training data
# with open(f'train_preprocessed_mntfrcwin{window_size}.npy', 'wb') as f:
#     np.save(f,X_train)
# # Testing data
# with open(f'test_preprocessed_mntfrcwin{window_size}.npy', 'wb') as f:
#     np.save(f,X_test)