# Preprocess Datasets and Extract Features
> Feature engineering notebook

Dataset columns (same convention as the lab1):

| Col1 | Col2 | Col3 | Col3 | $\dots$ |
|------|------|------|------|---------|
| $Y$  |$Y_0$ | $X_1$| $X_2$| $\dots$ |

- $Y$ : labels or target values, in our case $X(T+5)$
- $Y_0$ : present value $X(T+0)$
- $X_1$, $X_2$, $\dots$ : other features

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline

## Import Normalized Datasets

In [None]:
# Training data
with open('train_norm.npy', 'rb') as f:
    train_norm = np.load(f)
# Testing data
with open('test_norm.npy', 'rb') as f:
    test_norm = np.load(f)

## Extract Features

- Revision video: [Session 3: The Prediction Pipeline](https://youtu.be/4W6-48wXXEc?t=1246)

number of samples in normalized datasets

In [None]:
print("train_norm: {} samples\ntest_norm : {} samples".format(train_norm.shape[0],test_norm.shape[0]))

train_norm: 786 samples
test_norm : 90 samples


### Naive Window as Features

In [None]:
window_size = 6 # Y(T+0) and 5 previous points, T-1,...,T-4,T-5
lead_time = 5 # T+5
start_time = window_size-1 # index of last elem of window

In [None]:
X = [train_norm[start_time+lead_time:].reshape(-1,1) ] # Y(T+5)
# append Y(T+0) to Y(T-5)
for w in range(window_size,0,-1):
    X.append(train_norm[w-1:train_norm.shape[0] - lead_time - window_size+w].reshape(-1,1))
# concatenate Y(T+5), Y(T+0),...,Y(T-5)
X = np.concatenate(X,axis=1)