# Load Packages

- We use `torch`package.
- We use `pytorch_lightning` packages to simplify fitting and evaluate models.

In [1]:
import numpy as np, pandas as pd, matplotlib.pyplot as plt
from sklearn.linear_model import \
     (LinearRegression,
      LogisticRegression,
      Lasso)
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold, train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from ISLP import load_data
from ISLP.models import ModelSpec as MS

### Torch-Specific Imports
There are a number of imports for `torch`. (These are not
included with `ISLP`, so must be installed separately.)
First we import the main library
and essential tools used to specify sequentially-structured networks.

- Main library and essential tools to specify sequentially-structured networks

In [2]:
import torch
from torch import nn
from torch.optim import RMSprop
from torch.utils.data import TensorDataset

- Tools from `torchmetrics` to compute metrics to evaluate performance.
- Tools from `torchinfo` to summarize info of the layers of a model.
    - `read_image()` load test images

In [66]:
from torchmetrics import (MeanAbsoluteError, R2Score)
import torchinfo

- `pytorch_lightning` package simplifies the specification and fitting and evaluate models by reducing amount of boilerplate code needed.
- `pytorch_lightning` is higher-level interface than `torch`.
- `pytorch_lightning` is a high-level module for fitting `torch` models

In [4]:
from pytorch_lightning import Trainer
from pytorch_lightning.loggers import CSVLogger

- `seed_everything()` set seed.
- `use_deterministic_algorithms` fix algorithms.

In [5]:
from pytorch_lightning import seed_everything
seed_everything(0, workers=True)
torch.use_deterministic_algorithms(True, warn_only=True)

Seed set to 0


- We use datasets from `torchvision`.
- We use transforms from `torchvision` for preprocessing.
- We use a pretrained network for image classification

In [None]:
from torchvision.io import read_image
from torchvision.datasets import MNIST, CIFAR100
from torchvision.models import (resnet50, ResNet50_Weights)
from torchvision.transforms import (Resize, Normalize,
                                    CenterCrop, ToTensor)

- `SimpleDataModule` and `SimpleModule` from `ISLP.torch` are simple versions of objects used in `pytorch_lightning`.
- `ErrorTracker` collects targets and predictions over each mini-batch during validation or testing, enabling metric computation over the entire validation or test data set.

In [7]:
from ISLP.torch import (SimpleDataModule, SimpleModule,
                        ErrorTracker, rec_num_workers)

- We use helper functions from `ISLP.torch.imdb` to load data, lookup that maps integers to keys in data.
- We use a modified copy of the preprocessed `imdb` data from `keras`. It saves time for preprocessing.
    - `keras` is a separate package for fitting deep learning models.

In [8]:
from ISLP.torch.imdb import (load_lookup, load_tensor,
                             load_sparse, load_sequential)


- We use `glob()` from `glob` package to find all files matching wildcard characters.
- We use `json` module to load JSON file for looking up classes to identify labels of the pictures in `ResNet50`.

In [None]:
from glob import glob; import json

# New York Stock Exchange Data
- Data consisting of the Dow Jones returns, log trading volume, and log volatility for the New York Stock Exchange over a 20 year period
- There are 6051 rows. Row index is the date which has format YYYY-MM-DD.
- There are 5 variables:
    - day_of_week: Day of the week (mon, tues, wed, thur, fri)
    - DJ_return: Return for Dow Jones Industrial Average
    - log_volume: Log of trading volume
    - log_volatility: Log of volatility
    - train: For the first 4,281 observations, this is set to True

In [16]:
NYSE = load_data('NYSE')
NYSE.head()

Unnamed: 0_level_0,day_of_week,DJ_return,log_volume,log_volatility,train
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1962-12-03,mon,-0.004461,0.032573,-13.127403,True
1962-12-04,tues,0.007813,0.346202,-11.749305,True
1962-12-05,wed,0.003845,0.525306,-11.665609,True
1962-12-06,thur,-0.003462,0.210182,-11.626772,True
1962-12-07,fri,0.000568,0.044187,-11.72813,True


In [17]:
NYSE.describe()

Unnamed: 0,DJ_return,log_volume,log_volatility
count,6051.0,6051.0,6051.0
mean,0.000177,-0.008336,-9.842713
std,0.008436,0.233684,0.753937
min,-0.047177,-1.322425,-13.127403
25%,-0.00464,-0.159956,-10.334196
50%,0.000125,-0.013249,-9.843592
75%,0.004792,0.131632,-9.379632
max,0.049517,1.03937,-7.477833


## Linear Regression

In [23]:
# Standardize the data
cols = ['DJ_return', 'log_volume', 'log_volatility']
X = pd.DataFrame(StandardScaler(
                     with_mean=True,
                     with_std=True).fit_transform(NYSE[cols]),
                 columns=NYSE[cols].columns,
                 index=NYSE.index)


- We set up the lagged versions of the data, dropping any rows with missing values.

In [24]:
for lag in range(1, 6): # Create lags from 1 to 5
    for col in cols:
        newcol = np.zeros(X.shape[0]) * np.nan
        newcol[lag:] = X[col].values[:-lag]
        # Create a new col with name col_lag and add to X
        X.insert(len(X.columns), "{0}_{1}".format(col, lag), newcol)
# Add the original column 'train' to X
X.insert(len(X.columns), 'train', NYSE['train'])
X = X.dropna(); X.head(3).round(2)

Unnamed: 0_level_0,DJ_return,log_volume,log_volatility,DJ_return_1,log_volume_1,log_volatility_1,DJ_return_2,log_volume_2,log_volatility_2,DJ_return_3,log_volume_3,log_volatility_3,DJ_return_4,log_volume_4,log_volatility_4,DJ_return_5,log_volume_5,log_volatility_5,train
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1962-12-10,-1.3,0.61,-1.37,0.05,0.22,-2.5,-0.43,0.94,-2.37,0.43,2.28,-2.42,0.91,1.52,-2.53,-0.55,0.18,-4.36,True
1962-12-11,-0.01,-0.01,-1.51,-1.3,0.61,-1.37,0.05,0.22,-2.5,-0.43,0.94,-2.37,0.43,2.28,-2.42,0.91,1.52,-2.53,True
1962-12-12,0.38,0.04,-1.55,-0.01,-0.01,-1.51,-1.3,0.61,-1.37,0.05,0.22,-2.5,-0.43,0.94,-2.37,0.43,2.28,-2.42,True


- We extract the response ``log_volume`` and training indicator `train`.
- We keep only the lagged versions of the data.

In [25]:
Y, train = X['log_volume'], X['train']
X = X.drop(columns=['train'] + cols)
X.columns

Index(['DJ_return_1', 'log_volume_1', 'log_volatility_1', 'DJ_return_2',
       'log_volume_2', 'log_volatility_2', 'DJ_return_3', 'log_volume_3',
       'log_volatility_3', 'DJ_return_4', 'log_volume_4', 'log_volatility_4',
       'DJ_return_5', 'log_volume_5', 'log_volatility_5'],
      dtype='object')

- We fit a simple linear model and use `score()` to compute the $R^2$ on the test data.

In [None]:
M = LinearRegression()
# LinearRegression() add an intercept by default
M.fit(X[train], Y[train])
M.score(X[~train], Y[~train])

0.41289129385625223

- We refit this model, including the factor variable `day_of_week`.
- We use `get_dummies()` from `pandas` to form the indicators.

In [27]:
X_day = pd.concat( [ X, pd.get_dummies(NYSE['day_of_week']) ],
                  axis=1).dropna()

- We don't need to reinstantiate the linear regression model since its `fit()` method accepts both the design matrix and the response directly.
- The model achieves an $R^2$ of 46%.

In [28]:
M.fit(X_day[train], Y[train])
M.score(X_day[~train], Y[~train])

0.4595563133053274

## RNN

- We must reshape the data to fit RNN. The `input_shape` argument to the layer `nn.RNN()` requires 5 lagged versions of each feature, it means that it requires 5 columns for each of 3 features.
- The `nn.RNN()` layer expects the first row of each observation to be lag 5, then lag 4, and so on.
- We rearrange the columns of the data frame so that the variables are correctly lagged. We use the `reindex()` method to do this.

In [40]:
ordered_cols = []
for lag in range(5,0,-1):
    for col in cols:
        ordered_cols.append('{0}_{1}'.format(col, lag))
X = X.reindex(columns=ordered_cols)
X.head(3).round(2)

Unnamed: 0_level_0,DJ_return_5,log_volume_5,log_volatility_5,DJ_return_4,log_volume_4,log_volatility_4,DJ_return_3,log_volume_3,log_volatility_3,DJ_return_2,log_volume_2,log_volatility_2,DJ_return_1,log_volume_1,log_volatility_1
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1962-12-10,-0.55,0.18,-4.36,0.91,1.52,-2.53,0.43,2.28,-2.42,-0.43,0.94,-2.37,0.05,0.22,-2.5
1962-12-11,0.91,1.52,-2.53,0.43,2.28,-2.42,-0.43,0.94,-2.37,0.05,0.22,-2.5,-1.3,0.61,-1.37
1962-12-12,0.43,2.28,-2.42,-0.43,0.94,-2.37,0.05,0.22,-2.5,-1.3,0.61,-1.37,-0.01,-0.01,-1.51


- We now reshape the data.
- ``to_numpy().reshape((-1,5,3)`` tells `NumPy` to reshape the data into a 3D array with second dimension of size 5 and third dimension of size 3. It will automatically determine the size of the first dimension. Simply, it will reshape the data into a n groups of a 5x3 matrix.
- The result is 6046 groups of 5x3 matrices.
- We show the first two groups.
    - For each group, the first row is lag 5 ofr DJ_return, log_volume, and log_volatility.
    - The second row is lag 4 of DJ_return, log_volume, and log_volatility. And so on.

In [38]:
X_rnn = X.to_numpy().reshape((-1,5,3))
print(X_rnn.shape)
X_rnn[:2,:]

(6046, 5, 3)


array([[[-0.54982334,  0.17507497, -4.35707786],
        [ 0.90519995,  1.51729071, -2.52905765],
        [ 0.43481275,  2.28378937, -2.41803694],
        [-0.43139673,  0.93517558, -2.36652094],
        [ 0.04634026,  0.22477858, -2.5009701 ]],

       [[ 0.90519995,  1.51729071, -2.52905765],
        [ 0.43481275,  2.28378937, -2.41803694],
        [-0.43139673,  0.93517558, -2.36652094],
        [ 0.04634026,  0.22477858, -2.5009701 ],
        [-1.30412619,  0.60591805, -1.366028  ]]])

- We define class ``NYSEModel`` that inherits from `nn.Module`.
    - ``super(NYSEModel, self).__init__()`` calls the constructor of the parent class.
    - ``self.rnn = nn.RNN(3, 12, batch_first=True)``  initializes a RNN layer:
        - 3: The input size, i.e., the number of features at each time step.
        - 12: The number of hidden units.
        - `batch_first=True` specifies that the first dimension of the input is the batch size.
    - ``self.dense = nn.Linear(12, 1)`` initializes a fully connected (dense) layer.
        - 12: The input size, which matches the number of hidden units in the RNN.
        - 1: The output size, which is a single value.
    - ``self.dropout = nn.Dropout(0.1)`` initializes a dropout layer with a dropout rate of 10%.
    - We define ``forward()`` method that specifies how the input is passed through the layers.
        - ``val, h_n = self.rnn(x)``: This passes the input ``x`` through the RNN layer.
            - ``val`` is the output at each time step.
            - ``h_n`` is the hidden state at the final time step.
        - ``val = self.dense(self.dropout(val[:,-1]))``: This passes the output through the dense layer with dropout.
            - ``val[:,-1]`` extracts the output at the final time step.
        - ``torch.flatten(val)`` flattens the output of the dense layer to a 1D tensor.
        

In [58]:
class NYSEModel(nn.Module):
    def __init__(self):
        super(NYSEModel, self).__init__()
        self.rnn = nn.RNN(3,
                          12,
                          batch_first=True)
        self.dense = nn.Linear(12, 1)
        self.dropout = nn.Dropout(0.1)
    def forward(self, x):
        val, h_n = self.rnn(x)
        val = self.dense(self.dropout(val[:,-1]))
        return torch.flatten(val)
nyse_model = NYSEModel()

- We form the training and test datasets.

In [79]:
datasets = []
for mask in [train, ~train]:
    X_rnn_t = torch.tensor( np.asarray(X_rnn[mask].astype(np.float32)) )
    Y_t = torch.tensor( np.asarray(Y[mask].astype(np.float32)) )
    datasets.append( TensorDataset(X_rnn_t, Y_t) )
nyse_train, nyse_test = datasets

- We inspect the summary.

In [80]:
torchinfo.summary(nyse_model,
        input_data=X_rnn_t,
        col_names=['input_size',
                   'output_size',
                   'num_params'])

Layer (type:depth-idx)                   Input Shape               Output Shape              Param #
NYSEModel                                [1770, 5, 3]              [1770]                    --
├─RNN: 1-1                               [1770, 5, 3]              [1770, 5, 12]             204
├─Dropout: 1-2                           [1770, 12]                [1770, 12]                --
├─Linear: 1-3                            [1770, 12]                [1770, 1]                 13
Total params: 217
Trainable params: 217
Non-trainable params: 0
Total mult-adds (Units.MEGABYTES): 1.83
Input size (MB): 0.11
Forward/backward pass size (MB): 0.86
Params size (MB): 0.00
Estimated Total Size (MB): 0.97

- We provide the `fit` function with test data for validation, allowing us to monitor and plot its progress on the test set. However, this should not influence early stopping to avoid biasing the test performance.
- Both datasets are placed into a data module with a batch size of 64.

In [81]:
nyse_dm = SimpleDataModule(train_dataset=nyse_train,
                           test_dataset=nyse_test,
                           num_workers=min(4, 10),
                           validation=nyse_test,
                           batch_size=64)

- We run some data through our model to check the sizes match up correctly.

In [82]:
for idx, (x, y) in enumerate(nyse_dm.train_dataloader()):
    out = nyse_model(x)
    print(y.size(), out.size())
    if idx >= 2:
        break

torch.Size([64]) torch.Size([64])
torch.Size([64]) torch.Size([64])
torch.Size([64]) torch.Size([64])


- We set up a trainer for a regression problem.
- We request the $R^2$ metric to be computed at each epoch.

In [83]:
nyse_optimizer = RMSprop(nyse_model.parameters(), lr=0.001)
nyse_module = SimpleModule.regression(nyse_model,
                                      optimizer=nyse_optimizer,
                                      metrics={'r2':R2Score()})

- The results on the test data are very similar to the linear AR model. 

In [84]:
nyse_trainer = Trainer(deterministic=True,
                       max_epochs=200,
                       enable_progress_bar=False,
                       callbacks=[ErrorTracker()])
nyse_trainer.fit(nyse_module,
                 datamodule=nyse_dm)
nyse_trainer.test(nyse_module,
                  datamodule=nyse_dm)

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name  | Type      | Params | Mode 
--------------------------------------------
0 | model | NYSEModel | 217    | train
1 | loss  | MSELoss   | 0      | train
--------------------------------------------
217       Trainable params
0         Non-trainable params
217       Total params
0.001     Total estimated model params size (MB)
5         Modules in train mode
0         Modules in eval mode
`Trainer.fit` stopped: `max_epochs=200` reached.


[{'test_loss': 0.6264728307723999, 'test_r2': 0.40544670820236206}]

## Non-Linear AR Model

- We fit a model without the `nn.RNN()` layer by just using a `nn.Flatten()` layer instead. 
- This would be a nonlinear AR model.
- If in addition we excluded the hidden layer, this would be equivalent to our earlier linear AR model.
- We fit a nonlinear AR model using `X_day` that includes the `day_of_week` indicators.
- We first create our test and training datasets and a corresponding data module.

In [85]:
datasets = []
for mask in [train, ~train]:
    X_day_t = torch.tensor(
                   np.asarray(X_day[mask]).astype(np.float32))
    Y_t = torch.tensor(np.asarray(Y[mask]).astype(np.float32))
    datasets.append(TensorDataset(X_day_t, Y_t))
day_train, day_test = datasets

In [None]:
day_dm = SimpleDataModule(train_dataset=day_train,
                          test_dataset=day_test,
                          num_workers=min(4, 10),
                          validation=day_test,
                          batch_size=64)

- We define class ``NonLinearARModel`` that inherits from `nn.Module`.
    - ``super(NonLinearARModel, self).__init__()`` calls the constructor of the parent class.
    - ``self._forward = nn.Sequential(...)`` initializes a sequential container that holds a sequence of layers.
    - ``nn.Flatten()`` flattens the input.
    - ``nn.Linear(20, 32)``: This is a fully connected (dense) layer with 20 input features and 32 output features.
    - We define ``forward()`` method that specifies how the input is passed through the layers.
        - ``self._forward(x)`` passes the input ``x`` through the layers in the sequential container.
        - ``torch.flatten()`` flattens the output of the dense layer to a 1D tensor.

In [88]:
class NonLinearARModel(nn.Module):
    def __init__(self):
        super(NonLinearARModel, self).__init__()
        self._forward = nn.Sequential(nn.Flatten(),
                                      nn.Linear(20, 32),
                                      nn.ReLU(),
                                      nn.Dropout(0.5),
                                      nn.Linear(32, 1))
    def forward(self, x):
        return torch.flatten(self._forward(x))


In [89]:
nl_model = NonLinearARModel()
nl_optimizer = RMSprop(nl_model.parameters(), lr=0.001)
nl_module = SimpleModule.regression(nl_model,
                                        optimizer=nl_optimizer,
                                        metrics={'r2':R2Score()})

In [90]:
nl_trainer = Trainer(deterministic=True,
                     max_epochs=20,
                     enable_progress_bar=False,
                     callbacks=[ErrorTracker()])
nl_trainer.fit(nl_module, datamodule=day_dm)
nl_trainer.test(nl_module, datamodule=day_dm) 

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs



  | Name  | Type             | Params | Mode 
---------------------------------------------------
0 | model | NonLinearARModel | 705    | train
1 | loss  | MSELoss          | 0      | train
---------------------------------------------------
705       Trainable params
0         Non-trainable params
705       Total params
0.003     Total estimated model params size (MB)
8         Modules in train mode
0         Modules in eval mode
`Trainer.fit` stopped: `max_epochs=20` reached.


[{'test_loss': 0.560681939125061, 'test_r2': 0.46788549423217773}]

- We see the test $R^2$ is a slight improvement over the linear AR model that also includes `day_of_week`.