# ML Reproducibility Challenge 2022
## TS2Vec: Towards Universal Representation of Time Series

The original paper can be found [here](https://arxiv.org/pdf/2106.10466.pdf)

Original code implementation can be found on [github/](https://github.com/yuezhihan/ts2vec.git)

ML Reproducibility Challenge 2022 [homepage](https://paperswithcode.com/rc2022)

In [None]:
!git clone https://github.com/yuezhihan/ts2vec.git # clones TS2Vec repo
!python3 --version # want: 3.8.15

## Preparing the Colab Environment

Our Google Colab environment came with a python3.7 kernel installed. TS2Vec requires python3.8, so we had to preparing the environment accordingly. 

Installs python3.8 and library requirements. 

In [None]:
!sudo apt install python3.8
!sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.7 1
!sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 2
!python3 --version # want: 3.8.15

In [None]:
# Updating python version breaks pip -- get pip
!curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
!python3 get-pip.py

# install requirements - their requirements.txt file didn't work
!pip3 install torch
!pip3 install numpy
!pip3 install scikit_learn==0.24.2
!pip3 install bottleneck
!pip3 install pandas
!pip3 install statsmodels
!pip3 install tables

In [None]:
%cd ts2vec # change directory to cloned repo

## Reproducing paper results

Performance metrics presented in the results section can be reproduced by running TS2Vec packaged functions provided in the $\texttt{scripts/}$ folder.

Datasets must be manually prepared according to instructions in the TS2Vec $\texttt{README.md}$.

In [None]:
# sample reproduction line
# !python3 -u train.py ETTh1 forecast_univar --loader forecast_csv_univar --repr-dims 320 --max-threads 8 --seed 42 --eval

## Running TS2Vec on a new dataset

The dataset is of hydrologic data (water depth) measured over the course of 18 months. Data provided by Open-storm -- [link to grafana dashboard](http://ec2-3-142-80-107.us-east-2.compute.amazonaws.com:3000/d/RgZSCbz7zz/honey-creek-at-dexter?orgId=1&from=now-30d&to=now)

In [None]:
# run TS2Vec forecasting on hydrologic dataset with all default parameters
# !python3 train.py ARB048_format ARB_run --loader forecast_csv --eval

## Visualizing Forecasting Performance

To visualize the forecast predicitons, we had to modify several functions from the $\texttt{tasks/}$ folder. This step requires already having trained a model on the hydrologic data. 

In [None]:
# import functions from TS2Vec 
from ts2vec import TS2Vec
from tasks.forecasting import generate_pred_samples, cal_metrics

# import libraries
import datautils
import numpy as np
import matplotlib.pyplot as plt
import time
from sklearn.linear_model import Ridge


In [None]:
# load the model 
model = TS2Vec(input_dims=9)
file_path = 'training/ARB048_format__ARB_run_20221024_233357/model.pkl' # modify based on output of training step
model.load(file_path)

In [None]:
# prepare the datasets
data, train_slice, valid_slice, test_slice, scaler, pred_lens, n_covariate_cols = datautils.load_forecast_csv('ARB048_format')

In [None]:
# Plot time-encoded input data
fig, ax = plt.subplots(figsize=(12,6),nrows=2, sharex=True)
x = np.linspace(0,data.shape[1]-1, data.shape[1])
ax[0].scatter(x, data[0,:,-1], marker='.')
ax[0].set_title("Original Hydrologic Timeseries")
for i in range(9):
    ax[1].scatter(x, data[0,:,i], marker='.', alpha=0.5, label=i)
ax[1].set_title("Expanded dimensional data - Datetime elements separated out as features")
ax[1].legend()
plt.tight_layout()

In [None]:
def fit_ridge(train_features, train_y, valid_features, valid_y, MAX_SAMPLES=100000):
    # If the training set is too large, subsample MAX_SAMPLES examples
    if train_features.shape[0] > MAX_SAMPLES:
        split = train_test_split(
            train_features, train_y,
            train_size=MAX_SAMPLES, random_state=0
        )
        train_features = split[0]
        train_y = split[2]
    if valid_features.shape[0] > MAX_SAMPLES:
        split = train_test_split(
            valid_features, valid_y,
            train_size=MAX_SAMPLES, random_state=0
        )
        valid_features = split[0]
        valid_y = split[2]
    
    alphas = [0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000]
    valid_results = []
    for alpha in alphas:
        lr = Ridge(alpha=alpha).fit(train_features, train_y)
        valid_pred = lr.predict(valid_features)
        score = np.sqrt(((valid_pred - valid_y) ** 2).mean()) + np.abs(valid_pred - valid_y).mean()
        valid_results.append(score)
    best_alpha = alphas[np.argmin(valid_results)]
    
    lr = Ridge(alpha=best_alpha)
    lr.fit(train_features, train_y)
    return lr

In [None]:
def mod_eval_forecasting(model, data, train_slice, valid_slice, test_slice, scaler, pred_lens, n_covariate_cols):
    padding = 200
    
    t = time.time()
    all_repr = model.encode(
        data,
        casual=True,
        sliding_length=1,
        sliding_padding=padding,
        batch_size=256
    )
    ts2vec_infer_time = time.time() - t
    
    train_repr = all_repr[:, train_slice]
    valid_repr = all_repr[:, valid_slice]
    test_repr = all_repr[:, test_slice]
    
    train_data = data[:, train_slice, n_covariate_cols:]
    valid_data = data[:, valid_slice, n_covariate_cols:]
    test_data = data[:, test_slice, n_covariate_cols:]
    
    ours_result = {}
    lr_train_time = {}
    lr_infer_time = {}
    out_log = {}
    collect_test_pred = {}
    collect_test_labels = {}
    for pred_len in pred_lens:
        train_features, train_labels = generate_pred_samples(train_repr, train_data, pred_len, drop=padding)
        valid_features, valid_labels = generate_pred_samples(valid_repr, valid_data, pred_len)
        test_features, test_labels = generate_pred_samples(test_repr, test_data, pred_len)
        
        t = time.time()
        lr = fit_ridge(train_features, train_labels, valid_features, valid_labels)
        lr_train_time[pred_len] = time.time() - t
        
        t = time.time()
        test_pred = lr.predict(test_features)
        lr_infer_time[pred_len] = time.time() - t

        ori_shape = test_data.shape[0], -1, pred_len, test_data.shape[2]
        test_pred = test_pred.reshape(ori_shape)
        test_labels = test_labels.reshape(ori_shape)
        collect_test_pred[pred_len] = test_pred
        collect_test_labels[pred_len] = test_labels
        
    #     test_pred_inv = scaler.inverse_transform(test_pred)
    #     test_labels_inv = scaler.inverse_transform(test_labels)
    #     if test_data.shape[0] > 1:
    #         test_pred_inv = scaler.inverse_transform(test_pred.swapaxes(0, 3)).swapaxes(0, 3)
    #         test_labels_inv = scaler.inverse_transform(test_labels.swapaxes(0, 3)).swapaxes(0, 3)
    #     else:
    #         test_pred_inv = scaler.inverse_transform(test_pred)
    #         test_labels_inv = scaler.inverse_transform(test_labels)
            
    #     out_log[pred_len] = {
    #         'norm': test_pred,
    #         'raw': test_pred_inv,
    #         'norm_gt': test_labels,
    #         'raw_gt': test_labels_inv
    #     }
    #     ours_result[pred_len] = {
    #         'norm': cal_metrics(test_pred, test_labels),
    #         'raw': cal_metrics(test_pred_inv, test_labels_inv)
    #     }
        
    # eval_res = {
    #     'ours': ours_result,
    #     'ts2vec_infer_time': ts2vec_infer_time,
    #     'lr_train_time': lr_train_time,
    #     'lr_infer_time': lr_infer_time
    # }
    # return out_log, eval_res, 
    return collect_test_pred, collect_test_labels

In [None]:
# return test predictions and labels used to calculate the performance metrics
test_pred, test_labels = mod_eval_forecasting(model, data, train_slice, valid_slice, test_slice, scaler, pred_lens, n_covariate_cols)

In [None]:
predictions = test_pred[24]
labels= test_labels[24]
cal_metrics(predictions, labels) #confirm performance metrics are as expected

In [None]:
# plot prediction forecast for pred_length=24
fig, ax = plt.subplots(figsize=(12,6),nrows=2, sharex=True, sharey=True)
ax[0].plot(labels[0,:,0,0], labels[0,:,0,1])
ax[1].scatter(predictions[0,:,0,0], predictions[0,:,0,1], marker='.', c='orange')
ax[0].set_title('Scaled time-masked hydrologic data')
ax[1].set_title('Prediction, pred_length=24')
predictions[0,:,0,0] - labels[0,:,0,0]

In [None]:
# plot all pre_length forecast predictions overlaid
pred24= test_pred[24]
pred48= test_pred[48]
pred96= test_pred[96]
pred288= test_pred[288]
pred672= test_pred[672]
fig, ax = plt.subplots(figsize=(12,6),nrows=2, sharex=True, sharey=True)
ax[0].plot(labels[0,:,0,0], labels[0,:,0,1])
ax[1].scatter(pred672[0,:,0,0], pred672[0,:,0,1], marker='.', alpha=0.3, label='672')
ax[1].scatter(pred288[0,:,0,0], pred288[0,:,0,1], marker='.', alpha=0.3, label='288')
ax[1].scatter(pred96[0,:,0,0], pred96[0,:,0,1], marker='.', alpha=0.3, label='96')
ax[1].scatter(pred48[0,:,0,0], pred48[0,:,0,1], marker='.', alpha=0.3, label='48')
ax[1].scatter(pred24[0,:,0,0], pred24[0,:,0,1], marker='.', alpha=0.3, label='24')
ax[1].legend()
ax[0].set_title('Scaled time-masked hydrologic data')
ax[1].set_title('Prediction')