

In your project, you will pick a dataset (time-series) and an associated problem that can be solved via sequence models. You must describe why you need sequence models to solve this problem. Include a link to the dataset source. Next, you should pick an RNN framework that you would use to solve this problem (This framework can be in TensorFlow, PyTorch or any other Python Package).

For this problem, I will use the Ethereum-USD exchange rate data. This dataset contains the price of Etherum crypto-currency in a 1-minute interval. It contains information about market data relate to the currency, such as open, high, low and trade volume during the minute marker.

The dataset can be accessed in:

https://www.kaggle.com/datasets/patrickgendotti/btc-and-eth-1min-price-history


In [None]:
# same deal for gdrive and kaggle
from google.colab import drive
drive.mount('/content/drive')

!rm -r ~/.kaggle
!mkdir ~/.kaggle
!cp /content/drive/MyDrive/.kaggle/kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!pip install -q kaggle


# download 

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# install dataset and unzip

!rm -r dataset
!kaggle datasets download -d patrickgendotti/btc-and-eth-1min-price-history
!mkdir dataset
!unzip btc-and-eth-1min-price-history.zip -d dataset

In [None]:
!pip install livelossplot -q

In [None]:
import pandas as pd


raw_data = pd.read_csv('dataset/ETH_1min.csv')

raw_data.head()

In [None]:
# its probably sorted, but just to make sure
raw_data = raw_data.sort_values(by=['Unix Timestamp'])
# we don't need the date or symbol
# raw_data = raw_data.drop(['Date','Symbol','Unix Timestamp'],axis=1)

raw_data.describe()

In [None]:
# the dataset is too huge to play around with on colab.. gonna convert it to 30 minute intervals 
INTERVAL = 30

raw_data = pd.DataFrame([
    {
        'Open':raw_data.Open.values[beg],
        'High':raw_data.High.values[beg:beg+INTERVAL].max(),
        'Low':raw_data.Low.values[beg:beg+INTERVAL].min(),
        'Close':raw_data.Close.values[beg+INTERVAL-1],
        'Volume':raw_data.Volume.values[beg:beg+INTERVAL].sum(),
    }
    for beg in range(0,len(raw_data)-INTERVAL,INTERVAL)
])

In [None]:
# just the price movement during the interval
norm_data['Delta'] = norm_data['High'] - norm_data['Low']
# this will be our target variable. The difference betwen current closing price and the next one
price_change = norm_data.Close.values[1:] - norm_data.Close.values[:-1]
norm_data = norm_data.iloc[:-1,:]
norm_data['PriceChange'] = price_change

# target variable is "Close", which is the closing price. So we will move it to the back
# norm_data = norm_data[[c for c in norm_data.columns if not c == 'Close'] + ['Close']]


# np.arange

In [None]:
# examining feature distribution

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# numerical = X_raw.columns[X_raw.dtypes.isin([int,float])]
# numerical = dataset.select_dtypes(exclude=['object','bool'])
# X_raw.dtypes

fig,axes = plt.subplots(ncols=4,nrows=2,figsize=(12,8))

for c,ax in zip(norm_data.columns,axes.flatten()):
    norm_data[c].plot(ax=ax,title=c)
    # sns.lineplot(data=X_raw[c],ax=ax)


fig.tight_layout()

In [None]:
for c in norm_data.columns:
    print(f'Feature: {c} -- {norm_data[c].isna().sum()/norm_data.shape[0] * 100 :.4f}% values are NA')

as expected, open/high/low follow the same pattern. Volume seems to die down as time goes. To check if the High/Low interval changes, I will add a new column

In [29]:
import torch
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

data_pt = torch.tensor(norm_data.values).float().cuda()

def build_sequence(data,bin_size=100):
    X = torch.stack([ data[i:i+bin_size,:] for i in range(data.size(0) - bin_size) ])
    y = data[bin_size:,-1]
    return X,y

# the dataset is huge, so we will load the original version to cuda first, so that we can simply slice out what we need
X_all,y_all = build_sequence(data_pt)



X_train,X_test,y_train,y_test = train_test_split(X_all,y_all,test_size=0.1,shuffle=False)
X_train,X_dev,y_train,y_dev = train_test_split(X_train,y_train,test_size=0.1 / 0.9,shuffle=False)



# norm_data = pd.DataFrame(MinMaxScaler().fit_transform(raw_data),columns=raw_data.columns)

# idxs = np.arange(y_raw.shape[0])

# train_idx = []
# dev_idx = []
# test_idx = []


# 100 10 10 



# X_dev,y_dev = build_sequence(dev_data.values).float().cuda()
# X_test,y_test = build_sequence(test_data.values).float().cuda()

In [None]:
# To prepare the data for our models, need to create a dataset and loader

import torch
from torch.utils.data import DataLoader,Dataset

class ETHData(Dataset):
    def __init__(self,X,y):
        # self.data = data

        # X,y = [],[]
        self.X = X
        self.y = y
    def __len__(self,):
        return self.y.size(0)
    
    def __getitem__(self,idx):
        return (
            # notice that we are including past 'Close' column into the input data as well
            self.X[idx],
            self.y[idx]
        )

# train_data = ETHData(train_data)
# dev_data = ETHData(dev_data)
# test_data = ETHData(test_data)

In [13]:
from torch import nn,optim
from livelossplot import PlotLosses
from tqdm.auto import tqdm
from sklearn.metrics import r2_score,mean_squared_error

class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, bin_size =100):
        super(RNN, self).__init__()
        
        # Number of hidden dimensions
        self.hidden_size = hidden_size
        
        # Number of hidden layers
        self.num_layers = num_layers
        
        # RNN https://pytorch.org/docs/stable/generated/torch.nn.RNN.html
        self.rnn = nn.RNN(input_size, hidden_size, num_layers, batch_first=True, nonlinearity='relu')
        self.out_layer = nn.Linear(hidden_size, 1)


        # self.loss_func = nn.MSELoss()
    
    def forward(self, x):
        # Initialize hidden state with zeros
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).cuda()
            
        # One time step
        out, hn = self.rnn(x, h0)
        out = self.out_layer(out[:, -1, :]) 
        return out[:,0]

    def train(
        self,
        train_data,
        dev_data,
        num_epoch=100,
        lr=1e-4,
        wd=1e-3,
        loss_func=nn.MSELoss,
        extra_metrics=[r2_score,mean_squared_error]
    ):
        train_loader = DataLoader(ETHData(train_data),batch_size=512,shuffle=False)
        dev_dset = ETHData(dev_data)

        optimizer = optim.Adam(self.parameters(), lr=lr,weight_decay=wd)
        criterion = loss_func()


        liveloss = PlotLosses()

        for _ in range(num_epoch):
            all_losses = []
            y_real_train = []
            y_pred_train = []
            for x,y in tqdm(train_loader, desc='going through batches...',leave=False):
                optimizer.zero_grad()
                pred = self.forward(x)
                loss = criterion(pred,y)
                loss.backward()
                optimizer.step()

                all_losses.append(loss.item())
                y_real_train.append(y)
                y_pred_train.append(pred)
            
            all_real_train,all_pred_train = torch.hstack(y_real_train).cpu(),torch.hstack(y_pred_train).cpu()

            all_real_dev = dev_dset.y.cpu()
            all_pred_dev = self.predict(dev_dset.X).cpu()
            
            liveloss.update(dict(
                [
                    (loss_func.__name__, np.mean(all_losses))
                ] + [
                    (f'{m.__name__}_train',m(all_real_train,all_pred_train))
                    for m in extra_metrics
                ]
            ))
            liveloss.send()

    @torch.no_grad()
    def predict(self, X_input):
        return self.forward(X_input)
    
    @torch.no_grad()
    def score(self, X_input, y_input, metrics = [r2_score] ):
        pred = self.predict(X_input).cpu()
        return {
            m.__name__: m(y_input,pred)
            for m in metrics
        }




    

    

In [16]:
# dd = ETHData(train_data)
model = RNN(6,10,1).cuda()
# model.train(train_data,dev_data,num_epoch=100)
# with torch.no_grad():
model.predict(dd.X)

RuntimeError: ignored

## Task 1 (60 points):
### Part 1 (30 points): 
Implement your RNN either using an existing framework OR you can implement your own RNN cell structure. In either case, describe the structure of your RNN and the activation functions you are using for each time step and in the output layer. Define a metric you will use to measure the performance of your model 

NOTE: Performance should be measured both for the validation set and the test set.

### Part 2 (35 points): 
Update your network from part 1 with first an LSTM and then a GRU based cell structure (You can treat both as 2 separate implementations). Re-do the training and performance evaluation. What are the major differences you notice? Why do you think those differences exist between the 3 implementations (basic RNN, LSTM and GRU)?

Note: In part 1 and 2, you must perform sufficient data-visualization, pre-processing and/or feature-engineering if needed. The overall performance visualization of the loss function should also be provided.

### Part 3 (10 points): 
Can you use the traditional feed-forward network to solve the same problem. Why or why not? 

Hint: Can time series data be converted to usual features that can be used as input to a feed-forward network?


## Task 2 (25 points): 
In this task, use any of the pre-trained word embeddings. The Wor2vec embedding link provided with the lecture notes can be useful to get started. Write your own code/function that Projects in Machine Learning and AI (RPI Fall 2022) uses these embeddings and outputs cosine similarity and a dissimilarity score for any 2 pair of words (read as user input). The dissimilarity score should be defined by you. You either can have your own idea of a dissimilarity score or refer to literature (cite the paper you used). In 
either case clearly describe how this score helps determine the dissimilarity between 2 words.

Note: Dissimilarity measure has been an important metric for recommender systems trying to introduce ‘Novelty and Diversity’ in assortments (as opposed to only accuracy). You might find different metrics of dissimilarity in recommender system’s literature