# Uncomment if use Colab
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True) #mount google drive

In [None]:
# importing dependencies
%matplotlib inline
import torch
import matplotlib.pyplot as plt
from torch.utils import data
import random
import sys
import pickle

# ECS269U/P - Coursework


* The **goal** of the CW is similar to that of Week 2's Lab: fitting a curve to data, also known as **curve fitting**. 
* This has applications in many different disciplines that make use of AI: FinTech, Physics Modelling, or even Sports. 
* For example, we might be interested in learning the evolution (over time) of the price of a specific product in different countries. This can depend on several factors: the product itself, the country, the initial value of the product's price, etc. 
* As usual, we are interested in learning a model that finds these relationships *from the data*. 


### Learning a family of functions

* The main difference with Week 2's Lab is that we will learn a network that does not learn a single function but a *family of functions*.
* We will consider a family of sinusoidal functions. 
* Below you can find the code generating the data according to different random sinusoidal functions $\{f_a\}$. We randomly generate a set of 40 points in the x-axis in the interval $[-2, 2]$, slightly randomly shifted. Our functions will have the form of $y = f_a(x) = a * sin(x+a)$ where each $a$ will be randomly sampled for each function from interval $[-2, 2]$.  To "draw" a function $f_a$, we first choose some $a \sim U(-2,2)$ and then we compute $f_a$ using the above formula for all the $x$ in the x-axis. 


In [None]:
Nf = 2000 # the number of different functions f that we will generate
Npts = 40 # the number of x values that we will use to generate each fa
x = torch.zeros(Nf, Npts, 1)
for k in range(Nf):
    x[k,:,0] = torch.linspace(-2, 2, Npts)

x += torch.rand_like(x)*0.1
a = -2 + 4*torch.rand(Nf).view(-1,1).repeat(1, Npts).unsqueeze(2)
y = a*torch.sin(x+a)

## The Learning Goal

* Because we are dealing with a family of functions and not just a single function, our model must be able to perform two tasks: *Function Selection* and *Regression*.
* Function selection means that given some *additional* input (to be defined below) the model somehow must choose which function $f_a$ from the family of functions $\{f_a\}$ it needs to model.
* Once the correct function is picked then the model must perform regression i.e. learn the relationship $y=f_a(x)$.




## The Learning Objective

* During training we randomly sample functions from the family of functions $\{f_a\}$. For each $f_a$, we are provided with the (input, output) pairs $(x_t, y_t), t=1,\dots,N_{pts}$.

* To perform *Function Selection*, a **random subset** of $(x_t, y_t), t=1,\dots,N_{pts}$ is provided as auxiliary input to the model during *both training and testing*. These auxiliary data is called the *context data:* $(x_c, y_c), c=1,\dots,N_c$. 

* Note that the total number of context points $N_c$ should be different (and randomly chosen) for every batch so that the model learns to handle different number of context points at test time. This means that the model should be able to work for e.g. $N_c=5$ but also for $N_c=12$ etc.

* Our model will take the context pairs $(x_c, y_c)$ and input values $x_t$ and will produce the estimated values $\hat{y}_t$. 

* During training we have access to the ground-truth values $y_t$, and thus we can compute a loss between the model's predictions $\hat{y}_t$ and the ground-truth values $y_t$.  


## The Model

* Our model will consists of 2 MLPs which must be jointly trained.
* The first MLP is called the *Context Encoder* or Encoder. The Encoder will take as input each pair $(x_c, y_c)$ and will produce a corresponding feature representation $r_c$ of dimension $r_{dim}$.
* A total context feature is produced by averaging over all features: $r_C= \frac{1}{N}\sum_c r_c$.
* The second MLP is called the Decoder. It takes as input the $r_C$ and each input data $x_t$ and produces the model's prediction $\hat{y}_t$. 



![Encoder-Decoder](CW1.png)



## Architectures

* The Encoder and the Decoder are **MLPs**. You can experiment with your own architectures. You can also choose to implement the following architectures:
    * *Encoder*: It will map the input pair $(x_c, y_c)$ to some features of dimension $h_{dim}$ using 2 *hidden* layers. A final layer will produce the feature representation $r_c$ of dimension $r_{dim}$.
    * *Decoder*: It will map the input pair $(r_C, x_t)$ to some features of dimension $h_{dim}$ using 2 *hidden* layers. A final layer will produce the model's prediction $\hat{y}_t$.

## Tasks

* You have to implement the following:
    1. Create the training dataset and dataloader (10%). 
    2. Create the Encoder and Decoder (20 + 20%). 
    3. Create the optimizer and the loss for your model (10%).
    4. Write the training script that will train the model and print the training loss (30%).
    5. Evaluate the model on some validation data. Plot some predictions. (10%). 

* You might want to explore the impact of the following design choices and hyperparameters:
    1. Number of hidden layers and $h_{dim}$, and  $r_{dim}$.
    1. Type of optimizer, batch-size and all relevant hyper-parameters from Week 5.

# Test data  

* Test data are stored in a dictionary where each key has the data for a single function $f_a$. We have generated 6 different functions named as `function_num_1`, `function_num_2` and so on. 

In [None]:
# Task 1 Create the training dataset and dataloader (10%)

# Create dataset using the randomly generated x,y values
dataset = data.TensorDataset(x, y)
# Create Dataloader, shuffle is set to true so that the data is reshuffled at every epoch 
data_iter =  data.DataLoader(dataset, batch_size=40, shuffle=True)

In [None]:
# Task 2 Create the Encoder and Decoder (20 + 20%)

# define Encoder MLP
class Encoder(torch.nn.Module):
    def __init__(self, num_inputs):
        super(Encoder, self).__init__()
        self.num_inputs = num_inputs
        self.Linear1 = torch.nn.Linear(1,1600)
        self.ReLU = torch.nn.ReLU()
        self.Linear2 = torch.nn.Linear(1,num_inputs*n_c,1)

    def forward(self, x):
#         hidden layer for x values
        x_layer = self.Linear1(x[0]) 
        x_layer = self.ReLU(x_layer)
#         hidden layer for y values
        y_layer = self.Linear2(x[1])
        y_layer = self.ReLU(y)
#         define empty list of x and y values
        xr_c = []
        yr_c = []
#         define counter so that each x and y value can be iterated through
        encounter = 0
        for value in x[0]:
#             average each set of x and y values
            averagex = torch.mean(x_layer[encounter])
            averagey = torch.mean(y_layer[encounter])
#             add the average of each set of x and y values to the list defined 
            xr_c.append(averagex)
            yr_c.append(averagey)
#             add 1 to the iterator as the for loop has come to an end
            encounter += 1
#         find the average of all the x and y values to create the components of the total context feature 
        avgxr_c = sum(xr_c)/len(xr_c)
        avgyr_c = sum(yr_c)/len(yr_c)
#         place the x and y averages into a tuple to create r_c, the total context feature
        r_c = [avgxr_c, avgyr_c]
        return r_c

In [None]:
# define Decoder MLP

class Decoder(torch.nn.Module):
    def __init__(self, num_inputs, max_n_c):
        super(Decoder, self).__init__()
        self.num_inputs = num_inputs
        self.n_c = n_c
        self.Linear1 = torch.nn.Linear(3,1)
        self.ReLU = torch.nn.ReLU()
        self.Linear2 = torch.nn.Linear(1,num_inputs*max_n_c,1)
        
    def forward(self, x):
#         x is a tuple of r_c and x_t, split these out into variables to make code clearer 
        r_c = x[0]
        x_t = x[1][0]
#         define iterator
        i = 0
        for layer in x[1]:
            for x_t_value in x[1][0]:
#                  create tensor of r_c and x_t
                r_c_x_t = torch.tensor([r_c[0], r_c[1], x_t_value])
#                  put r_c_x_t values into hidden layers 
                y_hat = self.Linear1(r_c_x_t)
                y_hat = self.ReLU(y_hat)
                y_hat = self.Linear2(y_hat)
                y_hat = y_hat.view(-1, self.num_inputs, 1)
            i += 1
        return y_hat

In [None]:
# initalise Models 
max_n_c = 40
net = Encoder(Npts)
decoder = Decoder(Npts, max_n_c)

In [None]:
# function to plot training loss 
def plot_training_loss(training_loss):
    # define list in range 0 to the number of epochs, in our case 0 to 50
    epoch_list = list(range(0, epochs))
    # plot loss at each epoch
    plt.plot(epoch_list,training_loss)

In [None]:
# Task 4 Write the training script that will train the model tand print the training loss (30%)

# training method
def train(net, decoder, train_iter, loss, optimizer, epochs):
#     define empty list for training_loss
    training_loss = []
#     loop on every epoch
    for epoch in range(epochs):
#         loop on every value in X and y
        for X, y in train_iter:
#             define number of context points - random number between 3 and 35
            n_c = random.randint(3, 35)
            x_tensor = X
#             only take first n_c values of x to get model used to handling different number of context points
            x_split_tensor = torch.split(x_tensor, n_c)[0]
            y_tensor = y
            y_split_tensor = torch.split(y_tensor, n_c)[0]
            optimizer.zero_grad()
#             encoder returns total context feature r_c
            r_c = net([x_split_tensor[0], x_split_tensor[0]])
#             decoder returns y_hat, a prediction
            y_hat = decoder([r_c, x_tensor])
#             find loss of model, comparing models prediction against raw y value
            l = loss(y_hat, y)
            l.backward()
            optimizer.step()
#         add training loss at the end of each epoch to list of training loss list
        training_loss.append(l.detach().numpy())
#     call function to plot training loss 
    plot_training_loss(training_loss)
#     traing method returns its prediction for y
    return y_hat

In [None]:
# Task 3 Create the optimizer and the loss for your model (10%)

# define loss as MSELoss
loss = torch.nn.MSELoss()
# define learning rate
lr = 0.01
# define number of epochs
epochs = 10
# define optimizer
optimizer = torch.optim.Adam(decoder.parameters(), lr)
# call training method
y_hat = train(net, decoder, data_iter, loss, optimizer, epochs)

In [None]:
# Task 5: Evaluate the model on some validation data. Plot some predictions (10%)

# define test method
def test(net, decoder, test_iter, loss, optimizer, x_t):
#      loop on every value in X and y
    for X, y in test_iter:
        optimizer.zero_grad()
#         encoder returns total context feature r_c
        r_c = net([X, y])
#         decoder returns y_hat prediction
        y_hat = decoder([r_c, x_t])
        optimizer.step()
    return y_hat

In [None]:
path_to_the_pickle = '/content/gdrive/My Drive/Colab Notebooks/test_data.pkl'
test_data =pickle.load(open(path_to_the_pickle,'rb'))
# 6 test functions loop through each one
for i in range(1,7):
    name= 'function_num_{}'.format(i)
    x_c=test_data[name]['context_pairs'][0]
    y_c=test_data[name]['context_pairs'][1]
    x_t =test_data[name]['x']
#     create dataset of the test data
    test_dataset = data.TensorDataset(x_c, y_c)
#     create dataloader of dataloader
    test_data_iter =  data.DataLoader(dataset, n_c)
#     get models y prediction
    y_hat = test(net, decoder, test_data_iter, loss, optimizer, x_t)
#     create new plot canvas
    plt.figure(i)
#     plot models y prediction against the context data
    plt.plot(x_c[0,:,0].to('cpu'), y_c[0,:,0].to('cpu'), '*')
    plt.plot(x_t[0,:,0].to('cpu'), y_hat[0,:,0].detach().to('cpu'))