## Implementation of Transformer-Based Few Shot Learning

### ECE590 Homework assignment 6
Name: Javier Cervantes

net id: jc1010

We are interested in running experiments with Transformers applied to functional data. There are three papers we are focusing on, with associated code:

• For the linear attention examples, we are focused on this paper: https://arxiv.org/abs/2212.07677 and the GitHub for the code is here: https://github.com/google-research/self-organising-systems/tree/master/transformers_learn_icl_by_gd

• For the examples with softmax attention (traditional Transformer design), we are interested in this paper: https://arxiv.org/abs/2208.01066 and the GitHub for the code is here: https://github.com/dtsip/in-context-learning

• The MAML GitHub may also be useful: https://github.com/cbfinn/maml . For this homework, it is ok to use/modify the above code to implement experiments.

Consider the following process for generating contextual data for a linear model: weights $w_m \in \mathbb{R}^{10}$ are drawn for context $m$ as $w_m \sim \mathcal{N}(\mu, I_{10})$ where $\mu \in \mathbb{R}^{10}$ is a fixed mean vector. Covariates $x_i \in \mathbb{R}^{10}$ are drawn as $x_i \sim \mathcal{N}(0, I_{10})$. For contextual data $\mathcal{C}_m$ draw one weight vector $w_m$ as above. For a context of length $N$ draw $x_{m,i}, i = 1, \ldots, N$ as above, and for each $x_{m, i}$ constitute a corresponding $y_{m, i} = w_m^T x_{m, i}$. The contextual data so drawn are represented as $\mathcal{C}_m = (x_{m, 1}, y_{m, 1}, \ldots, x_{m, N}, y_{m, N})$. Finally, draw a query associated with $\mathcal{C}_m, x_{m, N+1}$, and the model is to predict $y_{m, N+1} $ given $x_{m, N+1}$ and $\mathcal{C}_m$. In the following experiments set $\mu$ as a vector of all values equal to one: $\mu = (1, \ldots, 1)^T$. In all experiments, consider the number of pairs in the context as $N = 5, \ldots, 20$ (examine performance of the methods over this range of $N$).

In [1]:
# import numpy as np


# def generate_linear_batch(
#     M, # number of contexts
#     N, # samples per context
#     dim_output,
#     dim_input,
#     slope_mean,
#     slope_cov,
#     intercept_range,
#     input_range,
#     use_bias=True,
# ):
#     slope = np.random.multivariate_normal(slope_mean, slope_cov, M + 1)
#     intercept = (
#         np.random.uniform(intercept_range[0], intercept_range[1], [M + 1])
#         if use_bias
#         else np.zeros(M + 1)
#     )
#     outputs = np.zeros([M + 1, N, dim_output])
#     init_inputs = np.zeros([M + 1, N, dim_input])
#     for func in range(M + 1):
#         init_inputs[func] = np.random.uniform(input_range[0], input_range[1], [N, 1])
#         outputs[func] = slope[func] * init_inputs[func] + intercept[func]

#     C_m = (init_inputs[:-1], outputs[:-1])
#     query = (init_inputs[-1], outputs[-1])
#     return C_m, query




In [2]:
import numpy as np
import torch


def generate_linear_batch(
    M,
    N,
    dim_output=1,
    dim_input=10,
    use_bias=False,
    bias_sigma=0.1,
):
    # Draw the weights from a multivariate normal distribution
    slope = np.random.multivariate_normal(np.ones(dim_input), np.eye(dim_input), M + 1)

    intercept = np.random.normal(0, bias_sigma, M + 1) if use_bias else np.zeros(M + 1)

    outputs = np.zeros([M + 1, N, dim_output])
    init_inputs = np.zeros([M + 1, N, dim_input])

    for func in range(M + 1):
        # Draw the inputs from a multivariate normal distribution
        init_inputs[func] = np.random.multivariate_normal(
            np.zeros(dim_input), np.eye(dim_input), N
        )

        outputs[func] = (
            np.dot(slope[func], init_inputs[func].T).reshape(-1, 1) + intercept[func]
        )

    C_m = (torch.tensor(init_inputs[:-1]), torch.tensor(outputs[:-1]))
    query = (torch.tensor(init_inputs[-1]), torch.tensor(outputs[-1]))
    return C_m, query


In [8]:
M = 10
N = 20
dim_output = 1
dim_input = 10
use_bias = False
bias_sigma = 0.1

C_m, query = generate_linear_batch(
    M=M,
    N=N,
    dim_output=dim_output,
    dim_input=dim_input,
    use_bias=use_bias,
    bias_sigma=bias_sigma,
)


In [4]:
inputa, labela = C_m
inputb, labelb = query
# print(inputb.shape, labelb.shape)
for task in range(inputa.size(0)):
    print(labela[task].dtype)


torch.float64
torch.float64
torch.float64
torch.float64
torch.float64
torch.float64
torch.float64
torch.float64
torch.float64
torch.float64


a) Perform in-context learning based on MAML, and assume a linear model $f_w(x) = w^T x$, with model parameters $w$. Use MAML to learn good initialization parameters $w_0$, based on $\mathcal{C}_m = (x_{m, 1}, y_{m, 1}, \ldots, x_{m, N}, y_{m, N})$ for $m = 1, \ldots, M$, then apply the model to predict $y_{M+1, N+1}$ for $\mathcal{C}_{M+1} = (x_{M+1, 1}, y_{M+1, 1}, \ldots, x_{M+1, N}, y_{M+1, N}, y_{M+1, N+1})$. Show results for various sizes of $M$ and $N$.

In [5]:
from torch.optim import Adam
from torch.nn import MSELoss
from tqdm import tqdm


def train(model, M, N, mu, num_epochs, learning_rate):
    optimizer = Adam(model.parameters(), lr=learning_rate)
    loss_fn = MSELoss()

    pbar = tqdm(range(num_epochs), desc="Training ...")
    min_loss = float("inf")

    for epoch in pbar:  # Iterate over pbar
        for _ in range(M):
            # Generate context
            C_m, query = generate_linear_batch(
                M=M,
                N=N,
            )

            # Data is already in PyTorch tensors
            inputa, labela = C_m

            # Generate query
            inputb, labelb = query

            # Forward pass
            outputas, lossesb = model.task_metalearn(
                inputa.float(),
                inputb.float(),
                labela.float(),
                labelb.float(),
            )

            # Compute loss
            loss = sum(lossesb)
            totalloss += loss.item()

            # Backward pass and optimization
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        avg_loss = totalloss / M

        # Print loss every 100 epochs
        if epoch % 100 == 0:
            print(f"Epoch {epoch}, Loss: {avg_loss:.2f}")

        # Save the model if the loss decreased
        if avg_loss < min_loss:
            min_loss = avg_loss
            torch.save(model.state_dict(), f"models/MAML_M{M}_N{N}.pth")

In [7]:
from maml_pytorch import MAMLModel

lr = 0.001
dim_input = 10
model = MAMLModel(
    dim_input=dim_input,
    dim_hidden=[40, 40],
    dim_output=1,
    num_updates=5,
    learning_rate=lr,
)
model.float()
mu = np.ones(dim_input)
# train(model, M=10, N=20, mu=mu, num_epochs=500, learning_rate=lr)

b) Add noise to the observed $y_{m, i}$ , where now it is $y_{m, i} = w_m^T x_{m, i} + \epsilon_{m, i}$, where $\epsilon_{m, i} \sim \mathcal{N}(0, \sigma^2)$. Examine different sizes of $\sigma^2$ and examine the robustness of MAML to such noise-induced model mismatch.

c) For (a) above, rather than doing MAML, use a Transformer with linear attention. As in https://arxiv.org/abs/2212.07677, show results as a function of $M$ and $N$, and compare your results to MAML, and also to ordinary least squares (OLS). Do the linear-attention Transformer two ways: 

1. based on the *designed* Transformer parameters 

2.  also when the Transformer parameters are learned, based on ${\mathcal{C}_m}_{m=1, M}$ 

Compare the results of the Transformer with designed and learned parameters. There are similar experiments in the paper, that you should design your results on. Consider performance with and without noise $\epsilon_{m,i}$ added to the observed outcomes, like in (b).

Now consider contextual data $\mathcal{C}_m = (x_{m, 1}, y_{m, 1}, \ldots, x_{m, N}, y_{m, N})$ where in each case $y_{m, i} = f_{w_m}(x_{m, i}) = w_m^T x_{m,i}$, where each $w_m \sim \mathcal{N}(\mathbf{0}_d, I_d)$, where $d = 10$. This is as above, but now the manner with which $x_{m,i}$ are drawn is different: Consider two 10-dimensional real-valued vectors: $v=(v_1, \ldots, v_{10})^T$ and $u=(u_1, \ldots, u_{10})^T$, where $v_j = cos(\frac{j \pi}{5})$ and $u_j = sin(\frac{j \pi}{5})$, for $j = 1, \ldots, 10$. Each $x_{m,i} = \alpha v + \beta u + \epsilon$, where $\alpha \sim \mathcal{N}(0, 1)$, $\beta \sim \mathcal{N}(0, 1)$, and $\epsilon = (\epsilon_1, \ldots, \epsilon_{10})^T$ , with $\epsilon_j \sim \mathcal{N}(0, \frac{1}{100})$. Note, what this says is that each $x_{m,i} \in \mathbb{R}^{10} $ is expressed as a randomly scaled sum of two factors, with the two factors represented by $u$ and $v$ (with respective random weights $\alpha$ and $\beta$), and the vector $\epsilon$ represents a small amount of additive noise.

d) What is the covariance of the data $x\in \mathbb{R}^{10}$ drawn in the manner specified above?

e) For data generated as above, use the designed Transformer with linear attention to perform few-shot learning on new contextual data $\mathcal{C}_{M+1} = (x_{m+1,1}, y_{M+1,1}, \ldots , x_{M+1, N}, y_{M+1, N}, x_{M+1, N+1})$, and compare that to the performance of a linear Transformer trained on ${\mathcal{C}_m}_{m=1,M}$ (in the latter, you *learn* the
Transformer parameters via ${\mathcal{C}_m}_{m=1,M} $ ). Consider performance for various settings of $M$, for the trained model. For this problem, pay special attention to the GD++ framework in https://arxiv.org/abs/2212.07677; we will also discuss this in class.

f) Repeat the experiments in (c)-(d) using softmax attention and the full Transformer, as in https://arxiv.org/abs/2208.01066; in this case we do not make the linear-attention assumption and follow the framework in the referenced paper (and that we discussed in class). Compare the performance of the learned Transformer with softmax attention to what you got in (e) with linear attention. Which method works better, in the sense of model accuracy as a function of Transformer depth?