<a href="https://colab.research.google.com/github/gmshroff/metaLearning2022/blob/main/code/nb2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MAML - MODEL-AGNOSTIC META-LEARNING

In [None]:
# !pip install import_ipynb --quiet
# !pip install learn2learn --quiet!pip install import_ipynb --quiet
# !git clone https://github.com/gmshroff/metaLearning.git
# %cd metaLearning/code

In [None]:
import requests
import pickle

In [None]:
import import_ipynb
import utils
import models

In [None]:
from IPython import display
import torch
import torch.nn as nn
from sklearn.manifold import TSNE
from matplotlib import pyplot as plt
from l2lutils import KShotLoader
from IPython import display

In [None]:
from course_data import MyDS, TsDS, FeedData

# Pre-trained Models

In [None]:
#Generate data - euclidean
meta_train_ds, meta_test_ds, full_loader = utils.euclideanDataset(n_samples=10000,n_features=20,n_classes=10,batch_size=32)

In [None]:
# Define an MLP network. Note that input dimension has to be data dimension. For classification
# final dimension has to be number of classes; for regression one.
#torch.manual_seed(10)
net0 = models.MLP(dims=[20,32,32,10])

In [None]:
# Train the network; note that network is trained in place so repeated calls further train it.
net0,loss,accs=models.Train(net0,full_loader,lr=1e-3,epochs=10,verbose=True)

In [None]:
#Training accuracy.
models.accuracy(net0,meta_train_ds.samples,meta_train_ds.labels,verbose=True)

In [None]:
# Test accuracy.
models.accuracy(net0,meta_test_ds.samples,meta_test_ds.labels)

# Second-order Differentiation using Autograd

Second-order derivatives as needed for MAML

In [None]:
network = (lambda x,w: x@w)
loss = torch.nn.MSELoss()

In [None]:
Z=(torch.ones(3,1)).float()
z=(torch.ones(3,1)*2).float()

In [None]:
Zt=(torch.ones(3,1)*1.5).float()
zt=(torch.ones(3,1)*2*1.5).float()

In [None]:
w0=(torch.ones(1,1,requires_grad=True)).float()

In [None]:
w1=w0.clone()

In [None]:
L=loss(network(Z,w1),z)

In [None]:
#g=torch.autograd.grad(L,w0)[0]
g=torch.autograd.grad(L,w1,create_graph=True)[0]
# L.backward()# Not good

In [None]:
w1.grad, w0.grad, L, w0, w1,w1.requires_grad,g

In [None]:
w1 = w1 - 0.1*g

In [None]:
L1=loss(network(Zt,w1),zt)
#L1=loss(net(Zt,w0-0.1*(2.0*(w0-2.0))),zt)

In [None]:
# Both OK - latter used with optimizer.step()
g1=torch.autograd.grad(L1,w0)[0]
# L1.backward()

In [None]:
g1

Working this out manually:

$w_0=1, L=(w_0-2)^2, dL=2\times(w_0-2)=-2,w_1=w_0-0.1\times(-2)=1.2$

$L_1=(w_1\times1.5-3)^2 = (w_0-0.1\times(2\times(w_0-2))\times1.5-3)^2 = (-1.2)^2$

$dL_1 = 2 \times (-1.2) \times (1.5 \times (1-.2))$

In [116]:
2*(-1.2)*(1.5*(1-.2))

-2.8800000000000003

In [117]:
w0.grad,w1.grad

(tensor([[-2.8800]]), None)

# Meta-Learning: Tasks

Generate a k-shot n-way loader using the meta-training dataset

In [118]:
classes_train = [i for i in range(5)]
classes_test = [i+5 for i in range(5)]
classes_train, classes_test

([0, 1, 2, 3, 4], [5, 6, 7, 8, 9])

In [124]:
meta_train_kloader=KShotLoader(meta_train_ds,shots=5,ways=5)

Sample a task - each task has a k-shot n-way training set and a similar test set

In [125]:
d_train,d_test=meta_train_kloader.get_task()

Let's try directly learning using the task training set albeit its small size: create a dataset and loader and train it with the earlier network and Train function.

In [126]:
taskds = utils.MyDS(d_train[0],d_train[1])

In [127]:
d_train_loader = torch.utils.data.DataLoader(dataset=taskds,batch_size=1,shuffle=True)

In [128]:
net = models.MLP(dims=[20,32,32,10])
# net1 = models.MLP(dims=[400,32,32,len(mapping.keys())])

In [129]:
net,losses,accs=models.Train(net,d_train_loader,lr=1e-3,epochs=20,verbose=True)

Epoch   19 Loss: 3.36018e-02 Accuracy: 1.00000


In [130]:
models.accuracy(net,d_train[0],d_train[1])

25.0 25


1.0

How does it do on the test set of the sampled task?

In [131]:
models.accuracy(net,d_test[0],d_test[1])

18.0 25


0.72

How does it do on the test set of from the meta-test set?

In [132]:
meta_test_kloader=KShotLoader(meta_test_ds,shots=5,ways=5)

In [133]:
d_train,d_test=meta_test_kloader.get_task()

In [134]:
models.accuracy(net,d_test[0],d_test[1])

6.0 25


0.24

Start with a pre-trained network

In [135]:
netp=net0

In [136]:
netp,losses,accs=models.Train(netp,d_train_loader,lr=1e-3,epochs=20,verbose=True)

Epoch   19 Loss: 7.62113e-03 Accuracy: 1.00000


In [137]:
models.accuracy(netp,d_train[0],d_train[1])

1.0 25


0.04

How does it do on the test set of the sampled task?

In [138]:
models.accuracy(netp,d_test[0],d_test[1])

3.0 25


0.12

In [139]:
meta_test_kloader=KShotLoader(meta_test_ds,shots=5,ways=5)

In [140]:
d_train,d_test=meta_test_kloader.get_task()

In [141]:
models.accuracy(netp,d_test[0],d_test[1])

0.0 25


0.0

# MAML - Model-Agnostic Meta-Learning

In [142]:
import learn2learn as l2l
import torch.optim as optim

In [143]:
shots,ways=2,5
net = models.MLP(dims=[20,32,32,ways])
maml = l2l.algorithms.MAML(net, lr=1e-1)
optimizer = optim.Adam(net.parameters(),lr=1e-3)
lossfn = torch.nn.NLLLoss()
meta_train_kloader=KShotLoader(meta_train_ds,shots=shots,ways=ways)

The MAML class above wraps our nn.Module class for parameter cloning and other purposes as below. One iteration of the MAML algorithm proceeds by first sampling a training task: Note that each of d_train and d_test is a tuple comprising of a training set, and labels.

In [144]:
d_train,d_test=meta_train_kloader.get_task()

In [145]:
learner = maml.clone()

The learner class above is a 'clone' of our network with copies of parameters so that we can change these without changing the parameters of the network. We apply the learner on training data of d_train and compute TRAINING loss w.r.t the training data of the task, i.e., d_train.

In [146]:
train_preds = learner(d_train[0])
train_loss = lossfn(train_preds,d_train[1])

In [147]:
train_loss

tensor(1.7114, grad_fn=<NllLossBackward0>)

In [148]:
net.layers[0].weight

Parameter containing:
tensor([[ 0.1121, -0.0092, -0.1242,  0.0586,  0.0297, -0.0524, -0.2009,  0.2082,
          0.0989,  0.0167, -0.0350,  0.0734, -0.0245,  0.1639, -0.0037,  0.1735,
          0.0533, -0.0820,  0.1081,  0.0737],
        [ 0.0070, -0.0386, -0.0314, -0.0761, -0.0223, -0.0870,  0.1188, -0.0043,
          0.1373,  0.0178,  0.1070,  0.1092, -0.0285, -0.1202,  0.1345,  0.1904,
          0.0238,  0.2207, -0.0216,  0.1676],
        [ 0.0640,  0.0818,  0.0549, -0.0860, -0.1960,  0.2120,  0.0915, -0.2207,
          0.1724,  0.2092,  0.1118, -0.0383, -0.1833,  0.0796, -0.1369, -0.0677,
          0.0626, -0.0314, -0.2176, -0.0437],
        [ 0.0673, -0.0091, -0.0399,  0.0687, -0.0365,  0.1350,  0.0106,  0.1543,
          0.0619, -0.1191, -0.1569, -0.0604,  0.1461,  0.1343,  0.1576, -0.1636,
          0.0497, -0.1160,  0.0541, -0.0714],
        [-0.0323,  0.1176, -0.0047, -0.2212,  0.1174,  0.0010,  0.2144,  0.0933,
         -0.0245,  0.1226, -0.1737, -0.1152,  0.0480, -0.0632,  0

In [149]:
learner.layers[0].weight

tensor([[ 0.1121, -0.0092, -0.1242,  0.0586,  0.0297, -0.0524, -0.2009,  0.2082,
          0.0989,  0.0167, -0.0350,  0.0734, -0.0245,  0.1639, -0.0037,  0.1735,
          0.0533, -0.0820,  0.1081,  0.0737],
        [ 0.0070, -0.0386, -0.0314, -0.0761, -0.0223, -0.0870,  0.1188, -0.0043,
          0.1373,  0.0178,  0.1070,  0.1092, -0.0285, -0.1202,  0.1345,  0.1904,
          0.0238,  0.2207, -0.0216,  0.1676],
        [ 0.0640,  0.0818,  0.0549, -0.0860, -0.1960,  0.2120,  0.0915, -0.2207,
          0.1724,  0.2092,  0.1118, -0.0383, -0.1833,  0.0796, -0.1369, -0.0677,
          0.0626, -0.0314, -0.2176, -0.0437],
        [ 0.0673, -0.0091, -0.0399,  0.0687, -0.0365,  0.1350,  0.0106,  0.1543,
          0.0619, -0.1191, -0.1569, -0.0604,  0.1461,  0.1343,  0.1576, -0.1636,
          0.0497, -0.1160,  0.0541, -0.0714],
        [-0.0323,  0.1176, -0.0047, -0.2212,  0.1174,  0.0010,  0.2144,  0.0933,
         -0.0245,  0.1226, -0.1737, -0.1152,  0.0480, -0.0632,  0.1881,  0.0511,
      

Note that at this point both the learner and original net have the same parameters. Lets see what the gradients w.r.t the TRAINING loss are: (We use pytorch's autograd functions directly.)

In [150]:
from torch.autograd import grad

In [151]:
train_grad=grad(train_loss,learner.layers[0].weight,retain_graph=True,
                                 create_graph=True,
                                 allow_unused=True)
train_grad[0]

tensor([[ 2.7454e-02, -2.0152e-03,  2.2447e-02,  8.7581e-03, -1.8203e-02,
         -5.4251e-03, -2.0568e-02, -1.7881e-02,  2.7123e-02, -1.0518e-02,
         -1.1338e-02, -2.3910e-03, -8.3744e-03,  7.0201e-03, -4.0147e-03,
          8.0017e-03,  2.0661e-02, -5.4978e-03, -1.6388e-02,  1.7153e-02],
        [-1.0287e-02,  1.8980e-02,  3.4639e-03, -4.5314e-03, -3.0747e-02,
          4.1525e-03, -1.5390e-02,  1.4088e-02, -3.0416e-02, -1.8706e-03,
         -3.7763e-02,  3.3578e-03,  9.0702e-03,  6.6391e-03,  4.5998e-03,
          2.8288e-03, -6.1740e-03,  1.1027e-02, -5.6933e-03, -6.0468e-03],
        [ 1.3979e-03, -1.3661e-02,  2.8597e-02, -6.5008e-04, -2.2044e-02,
         -3.2507e-02, -3.7264e-02, -3.7694e-02,  2.4167e-02, -1.9234e-02,
         -8.2191e-02, -2.2443e-02,  2.1345e-02, -6.0700e-03, -2.3837e-02,
         -1.0661e-03,  2.5801e-02,  2.7016e-03, -3.1403e-02,  3.0071e-02],
        [-1.9711e-02, -1.3235e-02,  2.3637e-02, -2.0479e-02, -2.1933e-03,
         -1.1159e-03, -1.2327e-02, 

Next we ADAPT the learner by taking one step on the CLONED parameters in direction of the gradient of the TRAINING loss above. This is the part that the l2l libarary does for us as per the MAML algorithm.

In [152]:
learner.adapt(train_loss)

We can check what has happended:

In [153]:
learner.layers[0].weight

tensor([[ 0.1093, -0.0090, -0.1264,  0.0577,  0.0315, -0.0519, -0.1989,  0.2100,
          0.0962,  0.0178, -0.0338,  0.0737, -0.0236,  0.1632, -0.0033,  0.1727,
          0.0513, -0.0814,  0.1097,  0.0720],
        [ 0.0081, -0.0405, -0.0318, -0.0756, -0.0192, -0.0874,  0.1203, -0.0057,
          0.1403,  0.0180,  0.1108,  0.1088, -0.0294, -0.1209,  0.1341,  0.1901,
          0.0245,  0.2196, -0.0210,  0.1682],
        [ 0.0638,  0.0831,  0.0521, -0.0859, -0.1938,  0.2153,  0.0952, -0.2169,
          0.1700,  0.2111,  0.1200, -0.0361, -0.1854,  0.0802, -0.1345, -0.0675,
          0.0600, -0.0317, -0.2145, -0.0467],
        [ 0.0693, -0.0078, -0.0422,  0.0707, -0.0363,  0.1351,  0.0119,  0.1564,
          0.0597, -0.1172, -0.1517, -0.0616,  0.1465,  0.1363,  0.1599, -0.1625,
          0.0481, -0.1164,  0.0554, -0.0715],
        [-0.0314,  0.1199, -0.0061, -0.2253,  0.1160,  0.0036,  0.2146,  0.0978,
         -0.0257,  0.1206, -0.1717, -0.1101,  0.0444, -0.0624,  0.1906,  0.0504,
      

In [154]:
(net.layers[0].weight - learner.layers[0].weight)/train_grad[0]

tensor([[0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000,
         0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000,
         0.1000, 0.1000],
        [0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000,
         0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000,
         0.1000, 0.1000],
        [0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000,
         0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000,
         0.1000, 0.1000],
        [0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000,
         0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000,
         0.1000, 0.1000],
        [0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000,
         0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000,
         0.1000, 0.1000],
        [0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1000, 0.1

So one step in the diretion of the gradient (w.r.t train_loss) has been taken. Next we compute the loss of this ADAPTED learner w.r.t. the TEST data of the task, i.e., d_test:

In [155]:
test_preds = learner(d_test[0])
adapt_loss = lossfn(test_preds,d_test[1])

The main MAML update to the original network net takes place now, by back-propagating through the (cumulative) adaptation loss (across possibly many tasks, here there was just one):

In [156]:
task_count = 1
optimizer.zero_grad()
total_loss = adapt_loss/task_count
total_loss.backward()

In [157]:
net.layers[0].weight

Parameter containing:
tensor([[ 0.1121, -0.0092, -0.1242,  0.0586,  0.0297, -0.0524, -0.2009,  0.2082,
          0.0989,  0.0167, -0.0350,  0.0734, -0.0245,  0.1639, -0.0037,  0.1735,
          0.0533, -0.0820,  0.1081,  0.0737],
        [ 0.0070, -0.0386, -0.0314, -0.0761, -0.0223, -0.0870,  0.1188, -0.0043,
          0.1373,  0.0178,  0.1070,  0.1092, -0.0285, -0.1202,  0.1345,  0.1904,
          0.0238,  0.2207, -0.0216,  0.1676],
        [ 0.0640,  0.0818,  0.0549, -0.0860, -0.1960,  0.2120,  0.0915, -0.2207,
          0.1724,  0.2092,  0.1118, -0.0383, -0.1833,  0.0796, -0.1369, -0.0677,
          0.0626, -0.0314, -0.2176, -0.0437],
        [ 0.0673, -0.0091, -0.0399,  0.0687, -0.0365,  0.1350,  0.0106,  0.1543,
          0.0619, -0.1191, -0.1569, -0.0604,  0.1461,  0.1343,  0.1576, -0.1636,
          0.0497, -0.1160,  0.0541, -0.0714],
        [-0.0323,  0.1176, -0.0047, -0.2212,  0.1174,  0.0010,  0.2144,  0.0933,
         -0.0245,  0.1226, -0.1737, -0.1152,  0.0480, -0.0632,  0

In [158]:
optimizer.step()

In [159]:
net.layers[0].weight

Parameter containing:
tensor([[ 1.1105e-01, -8.1686e-03, -1.2519e-01,  5.9588e-02,  3.0707e-02,
         -5.3418e-02, -1.9991e-01,  2.0925e-01,  9.7909e-02,  1.5700e-02,
         -3.3958e-02,  7.2418e-02, -2.5471e-02,  1.6286e-01, -2.7217e-03,
          1.7254e-01,  5.4342e-02, -8.2958e-02,  1.0910e-01,  7.4745e-02],
        [ 6.0320e-03, -3.7642e-02, -3.2428e-02, -7.5061e-02, -2.1295e-02,
         -8.8029e-02,  1.1975e-01, -3.2606e-03,  1.3830e-01,  1.6785e-02,
          1.0801e-01,  1.0817e-01, -2.9536e-02, -1.2119e-01,  1.3553e-01,
          1.8938e-01,  2.4835e-02,  2.1970e-01, -2.0576e-02,  1.6860e-01],
        [ 6.4964e-02,  8.2770e-02,  5.5930e-02, -8.4968e-02, -1.9503e-01,
          2.1103e-01,  9.0497e-02, -2.1967e-01,  1.7340e-01,  2.1016e-01,
          1.1075e-01, -3.9335e-02, -1.8227e-01,  7.8630e-02, -1.3587e-01,
         -6.6652e-02,  6.1624e-02, -3.2398e-02, -2.1863e-01, -4.2700e-02],
        [ 6.6347e-02, -8.0743e-03, -4.0870e-02,  6.9673e-02, -3.5518e-02,
          1.3

So, the original parameters have been updated by a gradient step using on all the task adaptation losses. 

# Putting it all together: MAML Algorithm
Now let's put all of the above in a loop - the MAML algorithm:

In [160]:
import learn2learn as l2l
import torch.optim as optim
shots,ways = 5,5
net = models.MLP(dims=[20,32,32,ways])
maml = l2l.algorithms.MAML(net, lr=5e-3)
# maml = l2l.algorithms.MAML(net0, lr=5e-3)
optimizer = optim.Adam(maml.parameters(),lr=5e-4)
lossfn = torch.nn.NLLLoss()
meta_train_kloader=KShotLoader(meta_train_ds,shots=shots,ways=ways,num_tasks=1000,
                               classes=None)

In [161]:
# Number of epochs, tasks per step and number of fast_adaptation steps 
n_epochs=300
task_count=32
fas = 5

Note: In practice we use more than one gradient step for adpation, this is called 'fast adaptation'.

In [164]:
epoch=0
while epoch<n_epochs:
    adapt_loss = 0.0
    test_acc = 0.0
    # Sample and train on a task
    for task in range(task_count):
        d_train,d_test=meta_train_kloader.get_task()
        learner = maml.clone()
        for fas_step in range(fas):
            train_preds = learner(d_train[0])
            train_loss = lossfn(train_preds,d_train[1])
            learner.adapt(train_loss)
        test_preds = learner(d_test[0])
        adapt_loss += lossfn(test_preds,d_test[1])
        learner.eval()
        test_acc += models.accuracy(learner,d_test[0],d_test[1],verbose=False)
        learner.train()
        # Done with a task
    # Update main network
    print('Epoch  % 2d Loss: %2.5e Avg Acc: %2.5f'%(epoch,adapt_loss/task_count,test_acc/task_count))
    display.clear_output(wait=True)
    optimizer.zero_grad()
    total_loss = adapt_loss
    total_loss.backward()
    optimizer.step()
    epoch+=1
    

Epoch   299 Loss: 3.72539e-01 Avg Acc: 0.89000


Now test the trained maml network and applying the adaption step to tasks sampled from the meta_test_ds dataset:

In [165]:
meta_test_kloader=KShotLoader(meta_test_ds,shots=shots,ways=ways,
                              classes=None)
test_acc = 0.0
task_count = 20
adapt_steps = 5
maml.eval()
# Sample and train on a task
for task in range(task_count):
    d_train,d_test=meta_test_kloader.get_task()
    learner = maml.clone()
    learner.eval()
    for adapt_step in range(adapt_steps):
        train_preds = learner(d_train[0])
        train_loss = lossfn(train_preds,d_train[1])
        learner.adapt(train_loss)
    test_preds = learner(d_test[0])
    test_acc += models.accuracy(learner,d_test[0],d_test[1],verbose=False)
    # Done with a task
#learner.train()
print('Avg Acc: %2.5f'%(test_acc/task_count))

Avg Acc: 0.87000
