# CE-40959: Advanced Machine Learning
## HW1 - Black-box Meta Learning (100 points)

#### Name: Mohammad Mozafari
#### Student No: 400201167

In this notebook, you are going to implement a black-box meta learner using the `Omniglot` dataset.

Please write your code in specified sections and do not change anything else. If you have a question regarding this homework, please ask it on the Quera.

Also, it is recommended to use Google Colab to do this homework. You can connect to your drive using the code below:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Import Required libraries

In [1]:
import numpy as np
import os
import matplotlib.pyplot as plt
import torch
import torchvision
import random
import torch.nn as nn
import math

import torch.nn.functional as F
import torchvision.transforms as transforms
import torch.optim as optim
import torch.utils.data as data

## Introduction

In Meta-Learning literature and in the meta-training phase, you are given some batches which consist of `support` and `query` sets. you train your model in a way that by using a support set you could predict query set labels correctly.

In this homework, you are going to implement such meta-learner like the below architecture. In this model, at each step, you give all your support images and one query to the network simultaneously (query at the end) and you expect that the model predicts query label based on your inputs.


<br><br>

<div style="text-align:center;"><img src="https://drive.google.com/uc?export=view&id=1Au9GF7FB_IChrMLmvM0z4RBPP1R3oPgY" width=300></div>

<br><br>

Don't worry if you didn't understand the architecture. we are going to explain it step by step.

So if our meta-learning is K-shot N-way then each batch will consist of N*K support images with labels and one query image which we have its label in the meta-training phase.

First we should build dataset it this way that each batch return N*K+1 images

The Omniglot data set is designed for developing more human-like learning algorithms. It contains 1623 different handwritten characters from 50 different alphabets. Each of the 1623 characters was drawn online via Amazon's Mechanical Turk by 20 different people.

Train and test dataset contains 964 and 659 classes, respectively. Torchvision-based Omniglot dataset is ordered and every 20 images in a row belong to one class.

In [2]:
# Meta learning parameters.

N = 5
K = 1

## Prepare dataset (25 points)

In [3]:
transform = transforms.Compose([
    transforms.Resize(28),
    transforms.ToTensor()
])

train_dataset = torchvision.datasets.Omniglot('./data/omniglot/', download = True, background = True, transform = transform)
test_dataset = torchvision.datasets.Omniglot('./data/omniglot/', download = True, background = False, transform = transform)

train_labels = np.repeat(np.arange(964), 20)
test_labels = np.repeat(np.arange(659), 20)

Downloading https://raw.githubusercontent.com/brendenlake/omniglot/master/python/images_background.zip to ./data/omniglot/omniglot-py/images_background.zip


  0%|          | 0/9464212 [00:00<?, ?it/s]

Extracting ./data/omniglot/omniglot-py/images_background.zip to ./data/omniglot/omniglot-py
Downloading https://raw.githubusercontent.com/brendenlake/omniglot/master/python/images_evaluation.zip to ./data/omniglot/omniglot-py/images_evaluation.zip


  0%|          | 0/6462886 [00:00<?, ?it/s]

Extracting ./data/omniglot/omniglot-py/images_evaluation.zip to ./data/omniglot/omniglot-py


To build a dataloader, we should have a class that yields indexes of selected data in the dataset for every iteration and pass it to the `batch_sampler` attribute of dataloader.

Complete below code based on this pseudocode:


1.   select `N` classes randomly from all classes
2.   select `1` class from `N` selected classes as query-contained class
3.   select `K` images from other `N-1` classes independently and randomly
4.   select `K+1` images from the query-contained class independently and randomly
5.   shuffle dataset indexes, but don't forget to put query index at the last of the list



In [4]:
class BatchSampler(object):
    """
    BatchSampler: yield a batch of indexes at each iteration.
    __len__ returns the number of episodes per epoch (same as 'self.iterations').
    """

    def __init__(self, labels, classes_per_it, num_samples, iterations, batch_size):
        """
        Initialize the BatchSampler object
        Arguments:
        - labels: array of labels of dataset.
        - classes_per_it: number of random classes for each iteration
        - num_samples: number of samples for each iteration for each class
        - iterations: number of iterations (episodes) per epoch
        - batch_size: number of batches per iteration
        """
        super(BatchSampler, self).__init__()
        self.labels = labels
        self.classes_per_it = classes_per_it
        self.sample_per_class = num_samples
        self.iterations = iterations
        self.batch_size = batch_size

    def __iter__(self):
        '''
        yield a batch of indexes
        '''

        for it in range(self.iterations):
            total_batch_indexes = np.array([])

            #################################################################################
            #                  COMPLETE THE FOLLOWING SECTION (25 points)                   #
            #################################################################################
            # feel free to add/edit initialization part of sampler.
            #################################################################################

            unique_labels = np.unique(self.labels)
            for _ in range(self.batch_size):
                random_classes = np.random.choice(unique_labels, size=self.classes_per_it, replace=False)
                seq = np.zeros(self.classes_per_it * self.sample_per_class + 1)
                for i, c in enumerate(random_classes):
                    count = self.sample_per_class
                    if i == len(random_classes) - 1:
                        count += 1
                    random_indices = np.random.randint(20*c, 20*c+20, size=count)
                    seq[i*self.sample_per_class:i*self.sample_per_class+count] = random_indices
                np.random.shuffle(seq[:-1])
                total_batch_indexes = np.concatenate((total_batch_indexes, seq))

            #################################################################################
            #                                   THE END                                     #
            #################################################################################

            yield total_batch_indexes.astype(int)

    def __len__(self):
        return self.iterations

In [5]:
iterations = 5000
batch_size = 32

train_sampler = BatchSampler(labels=train_labels, classes_per_it=N,
                              num_samples=K, iterations=iterations,
                              batch_size=batch_size)

test_sampler = BatchSampler(labels=test_labels, classes_per_it=N,
                              num_samples=K, iterations=iterations,
                              batch_size=batch_size)

train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_sampler=train_sampler)
test_dataloader = torch.utils.data.DataLoader(test_dataset, batch_sampler=test_sampler)

## Model (50 points)

Let's Build our model. the first block of our model is one encoder which is given below. you are going to implement other blocks of networks with a given explanation

In [6]:
def conv_block(in_channels, out_channels):
    return nn.Sequential(
            nn.Conv2d(in_channels, out_channels, 3, padding=1),
            nn.BatchNorm2d(out_channels, momentum=1),
            nn.ReLU(),
            nn.MaxPool2d(2)
        )

class OmniglotNet(nn.Module):
    '''
    source: https://github.com/jakesnell/prototypical-networks/blob/f0c48808e496989d01db59f86d4449d7aee9ab0c/protonets/models/few_shot.py#L62-L84
    '''
    def __init__(self, x_dim=1, hid_dim=64, z_dim=64):
        super(OmniglotNet, self).__init__()
        self.encoder = nn.Sequential(
            conv_block(x_dim, hid_dim),
            conv_block(hid_dim, hid_dim),
            conv_block(hid_dim, hid_dim),
            conv_block(hid_dim, z_dim)
        )

    def forward(self, x):
        x = self.encoder(x)
        return x.view(x.size(0), -1)

The whole network consists of two major blocks:


1.   Causal Attention
2.   Temporal Convolution

The first block is `Causal Attention`:


<div style="text-align:center;"><img src="https://drive.google.com/uc?export=view&id=19lWuKzYTRry-UBog838o7dWYVL-r54WF" width=500></div>

<br><br>

The mechanism is so similar to self-attention (if you don't have any information about self-attention, see [this link](https://www.geeksforgeeks.org/self-attention-in-nlp/)) with one difference. the `masked softmax` has been replaced by `softmax`. It means that at each timestep when you calculate weights of the attention mechanism, you do it with just past keys/values.

In [7]:
class AttentionBlock(nn.Module):
    def __init__(self, in_channels, key_size, value_size):
        super(AttentionBlock, self).__init__()

        #################################################################################
        #                  COMPLETE THE FOLLOWING SECTION (2.5 points)                  #
        #################################################################################
        self.key_affine = nn.Linear(in_channels, key_size)
        self.query_affine = nn.Linear(in_channels, key_size)
        self.value_affine = nn.Linear(in_channels, value_size)
        #################################################################################
        #                                   THE END                                     #
        #################################################################################
        self.softmax_temp = math.sqrt(key_size) #don't forget to apply temperature before calculating softmax.

    def forward(self, x):
        # x is dim (N, T, in_channels) where N is the batch_size, and T is the sequence length
        mask = np.array([[1 if i>j else 0 for i in range(x.shape[1])] for j in range(x.shape[1])])
        mask = torch.ByteTensor(mask).to(x.device)

      
        #################################################################################
        #                  COMPLETE THE FOLLOWING SECTION (7.5 points)                  #
        #################################################################################
        keys = self.key_affine(x)                                    # keys:    (N, T, K)
        queries = self.query_affine(x)                               # queries: (N, T, K)
        values = self.value_affine(x)                                # values:  (N, T, V)

        logits = torch.bmm(queries, keys.permute((0, 2, 1)))         # logits:  (N, T, T)
        logits.data.masked_fill_(mask, -float('inf'))
        weights = F.softmax(logits/self.softmax_temp, dim=2)
        output = torch.bmm(weights, values)                          # output:  (N, T, V)             

        return torch.cat((x, output), dim=2)
        #################################################################################
        #                                   THE END                                     #
        #################################################################################

The second block is `Temporal Convolution`:

a Temporal Convolution consists of a series of `Dense Blocks` whose dilation rates increase exponentially until their receptive field exceeds the desired sequence length. For example first time when you apply this block, sequence length is (N*K+1) and dilation is 2.
to sum up, what you will do is this:

<div style="text-align:center;"><img src="https://drive.google.com/uc?export=view&id=1_mWTFiZNQlN4sMTWp2GqolSSzNTAFJuh" width=1000></div>

<br>
Dense Block pseduocode is:
<br><br>

<div style="text-align:center;"><img src="https://drive.google.com/uc?export=view&id=1T2q6KugqBEcwSyJAAGymTaXTe__MGsv3" width=1000></div>

<br>
The `CausalConv` code is given.

<br>

In [8]:
class CasualConv1d(nn.Module):
    def __init__(self, in_channels, out_channels, dilation=1):
        super(CasualConv1d, self).__init__()

        self.pad = nn.ConstantPad1d((dilation, 0), 0)
        self.conv1d = nn.Conv1d(in_channels, out_channels, kernel_size=2, dilation=dilation)

    def forward(self, x):
        return self.conv1d(self.pad(x))

class DenseBlock(nn.Module):

    def __init__(self, in_channels, out_channels, dilation):
        super().__init__()

        #################################################################################
        #                  COMPLETE THE FOLLOWING SECTION (2.5 points)                  #
        #################################################################################
        self.cc1 = CasualConv1d(in_channels, out_channels, dilation)
        self.cc2 = CasualConv1d(in_channels, out_channels, dilation)
        #################################################################################
        #                                   THE END                                     #
        #################################################################################


    def forward(self, x):

        #################################################################################
        #                  COMPLETE THE FOLLOWING SECTION (5 points)                    #
        #################################################################################
        xf = self.cc1(x)
        xg = self.cc2(x)
        activation = torch.tanh(xf) * torch.sigmoid(xg)
        return torch.cat((x, activation), dim=1)
        #################################################################################
        #                                   THE END                                     #
        #################################################################################



class TemporalConvolutionBlock(nn.Module):

    def __init__(self, sequence_length, in_channels, dense_block_out_channels):
        super().__init__()

        #################################################################################
        #                  COMPLETE THE FOLLOWING SECTION (2.5 points)                  #
        #################################################################################
        dblocks = []
        for i in range(np.math.ceil(np.math.log2(sequence_length))):
            db = DenseBlock(in_channels + i * dense_block_out_channels, dense_block_out_channels, 2 ** (i+1))
            dblocks.append(db)
        self.dense_blocks = nn.ModuleList(dblocks)
        #################################################################################
        #                                   THE END                                     #
        #################################################################################

    def forward(self, x):

        #################################################################################
        #                  COMPLETE THE FOLLOWING SECTION (10 points)                   #
        #################################################################################
        x = x.permute(0, 2, 1)
        for db in self.dense_blocks: 
            x = db(x)
        return x.permute(0, 2, 1)
        #################################################################################
        #                                   THE END                                     #
        #################################################################################
        


In [9]:
# The general mechanism of the network is as follows:
# Your input shape is (B*S, C, H, W). B is batch size, S is sequence length, C is channel, H is height and W is width of your image.
# first you should pass your input to "OmniglotNet" network to get feature vectors per data. shape: (B*S, V). V is feature vector size.
# then separate B and S dimensions and concat one-hot labels with your data. Shape: (B, S, V + N). N is your meta-learner parameter (number of classes per batch)
# pass it to a attention block with key size of 64 and value size of 32. shape: (B, S, v1)
# pass it to a temporal convolution block which consists of dense blocks with 128 output channels. shape: (B, S, v2)
# pass it to a attention block with key size of 256 and value size of 128. shape: (B, S, v3)
# pass it to a temporal convolution block which consists of dense blocks with 128 output channels. shape: (B, S, v4)
# pass it to a attention block with key size of 512 and value size of 256. shape: (B, S, v5)
# pass it to a Linear block with N outputs to predict labels. shape: (B, S, N)
# return last index of sequence which is related to query (second dimension). shape: (B, N)

class Network(nn.Module):
    def __init__(self, N, K):
        super(Network, self).__init__()

        self.N = N
        self.K = K
        self.encoder = OmniglotNet()
        channels_number = 64 + N
        tc_layers = np.math.ceil(np.math.log2(N*K+1))

        #################################################################################
        #                  COMPLETE THE FOLLOWING SECTION (10 points)                   #
        #################################################################################
        self.attn1 = AttentionBlock(channels_number, 64, 32)
        channels_number += 32
        self.tc1 = TemporalConvolutionBlock(N*K+1, channels_number, 128)
        channels_number += 128 * tc_layers
        
        self.attn2 = AttentionBlock(channels_number, 256, 128)
        channels_number += 128
        self.tc2 = TemporalConvolutionBlock(N*K+1, channels_number, 128)
        channels_number += 128 * tc_layers
        
        self.attn3 = AttentionBlock(channels_number, 512, 256)
        channels_number += 256
        
        self.linear = nn.Linear(channels_number, N)
        #################################################################################
        #                                   THE END                                     #
        #################################################################################


    def forward(self, input, labels):

        #################################################################################
        #                  COMPLETE THE FOLLOWING SECTION (10 points)                   #
        #################################################################################
        # input shape is (B*S, C, H, W)
        # labels shape is (B, S, N)
        # output shape is (N, N)
        # calculate output by given description
        #################################################################################
        B, S = labels.shape[:2]

        x = self.encoder(input)                       # x: (B*S, V)
        x = x.view(B, S, -1)                          # x: (B, S, V)
        x = torch.cat((x, labels), dim=2)             # x: (B, S, V+N)
        x = x.type(torch.float32)

        x = self.attn1(x)
        x = self.tc1(x)
        x = self.attn2(x)
        x = self.tc2(x)
        x = self.attn3(x)
        x = self.linear(x)

        return x[:, -1, :]
        #################################################################################
        #                                   THE END                                     #
        #################################################################################

## Train (15 points)

In [10]:
def labels_to_one_hot(labels):
    unique = np.unique(labels)
    map = {label:idx for idx, label in enumerate(unique)}
    idxs = [map[labels[i]] for i in range(labels.size)]
    one_hot = np.zeros((labels.size, unique.size))
    one_hot[np.arange(labels.size), idxs] = 1
    return one_hot, idxs

In [11]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

lr = 5e-4
model = Network(N, K).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
epochs = 4

criterion = nn.CrossEntropyLoss()
model.train()

for epoch in range(epochs):
    print('Epoch {}/{}'.format(epoch, epochs))
    running_loss, running_acc = 0.0, 0.0
    loader_iter = iter(train_dataloader)
    for i, batch in enumerate(loader_iter):
        #################################################################################
        #                  COMPLETE THE FOLLOWING SECTION (15 points)                   #
        #################################################################################
        # prepare your data as input to your model.
        # extract query label (last image label in each batch) for loss function.
        # convert your labels to one-hot form and don't forget to set all elements of
        # one-hotted query label to zero (it's trivial that we shouldn't give
        # the output of the network to model as input!).
        # train your model.
        # save loss of each iteration
        #################################################################################
        x, y = batch
        x = x.to(device)

        # one hot encoding of labels      
        one_hots = []
        y = y.reshape((-1, K * N + 1))
        for j in range(y.shape[0]):
            vector, _ = labels_to_one_hot(y[j, :].numpy())
            one_hots.append(vector)
        y = np.stack(one_hots, axis=0)
        targets = y[:, -1, :].argmax(axis=1)
        y[:, -1, :] = 0
        y = torch.from_numpy(y).to(device)
        targets = torch.from_numpy(targets).to(device)

        output = model(x, y)
        loss = criterion(output, targets)
        running_loss += loss
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        preds = output.argmax(dim=1)
        acc = ((preds == targets) * 1.0).mean()
        running_acc += acc

        log_every_iter = 50
        if i % log_every_iter == log_every_iter - 1:
            print(f'[{epoch + 1}, {i + 1:5d}] loss: {running_loss / log_every_iter:.3f} accuracy: {running_acc / log_every_iter:.3f}')
            running_loss = 0.0
            running_acc = 0.0
            fr = 0.0

        #################################################################################
        #                                   THE END                                     #
        #################################################################################

torch.save({
    'model_state_dict': model.state_dict()
}, '{}.pth'.format('snail'))

Epoch 0/4




[1,    50] loss: 1.624 accuracy: 0.226
[1,   100] loss: 1.599 accuracy: 0.234
[1,   150] loss: 1.559 accuracy: 0.286
[1,   200] loss: 1.532 accuracy: 0.303
[1,   250] loss: 1.545 accuracy: 0.289
[1,   300] loss: 1.502 accuracy: 0.337
[1,   350] loss: 1.512 accuracy: 0.308
[1,   400] loss: 1.443 accuracy: 0.346
[1,   450] loss: 1.387 accuracy: 0.368
[1,   500] loss: 1.352 accuracy: 0.403
[1,   550] loss: 1.242 accuracy: 0.472
[1,   600] loss: 1.157 accuracy: 0.499
[1,   650] loss: 1.029 accuracy: 0.569
[1,   700] loss: 0.938 accuracy: 0.602
[1,   750] loss: 0.842 accuracy: 0.651
[1,   800] loss: 0.786 accuracy: 0.667
[1,   850] loss: 0.732 accuracy: 0.697
[1,   900] loss: 0.653 accuracy: 0.732
[1,   950] loss: 0.599 accuracy: 0.761
[1,  1000] loss: 0.535 accuracy: 0.787
[1,  1050] loss: 0.471 accuracy: 0.811
[1,  1100] loss: 0.431 accuracy: 0.835
[1,  1150] loss: 0.432 accuracy: 0.822
[1,  1200] loss: 0.409 accuracy: 0.836
[1,  1250] loss: 0.411 accuracy: 0.844
[1,  1300] loss: 0.375 ac

## Test (5 points)

In [12]:
model.eval()
test_epochs = 1

total_correct = 0
total_examples = 0
for epoch in range(test_epochs):
    loader_iter = iter(test_dataloader)
    running_acc = 0.0
    for i, batch in enumerate(loader_iter):
        #################################################################################
        #                  COMPLETE THE FOLLOWING SECTION (5 points)                    #
        #################################################################################
        # report accuracy of your model.
        # plot loss values in whole training iterations.
        #################################################################################
        x, y = batch
        x = x.to(device)

        # one hot encoding of labels        
        one_hots = []
        y = y.reshape((-1, K * N + 1))
        for j in range(y.shape[0]):
            vector, _ = labels_to_one_hot(y[j, :].numpy())
            one_hots.append(vector)
        y = np.stack(one_hots, axis=0)
        targets = y[:, -1, :].argmax(axis=1)
        y[:, -1, :] = 0
        y = torch.from_numpy(y).to(device)
        targets = torch.from_numpy(targets).to(device)

        output = model(x, y)
        # running_loss += loss
        # optimizer.zero_grad()
        # loss.backward()
        # optimizer.step()

        preds = output.argmax(dim=1)
        correct = ((preds == targets) * 1.0).sum()
        total_correct += correct
        total_examples += preds.shape[0]
        running_acc += correct / preds.shape[0]

        log_every_iter = 50
        if i % log_every_iter == log_every_iter - 1:
            print(f'[{epoch + 1}, {i + 1:5d}] partition accuracy: {running_acc / log_every_iter:.3f}')
            running_acc = 0.0
        #################################################################################
        #                                   THE END                                     #
        #################################################################################
    
print('Total accuracy: {}'.format(total_correct / total_examples))



[1,    50] partition accuracy: 0.939
[1,   100] partition accuracy: 0.942
[1,   150] partition accuracy: 0.945
[1,   200] partition accuracy: 0.946
[1,   250] partition accuracy: 0.953
[1,   300] partition accuracy: 0.956
[1,   350] partition accuracy: 0.949
[1,   400] partition accuracy: 0.942
[1,   450] partition accuracy: 0.949
[1,   500] partition accuracy: 0.937
[1,   550] partition accuracy: 0.949
[1,   600] partition accuracy: 0.948
[1,   650] partition accuracy: 0.944
[1,   700] partition accuracy: 0.951
[1,   750] partition accuracy: 0.944
[1,   800] partition accuracy: 0.949
[1,   850] partition accuracy: 0.959
[1,   900] partition accuracy: 0.951
[1,   950] partition accuracy: 0.948
[1,  1000] partition accuracy: 0.933
[1,  1050] partition accuracy: 0.940
[1,  1100] partition accuracy: 0.943
[1,  1150] partition accuracy: 0.933
[1,  1200] partition accuracy: 0.953
[1,  1250] partition accuracy: 0.946
[1,  1300] partition accuracy: 0.943
[1,  1350] partition accuracy: 0.951
[

## Question (5 points)

Question) State one problem of using this network for meta-learning
<br><br>

Answer:

1. Finding good architecture for TCBlocks and DenseBlocks and how to connect them for different problems can be difficult and time-consuming.
2. Model complexity increases with sequence length. (Very large sequences can cause problems)