# Assignment 3
## Question 1: Siamese networks & one-shot learning (8pt)
The Cifar-100 dataset is similar to the Cifar-10 dataset. It also consists of 60,000 32x32 RGB images, but they are distributed over 100 classes instead of 10. Thus, each class has much less examples, only 500 training images and 100 testing images per class. For more info about the dataset, see https://www.cs.toronto.edu/~kriz/cifar.html.

*HINT: Import the Cifar-100 dataset directly from Keras, no need to download it from the website. Use* `label_mode="fine"`

### Task 1.1: Siamese network
**a)**
* Train a Siamese Network on the first 80 classes of (the training set of) Cifar-100, i.e. let the network predict the probability that two input images are from the same class. Use 1 as a target for pairs of images from the same class (positive pairs), and 0 for pairs of images from different classes (negative pairs). Randomly select image pairs from Cifar-100, but make sure you train on as many positive pairs as negative pairs.

* Evaluate the performance of the network on 20-way one-shot learning tasks. Do this by generating 250 random tasks and obtain the average accuracy for each evaluation round. Use the remaining 20 classes that were not used for training. The model should perform better than random guessing.

For this question you may ignore the test set of Cifar-100; it suffices to use only the training set and split this, using the first 80 classes for training and the remaining 20 classes for one-shot testing.

*HINT: First sort the data by their labels (see e.g.* `numpy.argsort()`*), then reshape the data to a shape of* `(n_classes, n_examples, width, height, depth)`*, similar to the Omniglot data in Practical 4. It is then easier to split the data by class, and to sample positive and negative images pairs for training the Siamese network.*

*NOTE: do not expect the one-shot accuracy for Cifar-100 to be similar to that accuracy for Omniglot; a lower accuracy can be expected. However, accuracy higher than random guess is certainly achievable.*

In [2]:
from keras.layers import Input, Conv2D, Lambda, Dense, Flatten, MaxPooling2D, Dropout, BatchNormalization
from keras.models import Model, Sequential
from keras.regularizers import l2
from keras import backend as K
from keras.losses import binary_crossentropy
import numpy as np
import os
import pickle
import matplotlib.pyplot as plt
from sklearn.utils import shuffle

In [53]:
input_shape = (32, 32, 3)
left_input = Input(input_shape)
right_input = Input(input_shape)

# build convnet to use in each siamese 'leg'
convnet = Sequential()
convnet.add(Conv2D(32, (3,3), activation='relu', input_shape=input_shape, kernel_regularizer=l2(2e-4)))
convnet.add(MaxPooling2D())
convnet.add(BatchNormalization())
convnet.add(Dropout(0.25))
convnet.add(Conv2D(64, (2,2), activation='relu', kernel_regularizer=l2(2e-4)))
convnet.add(MaxPooling2D())
convnet.add(BatchNormalization())
convnet.add(Dropout(0.25))
convnet.add(Conv2D(64, (2,2), activation='relu', kernel_regularizer=l2(2e-4)))
convnet.add(MaxPooling2D())
convnet.add(BatchNormalization())
convnet.add(Dropout(0.25))
convnet.add(Conv2D(128, (1,1), activation='relu', kernel_regularizer=l2(2e-4)))
convnet.add(Flatten())
convnet.add(BatchNormalization())
convnet.add(Dropout(0.25))
convnet.add(Dense(1024, activation="sigmoid", kernel_regularizer=l2(1e-3)))
convnet.summary()

# encode each of the two inputs into a vector with the convnet
encoded_l = convnet(left_input)
encoded_r = convnet(right_input)

# merge two encoded inputs with the L1 distance between them, and connect to prediction output layer
L1_distance = lambda x: K.abs(x[0]-x[1])
both = Lambda(L1_distance)([encoded_l, encoded_r])
prediction = Dense(1, activation='sigmoid')(both)
siamese_net = Model(inputs=[left_input,right_input], outputs=prediction)


siamese_net.compile(loss="binary_crossentropy", optimizer="adam")

siamese_net.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_70 (Conv2D)           (None, 30, 30, 32)        896       
_________________________________________________________________
max_pooling2d_61 (MaxPooling (None, 15, 15, 32)        0         
_________________________________________________________________
batch_normalization_75 (Batc (None, 15, 15, 32)        128       
_________________________________________________________________
dropout_60 (Dropout)         (None, 15, 15, 32)        0         
_________________________________________________________________
conv2d_71 (Conv2D)           (None, 14, 14, 64)        8256      
_________________________________________________________________
max_pooling2d_62 (MaxPooling (None, 7, 7, 64)          0         
_________________________________________________________________
batch_normalization_76 (Batc (None, 7, 7, 64)          256       
__________

In [58]:
from keras.datasets import cifar100
import numpy as np

(x_train, y_train), (x_test, y_test) = cifar100.load_data(label_mode='fine')

x_train_sorted = x_train[np.argsort(y_train, axis=0)]
y_train_sorted = np.sort(y_train, axis=0)

x_train_sorted = np.reshape(x_train_sorted, (100, 500, 32, 32, 3))

x_train_80, x_train_20 = x_train_sorted[:80], x_train_sorted[80:]
y_train_80, y_train_20 = y_train_sorted[:80], y_train_sorted[80:]

print(x_train_80.shape)
# batch_size = 80
# n_examples, width, height, depth = x_train.shape

def get_batch(batch_size, X):
    """Create batch of n pairs, half same class, half different class"""
    n_classes, n_examples, w, h, z = X.shape
    # randomly sample several classes to use in the batch
    categories = np.random.choice(n_classes, size=(batch_size,), replace=False)
    # initialize 2 empty arrays for the input image batch
    pairs = [np.zeros((batch_size, h, w, z)) for i in range(2)]
    # initialize vector for the targets, and make one half of it '1's, so 2nd half of batch has same class
    targets = np.zeros((batch_size,))
    targets[batch_size//2:] = 1
    for i in range(batch_size):
        category = categories[i]
        idx_1 = np.random.randint(0, n_examples)
        pairs[0][i, :, :, :] = X[category, idx_1].reshape(w, h, z)
        idx_2 = np.random.randint(0, n_examples)
        # pick images of same class for 1st half, different for 2nd
        if i >= batch_size // 2:
            category_2 = category
        else:
            #add a random number to the category modulo n_classes to ensure 2nd image has different category
            category_2 = (category + np.random.randint(1,n_classes)) % n_classes
        pairs[1][i, :, :, :] = X[category_2,idx_2].reshape(w, h, z)
    return pairs, targets

def batch_generator(batch_size, X):
    """a generator for batches, so model.fit_generator can be used. """
    while True:
        pairs, targets = get_batch(batch_size, X)
        yield (pairs, targets)

def train(model, X_train, batch_size=64, steps_per_epoch=100, epochs=10):
    model.fit_generator(batch_generator(batch_size, X_train), steps_per_epoch=steps_per_epoch, epochs=epochs)

train(siamese_net, x_train_80)

(80, 500, 32, 32, 3)
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [60]:
def make_oneshot_task(N, X, c, language=None):
    """Create pairs of (test image, support set image) with ground truth, for testing N-way one-shot learning."""
    n_classes, n_examples, w, h, z = X.shape
    indices = np.random.randint(0, n_examples, size=(N,))
    if language is not None:
        low, high = c[language]
        if N > high - low:
            raise ValueError("This language ({}) has less than {} letters".format(language, N))
        categories = np.random.choice(range(low,high), size=(N,), replace=False)
    else:  # if no language specified just pick a bunch of random letters
        categories = np.random.choice(range(n_classes), size=(N,), replace=False)            
    true_category = categories[0]
    ex1, ex2 = np.random.choice(n_examples, replace=False, size=(2,))
    test_image = np.asarray([X[true_category, ex1, :, :]]*N).reshape(N, w, h, z)
    support_set = X[categories, indices, :, :]
    support_set[0, :, :] = X[true_category, ex2]
    support_set = support_set.reshape(N, w, h, z)
    targets = np.zeros((N,))
    targets[0] = 1
    targets, test_image, support_set = shuffle(targets, test_image, support_set)
    pairs = [test_image, support_set]
    return pairs, targets

def test_oneshot(model, X, c, N=20, k=250, language=None, verbose=True):
    """Test average N-way oneshot learning accuracy of a siamese neural net over k one-shot tasks."""
    n_correct = 0
    if verbose:
        print("Evaluating model on {} random {}-way one-shot learning tasks ...".format(k, N))
    for i in range(k):
        inputs, targets = make_oneshot_task(N, X, c, language=language)
        probs = model.predict(inputs)
        if np.argmax(probs) == np.argmax(targets):
            n_correct += 1
    percent_correct = (100.0*n_correct / k)
    if verbose:
        print("Got an average of {}% accuracy for {}-way one-shot learning".format(percent_correct, N))
    return percent_correct

test_oneshot(siamese_net, x_train_80, y_train_80)

Evaluating model on 250 random 20-way one-shot learning tasks ...
Got an average of 11.6% accuracy for 20-way one-shot learning


11.6

***

**b)** Briefly motivate your model's architecture, as well as its performance. What accuracy would random guessing achieve (on average)?

**Answer:**


The model that is used for the one-shot learning task consists of 4 convolutional layers with respectively 32, 64, 64 and 128 filters. These amounts of filters were necessary to catch all the features of the images and obtain an accuracy higher than random guessing. Between each convolutional layers there are Maxpool layers with size 2x2 to reduce the dimensionality of the image, followed by batch normalization to increase the training speed and dropout to avoid overfitting on the train data. The last convolutional layer is followed by a Flatten layer en subsequently a dense hidden layer with 1024 neurons and sigmoid activation in order to limit the output vector of an image to zeros and ones. This model is taken as both the left and right input leg of the next model which only has a dense layer with one neuron and sigmoid activation, because it is binary classification: either the two images are from the same class or not. The loss function belonging to this final model is binary cross entropy, because as said before: it is about binary classification.

In each of the 250 random tasks there are 20 possible classes where only 1 is correct, so guessing would result in an accuracy of 5% while the model has an accuracy aboce 10% which is twice as high.

***

**c)** Compare the performance of your Siamese network for Cifar-100 to the Siamese network from Practical 4 for Omniglot. Name three fundamental differences between the Cifar-100 and Omniglot datasets. How do these differences influence the difference in one-shot accuracy?

**Answer:**

First major difference is that the Omniglot dataset contains images of a resolution (105x105), while the images in the Cifar-100 dataset are (28x28). Cecause these images are much smaller they contain less features, making it harder to extract global patterns to recognize a character.

Second difference is that every pixel in the Omniglot dataset has a greyscale from 0 to 1, while every pixel in the image of the Cifar-100 dataset has a RGB code. This might cause that images look less similar to each other and it is more difficult to make the right classification.

Third and last difference is that the Omniglot dataset only consists of 20 classes while the Cifar-100 dataset has 100 classes, from which 80 are used for training and 20 for evaluating the model. Having a larger amount of classes makes it harder to create a model with high accuracy.

***

### Task 1.2: One-shot learning with neural codes
**a)**
* Train a CNN classifier on the first 80 classes of Cifar-100. Make sure it achieves at least 40% classification accuracy on those 80 classes (use the test set to validate this accuracy).
* Then use neural codes from one of the later hidden layers of the CNN with L2-distance to evaluate one-shot learning accuracy for the remaining 20 classes of Cifar-100. I.e. for a given one-shot task, obtain neural codes for the test image as well as the support set. Then pick the image from the support set that is closest (in L2-distance) to the test image as your one-shot prediction.

In [92]:
from keras.datasets import cifar100
from keras.utils import to_categorical
import numpy as np

(x_train, y_train), (x_test, y_test) = cifar100.load_data(label_mode='fine')

x_train_sorted = x_train[np.argsort(y_train, axis=0)]
y_train_sorted = np.sort(y_train, axis=0)

x_test_sorted = x_test[np.argsort(y_test, axis=0)]
y_test_sorted = np.sort(y_test, axis=0)

x_train_sorted = np.reshape(x_train_sorted, (50000, 32, 32, 3))
x_test_sorted = np.reshape(x_test_sorted, (10000, 32, 32, 3))

x_train_80, x_train_20 = x_train_sorted[:40000], x_train_sorted[40000:]
y_train_80, y_train_20 = y_train_sorted[:40000], y_train_sorted[40000:]
x_test_80, x_test_20 = x_test_sorted[:8000], x_test_sorted[8000:]
y_test_80, y_test_20 = y_test_sorted[:8000], y_test_sorted[8000:]

cnn = Sequential()

cnn.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=(32, 32, 3)))
cnn.add(MaxPooling2D(pool_size=(2, 2)))
cnn.add(BatchNormalization())
cnn.add(Dropout(0.5))

cnn.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))
cnn.add(MaxPooling2D(pool_size=(2, 2)))
cnn.add(BatchNormalization())
cnn.add(Dropout(0.5))

cnn.add(Flatten())

cnn.add(Dense(1024, activation='relu'))
cnn.add(Dense(512, activation='relu', name="neural_codes"))
cnn.add(BatchNormalization())
cnn.add(Dropout(0.5))
cnn.add(Dense(80, activation='softmax'))

cnn.summary()

cnn.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
cnn.fit(x_train_80, to_categorical(y_train_80), epochs=15, batch_size=256)
cnn.evaluate(x_test_80, to_categorical(y_test_80))


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_80 (Conv2D)           (None, 30, 30, 32)        896       
_________________________________________________________________
max_pooling2d_70 (MaxPooling (None, 15, 15, 32)        0         
_________________________________________________________________
batch_normalization_88 (Batc (None, 15, 15, 32)        128       
_________________________________________________________________
dropout_73 (Dropout)         (None, 15, 15, 32)        0         
_________________________________________________________________
conv2d_81 (Conv2D)           (None, 13, 13, 64)        18496     
_________________________________________________________________
max_pooling2d_71 (MaxPooling (None, 6, 6, 64)          0         
_________________________________________________________________
batch_normalization_89 (Batc (None, 6, 6, 64)          256       
__________

[2.735379447221756, 0.318875]

In [None]:
from sklearn.neighbors import NearestNeighbors

neural_codes_model = Model(inputs=cnn.input, outputs=cnn.get_layer("neural_codes").output)
support_set = neural_codes_model.predict(x_train_20)
neigh = NearestNeighbors(n_neighbors=1, p=2)
neigh.fit(support_set)

correct = 0
one_shots = neural_codes_model.predict(x_test_20)
for idx, one_shot in enumerate(one_shots):
    distances, indexes = neigh.kneighbors([one_shots[idx]])
    if y_test_20[idx] == y_train_20[indexes[0]]:
        correct += 1
        
print("Accuracy is {}".format(correct/len(one_shots)))



***

**b)** Briefly motivate your CNN architecture, and discuss the difference in one-shot accuracy between the Siamese network approach and the CNN neural codes approach.

**Answer:**

The CNN architecture consistso of 2 convolutional layers with respectively 32 and 64 filters, this in order to extract enough features from the dataset. After the convolutional layers there is a Maxpool layer to reduce dimensionality, a batch normalization layer to speed up the training process and a dropout layer with a dropout of 0.5, such a high value is necessary to prevent overfitting (which it still does) on the training data. After this there is a Flatten layer which produces a single vector, followed by 2 hidden dense layers with respectively 1024 and 512 neurons with relu activation. Then there is another batch normalization and dropout layer and finally a layer with 80 neurons (1 for each possible class) with softmax activation. This activation is chosen together with the categorical cross entropy loss function because this is a multi-class classification.

The difference between the Siamese network approach and this approach is that for the siamese network pairs are made that either contain characters from the same class (1) or from different classes (0), while in the CNN approach from each image a nerual code is generated which is compared with the support set and the class of the image closest to an image in the support set is assigned as classified class.