# Tutorial 6.2: Bootstrap Your Own Latent (BYOL)

Author: [René Larisch](mailto:rene.larisch@informatik.tu-chemnitz.de)

The self-supervised approach **B**ootstrap **Y**our **O**wn **L**atent (BYOL) has been introduced by [Grill et al. 2020](https://arxiv.org/abs/2006.07733).

In [None]:
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, Sampler
from torchvision import models, transforms
from torchvision.io import read_image
import copy

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f'PyTorch version: {torch.__version__} running on {device}')


import os, sys
notebook_dir = os.getcwd()
root_path = os.path.abspath(os.path.join(notebook_dir, ".."))
if root_path not in sys.path:
    sys.path.append(root_path)
    print(f"Added {root_path} to sys.path")

from Utils.dataloaders import prepare_UTKFace_age_task
from Utils.little_helpers import timer, set_seed, get_parameters
from Utils.functions import train_model, evaluate_model, test_model
from Utils.optimizers import LARS

set_seed(42)

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from tqdm.notebook import tqdm

## How BYOL works

Self-supervised approaches like SimCLR are contrastive learning methods that aim to increase the similarity between representations of two similar input samples (a positive pair) and decrease the similarity between two different input samples (a negative pair). For these approaches, selecting negative pairs is important for achieving good representation. Due to this, the performance of the network depends on the right batch size, the right augmentation pipeline and other things.

Grill et al. (2020) tackle this problem by presenting a self-supervised learning approach, which does not need negative pairs.

To do this, BYOL uses two networks, an online (with the parameter set $\theta$) and a target network (with the parameter set $\xi$) (see Fig.1). Similar to SimCLR, both networks consist of an encoder model (for example a ResNet-like network), a representation layer (here called embedding layer), and a non-linear projection head. While both networks are identical in there architecture until the non-linear projection head, the online network is extended by a prediction head. 

<div align="center">
    <img src="figures/BYOL_net.png" width="650"/>
    <p><i>Figure 1: BYOL uses two networks, an online and a target network. While the weights in the online network are updated with backpropagation, the weights in the target network are updated with an exponential moving average from the weights of the online network.</i></p>
</div>

BYOL creates for an image ($x$) two differently augmented versions ($v$ and $v'$) and uses  $v$ to calculate the output of the online network ($q_{\theta}(z_{\theta})$) and $v'$ to calculate the output of the target network ($z'_{\xi}$).
Here, $z_{\theta}$ and $z_{\xi}$ are the outputs of the projection head of the online and target network, respectively, and $q_{\theta}()$ the prediction function. The loss between the outputs of the online and target network is defined as:

$$
 \mathcal{L_{\theta,\xi}} = \frac{\langle q_{\theta}(z_{\theta}), z'_{\xi} \rangle}{\|q_{\theta}(z_{\theta})\|_{2} \cdot \| z'_{\xi} \|_2}
$$

While only the weight updates of the online network are based on the loss, the target network is updated as a moving average of the online network:

$$
 \xi = \tau \xi + (1- \tau)\theta
$$

The learning approach can therefore be understood in such a way, that the online network tries to predict the output of the target network, without knowing the used augmentation and the parameters of the network. Changing the parameters of the target network and adding asymmetry between the model architectures add additional variability to the output of the target network. The authors suggest, that this variability avoids the collapsing to a simple solution and forces the online network to learn robust representations.

After training, the output of the encoding network, or the embedding layer, from the online network will be used for the downstream task. 

Without the need for negative samples, the authors demonstrate that the performance on the downstream task is less sensitive for the batch size and more robust against changes in the augmentation pipeline than SimCLR (Fig.2).

<div align="center">
    <img src="figures/results.png" width="650"/>
    <p><i>Figure 2: The decrease of accuracy on a downstream task (ImageNet) compared between BYOL (red) and SimCLR (blue) for different batch sizes and different combinations of augmentations.</i></p>
</div>

While BYOL has the advantage of a simpler loss function and less sensitivity to batch sizes and augmentations over SimCLR, it requires two networks, which leads to longer computation times. Additionally, BYOL requires more than twice the memory storage due to the two networks and the additional prediction head.


## Implementation

Instead of building one big object, we will implement the different parts (encoder model, projection head, prediction head) as different objects, as they enable us to be more flexible in changing the structure of the different parts.


### Prediction head

The prediction head (which will only be used on the online network) will be implemented as a simple multi-layer perceptron (MLP) network. 
Please note, that while we here implement a very shallow prediction head, a more deeper network can improve the performance of BYOL, but also slows down the learning.

In [None]:
from torch import nn

class PredHead(nn.Module):
    def __init__(self, dim, embedding_size=256, hidden_size=2048, batch_norm_mlp=False):
        super().__init__()
        norm = nn.BatchNorm1d(hidden_size) if batch_norm_mlp else nn.Identity()
        self.net = nn.Sequential(
            nn.Linear(dim, hidden_size),
            norm,
            nn.ReLU(inplace=True),
            nn.Linear(hidden_size, embedding_size)
        )

    def forward(self, x):
        return self.net(x)

### Projection head

The projection head is just one layer with an activation function, which gets the flatten output from the encoder model.

In [None]:
class LinearLayer(nn.Module):
    def __init__(self,
                 in_features,
                 out_features,
                 use_bias = True,
                 use_bn = False,
                 **kwargs):
        super(LinearLayer, self).__init__(**kwargs)

        self.in_features = in_features
        self.out_features = out_features
        self.use_bias = use_bias
        self.use_bn = use_bn
        
        self.linear = nn.Linear(self.in_features, 
                                self.out_features, 
                                bias = self.use_bias and not self.use_bn)
        if self.use_bn:
             self.bn = nn.BatchNorm1d(self.out_features)

    def forward(self,x):
        x = self.linear(x)
        if self.use_bn:
            x = self.bn(x)
        return x

class AddProjHead(nn.Module):
    def __init__(self, model, in_features, layer_name, hidden_size=4096,
                embedding_size = 256, batch_norm_mlp=True):
        super(AddProjHead, self).__init__()
        self.base_model = model

        if isinstance(self.base_model, str):
            #PRETRAINED MODEL
            if self.base_model == 'resnet50':
                self.encoder_model = models.resnet50(pretrained=True)
        else:
            self.encoder_model = self.base_model

        if self.encoder_model == None:
            print('No valid encoder model chosen')
            
        ## if base_model == resnet50, switch the first convolutional and some later layers for fine tuning
        if self.base_model == 'resnet50':
            self.pretrained.conv1 = nn.Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), bias=False)
            self.pretrained.maxpool = nn.Identity()
            self.pretrained.fc = nn.Identity()
            ## set the pretrained weights fix for the resnet model
            for p in self.pretrained.parameters():
                p.requires_grad = False

        ## embedding layer from where we later want the embeddings for the downstream task
        self.embedding_layer = LinearLayer(in_features,embedding_size, True, True)
        self.projection =   nn.Sequential(
                LinearLayer(embedding_size,hidden_size,True, True),
                nn.ReLU(),
                LinearLayer(hidden_size,hidden_size,False,True),
                nn.ReLU(),)
        
    def forward(self, x, return_embedding = False):
        out = self.encoder_model(x)
        embedding = self.embedding_layer(out)
        if return_embedding:
            return embedding
        return self.projection(embedding)

### Encoder network
We will implement here a little convolutional neural network (as used in the other notebooks). A more common encoder network is ResNet.

In [None]:
class Net(nn.Module):
    def __init__(self, lat_dim):
        super().__init__()
        ## define the network structure with the layers
        self.conv1 = nn.Conv2d(3,256,3) # in_channels, out_channels, kernel_size 
        self.pool  = nn.MaxPool2d(2,2) # kernel_size, stride
        self.dropout = nn.Dropout(0.2) # dropout factor
        self.conv2 = nn.Conv2d(256,128,3) # in_channels,out_channels, kernel_size
        self.conv3 = nn.Conv2d(128,64,3) # in_channels,out_channels, kernel_size
        self.fc1   = nn.Linear(64*14*14, 120) # in_channels, out_channels
        self.fc2   = nn.Linear(120,1024)    
        self.fc3   = nn.Linear(1024,lat_dim) # in_channels, out_channels
        #self.fc3   = nn.Linear(84,4) # in_channels, out_channels

        
    def forward(self, x):
        ## define the functionality of each layer/between the layers
        x = self.pool(F.relu(self.conv1(x)))
        x = self.dropout(x)
        x = self.pool(F.relu(self.conv2(x)))
        x = self.dropout(x)
        x = self.pool(F.relu(self.conv3(x)))
        x = torch.flatten(x,1) # flatten all dimensions except batch
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        x = F.relu(self.fc2(x))
        x = self.dropout(x)
        x = F.relu(self.fc3(x))
        return x

### Exponential moving average (EMA)
Before we stitch everything together, we have to implement an object to calculate the exponential moving average to update the target network. 

In [None]:
class EMA():
    def __init__(self, alpha):
        super().__init__()
        self.alpha = alpha

    def update_average(self, old, new):
        if old is None:
            return new
        return old * self.alpha + (1 - self.alpha) * new

### The complete network
Finally, we must decide:
* whether to use batch normalization in the projection and prediction heads
* how many features should be sent to the projection head
* the size of the projection network
* the decay for the moving average
* and whether to use momentum for the moving average.

To symmetrize the loss ($\mathcal{L}_{\theta,\xi}$), we feed $v'$ to the online network and $v$ to the target network to calculate the loss between the predictions of the different augmentation variants. The final loss is described as follows:

$$
 \mathcal{L}_{\theta,\xi} = 2 - 2 \cdot \frac{\langle q_{\theta}(z_{\theta}), z'_{\xi} \rangle}{\|q_{\theta}(z_{\theta})\|_{2} \cdot \| z'_{\xi} \|_2}
$$

In [None]:
class BYOL(nn.Module):
    def __init__( self, net,
        batch_norm_mlp=True,
        layer_name = 'fc',
        in_features = 512,
        projection_size = 256,
        projection_hidden_size = 2048,
        moving_average_decay = 0.99,
        use_momentum = True):
        super().__init__()
        ## init online model with projection head
        self.online_model = AddProjHead(model = net, in_features = in_features,
                                         layer_name = layer_name,
                                         embedding_size = projection_size,
                                         hidden_size = projection_size,
                                         batch_norm_mlp = batch_norm_mlp)
        self.use_momentum = use_momentum
        ## extra function that we use to get the target (see below)
        self.target_model = self._get_target()
        self.target_ema_updater = EMA(moving_average_decay)
        ## additional prediction head for the online network
        self.online_predictor = PredHead(projection_size,projection_size, projection_hidden_size)

    @torch.no_grad()
    def _get_target(self):
        return copy.deepcopy(self.online_model)

    @torch.no_grad()
    def update_moving_average(self):
        assert self.use_momentum, 'you do not need to update the moving average, since you have turned off momentum ' \
                                  'for the target encoder '
        assert self.target_model is not None, 'target encoder has not been created yet'

        for online_params, target_params in zip(self.online_model.parameters(), self.target_model.parameters()):
            old_weight, up_weight = target_params.data, online_params.data
            target_params.data = self.target_ema_updater.update_average(old_weight, up_weight)

    ## Loss function
    def loss_fn(self, x,y):
        #L2 normalization
        x = F.normalize(x, dim=-1, p=2)
        y = F.normalize(y, dim=-1, p=2)
    
        return 2 - 2 * (x * y).sum(dim=-1) 
    
    def forward(self,image_one, image_two = None, return_embedding = False):
        if return_embedding or (image_two is None):
            return self.online_model(image_one, return_embedding=True)

        # online projections: backbone + projection
        online_proj_one = self.online_model(image_one)
        online_proj_two = self.online_model(image_two)

        # additional predictor
        online_pred_one = self.online_predictor(online_proj_one)
        online_pred_two = self.online_predictor(online_proj_two)

        with torch.no_grad():
            ## target network processes the images and predicts: backbone + PredHead
            target_proj_one = self.target_model(image_one).detach_()
            target_proj_two = self.target_model(image_two).detach_()

        loss_one = self.loss_fn(online_pred_one, target_proj_one)
        loss_two = self.loss_fn(online_pred_two, target_proj_two)

        return((loss_one + loss_two).mean())

In [None]:
num_features = 512#128
byol_net = BYOL(Net(num_features), in_features=num_features, batch_norm_mlp=True).to(device)

print('Trainable Parameters in CCT: %.3fM' % get_parameters(byol_net))

## AgeDB dataset
The AgeDB, published by [Moschoglou et al.2017](10.1109/CVPRW.2017.250), contains 16,488 images of famous people at different ages.
Each sample image contains three pieces of information: the person's name, gender, and age.

### Prepare the dataset

To run the following lines of code, make sure you have access to the dataset (e.g. download it beforehand from https://www.kaggle.com/datasets/nitingandhi/agedb-database).

In [None]:
def make_annotation_file(path_data, split):
    ## function to iterate through the path and create a annotation csv
    list_data    = []
    label_name   = []
    label_age    = []
    label_gender = []
    
    for idx,file in enumerate(os.listdir(path_data)):
        _, name, age, gender = file.split('_')
        gender = gender.split('.')[0]
        
        list_data.append((path_data+'/'+file, name, age, gender))

    ## shuffle and then split the dataset into train, test, validation
    n_samples = len(list_data)
    idx = np.linspace(0,n_samples-1, n_samples, dtype='int32')
    np.random.shuffle(idx)    
    list_data   = np.asarray(list_data)[idx]

    n_train= int(n_samples*split[0])
    n_test = int(n_samples*split[1])

    ##train data
    train_data = list_data[:n_train]

    ## test data
    test_data  = list_data[n_train:n_train+n_test]
    
    ## create two csv-files for the annotations
    csv_train = pd.DataFrame(train_data, columns = ['Images','Name','Age','Gender'])
    csv_test = pd.DataFrame(test_data, columns = ['Images','Name','Age','Gender'])

    ## save them for later 
    csv_train.to_csv('annot_train.csv', sep=',', index=False)
    csv_test.to_csv('annot_test.csv', sep=',', index=False)

    labels_name = csv_test['Name']
    labels_age = csv_test['Age']
    labels_gender = csv_test['Gender']

    return(labels_name, labels_age, labels_gender)

path_data = '../Dataset/AgeDB/'
split = [0.8,0.2]

labels_name, labels_age, labels_gender = make_annotation_file(path_data, split)

### Augmentation pipeline

In [None]:
import torchvision.transforms.v2 as v2
class DataSetAugment(Dataset):
    def __init__(self,phase, annotations_file, s = 0.5):
        self.phase = phase
        self.img_labels = pd.read_csv(annotations_file)
        self.s = s
        self.resize = v2.Resize((128,128))
        self.transforms = transforms.Compose([v2.RandomResizedCrop(size=(128,128), scale=(0.75, 1.0), ratio=(0.9, 1.05)),
                                            transforms.Compose([v2.RandomApply([
                                                                v2.ColorJitter(0.8*self.s, 
                                                                                       0.8*self.s, 
                                                                                       0.8*self.s, 
                                                                                       0.2*self.s)], p = 0.3),
                                                                  v2.RandomGrayscale(p=0.2)]),
                                             v2.RandomHorizontalFlip(),
                                             v2.RandomApply([v2.GaussianBlur((3,3), (1.0,2.0))],p=0.2),
                                             v2.RandomResizedCrop(size=(128,128), scale=(0.08,1.0)),])            

    def __len__(self):
        return self.img_labels.shape[0]

    def __getitem__(self,idx):
        img_path = os.path.join(self.img_labels.iloc[idx,0])
        y =  self.img_labels.iloc[idx,3] # we select here the age
        x = read_image(img_path)
        x = x.numpy()
        x = x.astype(np.float32)/255.0
        
        
        x1 = self.augment(torch.from_numpy(x))
        x2 = self.augment(torch.from_numpy(x))
        
        x1 = self.preprocess(x1)
        x2 = self.preprocess(x2)
        
        return x1, x2, y

    # shuffles the dataset at the end of each epoch
    def on_epoch_end(self):
         self.img_labels = self.img_labels.sample(frac=1).reset_index(drop=True)
    #    self.imgarr = self.imgarr[random.sample(population = list(range(self.__len__())),k = self.__len__())]

    def preprocess(self,frame):
        MEAN = torch.tensor([[[0.485]], [[0.456]], [[0.406]]])
        STD = torch.tensor([[[0.229]],[[0.224]],[[0.225]]])
        frame = (frame-MEAN)/STD
        return frame
    
    # applies randomly selected augmentations to each clip (same for each frame in the clip)
    def augment(self, frame, transformations = None):
        
        if self.phase == 'train':
            frame = self.transforms(frame)
        else:
            frame = self.resize(frame)
            
        return frame

Now create train and test set and plot some examples.

*Note: If you're working on a Windows system use 0 workers to avoid multiprocessing issues (Windows uses "spawn" for creating processes instead of "fork", which can cause issues with multiprocessing). Thus, `num_workers = 0`.*

In [None]:
batch_size = 32

train_set = DataSetAugment('train', 'annot_train.csv')
train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True, num_workers=4)

test_set = DataSetAugment('valid', 'annot_test.csv')
test_loader = DataLoader(test_set, batch_size=batch_size, shuffle=True, num_workers=4)

In [None]:
def imshow(imges):
    plt.figure()
    for i in range(4):
        img = imges[i]
        npimg = img.numpy()
        plt.subplot(1,4,i+1)
        plt.imshow(np.transpose(npimg,(1,2,0)))
        plt.axis('off')
    plt.show

dataiter = iter(test_loader)
images_1, image_2, _= next(dataiter)
imshow(image_2)

## Training

Before we start with training the model, we define a saving routine, so that we can use the trained model afterwards, and the optimizer as well as a scheduler.

In [None]:
def save_model(model, optimizer, scheduler, current_epoch, name):
    os.makedirs('./content/saved_models/', exist_ok=True) 
    out = os.path.join('./content/saved_models/',name.format(current_epoch))

    torch.save({'model_state_dict': model.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                'scheduler_state_dict':scheduler.state_dict()}, out)

In [None]:
#OPTIMIZER
optimizer = LARS(
    [params for params in byol_net.parameters() if params.requires_grad],
    lr= 0.2 * batch_size/256,
    weight_decay=1e-6,
    exclude_from_weight_decay=["batch_normalization", "bias"],
)

# "decay the learning rate with the cosine decay schedule without restarts"
#SCHEDULER OR LINEAR WARMUP
warmupscheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lambda epoch : (epoch+1)/10.0, verbose = True)

#SCHEDULER FOR COSINE DECAY
mainscheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(optimizer, 500, eta_min=0.05, last_epoch=-1, verbose = True)


Let's start the training.

In [None]:
import time 

nr = 0
current_epoch = 0
num_epochs = 20
tr_loss = []
val_loss = []

for epoch in range(num_epochs):
    stime = time.time()

    byol_net.train()
    tr_loss_epoch = 0
    train_pbar = tqdm(train_loader, desc=f'Epoch {epoch +1}/{num_epochs} [Train]')
    
    for x_i, x_j, _ in train_pbar:

        optimizer.zero_grad()
        x_i = x_i.to(device).float()
        x_j = x_j.to(device).float()
        
        loss =  byol_net(x_i, x_j)
        loss.backward()

        optimizer.step()
        # EMA update
        byol_net.update_moving_average()

        tr_loss_epoch += loss.item()
        
    if epoch < 10:
        warmupscheduler.step()
    if epoch >= 10:
        mainscheduler.step()
    
    lr = optimizer.param_groups[0]["lr"]

    if (epoch+1) % 50 == 0:
        save_model(byol_net, optimizer, mainscheduler, current_epoch,"BYOL_net_checkpoint.pt")

    byol_net.eval()
    with torch.no_grad():
        val_loss_epoch = 0
        val_bar = tqdm(test_loader, desc=f'Epoch {epoch +1}/{num_epochs} [Val]')
        for x_i, x_j, _ in val_bar:
            
            x_i = x_i.to(device).float()
            x_j = x_j.to(device).float()

            loss =  byol_net(x_i, x_j)

            val_loss_epoch += loss.item()


    tr_loss.append(tr_loss_epoch / len(train_loader))
    val_loss.append(val_loss_epoch / len(test_loader))
    print(f"Epoch [{epoch}/{num_epochs}]\t Training Loss: {tr_loss_epoch / len(train_loader)}\t lr: {round(lr, 5)}")
    print(f"Epoch [{epoch}/{num_epochs}]\t Validation Loss: {val_loss_epoch / len(test_loader)}\t lr: {round(lr, 5)}")
    current_epoch += 1

    time_taken = (time.time()-stime)/60
    print(f"Epoch [{epoch}/{num_epochs}]\t Time Taken: {time_taken} minutes")

save_model(byol_net, optimizer, mainscheduler, current_epoch, "BYOL_net_final.pt")

plt.figure()
plt.plot(tr_loss, label='Trainings Loss')
plt.plot(val_loss, label='Validation Loss')
plt.legend()
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.show()

## Fine Tune

As for the SimCLR notebook, we will now use the pre-trained network and fine-tune it to predict the gender of the persons in the dataset.

In [None]:
new_byol_net = BYOL(Net(num_features), in_features=num_features, batch_norm_mlp=True).to(device)
new_byol_net.load_state_dict(torch.load('./content/saved_models/BYOL_net_final.pt', weights_only=True)['model_state_dict'])

Before the fine-tuning, a brief reminder that we wrote the forward() function in the BOYL class so that, when only one image is fed to the network, the output of the projection head is obtained from the online network.

In [None]:
class FineTuneModel(nn.Module):
    def __init__(self,base_model, num_classes, freeze_base = False):
        
        super().__init__()
        self.base_model = base_model
        ## freeze the pre-trained model if necessary
        if freeze_base:
            self.base_model.requires_grad_(False)
        else:
            self.base_model.requires_grad_(True)

        
        in_features = 512

        ## new head for the classification
        self.new_head = nn.Sequential(nn.Linear(in_features, 256),
                                      nn.ReLU(),
                                      nn.Dropout(0.3),
                                      nn.Linear(256, 256),
                                      nn.ReLU(),
                                      nn.Dropout(0.3),
                                      nn.Linear(256, num_classes))

    
    def forward(self,x):
        out = self.base_model(x)
        out = self.new_head(out)
        return (out)

Now, define the Optimizer and Loss function.

In [None]:
import torch.optim as optim
num_classes = 2

gender_model = FineTuneModel(new_byol_net, num_classes, freeze_base=False)

print('Trainable Parameters in network: %.3fM' % get_parameters(gender_model))

num_epochs=5
init_lr= 1e-4
optimizer = optim.AdamW(gender_model.parameters(), lr=init_lr, weight_decay=5e-4)
criterion = nn.CrossEntropyLoss()

Let's start the fine-tuning.

In [None]:
gender_model.to(device)
for epoch in range(num_epochs):
    gender_model.train()
    running_loss = 0.0
    correct = 0
    total = 0

    train_pbar = tqdm(train_loader, desc=f'Epoch {epoch + 1}/{num_epochs} [Train]')

    for x_i,_, labels in train_pbar:
        x_i = x_i.to(device)
        y = np.zeros(len(labels))
        y[np.asarray(labels)=='m'] = 0
        y[np.asarray(labels)=='f'] = 1
        y = torch.tensor(y)
        y = y.to(torch.int64)
        
        optimizer.zero_grad()

        # Forward pass
        out = gender_model(x_i)
        out = out.to(device)
        y = y.to(device)
        
        loss = criterion(out, y)

        loss.backward()
        optimizer.step()
        # Statistics
        running_loss += loss.item() * x_i.size(0)
        _, predicted = torch.max(out, 1)
        total += y.size(0)
        correct += (predicted == y).sum().item()

        # Update progress bar
        train_pbar.set_postfix({'loss': loss.item(), 'acc': 100 * correct / total})

    # Calculate epoch statistics
    epoch_train_loss: float = running_loss / len(train_loader.dataset)
    epoch_train_acc: float = 100 * correct / total

Let's see, how well it performed.

In [None]:
gender_model.eval()
gender_model.to(device)

predictions = []
label_list = []

with torch.no_grad():
    eval_pbar = tqdm(test_loader, desc='Evaluation')

    for x_i,_, labels in eval_pbar:
        if x_i.size()[0] < batch_size:
            ##ignore the last batch, if it did not fit 
            break
        x_i = x_i.to(device)
        out = gender_model(x_i)
        if isinstance(out, tuple):
            out = out[0]
        _, predicted = torch.max(out, 1)
        predictions.append(predicted.cpu())
        label_list.append(labels)

In [None]:
predictions = np.asarray(predictions).flatten()
label_list = np.asarray(label_list).flatten()
labels = np.zeros(len(label_list))
labels[label_list=='f'] = 1

In [None]:
from sklearn.metrics import classification_report

print(classification_report(labels, predictions))

## Exercises

### 1. Change the encoder network to ResNet50

In [None]:
# Your code here

### 2. Use t-SNE to see how gender is represented in the latent representation from the online network.

In [None]:
# Your code here

### 3. Change the augmentation pipeline

In [None]:
# Your code here