# Tutorial 6.1: Simple framework for Contrastive Learning of visual Representations (SimCLR)

Author: [René Larisch](mailto:rene.larisch@informatik.tu-chemnitz.de)

The **Sim**ple framework for **C**ontrastive **L**earning of visual **R**epresentations (SimCLR), a contrastive learning approach that does not require knowledge of labels, was published by [Chen et al. in 2020](https://arxiv.org/abs/2002.05709).


In [None]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader, Sampler
import torch.nn.functional as F
from torchvision import models, transforms
from torchvision.io import read_image

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f'PyTorch version: {torch.__version__} running on {device}')

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import time

import os, sys
notebook_dir = os.getcwd()
root_path = os.path.abspath(os.path.join(notebook_dir, ".."))
if root_path not in sys.path:
    sys.path.append(root_path)
    print(f"Added {root_path} to sys.path")


from Utils.little_helpers import timer, set_seed, get_parameters
set_seed(42)

from Utils.optimizers import LARS
from sklearn.manifold import TSNE
from tqdm.notebook import tqdm


## Short reminder of contrastive learning

Before diving deeper into SimCLR, let's briefly review the main idea behind contrastive learning.

Suppose we have multiple images of cats and birds and feed them into a deep neural network (as depicted in Fig. 1). Each input sample would lead to a representation within the network. If there is a (final) layer with only two neurons, we can display these different representations on a two-dimensional map.

<div align="center">
    <img src="figures/Contrastive_1.png" width="450"/>
    <p><i>Figure 1: Different input samples fed into a deep neural network (DNN) create different latent representations within the network.</i></p>
</div>

The goal of contrastive learning is to have the representations of two input images showing a similar concept (such as two cats) be close to each other (having a minimum distance). Meanwhile, the representations of two input images showing different concepts (e.g., a bird and a cat) should be far from each other (i.e., have a large distance). Instead of distance, you can also think of the similarity between the representations.

<div align="center">
    <img src="figures/Contrastive.gif" width="450"/>
    <p><i>Figure 2: Representations of the same class should become closer (blue arrows), representations of different classes should become more distant (red arrows).</i></p>
</div>

Most contrastive learning approaches use positive pairs (samples showing the same concept) and negative pairs (samples showing different concepts) to maximize similarity between positive pairs and minimize similarity between negative pairs.
This allows the network to learn which input features are similar in positive pairs and which are different in negative pairs.

Early approaches, like [Chopra et al. (2005)](https://doi.org/10.1109/CVPR.2005.202), created positive and negative pairs through pre-selection. However, SimCLR did not require this prior knowledge.

## How SimCLR works

### The overall idea
Similar to previous approaches, SimCLR aims to maximize representational similarity between images of the same concept and minimize it between images of different concepts. Here, representational similarity refers to a representation within the neural network, also called a latent representation.

The cosine similarity is used to measure the similarity between representations $z_i$ and $z_j$:

$$
 \cos(z_i,z_j) = \frac{z_i^T z_j}{||z_i|| \; ||z_j||}
$$

As a semi-supervised approach, SimCLR is designed primarily for pre-training on unlabeled data. These representations can then be used or fine-tuned for the final task, called the downstream task. To accomplish this, a deep neural network (e.g., a ResNet) is augmented with a representation layer and a nonlinear projection head (Fig. 3).
Although only the network up to the representation layer (i.e. $h_i$ and $h_j$) is used for the downstream task, Chen et al. (2020) demonstrated that adding a nonlinear projection head and optimizing its output ($z_i$ and $z_j$) improves encoding in the representation layer.

<div align="center">
    <img src="figures/simclr-general-architecture.png" width="750"/>
    <p><i>Figure 3: General structure of the SimCLR learning approach. A deep neural network, called the base encoder, is extended by a representation layer and a nonlinear projection head. Optimizing the similarity between the representations from the projection head also tunes the representations for the downstream tasks.</i></p>
</div>

### Projection Head

Since the representation layer can easily be implemented with a standard linear layer, we will first implement the projection head. Although we could use the standard linear layer from PyTorch, we will define our own linear layer class that can perform batch normalization and determine whether the layer should have a bias value.

In [None]:
class LinearLayer(nn.Module):
    def __init__(self,
                 in_features,
                 out_features,
                 use_bias = True,
                 use_bn = False,
                 **kwargs):
        super(LinearLayer, self).__init__(**kwargs)

        self.in_features = in_features
        self.out_features = out_features
        self.use_bias = use_bias
        self.use_bn = use_bn
        
        self.linear = nn.Linear(self.in_features, 
                                self.out_features, 
                                bias = self.use_bias and not self.use_bn)
        if self.use_bn:
             self.bn = nn.BatchNorm1d(self.out_features)

    def forward(self,x):
        x = self.linear(x)
        if self.use_bn:
            x = self.bn(x)
        return x


Now, we can write the ProjectionHead class, which receives input features from the encoder model and outputs the activity of some neurons.
The projection head can consist of one linear layer or multiple nonlinear layers.

In [None]:
class ProjectionHead(nn.Module):
    def __init__(self,
                 in_features,
                 hidden_features,
                 out_features,
                 head_type = 'nonlinear',
                 **kwargs):
        super(ProjectionHead,self).__init__(**kwargs)
        
        self.in_features = in_features
        self.out_features = out_features
        self.hidden_features = hidden_features
        self.head_type = head_type

        if self.head_type == 'linear':
            self.layers = LinearLayer(self.in_features,self.out_features,False, True)
        elif self.head_type == 'nonlinear':
            self.layers = nn.Sequential(
                LinearLayer(self.in_features,self.hidden_features,True, True),
                nn.ReLU(),
                LinearLayer(self.hidden_features,self.out_features,False,True))
        
    def forward(self,x):
        x = self.layers(x)
        return x

### Encoder

Now we can connect the ProjectionHead to the preceding encoder. Here in this case, the encoder can be a pre-trained ResNet50 from PyTorch or a self-defined network.

In [None]:
class SimCLRModel(nn.Module):
    def __init__(self,base_model, num_features=64):
        
        super().__init__()
        self.base_model = base_model
        self.num_features = num_features

        if isinstance(base_model, str):
            #PRETRAINED MODEL
            if base_model == 'resnet50':
                self.pretrained = models.resnet50(pretrained=True)
        else:
            self.pretrained = self.base_model

        if self.pretrained == None:
            print('No valid pretrained model chosen')
        

        #self.pretrained[0] = nn.Conv2d(3, 16, kernel_size=(3, 3), stride=(1, 1), bias=False)
        ## if base_model == resnet50, switch the first convolutional and some later layers for fine tuning
        if base_model == 'resnet50':
            self.pretrained.conv1 = nn.Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), bias=False)
            self.pretrained.maxpool = nn.Identity()
            self.pretrained.fc = nn.Identity()
            ## set the pretrained weights fix for the resnet model
            for p in self.pretrained.parameters():
                p.requires_grad = False

        self.latent_layer = LinearLayer(128,self.num_features, True, True)
        self.projector = ProjectionHead(self.num_features, 2048, 128)

    
    def forward(self,x):
        out = self.pretrained(x)
        #print(out.size())
        latent = F.relu(self.latent_layer(out))
        xp = self.projector(torch.squeeze(latent))
        # we return the output from projection head and 
        # the latent layer as we need it later for the real task 
        return (xp,latent)

Let's define a small encoder network consisting of three convolutional layers and several fully connected layers.

In [None]:
class EncoderNet(nn.Module):
    def __init__(self):
        super().__init__()
        ## define the network structure with the layers
        self.conv1 = nn.Conv2d(3,256,3) # in_channels, out_channels, kernel_size 
        self.pool  = nn.MaxPool2d(2,2) # kernel_size, stride
        self.dropout = nn.Dropout(0.2) # dropout factor
        self.conv2 = nn.Conv2d(256,128,3) # in_channels,out_channels, kernel_size
        self.conv3 = nn.Conv2d(128,32,3) # in_channels,out_channels, kernel_size
        self.fc1   = nn.Linear(32*14*14, 120) # in_channels, out_channels
        self.fc2   = nn.Linear(120,1024)    
        self.fc3   = nn.Linear(1024,128) # in_channels, out_channels

        
    def forward(self, x):
        ## define the functionality of each layer/between the layers
        x = self.pool(F.relu(self.conv1(x)))
        x = self.dropout(x)
        x = self.pool(F.relu(self.conv2(x)))
        x = self.dropout(x)
        x = self.pool(F.relu(self.conv3(x)))
        #print(x.size())
        x = torch.flatten(x,1) # flatten all dimensions except batch
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        x = F.relu(self.fc2(x))
        x = self.dropout(x)
        x = F.relu(self.fc3(x))
        return x


### Full Network

Create the complete SimCLRModel, using our custom encoder network. You can also use a pre-trained network.

In [None]:
num_features = 128
#model = SimCLRModel('resnet50', num_features).to(device)
model = SimCLRModel(EncoderNet(), num_features).to(device)

print('Trainable Parameters in CCT: %.3fM' % get_parameters(model))
print(model)

## Loss Function

<table style>
<td style="border: none!important; width:55%">

Since deep learning typically uses batch learning to update weights, SimCLR uses batches to identify positive and negative pairs.

After creating two variants of each image in a batch using data augmentation, all variants are paired with each other.
Variants of the same original image are considered positive pairs, and pairs of variants from different images are considered negative pairs.

With $k$ samples per batch, there are $2k$ data points after augmentation. Each data point has one positive pair partner, and the remaining $2(k-1)$ augmented samples are considered negative samples (even if they accidentally depict an object from the same class or concept).

The loss is calculated for each positive pair in the batch as follows:</p> <p></p>
$$
 \mathcal{l}(i,j) = -\log{} \frac{\exp(sim(z_i,z_j)/\tau )}{ \sum^{2N}_{k=1} \mathbb{1}_{[k \neq i]}\exp(sim(z_i,z_k)/\tau)}
$$
with $sim(z_i,z_j) = \cos(z_i,z_j)$ is the cosine similarity between the representations $z_i$ and $z_j$.

The numerator depends only on the similarity of the single positive pair, while the denominator depends on the summed similarity of all negative pairs plus the positive one. The parameter $\tau$ is a temperature parameter that controls the influence of the numerator and denominator.

Thus, loss decreases as the similarity between positive pairs increases and increases as the similarity between negative pairs increases.

Note: Switching the order in the positive pair, i.e. calculate the loss between $z_j$ and $z_i$, results in a different loss $\mathcal{l}(j,i)$ as it is not symmetric.

The loss for one image is the sum of the two sub-losses $\mathcal{l}(i,j)$ and $\mathcal{l}(j,i)$.

At the end of an epoch, the total loss is calculated as follow:
$$
    \mathcal{L} = \frac{1}{2N} \sum^{N}_{k=1}[\mathcal{l}(2k-1,2k) + \mathcal{l}(2k,2k-1)]
$$   

</td>
<td style="border: none!important; width:45%">
<div align="center">
    <img src="figures/SimCLR_schema.gif" width="400"/>
</div>   
</td>
</table>


In [None]:
class SimCLR_Loss(nn.Module):
    def __init__(self, batch_size, temperature):
        super().__init__()
        self.batch_size = batch_size
        self.temperature = temperature

        self.mask = self.mask_correlated_samples(batch_size)
        self.criterion = nn.CrossEntropyLoss(reduction="sum")
        self.similarity_f = nn.CosineSimilarity(dim=2)

    def mask_correlated_samples(self, batch_size):
        N = 2 * batch_size
        mask = torch.ones((N, N), dtype=bool)
        mask = mask.fill_diagonal_(0)
        
        for i in range(batch_size):
            mask[i, batch_size + i] = 0
            mask[batch_size + i, i] = 0
        return mask

    def forward(self, z_i, z_j):

        N = 2 * self.batch_size

        z = torch.cat((z_i, z_j), dim=0)
        sim = self.similarity_f(z.unsqueeze(1), z.unsqueeze(0)) / self.temperature

        
        sim_i_j = torch.diag(sim, diagonal=self.batch_size) #torch.diag(input = sim, diagonal = self.batch_size)
        sim_j_i = torch.diag(input = sim, diagonal =-self.batch_size)
        
        # We have 2N samples
        positive_samples = torch.cat((sim_i_j, sim_j_i), dim=0).reshape(N, 1)
        negative_samples = sim[self.mask].reshape(N, -1)
        
        #SIMCLR
        labels = torch.from_numpy(np.array([0]*N)).reshape(-1).to(positive_samples.device).long() #.float()
        
        logits = torch.cat((positive_samples, negative_samples), dim=1)
        loss = self.criterion(logits, labels)
        loss /= N
        
        return loss

In [None]:
#OPTIMIZER
optimizer = LARS(
    [params for params in model.parameters() if params.requires_grad],
    lr=0.2,
    weight_decay=1e-6,
    exclude_from_weight_decay=["batch_normalization", "bias"],
)

# "decay the learning rate with the cosine decay schedule without restarts"
#SCHEDULER OR LINEAR WARMUP
warmupscheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lambda epoch : (epoch+1)/10.0, verbose = True)

#SCHEDULER FOR COSINE DECAY
mainscheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(optimizer, 500, eta_min=0.05, last_epoch=-1, verbose = True)
batch_size = 32
#LOSS FUNCTION
criterion = SimCLR_Loss(batch_size = batch_size, temperature = 0.5)

## AgeDB dataset
The AgeDB, published by [Moschoglou et al.2017](10.1109/CVPRW.2017.250), contains 16,488 images of famous people at different ages.
Each sample image contains three pieces of information: the person's name, gender, and age.

### Prepare the dataset

To run the following lines of code, make sure you have access to the dataset (e.g. download it beforehand from https://www.kaggle.com/datasets/nitingandhi/agedb-database).

In [None]:
def make_annotation_file(path_data, split):
    ## function to iterate through the path and create a annotation csv
    list_data    = []
    label_name   = []
    label_age    = []
    label_gender = []
    
    for idx,file in enumerate(os.listdir(path_data)):
        _, name, age, gender = file.split('_')
        gender = gender.split('.')[0]
        
        list_data.append((path_data+'/'+file, name, age, gender))

    ## shuffle and then split the dataset into train, test, validation
    n_samples = len(list_data)
    idx = np.linspace(0,n_samples-1, n_samples, dtype='int32')
    np.random.shuffle(idx)    
    list_data   = np.asarray(list_data)[idx]

    n_train= int(n_samples*split[0])
    n_test = int(n_samples*split[1])

    ##train data
    train_data = list_data[:n_train]

    ## test data
    test_data  = list_data[n_train:n_train+n_test]
    
    ## create two csv-files for the annotations
    csv_train = pd.DataFrame(train_data, columns = ['Images','Name','Age','Gender'])
    csv_test = pd.DataFrame(test_data, columns = ['Images','Name','Age','Gender'])

    ## save them for later 
    csv_train.to_csv('annot_train.csv', sep=',', index=False)
    csv_test.to_csv('annot_test.csv', sep=',', index=False)

    labels_name = csv_test['Name']
    labels_age = csv_test['Age']
    labels_gender = csv_test['Gender']

    return(labels_name, labels_age, labels_gender)

path_data = '../Dataset/AgeDB/'
split = [0.8,0.2]

labels_name, labels_age, labels_gender = make_annotation_file(path_data, split)

### Augmentation pipeline
As shown in the original publication, the types and order of augmentations play a crucial role in the quality of learned representations.

For this reason, we will stick to the original pipeline mentioned in [Chen et al. (2020)](https://arxiv.org/abs/2002.05709), though we should mention that other self-supervised approaches propose augmentation pipelines that may lead to better results.

Please note that the dataset we create will return two augmented versions of each image in a batch, along with the corresponding age.

In [None]:
import torchvision.transforms.v2 as v2
from torchvision.io import ImageReadMode

class DataSetAugment(Dataset):
    def __init__(self,phase, annotations_file, s = 0.5):
        self.phase = phase
        self.img_labels = pd.read_csv(annotations_file)
        self.s = s
        self.resize = v2.Resize((128,128))
        self.transforms = transforms.Compose([v2.RandomResizedCrop(size=(128,128), scale=(0.75, 1.0), ratio=(0.9, 1.05)),
                                            transforms.Compose([v2.RandomApply([
                                                                v2.ColorJitter(0.8*self.s, 
                                                                                       0.8*self.s, 
                                                                                       0.8*self.s, 
                                                                                       0.2*self.s)], p = 0.3),
                                                                  v2.RandomGrayscale(p=0.2)]),
                                             v2.RandomHorizontalFlip(),
                                             v2.RandomApply([v2.GaussianBlur((3,3), (1.0,2.0))],p=0.2),
                                             v2.RandomResizedCrop(128,(0.08,1.0)),])

    def __len__(self):
        return self.img_labels.shape[0]

    def __getitem__(self,idx):
        img_path = os.path.join( self.img_labels.iloc[idx,0])
        y =  self.img_labels.iloc[idx,3] # we select here the age
        
         # Some images are in RGB and some in gray-scale, so we set the ImageReadMode to convert all to RGB
        x = read_image(img_path, mode = ImageReadMode.RGB)
        x = x.numpy()
        x = x.astype(np.float32)/255.0
        
        x1 = self.augment(torch.from_numpy(x))
        x2 = self.augment(torch.from_numpy(x))

        return x1, x2, y

    def preprocess(self,frame):
        MEAN = torch.tensor([[[0.485]], [[0.456]], [[0.406]]])
        STD = torch.tensor([[[0.229]],[[0.224]],[[0.225]]])
        frame = (frame-MEAN)/STD
        return frame
    
    #applies randomly selected augmentations to each clip (same for each frame in the clip)
    def augment(self, frame, transformations = None):
        
        if self.phase == 'train':
            frame = self.transforms(frame)
        else:
            frame = self.resize(frame)
        
        return frame

Now create train and test set.

*Note: If you're working on a Windows system use 0 workers to avoid multiprocessing issues (Windows uses "spawn" for creating processes instead of "fork", which can cause issues with multiprocessing). Thus, `num_workers = 0`.*

In [None]:
train_set = DataSetAugment('train','annot_train.csv')
train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True, num_workers=4)

test_set = DataSetAugment('valid','annot_test.csv')
test_loader = DataLoader(test_set, batch_size=batch_size, shuffle=False, num_workers=4)


Let's plot some samples.

In [None]:
def imshow(imges):
    plt.figure(figsize=(8,2))
    for i in range(4):
        img = imges[i]
        #img = img / 2 + 0.5
        npimg = img.numpy()
        plt.subplot(1,4,i+1)
        plt.imshow(np.transpose(npimg,(1,2,0)))
        plt.axis('off')
    plt.show

dataiter = iter(test_loader)
images_1,_, y= next(dataiter)
imshow(images_1)
print(y[:4])

## Train

In [None]:
def save_model(model, optimizer, scheduler, current_epoch, name):
    os.makedirs('./content/saved_models/', exist_ok=True) 
    out = os.path.join('./content/saved_models/',name.format(current_epoch))

    torch.save({'model_state_dict': model.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                'scheduler_state_dict':scheduler.state_dict()}, out)

In [None]:
current_epoch = 0
epochs = 50#100
tr_loss = []
val_loss = []

for epoch in range(epochs):
        
    print(f"Epoch [{epoch}/{epochs}]\t")
    stime = time.time()

    model.train()
    tr_loss_epoch = 0
    pbar = tqdm(total = len(train_loader)-1 )
    
    for step, (x_i, x_j, _) in enumerate(train_loader):
        if x_i.size()[0] < batch_size:
            ##ignore the last batch, if it did not fit
            break
        optimizer.zero_grad()
        x_i = x_i.squeeze().to(device).float()
        x_j = x_j.squeeze().to(device).float()
        
        # positive pair, with encoding
        z_i,_ = model(x_i)
        z_j,_ = model(x_j)

        loss = criterion(z_i, z_j)
        loss.backward()

        optimizer.step()
        
        tr_loss_epoch += loss.item()

        pbar.update(1)
    if epoch < 10:
        warmupscheduler.step()
    if epoch >= 10:
        mainscheduler.step()
    
    lr = optimizer.param_groups[0]["lr"]

    if (epoch+1) % 50 == 0:
        save_model(model, optimizer, mainscheduler, current_epoch,"SimCLR_net_checkpoint.pt")

    model.eval()
    with torch.no_grad():
        val_loss_epoch = 0
        for step, (x_i, x_j, _) in enumerate(test_loader):
          if x_i.size()[0] < batch_size:
              ##ignore the last batch, if it did not fit 
              break
          x_i = x_i.squeeze().to(device).float()
          x_j = x_j.squeeze().to(device).float()

          # positive pair, with encoding
          z_i,_ = model(x_i)
          z_j,_ = model(x_j)

          loss = criterion(z_i, z_j)

          if step % 50 == 0:
              print(f"Step [{step}/{len(test_loader)}]\t Loss: {round(loss.item(),5)}")

          val_loss_epoch += loss.item()


    tr_loss.append(tr_loss_epoch / len(train_loader))
    val_loss.append(val_loss_epoch / len(test_loader))
    print(f"Epoch [{epoch}/{epochs}]\t Training Loss: {tr_loss_epoch / len(train_loader)}\t lr: {round(lr, 5)}")
    print(f"Epoch [{epoch}/{epochs}]\t Validation Loss: {val_loss_epoch / len(test_loader)}\t lr: {round(lr, 5)}")
    current_epoch += 1

    time_taken = (time.time()-stime)/60
    print(f"Epoch [{epoch}/{epochs}]\t Time Taken: {time_taken} minutes")

save_model(model, optimizer, mainscheduler, current_epoch, "SimCLR_net_final.pt")

plt.figure()
plt.plot(tr_loss, label='Trainings Loss')
plt.plot(val_loss, label='Validation Loss')
plt.legend()
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.show()


## Fine tune
We will now use the pre-trained network and fine-tune it to predict the gender.

In [None]:
new_model = SimCLRModel(EncoderNet(), num_features).to(device)
new_model.load_state_dict(torch.load('./content/saved_models/SimCLR_net_final.pt', weights_only=True)['model_state_dict'])

First, we will create a new model that we want to fine-tune.
Here, we try to keep the FineTuneModel class as simple as possible by building in only one if-statement to determine whether the weights from the contrastive-learning network should be frozen.

In [None]:
class FineTuneModel(nn.Module):
    def __init__(self,base_model, num_classes, freeze_base = False):
        
        super().__init__()
        self.base_model = base_model
        ## freeze the pre-trained model if necessary
        if freeze_base:
            self.base_model.requires_grad_(False)
        else:
            self.base_model.requires_grad_(True)

        
        in_features = self.base_model.latent_layer.out_features

        ## new head for the classification
        self.new_head = nn.Sequential(nn.Linear(in_features, 256),
                                      nn.ReLU(),
                                      nn.Dropout(0.3),
                                      nn.Linear(256, num_classes))

    
    def forward(self,x):
        ## NOTE: we use here the output of the latent_layer!
        _, out = self.base_model(x)
        out = self.new_head(out)
        return (out)

Init the model for predicting the gender.

In [None]:
import torch.optim as optim
num_classes = 2

gender_model = FineTuneModel(new_model, num_classes, freeze_base=True)

print('Trainable Parameters in network: %.3fM' % get_parameters(gender_model))

num_epochs=5
init_lr= 1e-4
optimizer = optim.AdamW(gender_model.parameters(), lr=init_lr, weight_decay=5e-4)
criterion = nn.CrossEntropyLoss()

We will build the labels inside the training loop as we get strings ("m" for male and "f" for female) from the train_loader.

In [None]:
gender_model.to(device)
for epoch in range(num_epochs):
    gender_model.train()
    running_loss = 0.0
    correct = 0
    total = 0

    train_pbar = tqdm(train_loader, desc=f'Epoch {epoch + 1}/{num_epochs} [Train]')

    for x_i,_, labels in train_pbar:
        x_i = x_i.to(device)
        y = np.zeros(len(labels))
        y[np.asarray(labels)=='m'] = 0
        y[np.asarray(labels)=='f'] = 1
        y = torch.tensor(y)
        y = y.to(torch.int64)
        
        optimizer.zero_grad()

        # Forward pass
        out = gender_model(x_i)
        out = out.to(device)
        y = y.to(device)
        
        loss = criterion(out, y)

        loss.backward()
        optimizer.step()
        # Statistics
        running_loss += loss.item() * x_i.size(0)
        _, predicted = torch.max(out, 1)
        total += y.size(0)
        correct += (predicted == y).sum().item()

        # Update progress bar
        train_pbar.set_postfix({'loss': loss.item(), 'acc': 100 * correct / total})

    # Calculate epoch statistics
    epoch_train_loss: float = running_loss / len(train_loader.dataset)
    epoch_train_acc: float = 100 * correct / total

Now we evaluate the performance of the fine-tuned model.

In [None]:
gender_model.eval()
gender_model.to(device)

predictions = []
label_list = []

with torch.no_grad():
    eval_pbar = tqdm(test_loader, desc='Evaluation')

    for x_i,_, labels in eval_pbar:
        if x_i.size()[0] < batch_size:
            ##ignore the last batch, if it did not fit 
            break
        x_i = x_i.to(device)
        out = gender_model(x_i)
        if isinstance(out, tuple):
            out = out[0]
        _, predicted = torch.max(out, 1)
        predictions.append(predicted.cpu())
        label_list.append(labels)

In [None]:
predictions = np.asarray(predictions).flatten()
label_list = np.asarray(label_list).flatten()
labels = np.zeros(len(label_list))
labels[label_list=='f'] = 1

In [None]:
from sklearn.metrics import classification_report

print(classification_report(labels, predictions))

## Exercises

### 1. Evaluate parameters and network structure

a) Play with the `batch_size`. What happens if you make it bigger?

b) Change the encoder network to ResNet50.

In [None]:
# Your code here

### 2. Change the code to predict the age as a regression task

In [None]:
# Your code here

### 3. Use t-SNE to see how age is represented in the latent representation

In [None]:
# Your code here