# Task 4: Convolutional Networks <br/> CC6204 Deep Learning, Universidad de Chile.

## Name: Humberto Rodrigues 


In [None]:
import os
import sys
import random
import pickle

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np
from scipy.spatial import distance

import torchvision
import torchvision.transforms as transforms

# Downloading external utils library
if not os.path.exists('utils.py'):
  !wget https://raw.githubusercontent.com/dccuchile/CC6204/master/2020/tareas/tarea4/utils.py

from utils import ImageCaptionDataset, train_for_classification, train_for_retrieval

--2020-12-18 23:43:59--  https://raw.githubusercontent.com/dccuchile/CC6204/master/2020/tareas/tarea4/utils.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7403 (7.2K) [text/plain]
Saving to: ‘utils.py’


2020-12-18 23:43:59 (109 MB/s) - ‘utils.py’ saved [7403/7403]



In [None]:
# Installing auto test runner
!pip install -U "git+https://github.com/dccuchile/CC6204.git@master#egg=cc6204&subdirectory=autocorrect"
from timeit import default_timer as timer
from cc6204 import AutoCorrect, FailedTest
corrector = AutoCorrect(host="cc6204.dcc.uchile.cl", port=443)
token = "]ye/Ox;nsz"

Collecting cc6204
  Cloning https://github.com/dccuchile/CC6204.git (to revision master) to /tmp/pip-install-mahbyxam/cc6204
  Running command git clone -q https://github.com/dccuchile/CC6204.git /tmp/pip-install-mahbyxam/cc6204
Building wheels for collected packages: cc6204
  Building wheel for cc6204 (setup.py) ... [?25l[?25hdone
  Created wheel for cc6204: filename=cc6204-0.5.0-cp36-none-any.whl size=5802 sha256=71549966d9e3aeac8046a3bcb185b7dc7b0d4962d7841260fd1fa981d298b3c9
  Stored in directory: /tmp/pip-ephem-wheel-cache-o_ei0wjc/wheels/62/f0/30/aadcb7ce24a2f9c935890518e902d4e23bf97b80f47bb64414
Successfully built cc6204
Installing collected packages: cc6204
Successfully installed cc6204-0.5.0
Connection stablished


In [None]:
import plotly.graph_objects as go
import plotly.io as pio
pio.renderers.default='colab'

def plot_results(loss_data, scores_data, secore_2_data=None):
  loss_fig = go.Figure()
  for result in loss_data:
    x_axis = result["x"]
    y_axis = result["y"]
    loss_fig.add_scatter(
        x=x_axis,
        y=y_axis,
        mode="lines+markers", 
        textposition="bottom center",
        name=result["name"]
    )
  loss_fig.update_layout(
      autosize=False,
      width=700,
      height=450,
      title="Loss Chart",
      xaxis_title="Epochs",
      yaxis_title="Loss",
      font=dict(
          family="Courier New, monospace",
          size=14,
          color="#7f7f7f"
      )
  )
  loss_fig.show()
  print("\n")
  acc_fig = go.Figure()
  for result in scores_data:
    x_axis = result["x"]
    y_axis = result["y"]
    acc_fig.add_scatter( 
        x=x_axis,
        y=y_axis,
        mode="lines+markers", 
        textposition="bottom center",
        name=result["name"]
    )
  acc_fig.update_layout(
      autosize=False,
      width=700,
      height=450,
      title="Metric Chart",
      xaxis_title="Epochs",
      yaxis_title="Score",
      font=dict(
          family="Courier New, monospace",
          size=14,
          color="#7f7f7f"
      )
  )
  acc_fig.show()

  if secore_2_data:
    print("\n")
    sc_fig = go.Figure()
    for result in secore_2_data:
      x_axis = result["x"]
      y_axis = result["y"]
      sc_fig.add_scatter( 
          x=x_axis,
          y=y_axis,
          mode="lines+markers", 
          textposition="bottom center",
          name=result["name"]
      )
    sc_fig.update_layout(
        autosize=False,
        width=700,
        height=450,
        title="Metric Chart",
        xaxis_title="Epochs",
        yaxis_title="Score",
        font=dict(
            family="Courier New, monospace",
            size=14,
            color="#7f7f7f"
        )
    )
    sc_fig.show()



# Part 1: Most common convolutional architechtures GoogleNet & ResNet 
In this section We will be exploring few of the initial deep convolutional networks. In first place we need to implement a version of `GoogleNet` with some small changes mostly related to the fact that originally this models were designed to work with input images of `3x224x224` and in this work we will be using `3x32x32`. This means that some of the initial reductions applied to the image in the original architechture will not be applied here.

The main idea behind the models is not only to create really deep convolutional networks but also deal with the `vanishing gradient` problem that comes associated with that fact.

## GoogleNet

This model has a total of 20 layers and was the `state-of-the-art` for image classification problem during 2015. It has two main particularities which are also the reason behind why this structure worked so well at that moment.

The general philosophy is to increase the number of channels and reduce the size of the image from layer to layer. This can bring some big ammount of operation in some specific parts of the model, to avoid this: convolutional layers of `1x1` are added.

Another important concept added was the `Inception module` which consists in apply different transformations to the input in `parallel` and generate a single output with all the results.

Finally the most notable feature is the way to deal with the `vanishing gradient` problem. the proposed solutions add two additional `outputs` in the middle of the structure changing the loss value as the average of all of them. The implication of this is that now the gradient will be able to reach far layers connected to those additional inputs by two different ways during the `backpropagation` algorithm.

In [None]:
# We will be using nn.Sequential during all this work
# To have cleaner forward methods

class InceptionModule(nn.Module):
  def __init__(self, 
    in_channels, 
    ch_3x3_reduce=96, 
    ch_5x5_reduce=16,
    ch_3x3=128,
    ch_5x5=32,
    ch_pool_proj=32,
    ch_1x1=64
  ):
    super(InceptionModule, self).__init__()

    self.seq_1 = nn.Sequential(
      nn.Conv2d(in_channels=in_channels,out_channels=ch_3x3_reduce,kernel_size=1),
      nn.ReLU(),
      nn.Conv2d(in_channels=ch_3x3_reduce,out_channels=ch_3x3,kernel_size=3,padding=1),
      nn.ReLU(),
    )

    self.seq_2 = nn.Sequential(
      nn.Conv2d(in_channels=in_channels,out_channels=ch_5x5_reduce,kernel_size=1),
      nn.ReLU(),
      nn.Conv2d(in_channels=ch_5x5_reduce,out_channels=ch_5x5,kernel_size=5,padding=2),
      nn.ReLU(),
    )

    self.seq_3 = nn.Sequential(
      nn.MaxPool2d(kernel_size=3,padding=1,stride=1),
      nn.ReLU(),
      nn.Conv2d(in_channels=in_channels,out_channels=ch_pool_proj,kernel_size=1),
      nn.ReLU(),
    )

    self.seq_4 = nn.Sequential(
      nn.Conv2d(in_channels=in_channels,out_channels=ch_1x1,kernel_size=1),
      nn.ReLU(),
    )   

  def forward(self, x):
    return torch.cat((self.seq_1(x),self.seq_2(x),self.seq_3(x),self.seq_4(x)),1)

In [None]:
# Auxiliar module with the common structure for the auxiliar outputs
class IntermediateOutputBlock(nn.Module):
  def __init__(self, in_channels):
    super(IntermediateOutputBlock, self).__init__()
    self.seq = nn.Sequential(
      nn.AvgPool2d(kernel_size=5,stride=3,padding=1),
      nn.ReLU(),
      nn.Conv2d(in_channels=in_channels,out_channels=128,stride=1,kernel_size=4,padding=1),
      nn.ReLU(),
      nn.Flatten(),
      nn.Linear(2048,1024),
      nn.ReLU(),
      nn.Dropout(p=0.7)
    )
  def forward(self, x):
    return self.seq(x)
    

In [None]:
# Running automatic tests to validate our Inception Modules
x, in_chs, ch_1x1, ch_3x3_red, ch_3x3, ch_5x5_red, ch_5x5, ch_pool_proj = corrector.get_test_data(homework=4, question="1a", test=1, token=token)

with torch.no_grad():
  model = InceptionModule(in_chs, ch_3x3_red, ch_5x5_red, ch_3x3, ch_5x5, ch_pool_proj, ch_1x1)
  s = timer()
  result = model(torch.tensor(x))
  t = timer()-s

corrector.submit(homework=4, question="1a", test=1, token=token, answer=list(result.size()), time=t)

Correct Test!


In [None]:
class GoogLeNet(nn.Module):
  def __init__(self, n_classes, use_aux_logits=True):
    super(GoogLeNet, self).__init__()

    # We will be defining our google net model as a combination of sequential executions
    # Activations approach and regularization methods were implemented
    # Based on the original references

    self.first_block = nn.Sequential(
      nn.Conv2d(in_channels=3,out_channels=64,stride=1,kernel_size=7,padding=3),
      nn.ReLU(),
      nn.MaxPool2d(kernel_size=3,stride=1,padding=1),
      nn.ReLU(),
      nn.Conv2d(in_channels=64,out_channels=64,stride=1,kernel_size=1),
      nn.ReLU(),
      nn.Conv2d(in_channels=64,out_channels=192,stride=1,kernel_size=3,padding=1),
      nn.ReLU(),
      nn.MaxPool2d(kernel_size=3,stride=1,padding=1),
      nn.ReLU(),
      InceptionModule(192),
      InceptionModule(256,96,16,240,60,60,120),
      nn.MaxPool2d(kernel_size=3,stride=2),
      nn.ReLU(),
      InceptionModule(480,96,16,256,64,64,128)
    )

    # Temptative first output block
    self.output_1 = nn.Sequential(
      IntermediateOutputBlock(512),
      nn.Linear(1024,n_classes)
    )

    # output for this one 528 channels
    self.second_block = nn.Sequential(
      InceptionModule(512,96,16,256,64,64,128),
      InceptionModule(512,96,16,256,64,64,128),
      InceptionModule(512,96,16,264,66,66,132)
    )

    # Temptative first output block
    self.output_2 = nn.Sequential(
      IntermediateOutputBlock(528),
      nn.Linear(1024,n_classes)
    )

    self.third_block = nn.Sequential(
      InceptionModule(528,96,16,416,104,104,208),
      nn.MaxPool2d(kernel_size=3,stride=2),
      nn.ReLU(),
      InceptionModule(832,96,16,416,104,104,208),
      InceptionModule(832,96,16,512,128,128,256),
      nn.AvgPool2d(kernel_size=7),
      nn.ReLU(),
      nn.Flatten(),
    )

    self.output_block = nn.Sequential(
      nn.Dropout(p=0.4),
      nn.Linear(1024,n_classes)
    )


  def forward(self, x):
    aux_logits = []
    x = self.first_block(x)
    
    if self.training:
      aux_logits.append(self.output_1(x))

    x = self.second_block(x)
    
    if self.training:
      aux_logits.append(self.output_2(x))
    
    x = self.third_block(x)

    hidden = x

    res = self.output_block(x)

    return {'hidden': hidden, 'logits': res, 'aux_logits': aux_logits}

In [None]:
# Running automatic test to validate our GoogleNet model
x, n_classes, use_aux_logits = corrector.get_test_data(homework=4, question="1b", test=1, token=token)

with torch.no_grad():
  model = GoogLeNet(n_classes=n_classes, use_aux_logits=use_aux_logits)
  s = timer()
  result = model(torch.tensor(x))
  t = timer()-s

sizes = [result['hidden'].shape[0]] + list(result['logits'].size()) + [d for a in result['aux_logits'] for d in a.size()]
corrector.submit(homework=4, question="1b", test=1, token=token, answer=sizes, time=t)

Correct Test!


## ResNet
Resnet was released after `GoogleNet` and it follows a similar strategy in terms of how the input in changing through the netwrok.

The principal differences are the size of the models and the way they deal with `vanishing gradient`. For Restnet the base proposal is a model with 34 layers composed by several normal convulutional layers and `Residual Blocks`.

A Residual block is the way this model deals with the `deepness` related problems, it consists in module with several convolutional operations added at the end with the `raw-input` of the block itself. The implication of this is that now we are giving to the gradient multiple ways to travel along the network `discarding` some operations that are maybe not needed and going further away.


We will be also working with a variant of `ResNet-34` which is called `ResNet-50`. This one uses `ResidualBottleNeckBlock` which is a residual block with an additional `1x1` convolutional operation at the beggining of it and a (4 * output channels) layer at the end. This will be increasing the number of channels we will be handling inside our model. The `BottleNeck` strategy is to reduce the number of operations we do with the now bigger output of the residual block.


In [None]:
import math
class ResidualBlock(nn.Module):
  def __init__(self, in_channels, out_channels, kernel_size, increase_initial_stride=False):
    super(ResidualBlock, self).__init__()

    self.seq_1 = nn.Sequential(
      nn.Conv2d(
        in_channels=in_channels,
        out_channels=out_channels,
        kernel_size=kernel_size,
        stride=2 if increase_initial_stride else 1,
        padding=math.ceil(float((kernel_size-1)) / 2.0)
      ),
      nn.BatchNorm2d(out_channels),
      nn.ReLU(),
      nn.Conv2d(
        in_channels=out_channels,
        out_channels=out_channels,
        kernel_size=kernel_size,
        stride=1,
        padding=math.ceil(float((kernel_size-1)) / 2.0)
      ),
      nn.BatchNorm2d(out_channels),
      nn.ReLU()
    )
    
    self.increase_initial_stride = increase_initial_stride
    self.in_ch = in_channels
    self.out_ch = out_channels
    self.batch = nn.BatchNorm2d(out_channels)
    self.rel = nn.ReLU()
    self.flat = nn.Flatten()

    
    self.conv_aux = nn.Conv2d(
      in_channels=in_channels,
      out_channels=out_channels,
      kernel_size=1,
      stride=2 if increase_initial_stride else 1,
    ) 

  def forward(self, x):
    # Applying conv1 activation and conv2
    partial_res = self.seq_1(x)
    
    # Using additional 1x1 conv layer to be able to add the residual
    # In the specific needed cases
    if self.in_ch != self.out_ch or self.increase_initial_stride:
      x = self.conv_aux(x)

    # Applying residual connection and activation 2
    partial_res = self.rel(self.batch(partial_res + x)) 
    return partial_res

class ResidualBottleNeckBlock(nn.Module):
  def __init__(self, in_channels, k, kernel_size, increase_initial_stride=False):
    super(ResidualBottleNeckBlock, self).__init__()
    self.seq_1 = nn.Sequential(
      nn.Conv2d(
        in_channels=in_channels,
        out_channels=k,
        kernel_size=1,
        stride=2 if increase_initial_stride else 1
      ),
      nn.BatchNorm2d(k),
      nn.ReLU(),
      nn.Conv2d(
        in_channels=k,
        out_channels=k,
        kernel_size=kernel_size,
        stride=1,
        padding=math.ceil(float((kernel_size-1)) / 2.0)
      ),
      nn.BatchNorm2d(k),
      nn.ReLU(),
      nn.Conv2d(
        in_channels=k,
        out_channels=4 * k,
        kernel_size=kernel_size,
        stride=1,
        padding=math.ceil(float((kernel_size-1)) / 2.0)
      ),
      nn.BatchNorm2d(4 * k),
      nn.ReLU()
    )
    
    self.increase_initial_stride = increase_initial_stride
    self.in_ch = in_channels
    self.out_ch = 4 * k
    self.rel = nn.ReLU()
    self.batch = nn.BatchNorm2d(4 * k)
    self.flat = nn.Flatten()

   
    self.conv_aux = nn.Conv2d(
      in_channels=in_channels,
      out_channels=4 * k,
      kernel_size=1,
      stride=2 if increase_initial_stride else 1
    )
  

  def forward(self, x):

    partial_res = self.seq_1(x)
    
    # Using additional 1x1 conv layer to be able to add the residual
    # In the specific needed cases
    if self.in_ch != self.out_ch or self.increase_initial_stride:
      x = self.conv_aux(x)
    
    # Applying residual connection and activation 2
    partial_res = self.rel(self.batch(partial_res + x))
    
    return partial_res

In [None]:
class ResNet34(nn.Module):
  def __init__(self, n_classes):
    super(ResNet34, self).__init__()
    self.first_block = nn.Sequential(
      nn.Conv2d(in_channels=3,out_channels=64,kernel_size=7,padding=3),
      nn.BatchNorm2d(64),
      nn.ReLU(),
      nn.MaxPool2d(kernel_size=3,stride=1,padding=1),
      nn.ReLU(),
      
      # RGroup 1
      ResidualBlock(64,64,3),
      ResidualBlock(64,64,3),
      ResidualBlock(64,64,3),

      # RGroup 2
      ResidualBlock(64,128,3,increase_initial_stride=True),
      ResidualBlock(128,128,3),
      ResidualBlock(128,128,3),
      ResidualBlock(128,128,3),

      # RGroup 3
      ResidualBlock(128,256,3,increase_initial_stride=True),
      ResidualBlock(256,256,3),
      ResidualBlock(256,256,3),
      ResidualBlock(256,256,3),
      ResidualBlock(256,256,3),
      ResidualBlock(256,256,3),

      # RGroup 4
      ResidualBlock(256,512,3,increase_initial_stride=True),
      ResidualBlock(512,512,3),
      ResidualBlock(512,512,3)
    )

    self.second_block = nn.Sequential(
      nn.AvgPool2d(kernel_size=4,stride=1),
      nn.ReLU(),
      nn.Flatten()
    )

    self.output_block =  nn.Linear(512,n_classes)

  def forward(self, x):
    hidden = self.second_block(self.first_block(x))
    return {'hidden': hidden, 'logits': self.output_block(hidden)}


class ResNet50(nn.Module):
  def __init__(self, n_classes):
    super(ResNet50, self).__init__()
    self.first_block = nn.Sequential(
      nn.Conv2d(in_channels=3,out_channels=64,kernel_size=7,padding=3),
      nn.BatchNorm2d(64),
      nn.ReLU(),
      nn.MaxPool2d(kernel_size=3,stride=1,padding=1),
      nn.ReLU(),
      
      # RGroup 1
      ResidualBottleNeckBlock(64,64,3),
      ResidualBottleNeckBlock(256,64,3),
      ResidualBottleNeckBlock(256,64,3),

      # RGroup 2
      ResidualBottleNeckBlock(256,128,3,increase_initial_stride=True),
      ResidualBottleNeckBlock(512,128,3),
      ResidualBottleNeckBlock(512,128,3),
      ResidualBottleNeckBlock(512,128,3),

      # RGroup 3
      ResidualBottleNeckBlock(512,256,3,increase_initial_stride=True),
      ResidualBottleNeckBlock(1024,256,3),
      ResidualBottleNeckBlock(1024,256,3),
      ResidualBottleNeckBlock(1024,256,3),
      ResidualBottleNeckBlock(1024,256,3),
      ResidualBottleNeckBlock(1024,256,3),

      # RGroup 4
      ResidualBottleNeckBlock(1024,512,3,increase_initial_stride=True),
      ResidualBottleNeckBlock(2048,512,3),
      ResidualBottleNeckBlock(2048,512,3)
    )

    self.second_block = nn.Sequential(
      nn.AvgPool2d(kernel_size=4,stride=1),
      nn.ReLU(),
      nn.Flatten()
    )
    self.output_block = nn.Linear(2048,n_classes)

  def forward(self, x):
    hidden = self.second_block(self.first_block(x))
    return {'hidden': hidden, 'logits': self.output_block(hidden)}

## Testing our models with CIFAR10

In [None]:
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4,
                                          shuffle=True, num_workers=2)

testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=4,
                                         shuffle=False, num_workers=2)

classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

Files already downloaded and verified
Files already downloaded and verified


### Testing with GoogleNet

In [None]:
BATCH_SIZE = 50
LR = 1e-3
EPOCHS = 20
REPORTS_EVERY = 1

net = GoogLeNet(n_classes=10, use_aux_logits=True)
optimizer = optim.AdamW(net.parameters(),lr=LR)
criterion = nn.CrossEntropyLoss()
scheduler = torch.optim.lr_scheduler.StepLR(
  optimizer=optimizer,step_size=1,gamma=0.85,verbose=True
)

train_loader = DataLoader(
  trainset, batch_size=BATCH_SIZE, shuffle=True, num_workers=2
)
test_loader = DataLoader(
  testset, batch_size=BATCH_SIZE, shuffle=False, num_workers=2
)

google_train_loss, google_acc = train_for_classification(
  net, train_loader, 
  test_loader,optimizer, 
  criterion, lr_scheduler=scheduler, 
  epochs=EPOCHS, reports_every=REPORTS_EVERY
)

google_train_loss = {
  "x": [x+1 for x in range(20)],
  "y": google_train_loss,
  "name": "GoogleNet Training Loss"
}

google_train_acc = {
  "x": [x+1 for x in range(20)],
  "y": google_acc[0],
  "name": "GoogleNet Training Acc"
}

google_val_acc = {
  "x": [x+1 for x in range(20)],
  "y": google_acc[1],
  "name": "GoogleNet Validation Acc"
}




Adjusting learning rate of group 0 to 1.0000e-03.
Epoch:1(50000/50000), lr:0.0010000, Loss:1.88960, Train Acc:25.1%, Validating..., Val Acc:37.37%, Avg-Time:172.791s.
Adjusting learning rate of group 0 to 8.5000e-04.
Epoch:2(50000/50000), lr:0.0008500, Loss:1.50800, Train Acc:43.0%, Validating..., Val Acc:48.59%, Avg-Time:172.528s.
Adjusting learning rate of group 0 to 7.2250e-04.
Epoch:3(50000/50000), lr:0.0007225, Loss:1.29346, Train Acc:51.7%, Validating..., Val Acc:54.77%, Avg-Time:172.535s.
Adjusting learning rate of group 0 to 6.1412e-04.
Epoch:4(50000/50000), lr:0.0006141, Loss:1.16004, Train Acc:57.1%, Validating..., Val Acc:58.68%, Avg-Time:172.536s.
Adjusting learning rate of group 0 to 5.2201e-04.
Epoch:5(50000/50000), lr:0.0005220, Loss:1.05038, Train Acc:61.6%, Validating..., Val Acc:61.00%, Avg-Time:172.621s.
Adjusting learning rate of group 0 to 4.4371e-04.
Epoch:6(50000/50000), lr:0.0004437, Loss:0.96930, Train Acc:64.8%, Validating..., Val Acc:62.64%, Avg-Time:172.660s

In [None]:
plot_results([google_train_loss], [google_train_acc, google_val_acc])





In [None]:
# Running automatic tests
x, y = list(test_loader)[0]
net.cpu()
net.eval()
y_pred = net(x)['logits'].max(dim=1)[1]
print("Correct Test!" if (y==y_pred).sum()/len(x) >= .75 else "Failed Test! [acc]")

Failed Test! [acc]


### Testing with ResNet-34

In [None]:
BATCH_SIZE = 50
LR = 1e-3
EPOCHS = 20
REPORTS_EVERY = 1

net = ResNet34(n_classes=10)
optimizer = optim.AdamW(net.parameters(),lr=LR)
criterion = nn.CrossEntropyLoss()
scheduler = torch.optim.lr_scheduler.StepLR(
  optimizer=optimizer,step_size=1,gamma=0.85,verbose=True
)

train_loader = DataLoader(
  trainset, batch_size=BATCH_SIZE,
  shuffle=True, num_workers=2
)
test_loader = DataLoader(
  testset, batch_size=BATCH_SIZE,
  shuffle=False, num_workers=2
)

res34_train_loss, res34_acc = train_for_classification(
  net, train_loader, 
  test_loader, optimizer, 
  criterion, lr_scheduler=scheduler, 
  epochs=EPOCHS, reports_every=REPORTS_EVERY
)

res34_train_loss = {
  "x": [x+1 for x in range(20)],
  "y": res34_train_loss,
  "name": "ResNet-34 Training Loss"
}

res34_train_acc = {
  "x": [x+1 for x in range(20)],
  "y": res34_acc[0],
  "name": "ResNet-34 Training Acc"
}

res34_val_acc = {
  "x": [x+1 for x in range(20)],
  "y": res34_acc[1],
  "name": "ResNet-34 Validation Acc"
}

Adjusting learning rate of group 0 to 1.0000e-03.
Epoch:1(50000/50000), lr:0.0010000, Loss:1.63300, Train Acc:38.8%, Validating..., Val Acc:46.71%, Avg-Time:144.865s.
Adjusting learning rate of group 0 to 8.5000e-04.
Epoch:2(50000/50000), lr:0.0008500, Loss:1.17505, Train Acc:57.6%, Validating..., Val Acc:61.87%, Avg-Time:144.890s.
Adjusting learning rate of group 0 to 7.2250e-04.
Epoch:3(50000/50000), lr:0.0007225, Loss:0.92354, Train Acc:67.1%, Validating..., Val Acc:69.03%, Avg-Time:144.963s.
Adjusting learning rate of group 0 to 6.1412e-04.
Epoch:4(50000/50000), lr:0.0006141, Loss:0.73227, Train Acc:74.3%, Validating..., Val Acc:74.85%, Avg-Time:145.010s.
Adjusting learning rate of group 0 to 5.2201e-04.
Epoch:5(50000/50000), lr:0.0005220, Loss:0.57626, Train Acc:80.0%, Validating..., Val Acc:76.79%, Avg-Time:145.025s.
Adjusting learning rate of group 0 to 4.4371e-04.
Epoch:6(50000/50000), lr:0.0004437, Loss:0.44443, Train Acc:84.6%, Validating..., Val Acc:79.39%, Avg-Time:145.024s

In [None]:
plot_results([res34_train_loss], [res34_train_acc,res34_val_acc])





### Testing with ResNet-50

In [None]:
BATCH_SIZE = 50
LR = 1e-3
EPOCHS = 20
REPORTS_EVERY = 1

net = ResNet50(n_classes=10)
optimizer = optim.AdamW(net.parameters(),lr=LR)
criterion = nn.CrossEntropyLoss()
scheduler = torch.optim.lr_scheduler.StepLR(
  optimizer=optimizer,step_size=1,gamma=0.85,verbose=True
)

train_loader = DataLoader(
  trainset, batch_size=BATCH_SIZE,
  shuffle=True, num_workers=2
)
test_loader = DataLoader(
  testset, batch_size=BATCH_SIZE,
  shuffle=False, num_workers=2
)

res50_train_loss, res50_acc = train_for_classification(
  net, train_loader, 
  test_loader, optimizer, 
  criterion, lr_scheduler=scheduler, 
  epochs=EPOCHS, reports_every=REPORTS_EVERY
)

res50_train_loss = {
  "x": [x+1 for x in range(20)],
  "y": res50_train_loss,
  "name": "ResNet-50 Training Loss"
}

res50_train_acc = {
  "x": [x+1 for x in range(20)],
  "y": res50_acc[0],
  "name": "ResNet-50 Training Acc"
}

res50_val_acc = {
  "x": [x+1 for x in range(20)],
  "y": res50_acc[1],
  "name": "ResNet-50 Validation Acc"
}



Adjusting learning rate of group 0 to 1.0000e-03.
Epoch:1(50000/50000), lr:0.0010000, Loss:1.62492, Train Acc:40.9%, Validating..., Val Acc:52.80%, Avg-Time:304.801s.
Adjusting learning rate of group 0 to 8.5000e-04.
Epoch:2(50000/50000), lr:0.0008500, Loss:1.22233, Train Acc:56.5%, Validating..., Val Acc:58.35%, Avg-Time:304.847s.
Adjusting learning rate of group 0 to 7.2250e-04.
Epoch:3(50000/50000), lr:0.0007225, Loss:0.98424, Train Acc:65.3%, Validating..., Val Acc:65.82%, Avg-Time:304.839s.
Adjusting learning rate of group 0 to 6.1412e-04.
Epoch:4(50000/50000), lr:0.0006141, Loss:0.82056, Train Acc:71.2%, Validating..., Val Acc:72.04%, Avg-Time:304.918s.
Adjusting learning rate of group 0 to 5.2201e-04.
Epoch:5(50000/50000), lr:0.0005220, Loss:0.68643, Train Acc:76.0%, Validating..., Val Acc:74.74%, Avg-Time:305.006s.
Adjusting learning rate of group 0 to 4.4371e-04.
Epoch:6(50000/50000), lr:0.0004437, Loss:0.57593, Train Acc:80.0%, Validating..., Val Acc:74.86%, Avg-Time:305.063s

In [None]:
plot_results([res50_train_loss], [res50_train_acc,res50_val_acc])





## Testing Our Models with CIFAR100

In [None]:
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

trainset = torchvision.datasets.CIFAR100(root='./data/cifar100', train=True,
                                         download=True, transform=transform)

testset = torchvision.datasets.CIFAR100(root='./data/cifar100', train=False,
                                        download=True, transform=transform)

Files already downloaded and verified
Files already downloaded and verified


### Testing with GoogleNet

In [None]:
BATCH_SIZE = 50
LR = 1e-3
EPOCHS = 20
REPORTS_EVERY = 1

net = GoogLeNet(n_classes=100, use_aux_logits=True)
optimizer = optim.AdamW(net.parameters(),lr=LR)
criterion = nn.CrossEntropyLoss()
scheduler = torch.optim.lr_scheduler.StepLR(
  optimizer=optimizer,step_size=1,gamma=0.85,verbose=True
)

train_loader = DataLoader(
  trainset, batch_size=BATCH_SIZE, shuffle=True, num_workers=2
)
test_loader = DataLoader(
  testset, batch_size=BATCH_SIZE, shuffle=False, num_workers=2
)

google_train_loss, google_acc = train_for_classification(
  net, train_loader, 
  test_loader,optimizer, 
  criterion, lr_scheduler=scheduler, 
  epochs=EPOCHS, reports_every=REPORTS_EVERY
)

google_train_loss_100 = {
  "x": [x+1 for x in range(20)],
  "y": google_train_loss,
  "name": "GoogleNet Training Loss"
}

google_train_acc_100 = {
  "x": [x+1 for x in range(20)],
  "y": google_acc[0],
  "name": "GoogleNet Training Acc"
}

google_val_acc_100 = {
  "x": [x+1 for x in range(20)],
  "y": google_acc[1],
  "name": "GoogleNet Validation Acc"
}

### Testing with ResNet-34

In [None]:
BATCH_SIZE = 50
LR = 1e-3
EPOCHS = 20
REPORTS_EVERY = 1

net = ResNet34(n_classes=100)
optimizer = optim.AdamW(net.parameters(),lr=LR)
criterion = nn.CrossEntropyLoss()
scheduler = torch.optim.lr_scheduler.StepLR(
  optimizer=optimizer,step_size=1,gamma=0.85,verbose=True
)

train_loader = DataLoader(
  trainset, batch_size=BATCH_SIZE,
  shuffle=True, num_workers=2
)
test_loader = DataLoader(
  testset, batch_size=BATCH_SIZE,
  shuffle=False, num_workers=2
)

res34_train_loss, res34_acc = train_for_classification(
  net, train_loader, 
  test_loader, optimizer, 
  criterion, lr_scheduler=scheduler, 
  epochs=EPOCHS, reports_every=REPORTS_EVERY
)

res34_train_loss_100 = {
  "x": [x+1 for x in range(20)],
  "y": res34_train_loss,
  "name": "ResNet-34 Training Loss"
}

res34_train_acc_100 = {
  "x": [x+1 for x in range(20)],
  "y": res34_acc[0],
  "name": "ResNet-34 Training Acc"
}

res34_val_acc_100 = {
  "x": [x+1 for x in range(20)],
  "y": res34_acc[1],
  "name": "ResNet-34 Validation Acc"
}

### Testing with ResNet-50

In [None]:
BATCH_SIZE = 50
LR = 1e-3
EPOCHS = 20
REPORTS_EVERY = 1

net = ResNet50(n_classes=100)
optimizer = optim.AdamW(net.parameters(),lr=LR)
criterion = nn.CrossEntropyLoss()
scheduler = torch.optim.lr_scheduler.StepLR(
  optimizer=optimizer,step_size=1,gamma=0.85,verbose=True
)

train_loader = DataLoader(
  trainset, batch_size=BATCH_SIZE,
  shuffle=True, num_workers=2
)
test_loader = DataLoader(
  testset, batch_size=BATCH_SIZE,
  shuffle=False, num_workers=2
)

res50_train_loss, res50_acc = train_for_classification(
  net, train_loader, 
  test_loader, optimizer, 
  criterion, lr_scheduler=scheduler, 
  epochs=EPOCHS, reports_every=REPORTS_EVERY
)

res50_train_loss_100 = {
  "x": [x+1 for x in range(20)],
  "y": res50_train_loss,
  "name": "ResNet-50 Training Loss"
}

res50_train_acc_100 = {
  "x": [x+1 for x in range(20)],
  "y": res50_acc[0],
  "name": "ResNet-50 Training Acc"
}

res50_val_acc_100 = {
  "x": [x+1 for x in range(20)],
  "y": res50_acc[1],
  "name": "ResNet-50 Validation Acc"
}



## Results

After Observing multiple experiments we can easily notice that `ResNet` models offers better results in terms of the used metrics (accuracy).

There is a slightly improvement in terms of converging for `Resnet-50` but is also a bigger model which it can be also noticed in the AVG time spent for each model.

All our experiments were done using the same configurations in terms of scheduler, optimizer, loss function and even learn rates.

Regarding the optimizer We did not use the ones specified in the respective references of the studied models but we use `AdamW` instead.

From now on in this work we will be using for codification the model which better results observed which is the prevoiusly mentioned `ResNet-50`.

# Part 2: Image Captioning
In this section we will be exploring the image captioning problem. For this we will be using pre-trained embedding representation of the captions of the and the convolutional models we defined before to get an encoded representation of the images.

Finally A new loss function needs to be used due to the nature of the problem. In this case we will be exploring the `TripletLoss` function and some of the parameters that it is receive.


## Image and Text encoding
In this section We will create a class to encode the information of our images and pre-trained caption embeddings. In this case we will use the same network structure for both but this can be easily changed and it can bring big changes in the results.

Additionally we will be using the most common regularization strategies  based on `Dropout` and `Batch Normalization` layers.



In [None]:
# We use as base one of the models defined before
# and add some dense layers to find relations between our represenations

class ImageEncoding(nn.Module):
  def __init__(self, cnn_model, cnn_out_size, out_size=128):
    super(ImageEncoding, self).__init__()
    self.cnn_model = cnn_model

    self.seq = nn.Sequential(
      nn.Dropout(p=0.4),
      nn.Linear(in_features=cnn_out_size,out_features=512),
      nn.BatchNorm1d(num_features=512),
      nn.ReLU(),
      nn.Dropout(p=0.2),
      nn.Linear(in_features=512,out_features=256),
      nn.BatchNorm1d(num_features=256),
      nn.ReLU(),
      nn.Dropout(p=0.1),
      nn.Linear(in_features=256,out_features=out_size)
    )

  def forward(self, x):
    x = self.cnn_model(x)['hidden']
    x = self.seq(x)
    return {'logits': x}

In [None]:
# We will be encoding the pre-trained embeddings adding some dense layers
# Looking for non-linear relations and regularization strategies
class TextEncoding(nn.Module):
  def __init__(self, text_embedding_size=4096, out_size=128):
    super(TextEncoding, self).__init__()

    self.seq = nn.Sequential(
      nn.Dropout(p=0.4),
      nn.Linear(in_features=text_embedding_size,out_features=512),
      nn.BatchNorm1d(num_features=512),
      nn.ReLU(),
      nn.Dropout(p=0.2),
      nn.Linear(in_features=512,out_features=256),
      nn.BatchNorm1d(num_features=256),
      nn.ReLU(),
      nn.Dropout(p=0.1),
      nn.Linear(in_features=256,out_features=out_size)
    )

  def forward(self, x):
    return {'logits': self.seq(x) }

In [None]:
# Running some automatic tests for our encodings
OUT_SIZE = 200

cnn_net = GoogLeNet(n_classes=10)
i_enc = ImageEncoding(cnn_model=cnn_net, cnn_out_size=1024, out_size=OUT_SIZE)
t_enc = TextEncoding(text_embedding_size=4096, out_size=OUT_SIZE)
i_enc.eval()
t_enc.eval()

print("Correct Test!" if (i_enc(torch.randn(9,3,32,32))['logits'].size()==t_enc(torch.randn(9,4096))['logits'].size()) else "Failed Test [size]")
print("Correct Test!" if (i_enc(torch.randn(9,3,32,32))['logits'].size(-1)==OUT_SIZE) else "Failed Test [size]")

Correct Test!
Correct Test!


## *Triplet Loss*

We need to define a loss function which actually works for the nature of this problem. One of the most commonly used is the `TripletLoss`. This method consist in receive an `anchor` image and a set of `caption` the idea is to minimize the distance between our `anchor` and it is `correct` caption and maximize it in terms of the `incorrect` ones. The distance method to be used is the euclidean distance, and against which `incorrect` caption we will compare is determined by the `negative` parameter.



In [None]:
# negative parameter changes the N_i choosen to compare agains the anchor image
# valid values: max, random, all
class TripletLoss(nn.Module):
  def __init__(self, margin=.2, negative='max'):
    super(TripletLoss, self).__init__()
    self.margin = margin
    self.negative = negative

  def forward(self, anchor, positive):
    dists = torch.cdist(anchor,positive,p=2.0)

    p_dists = torch.diagonal(dists)
    p_dists = p_dists.unsqueeze(1).expand_as(dists)

    cost = (p_dists - dists + self.margin).clamp(min=0).fill_diagonal_(0)

    if self.negative == 'max':
      cost = torch.max(cost,1)[0]
    elif self.negative == 'random':
      weights = torch.ones_like(cost).fill_diagonal_(0) 
      ids = torch.multinomial(weights, num_samples=1)
      cost = cost.gather(1, ids)

    return cost[cost>0].mean()

In [None]:
# Running automatic tests for our TripletLoss
for test in [1,2]:
  a, p, m, n  = corrector.get_test_data(homework=4, question="2b", test=test, token=token)

  criterion = TripletLoss(margin=m, negative=n)
  result = criterion(torch.tensor(a), torch.tensor(p)).item()

  corrector.submit(homework=4, question="2b", test=test, token=token, answer=result, time=0)

Correct Test!
Correct Test!


## Testing in Flickr8k

In [15]:
# Downloading Flickr8k data 

folder_path = './data/flickr8k'
if not os.path.exists(f'{folder_path}/images'):
  print('\n*** Descargando y extrayendo Flickr8k, siéntese y relájese 4 mins...')
  print('****** Descargando las imágenes...\n')
  !wget https://s06.imfd.cl/04/CC6204/tareas/tarea4/Flickr8k_Dataset.zip -P $folder_path/images
  print('\n********* Extrayendo las imágenes...\n  Si te sale mensaje de colab, dale Ignorar\n')
  !unzip -q $folder_path/images/Flickr8k_Dataset.zip -d $folder_path/images
  print('\n*** Descargando y anotaciones de la imágenes...\n')
  !wget http://hockenmaier.cs.illinois.edu/8k-pictures.html -P $folder_path/annotations

transform=transforms.Compose([transforms.ToTensor(), 
                              transforms.Resize((32, 32)),
                              transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

print('Inicializando pytorch Flickr8k dataset')
full_flickr_set = torchvision.datasets.Flickr8k(root=f'{folder_path}/images/Flicker8k_Dataset',
                                                ann_file = f'{folder_path}/annotations/8k-pictures.html',
                                                transform=transform)
print('Creando train, val y test splits...')

train_flickr_set, val_flickr_set, test_flickr_set = [], [], []
for i, item in enumerate(full_flickr_set):
  if i<6000:
    train_flickr_set.append(item)
  elif i<7000:
    val_flickr_set.append(item)
  else:
    test_flickr_set.append(item)


*** Descargando y extrayendo Flickr8k, siéntese y relájese 4 mins...
****** Descargando las imágenes...

--2020-12-18 23:45:17--  https://s06.imfd.cl/04/CC6204/tareas/tarea4/Flickr8k_Dataset.zip
Resolving s06.imfd.cl (s06.imfd.cl)... 192.80.24.186
Connecting to s06.imfd.cl (s06.imfd.cl)|192.80.24.186|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115419746 (1.0G) [application/zip]
Saving to: ‘./data/flickr8k/images/Flickr8k_Dataset.zip’


2020-12-18 23:48:10 (6.01 MB/s) - Connection closed at byte 1081176461. Retrying.

--2020-12-18 23:48:11--  (try: 2)  https://s06.imfd.cl/04/CC6204/tareas/tarea4/Flickr8k_Dataset.zip
Connecting to s06.imfd.cl (s06.imfd.cl)|192.80.24.186|:443... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 1115419746 (1.0G), 34243285 (33M) remaining [application/zip]
Saving to: ‘./data/flickr8k/images/Flickr8k_Dataset.zip’

Flickr8k_Dataset.zi 100%[+++++++++++++++++++>]   1.04G  4.91MB/s    in 7.8s    

202

In [None]:
# Downloading pre-trained embeddings for the captions

if not os.path.exists(f'{folder_path}/flickr_cap_encodings_4096d.pkl'):
  !wget https://s06.imfd.cl/04/CC6204/tareas/tarea4/flickr_cap_encodings_4096d.pkl -P $folder_path

with open(f'{folder_path}/flickr_cap_encodings_4096d.pkl', 'rb') as f:
  train_cap_encs, val_cap_encs, test_cap_encs = pickle.load(f)

# Using ImageCaptionDataset defined in torchvision to handle all our data 
train_flickr_tripletset = ImageCaptionDataset(train_flickr_set, train_cap_encs)
val_flickr_tripletset = ImageCaptionDataset(val_flickr_set, val_cap_encs)
test_flickr_tripletset = ImageCaptionDataset(test_flickr_set, test_cap_encs)


--2020-12-18 23:50:30--  https://s06.imfd.cl/04/CC6204/tareas/tarea4/flickr_cap_encodings_4096d.pkl
Resolving s06.imfd.cl (s06.imfd.cl)... 192.80.24.186
Connecting to s06.imfd.cl (s06.imfd.cl)|192.80.24.186|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 628212160 (599M) [application/octet-stream]
Saving to: ‘./data/flickr8k/flickr_cap_encodings_4096d.pkl’

         flickr_cap   9%[>                   ]  56.65M  6.60MB/s    eta 1m 47s 

In [None]:
BATCH_SIZE = 50
LR = 1e-3
EPOCHS = 20
REPORTS_EVERY = 1
CNN_OUT_SIZE = 2048
EMBEDDING_SIZE = 4096
OUT_SIZE = 512
MARGIN = .2
NEGATIVE = "all"

# Resnet dummy parameter because we do not use this model for classify only for encoding
# TODO: should be changed for a default value?
cnn_net = ResNet50(10)
img_net = ImageEncoding(cnn_model=cnn_net, cnn_out_size=CNN_OUT_SIZE, 
                        out_size=OUT_SIZE) 

text_net = TextEncoding(text_embedding_size=EMBEDDING_SIZE, out_size=OUT_SIZE)

optimizer = optim.AdamW([{'params': img_net.parameters()},
                        {'params': text_net.parameters()}],
                       lr=LR)
criterion = TripletLoss(margin=MARGIN, negative=NEGATIVE)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer=optimizer,step_size=5,gamma=0.85,verbose=True)

train_triplets_loader = DataLoader(train_flickr_tripletset, batch_size=BATCH_SIZE,
                                   shuffle=True, num_workers=2)
val_triplets_loader = DataLoader(val_flickr_tripletset, batch_size=BATCH_SIZE,
                                 shuffle=False, num_workers=2)

train_loss_flick, meanrr_flick, r10_flick = train_for_retrieval(img_net, text_net, 
                                              train_triplets_loader, 
                                              val_triplets_loader, optimizer, 
                                              criterion, scheduler, EPOCHS, 
                                              REPORTS_EVERY, norm=False)

train_loss_flick = {
  "x": [x+1 for x in range(20)],
  "y": train_loss_flick,
  "name": "ResNet-50 Flickr8k Training Loss"
}

meanrr_train_flick = {
  "x": [x+1 for x in range(20)],
  "y": meanrr_flick[0],
  "name": "ResNet-50 Flickr8k Training MRR"
}

meanrr_val_flick = {
  "x": [x+1 for x in range(20)],
  "y": meanrr_flick[1],
  "name": "ResNet-50 Flickr8k Validation MRR"
}

r10_train_flick = {
  "x": [x+1 for x in range(20)],
  "y": r10_flick[0],
  "name": "ResNet-50 Flickr8k Training R@10"
}

r10_val_flick = {
  "x": [x+1 for x in range(20)],
  "y": r10_flick[1],
  "name": "ResNet-50 Flickr8k Validation R@10"
}


In [None]:
plot_results([train_loss_flick], [meanrr_train_flick,meanrr_val_flick], [r10_train_flick,r10_val_flick])

In [None]:
# Running automatic tests
from PIL import Image
n_samples = 64

samples = torch.stack([test_flickr_tripletset[i][0] for i in range(n_samples)]).cuda()
refs = torch.stack([torch.from_numpy(test_flickr_tripletset[i][1]) for i in range(n_samples)]).cuda()
test_caps = [caps[0] for _, caps in test_flickr_set][:n_samples]

samples_enc = img_net(samples)['logits']
refs_enc = text_net(refs)['logits']

dists = torch.cdist(samples_enc.unsqueeze(0), refs_enc.unsqueeze(0), p=2).squeeze(0)
ranks = torch.argsort(dists, dim=1)[:,:10]
r10 = len([i for i in range(len(ranks)) if len(torch.where(ranks[i,:] == i)[0])]) / len(ranks)

print("Correct Test!" if r10 >= .25 else "Failed Test! [R@10]")

fig, axs = plt.subplots(nrows=n_samples, figsize=(2,n_samples*5))
for i in range(n_samples):
  axs[i].imshow(Image.open(full_flickr_set.ids[7000+i]))
  axs[i].text(600,0,"EXPECTED:\n{}: {}".format(i, test_caps[i]), fontsize=12, fontweight='bold')
  axs[i].text(600,750,"PREDICTED RANK:\n{}".format('\n'.join([f'{j}: {test_caps[j]}' for j in ranks[i]])), fontsize=12)

## Testing with COCO captions

In [None]:
# Downloading data for COCO captions dataset

folder_path = './data/coco-caps'
if not os.path.exists(f'{folder_path}/images/train2014'):
  print('\n*** Descargando y extrayendo COCO Captions, siéntese y relájese unos 20 mins...')
  print('****** Descargando training set...\n')
  !wget http://images.cocodataset.org/zips/train2014.zip -P $folder_path/images
  print('\n********* Extrayendo training set...\n  Si te sale mensaje de colab, dale Ignorar\n')
  !unzip -q $folder_path/images/train2014.zip -d $folder_path/images && rm $folder_path/images/train2014.zip
  print('\n*** Descargando y extrayendo validation set...\n')
  !wget http://images.cocodataset.org/zips/val2014.zip -P $folder_path/images && unzip -q $folder_path/images/val2014.zip -d $folder_path/images && rm $folder_path/images/val2014.zip
  print('\n*** Descargando y anotaciones de la imágenes...\n')
  !wget http://images.cocodataset.org/annotations/annotations_trainval2014.zip -P $folder_path && unzip -q $folder_path/annotations_trainval2014.zip -d $folder_path && rm $folder_path/images/annotations_trainval2014.zip

transform=transforms.Compose([transforms.ToTensor(), 
                              transforms.Resize((32, 32)),
                              transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

train_coco_set = torchvision.datasets.CocoCaptions(root=f'{folder_path}/images/train2014',
                                                   annFile = f'{folder_path}/annotations/captions_train2014.json',
                                                   transform=transform)

val_coco_set = torchvision.datasets.CocoCaptions(root=f'{folder_path}/images/val2014',
                                                 annFile = f'{folder_path}/annotations/captions_val2014.json',
                                                 transform=transform)


In [None]:
# Downloading pre_trained embedding Captions for COCO 
if not os.path.exists(f'{folder_path}/cap_encodings_512d.pkl'):
  !wget https://s06.imfd.cl/04/CC6204/tareas/tarea4/cap_encodings_512d.pkl -P $folder_path

with open(f'{folder_path}/cap_encodings_512d.pkl', 'rb') as f:
  train_cap_encs, val_cap_encs = pickle.load(f)

train_coco_tripletset = ImageCaptionDataset(train_coco_set, train_cap_encs)
val_coco_tripletset = ImageCaptionDataset(val_coco_set, val_cap_encs)

In [None]:
BATCH_SIZE = 50
LR = 1e-3
EPOCHS = 5
REPORTS_EVERY = 1
CNN_PREV_SIZE = 2048
EMBEDDING_SIZE = 512
OUT_SIZE = 512
MARGIN = .2
NEGATIVE = "all"

cnn_net = ResNet50(10)
img_net = ImageEncoding(cnn_model=cnn_net, cnn_out_size=CNN_PREV_SIZE, 
                        out_size=OUT_SIZE) 

text_net = TextEncoding(text_embedding_size=EMBEDDING_SIZE, out_size=OUT_SIZE)

optimizer = optim.Adam([{'params': img_net.parameters()},
                        {'params': text_net.parameters()}], 
                       lr=LR)
criterion = TripletLoss(margin=MARGIN, negative=NEGATIVE)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer=optimizer,step_size=5,gamma=0.85,verbose=True)

train_triplets_loader = DataLoader(train_coco_tripletset, batch_size=BATCH_SIZE,
                                   shuffle=True, num_workers=2)
val_triplets_loader = DataLoader(val_coco_tripletset, batch_size=BATCH_SIZE,
                                 shuffle=False, num_workers=2)

train_loss_coco, meanrr_coco, r10_coco = train_for_retrieval(img_net, text_net, 
                                              train_triplets_loader, 
                                              val_triplets_loader, optimizer, 
                                              criterion, scheduler, EPOCHS, 
                                              REPORTS_EVERY, norm=False)

train_loss_coco = {
  "x": [x+1 for x in range(20)],
  "y": train_loss_coco,
  "name": "ResNet-50 COCO Training Loss"
}

meanrr_train_coco = {
  "x": [x+1 for x in range(20)],
  "y": meanrr_coco[0],
  "name": "ResNet-50 COCO Training MRR"
}

meanrr_val_coco = {
  "x": [x+1 for x in range(20)],
  "y": meanrr_coco[1],
  "name": "ResNet-50 COCO Validation MRR"
}

r10_train_coco = {
  "x": [x+1 for x in range(20)],
  "y": r10_coco[0],
  "name": "ResNet-50 COCO Training R@10"
}

r10_val_coco = {
  "x": [x+1 for x in range(20)],
  "y": r10_coco[1],
  "name": "ResNet-50 COCO Validation R@10"
}

In [None]:
plot_results([train_loss_coco], [meanrr_train_coco,meanrr_val_coco], [r10_train_coco,r10_val_coco])

# Results, Analysis and Future works

After an extense study in diverse experiments with really deep convolutional networks is observable that there are a lot of different configurations to be applied. We try to keep a fixed structure in our test to be able to compare strictly the models, but in reality these researches are presented even with different optimizers, schedulers and number of epochs.

Even when our models are an adaptation of the original ones we can obtain pretty good results in these extensive and not trivial datasets for more complex tasks like image captioning.

Regarding the structure used to construct the models an important comment for all those who want to start using pytorch could be that: `nn.Sequential` can be a powerful and clean tool to build complex models but the classic `printing` debugging tool will not be effective at all. A better deguber is recommended.

Another important thing to mention is that we worked with input images with different sizes in comparison to the ones used when the studied models were defined, specifically we reduce our images until `4x4` and not `7x7` as the original networks. This can have an impact also in the results, a possible valid future apporoach could be reduce them to `8x8` or some value closer to `7x7`.

Regarding the padding approach used, is the one by default use by pytorch which is the one mentioned in the original researches (ZeroPadding). Trying a different padding strategy could be also a valid future work.

It is also possible to think in implementing more modern, bigger and complex deep convolutional networks like `DenseNet`.

A more interesting test can be done using a more complex scheduler like: `ReduceLROnPlateau` which can provide a better and more sophisticated way to reduce our learn rate during the epochs.

Finally, changin the image encoding or text approach encoding could be vital to obtain better results even in the networks used are the same.


