# Computer, color my picture!
A guide to re-color pictures based on natural language queries and Microsoft's Coco dataset  
– Benjamin Beilharz –
***
<br> 
Autocolorizing images has been already achieved by interactive human input by giving signals in forms of clicks or strokes on an image while having selected specific colors.
This implementation handles the input signals in form of natural language.  
  
As of today, merging multiple modalities in machine learning like language and images is still a complex tasks than people would think it is. Finding suitable ways to engineer proper feature combinations of above modalities, without depending on only one has been seen in a few publications in the field of multimodality. Either we depend on one feature only, or else we just are not able to find a suitable solution to train these, indeed difficult networks alltogether.
  
In this post I want to give you a quick overview about my experience diving into computer vision and conducting the attempt to reimplement the paper _Learning to Color from Language, Manjunatha et al. 2018_. The code of their implementation for Python2 is available on [Github](https://github.com/superhans/colorfromlanguage) and my code is mostly based on their implementation.  
  
**Disclaimer: I did not manage to get the network to learn properly.** 


In [None]:
# import modules commonly used for image processing and utilities provided by
# the original authors

import cv2
import json
import os
import pickle
import time
import random
import scipy
import string
import torch
import torchtext
import torchvision
import warnings
warnings.filterwarnings('ignore')

import h5py as h5
import numpy as np
import scipy.ndimage.interpolation as sni
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision.transforms as transforms

from collections import defaultdict, Counter
from skimage import io, color
from tqdm import tqdm
from torch.autograd import Variable
from torch.nn.utils.rnn import pad_packed_sequence, pack_padded_sequence
from torch.utils.tensorboard import SummaryWriter

from utils import decode_lookup, produce_minibatch_idxs, rgbim2lab,\
    enc_batch, prior_boosting, display, decode,init_modules, cvrgb2lab,\
    annealing, enc_batch_nnenc, LookupEncode, labim2rgb, error_metric, rmse_ab

## Data
***
In terms of data the previous authors went for the popular dataset from Microsoft, namely MS COCO (Common Objects in Context), a large dataset that is used for object detection and segmentation which has captions added to each image.  

Feel free to read additional information and download it on the [offical page](http://cocodataset.org/#home).

I used the data provided by the original authors of the paper.


### Images in a nutshell
We probably all know what is known as an image, but in terms of physics and math, an image is commonly represented as either a greyscale image with one color channel (lightness), or as a colored one, displayed as three overlaying color channels (red, green, blue) whereas each pixel has an arbitrary lightness/color value between 0 (black) and 255 (white).

So for an RGB image we have the following: $height\times width \times channels$, dealing with a tensor.  
  
<br>

![Color channels](https://miro.medium.com/max/2146/1*icINeO4H7UKe3NlU1fXqlA.jpeg)
<center>RGB color channels visualized</center>



## Feature extraction
***

For feature extraction the ResNet 101 architecture is used, which is a CNN consisting of 101 layers and available as a pretrained model in *torchvision*. It utilizes residual connections to learn into deeper layers by adding the input to the processed outputs after convolution layers.

![alt text](https://miro.medium.com/max/1200/1*6hF97Upuqg_LdsqWY6n_wg.png)
<center>ResNet 101 architecture</center>  
  
<small>*Please look at the architecture output below to get a glimpse of the feature extraction layers used.*</small>

In [None]:
def resnet():
    """Loads and extract layers from pre-trained resnet101 model
    
    Returns:
        pytorch model - torch.nn.Model
    """
    cnn = torchvision.models.resnet101(pretrained=True)
    layers = [cnn.conv1,
              cnn.bn1,
              cnn.relu,
              cnn.maxpool
             ]
    for i in range(2):  # model_stage_2 -> 1200, 512, 28, 28
        name = f'layer{i+1}'
        layers.append(getattr(cnn, name))
    
    model = torch.nn.Sequential(*layers)
    # model.cuda()
    model.double()
    model.eval()
    return model

In [None]:
def process_batch(current, model):
    """Processes batch of images

    Args:
        current - np.array - batch of grayscale images BATCH x Channels x Height x Width
        model - torch.nn.Module - pretrained resnet 101 model
    
    Returns:
        features - np.array - Extracted features by resnet 101
    """
    
    batch = np.concatenate(current, 0).astype(np.float32)  # concat all matrices on 4th dim
    # squeeze grayscale color into lightness channel according to lab color space
    mean = np.array([0.485, 0.456, 0.406]).reshape(1, 3, 1, 1)
    std = np.array([0.229, 0.224, 0.224]).reshape(1, 3, 1, 1)
    batch = (batch/255. - mean) /std
    batch = torch.tensor(batch, requires_grad=False).cuda()
    features = model(batch)
    return features.data.cpu().clone().numpy()

def generate_vision_features(image: h5.File, model: nn.Module, BATCH_SIZE: int):
    """Extract vision features from coco dataset.
    
    Args:
        image dataset - h5.File - coco dataset for feature extraction
        model - nn.Module - resnet101 model
        BATCH_SIZE - int - images per batch
    """
    with h5.File('img_features.h5', 'w') as f:
        # splitting dataset into image features for training and validation
        for split in ['train', 'val']:
            dataset = None
            current_batch = []
            lowerbound = 0
            image_files = images[split+'_ims']
            for i, img in tqdm(enumerate(image_files)):
                img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)  # convert color space to grayscale
                img = np.stack((img, )*3, -1).transpose(2, 0, 1)[None]
                current_batch.append(img)
                if len(current_batch) == BATCH_SIZE:
                    features = process_batch(current_batch, model)
                    if dataset is None:
                        N = len(image_files)
                        _, C, H, W = features.shape
                        dataset = f.create_dataset(split + '_features', (N, C, H, W),
                                                   dtype=np.float32)
                        upperbound = lowerbound + len(current_batch)
                        dataset[lowerbound:upperbound] = features  # add image features to images
                        lowerbound = upperbound
                        current_batch.clear()
                    if len(current_batch) > 0:  # if batch size results into remainders
                        features = process_batch(current_batch, model)
                        upperbound = lowerbound + len(current_batch)
                        dataset[lowerbound:upperbound] = features

### Language encoding

As we like to learn to colorize pictures based on captions, we also need to learn a contextualized representation of the with a standard bidirectional LSTM and use the last hidden state to condition the CNN throughout the convolution blocks.

![LSTM EXPLAINED](https://www.mdpi.com/water/water-11-01387/article_deploy/html/images/water-11-01387-g004.png)
<br>
<center>LSTM Explained</center>



In [None]:
class LanguageEncoder(nn.Module):
    def __init__(self,
                 emb_dim: int,
                 h_dim: int,
                 vocab_size: int,
                 pretrained_embeddings: bool):
        super(LanguageEncoder, self).__init__()
        self.emb = nn.Embedding(vocab_size, emb_dim)
        # warm start the model by adding pretrained weights
        self.emb.weight.data.copy_(torch.from_numpy(pretrained_embeddings))
        self.hidden_size = h_dim
        self.rnn = nn.LSTM(emb_dim, h_dim, num_layers=1, bidirectional=True, batch_first=True)
        self.dropout = nn.Dropout(p=0.2)
        
    def forward(self, query, lengths):
        bsz, max_len = query.size()  # get dimensionalities by input 
        emb = self.dropout(self.emb(query))  # regulatization
        lenghts, indices = torch.sort(lengths, dim=0, descending=True)
        _, (h_t, _) = self.rnn(pack_padded_sequence(emb[indices], lengths.data.tolist(), batch_first=True, enforce_sorted=False))
        h_t = torch.cat((h_t[0], h_t[1]), 1)  # concatenate both directional hidden state outputs
        _, indices = torch.sort(indices, dim=0)
        h_t = h_t[indices]
        return h_t


To combine the modalities a feature-wise affine transformation (FiLM: Visual Reasoning with a General Conditioning Layer, Perez et al. 2018) has been applied, which can be seen analog to the gating machinism we know from gated recurrent neural networks, like an input gate to determine which information should be used from the previous hidden state. 


![FILM Explained](./images/film.png)
<center>FiLM Layer visualized from Perez et al. 2017</center>
<br>

The final FiLM architecture in between looks somewhat like this:  
![FILM Arc](./images/filmed.png)
<center>FiLM residual block from Perez et al. 2017</center>
<br>

We use a recurrent neural network to encode the textual information and add a linear transformation to add the extracted vector into the residual blocks with the image.

In [None]:
class FiLM(nn.Module):
    """Feature-wise affine transformation
    Based on (Perez et al. 2018)
    Implementation: https://github.com/ethanjperez/film/blob/master/vr/models/filmed_net.py
    
    Conditions the output of a convolutional block to the weights based on
    language encoding.
    """
    def __init__(self):
        super(FiLM, self).__init__()
    def forward(self, x, gammas, betas):
        # add dimensions to gamma and beta and adjust dimensions to input
        gammas = gammas.unsqueeze(2).unsqueeze(3).expand_as(x)  
        betas = betas.unsqueeze(2).unsqueeze(3).expand_as(x)
        return gammas * x + betas
    
class FiLMedResBlock(nn.Module):
    """Linear projection of gamma and beta by adding further parameters to learn
    language conditioning.

    Uses residual connections over convolution and FiLM layers.
    """
    def __init__(self, in_dim, out_dim, stride=1, padding=1, dilation=1):
        super(FiLMedResBlock, self).__init__()
        self.conv1 = nn.Conv2d(in_dim, in_dim, kernel_size=1, padding=0)
        self.conv2 = nn.Conv2d(in_dim, out_dim, kernel_size=3, stride=stride, padding=1, dilation=dilation)
        self.bn2 = nn.BatchNorm2d(out_dim)
        self.film = FiLM()
        init_modules(self.modules())  # initialize weights
        
    def forward(self, x, gammas, betas):
        # forward pass with residual connection
        y = x
        y = F.relu(self.conv1(y))
        y = self.bn2(F.relu(self.conv2(y)))
        y = F.relu(self.film(y, gammas, betas))
        return y + x


## Autocolorize Architecture
***
The complete architecture for autocolorizing grayscale images based on captions has the following structure:   
![colorresnet](./images/color.png)
<center>ColorizeResnet</center>
<br>

1. Grayscale images (25, 25, width/height respectively)
2. Extracting features with pre-trained resnet101
3. LSTM Encoding of caption
4. FiLM block to condition image features by element wise multiplication with language
5. Linear layers inbetween FiLM blocks
6. Classification with Cross Entropy Loss


### Training objective

In this task we use the **CIELAB** color space as our ground truth. The **LAB** color space consists of a lightness channel (squeezed into a range of 0 to 100, as mentioned above with two additional channels where:

* L, a lightness channel, while 0 being black and 100 being white
* A, as the green-red component, with negative values being green hues and positive being reds
* B, as the blue-yellow channel, with negative values being blue hues and positive being yellows
  
![LAB Color space](https://upload.wikimedia.org/wikipedia/commons/7/7d/CIELAB_color_space_front_view.png)
<center>A front view of the LAB color space</center>  
<br>

The training objective is defined as given an input lightness channel $X \in \mathbb{R}^{H\times W\times 1}$ we predict the two other color channels for **LAB**, $Y \in \mathbb{R}^{H\times W\times 2}$.


Below the full model and training procedure as well as evaluation.

In [None]:
class AutocolorizeResnet(nn.Module):
    def __init__(self, vocab_size, feature_dim=(512, 28, 28), h_dim=256, emb_dim=300, num_modules=4, num_classes=625, train_vocab_embeddings=None):
        super(AutocolorizeResnet, self).__init__()
        self.num_modules = num_modules
        self.n_lstm_hidden = h_dim
        self.block = FiLMedResBlock
        self.in_dim = feature_dim[0]
        self.num_classes = num_classes
        dilations = [1, 1, 1, 1]

        # standard bidirectional LSTM encoder
        self.language_encoder = LanguageEncoder(emb_dim, h_dim, vocab_size, train_vocab_embeddings)

        # 512x512
        self.mod1 = self.block(self.in_dim, self.in_dim, dilations[0])
        self.mod2 = self.block(self.in_dim, self.in_dim, dilations[1])
        self.mod3 = self.block(self.in_dim, self.in_dim, dilations[2])
        self.mod4 = self.block(self.in_dim, self.in_dim, dilations[3])

        # language representations are projected into input dimensionality of CNN
        # mutliplying hidden dimensions by 2 because of bidirectional encoding
        # 512 x 1024
        self.dense_film_1 = nn.Linear(self.n_lstm_hidden*2, self.in_dim*2) 
        self.dense_film_2 = nn.Linear(self.n_lstm_hidden*2, self.in_dim*2)
        self.dense_film_3 = nn.Linear(self.n_lstm_hidden*2, self.in_dim*2)
        self.dense_film_4 = nn.Linear(self.n_lstm_hidden*2, self.in_dim*2)

        self.upsample = nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True)
        self.classifier = nn.Conv2d(512, self.num_classes, kernel_size=1, stride=1, dilation=1)


    def forward(self, x, query, query_lens):
        # encode caption/query
        query_features = self.language_encoder(query, query_lens)

        # linear projections for each of convolution block
        dense_film_1 = self.dense_film_1(query_features)
        dense_film_2 = self.dense_film_2(query_features)
        dense_film_3 = self.dense_film_3(query_features)
        dense_film_4 = self.dense_film_4(query_features) # bsz * 128

        # prepare gammas and betas for language weighting on convolutional layers
        gammas1, betas1 = torch.split(dense_film_1, self.in_dim, dim=-1)
        gammas2, betas2 = torch.split(dense_film_2, self.in_dim, dim=-1)
        gammas3, betas3 = torch.split(dense_film_3, self.in_dim, dim=-1)
        gammas4, betas4 = torch.split(dense_film_4, self.in_dim, dim=-1)
        
        # out is 2x512x28x28
        out = self.mod1(x, gammas1, betas1)
        out = self.mod2(out, gammas2, betas2)
        out = self.mod3(out, gammas3, betas3)
        out_last = self.mod4(out, gammas4, betas4) 
        
        out = self.upsample(out_last)  # 4x1024x56x56
        out = self.classifier(out) # 4 x nclasses x 56 x 56
        out = out.permute(0, 2, 3, 1).contiguous() # 4 x 56 x 56 x nclasses
        out = out.view(-1, self.num_classes)  # * x nclasses
        return out, out_last



In [None]:
def train(minibatches, net, optimizer, epoch, prior_probs, img_save_folder, writer):
    stime = time.time()
    c = Counter()
    for i, (batch_start, batch_end) in enumerate(minibatches):
        img_rgbs = train_origs[batch_start:batch_end]
        # convert all original rbg images to lab color space
        img_labs = np.array([cvrgb2lab(img_rgb) for img_rgb in img_rgbs])

        input_ = torch.from_numpy(train_ims[batch_start:batch_end])
        
        # lookup LAB values to create ground truth
        target = torch.from_numpy(lookup_enc.encode_points(img_labs[:, ::4, ::4, 1:]))

        input_query_ = train_words[batch_start:batch_end]
        input_lengths_ = train_lengths[batch_start:batch_end]

        # choose a caption and encode it as long dtype to be processed by embedding layer
        input_query = torch.from_numpy(input_query_.astype('int32')).long().cuda()
        input_query_lens = torch.from_numpy(input_lengths_.astype('int32')).long().cuda()

        # define input images
        input_ims = torch.tensor(input_.float()).cuda()
        target = torch.tensor(target.float()).cuda()

        # reset gradient
        optimizer.zero_grad()
        output, _ = net(input_ims, input_query, input_query_lens)

        # calculate cross_entropy loss
        loss = loss_function(output, target.view(-1)) 

        # backpropagation
        loss.backward()
        optimizer.step()
        
        # write loss to Tensorboard
        writer.add_scalar('train/loss', loss.item(), i)
        
        # print at every
        if i % 50 == 0:
            print('loss at epoch %d, batch %d / %d = %f, time: %f s' % \
                (epoch, i, len(minibatches), loss.item(), time.time()-stime))
            stime = time.time()

            if True:
                # applying softmax and transform with a/b color values
                dec_inp = F.softmax(output, dim=1)
                AB_vals = torch.matmul(dec_inp, cuda_cc) # 12544x2
                
                # reshape and select last image of batch
                AB_vals = AB_vals.view(len(img_labs), 56, 56, 2)[-1].data.cpu().numpy()[None,:,:,:]
                # resize image to previous width/height
                AB_vals = cv2.resize(AB_vals[0], (224, 224),
                     interpolation=cv2.INTER_CUBIC)
                # convert image back to rgb color space
                img_dec = labim2rgb(np.dstack((np.expand_dims(img_labs[-1, :, :, 0], axis=2), AB_vals)))

                # save last sample with caption
                img_labs_tosave = labim2rgb(img_labs[-1])
                word_list = input_query_[-1, :input_lengths_[-1]]
                words = '_'.join(vrev.get(w, 'unk') for w in word_list) 
                
                # save the grayscale, colored (ground truth) and recolored image
                cv2.imwrite(f'{img_save_folder}/{epoch}_{i}_bw.jpg',
                    cv2.cvtColor(img_rgbs[-1].astype('uint8'), 
                    cv2.COLOR_RGB2GRAY))
                cv2.imwrite(f'{img_save_folder}/{epoch}_{i}_color.jpg',
                    img_rgbs[-1].astype('uint8'))
                cv2.imwrite(f'{img_save_folder}/{epoch}_{i}_rec_{words}.jpg',
                    img_dec.astype('uint8'))
           
                # save model at every epoch
                if i == 0:                 
                    torch.save({
                        'epoch': epoch + 1,
                        'state_dict': net.state_dict(),
                        'optimizer' : optimizer.state_dict(),
                        'loss': loss.item(),
                    }, 'model_' + str(epoch)+'_'+str(i)+'.pth.tar')
    return net

In [None]:
def scale_attention_map(x):
    """Scale visual attention map to image size
    
    Args:
        x - np.array - 
    """
    x = (x - np.min(x)) / (np.max(x) - np.min(x))
    y = x * 255.
    y = cv2.cvtColor(y.astype('uint8'), cv2.COLOR_GRAY2RGB).astype('uint8')
    y = cv2.applyColorMap(y, cv2.COLORMAP_JET)
    return cv2.resize(y, (224, 224), interpolation=cv2.INTER_LANCZOS4)

def evaluate(minibatches, net, epoch, img_save_folder, save_every=20):
    stime = time.time()
    c = Counter()
    val_full_loss = 0.
    val_masked_loss = 0.
    val_loss = 0.
    n_val_ims = 0

    for i, (batch_start, batch_end) in enumerate(val_minibatches):
        img_rgbs = val_origs[batch_start:batch_end]
        img_labs = np.array([cvrgb2lab(img_rgb) for img_rgb in img_rgbs])

        input_ = torch.from_numpy(val_ims[batch_start:batch_end])
        gt_abs = img_labs[:, ::4, ::4, 1:]
        target = torch.from_numpy(lookup_enc.encode_points(gt_abs))

        input_query_ = val_words[batch_start:batch_end]
        input_lengths_ = val_lengths[batch_start:batch_end]

        input_query = torch.from_numpy(input_query_.astype('int32')).long().cuda()
        input_query_lens = torch.from_numpy(input_lengths_.astype('int32')).long().cuda()

        input_ims = torch.tensor(input_.float().cuda())
        target = torch.tensor(target.long().cuda())
    
        output, output_maps = net(input_ims, input_query, input_query_lens)

        dec_inp = F.softmax(output, dim=1)
        AB_vals = torch.matmul(dec_inp, cuda_cc)
        AB_vals = AB_vals.view(len(img_labs), 56, 56, 2).data.cpu().numpy()

        n_val_ims += len(AB_vals)
        
        for k, (img_rgb, AB_val) in enumerate(zip(img_rgbs, AB_vals)):
            AB_val = cv2.resize(AB_val, (224, 224),
            interpolation=cv2.INTER_CUBIC)
            img_dec = labim2rgb(np.dstack((np.expand_dims(img_labs[k, :, :, 0], axis=2), AB_val)))
            val_loss += error_metric(img_dec, img_rgb)

            if k == 0 and i%save_every == 0:
                output_maps = torch.mean(output_maps, dim=1).data.numpy().cpu()
                output_maps = scale_attention_map(output_maps[k])

                word_list = input_query_[k, :input_lengths_[k]]
                words = '_'.join(vrev.get(w, 'unk') for w in word_list)

                img_labs_tosave = labim2rgb(img_labs[k])
                cv2.imwrite(f'{img_save_folder}/{epoch}_{i}_bw.jpg',
                            cv2.cvtColor(img_rgbs[k].astype('uint8'), 
                            cv2.COLOR_RGB2GRAY))
                cv2.imwrite(f'{img_save_folder}/{epoch}_{i}_color.jpg',
                            img_rgbs[k].astype('uint8'))
                cv2.imwrite(f'{img_save_folder}/{epoch}_{i}_rec_{words}.jpg',
                            img_dec.astype('uint8'))
                cv2.imwrite(f'{img_save_folder}/{epoch}_{i}_att.jpg', output_maps)

    return val_loss / len(val_minibatches) 


def minibatch_idx(n, b):
    return [(i*b, (i+1)*b) for i in range(n//b)]

In [None]:
# HYPERPARAMETERS
LR = 0.001
EPOCHS = 30
BATCH_SIZE = 24
EMBEDDING_DIM = 300
HIDDEN_DIM = 150

# init tensorboard
writer = SummaryWriter()

# load vocab and pretrained w2v embeddings on captions
train_vocab = pickle.load(open('priors/coco_colors_vocab.p', 'rb'))
train_vocab_embeddings = pickle.load(open('priors/w2v_embeddings_colors.p', 'rb'), 
                                     encoding='latin1')

# setting seeds to ensure reproducability
torch.manual_seed(1337)
random.seed(1337)
np.random.seed(1337)


# initialize LAB color space encoder
lookup_enc = LookupEncode('priors/full_lab_grid_10.npy')
num_classes = lookup_enc.cc.shape[0]

# add the LAB color lookup
cuda_cc = torch.from_numpy(lookup_enc.cc).cuda()

# load images and previously extracted features
hfile = 'coco_colors.h5'
hf = h5.File(hfile, 'r')
features_file = 'img_features.h5'
ff = h5.File(features_file, 'r')

# color rebalancing
alpha = 1.
gamma = 0.5
gradient_prior_factor = torch.from_numpy(prior_boosting('./priors/coco_priors_onehot_625.npy', 
                                                        alpha, gamma)).float().cuda()

loss_function = nn.CrossEntropyLoss(weight=gradient_prior_factor)

# preparing vocabulary for embedding layer - vocab 2 idx
vrev = {v: k for k, v in train_vocab.items()}          
n_vocab = len(train_vocab)

# initialize network
net = AutocolorizeResnet(n_vocab, train_vocab_embeddings=train_vocab_embeddings) 
net.float()  # change all parameters in network to float
net.cuda()  # activate cuda

# init optimizer
optimizer = torch.optim.Adam(net.parameters(), lr=LR)

# loading images from dataset

# training images
train_origs = hf['train_ims']
train_ims = ff['train_features']
train_words = hf['train_words']                                        
train_lengths = hf['train_length']

# validation images
val_origs = hf['val_ims']                            
val_ims = ff['val_features']
val_words = hf['val_words']                                            
val_lengths = hf['val_length']

# create minibatches
n_train_ims = len(train_ims)
minibatches = minibatch_idx(n_train_ims, BATCH_SIZE)[:-1]
n_val_ims = len(val_ims)
val_minibatches = minibatch_idx(n_val_ims, 4)[:-1]

# define image folders
val_img_save_folder = 'image_val'
if not os.path.exists(val_img_save_folder): 
    os.makedirs(val_img_save_folder) 

img_save_folder = 'image_train'
if not os.path.exists(img_save_folder):
    os.makedirs(img_save_folder)                                    

print('Training')
print('='*50)

# training loop
for epoch in range(EPOCHS):
    random.shuffle(minibatches)
    random.shuffle(val_minibatches)

    net = train(minibatches, net, optimizer, epoch, gradient_prior_factor,
                img_save_folder, writer)
    t = time.time()
    val_full_loss = evaluate(val_minibatches, net, epoch, val_img_save_folder)
    writer.add_scalar('val/loss', val_full_loss)

## Results
***

The authors have showed impressive results in this overview:  
<br>
![eval_overview](https://raw.githubusercontent.com/superhans/colorfromlanguage/master/images/Activations4.png)
<br>
The model was able to detect react on the exchange of color words and could change the actual color of the object that was encoded in the caption. Also the attention maps in the last *FiLM Residual Blocks* showed that the model actually is able to detect the objects successfully at the images.  
  
In my reimplementation, I was eager to dive into computer vision with the mere foundamentals of knowing how convolutions work. Unfortunately, I was not able to get actual results, but a pattern that something might be off in my image preprocessing, because my model mostly proposes square regions (with a bias to the color in the caption), such that the attention maps look the same througout the epochs, which is also to be seen in the recolored training images where a slight color hue is present, but always in form of a square.

![comparison](./images/comparison.png)

### Takeaways
***
#### Computer Vision is...
* not as simple as just throwing convolutions around and hope things turn out fine.
* not just deep learning, but demands knowledge about image processing.
* actually intruiging, because you can get really creative.
* a field which is democratizing its findings and pretrained models, which is **AWESOME**

#### Personal:
* I want to further investigate my problems in this implementation and get the model to work properly to learn more.
* For me as a novice, it looks that the regions have not been correctly proposed, because parts of the images have been colorized correctly. Looking at the attention maps also supports this thought.

This project was thrilling and working in computer vision is actually super responsive, because of the ways we gain the feedback by visualizing the attention maps. In further projects I want to continue learning more about how images are processed and how neural networks and non-neural methods are applied to work with images and videos.

Thanks for reading!
– Ben

***
##### Literature Notes

* _Learning to Color from Language, Manjunatha and Iyyer et al. 2018_: https://arxiv.org/pdf/1804.06026.pdf
* _Colorful Image Colorization, Zhang et al. 2016_: https://arxiv.org/pdf/1603.08511.pdf
* _FiLM: Visual Reasoning with a General Conditioning Layer, Perez et al. 2017_: https://arxiv.org/pdf/1709.07871.pdf
* _CIE 1931 Color Space, Wikipedia_: https://en.wikipedia.org/wiki/CIE_1931_color_space

##### Code & Data Sources

* _FiLM_: https://github.com/ethanjperez/film/blob/master/vr/models/filmed_net.py
* _Color from Language_: https://github.com/superhans/colorfromlanguage

##### Image Sources

* RGB Channels: https://miro.medium.com/max/2146/1*icINeO4H7UKe3NlU1fXqlA.jpeg
* RESNET 101: https://miro.medium.com/max/1200/1*6hF97Upuqg_LdsqWY6n_wg.png
* CIELAB Color space: https://upload.wikimedia.org/wikipedia/commons/7/7d/CIELAB_color_space_front_view.png
* LSTM Legend: https://www.mdpi.com/water/water-11-01387/article_deploy/html/images/water-11-01387-g004.png
