📖 **Resources**:
 * EfficientNer B6 - https://pytorch.org/vision/stable/generated/torchvision.models.efficientnet_b6.html#torchvision.models.efficientnet_b6
 * CocoCaptions DS - https://pytorch.org/vision/stable/generated/torchvision.datasets.CocoCaptions.html#torchvision.datasets.CocoCaptions
 * Guide to torchvision COCO Dataset - https://medium.com/howtoai/pytorch-torchvision-coco-dataset-b7f5e8cad82
 * T5 Model HF - https://huggingface.co/docs/transformers/model_doc/t5#t5#
 * T5 Small HF - https://huggingface.co/t5-small
 * Aladdin Persson's image captioning repo https://github.com/aladdinpersson/Machine-Learning-Collection/tree/master/ML/Pytorch/more_advanced/image_captioning
 * Guide to image captioning - https://towardsdatascience.com/a-guide-to-image-captioning-e9fd5517f350
 * Andrej Karpathy's Deep Visual-Semantic Alignments for Generating Image Descriptions - https://cs.stanford.edu/people/karpathy/deepimagesent/

🔑 **Note**: Images from Coco will be downloaded, because model will be trained using GoogleColab, drive if which allow us to upload ~15 GB of data. Coco dataset is larger.

## 1. Import depedencies

In [1]:
import json
import os
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import models, transforms
from torch.utils.data import DataLoader, Dataset
import pandas as pd
import numpy as np
from transformers import T5Model, T5Tokenizer

In [2]:
# Load models
efficient_net = models.efficientnet_b6(pretrained=True, progress=True)
t5 = T5Model.from_pretrained('t5-small')
tokenizer = T5Tokenizer.from_pretrained('t5-small')

## 2. Data expectation

In [3]:
# Load data for expectations
with open(os.path.join('data', 'annotations', 'captions_train2017.json'), 'r') as f:
    data = json.loads(f.read())

In [4]:
data['licenses']

[{'url': 'http://creativecommons.org/licenses/by-nc-sa/2.0/',
  'id': 1,
  'name': 'Attribution-NonCommercial-ShareAlike License'},
 {'url': 'http://creativecommons.org/licenses/by-nc/2.0/',
  'id': 2,
  'name': 'Attribution-NonCommercial License'},
 {'url': 'http://creativecommons.org/licenses/by-nc-nd/2.0/',
  'id': 3,
  'name': 'Attribution-NonCommercial-NoDerivs License'},
 {'url': 'http://creativecommons.org/licenses/by/2.0/',
  'id': 4,
  'name': 'Attribution License'},
 {'url': 'http://creativecommons.org/licenses/by-sa/2.0/',
  'id': 5,
  'name': 'Attribution-ShareAlike License'},
 {'url': 'http://creativecommons.org/licenses/by-nd/2.0/',
  'id': 6,
  'name': 'Attribution-NoDerivs License'},
 {'url': 'http://flickr.com/commons/usage/',
  'id': 7,
  'name': 'No known copyright restrictions'},
 {'url': 'http://www.usa.gov/copyright.shtml',
  'id': 8,
  'name': 'United States Government Work'}]

In [5]:
data['info']

{'description': 'COCO 2017 Dataset',
 'url': 'http://cocodataset.org',
 'version': '1.0',
 'year': 2017,
 'contributor': 'COCO Consortium',
 'date_created': '2017/09/01'}

In [6]:
pd.DataFrame(data['images'])

Unnamed: 0,license,file_name,coco_url,height,width,date_captured,flickr_url,id
0,3,000000391895.jpg,http://images.cocodataset.org/train2017/000000...,360,640,2013-11-14 11:18:45,http://farm9.staticflickr.com/8186/8119368305_...,391895
1,4,000000522418.jpg,http://images.cocodataset.org/train2017/000000...,480,640,2013-11-14 11:38:44,http://farm1.staticflickr.com/1/127244861_ab0c...,522418
2,3,000000184613.jpg,http://images.cocodataset.org/train2017/000000...,336,500,2013-11-14 12:36:29,http://farm3.staticflickr.com/2169/2118578392_...,184613
3,3,000000318219.jpg,http://images.cocodataset.org/train2017/000000...,640,556,2013-11-14 13:02:53,http://farm5.staticflickr.com/4125/5094763076_...,318219
4,3,000000554625.jpg,http://images.cocodataset.org/train2017/000000...,640,426,2013-11-14 16:03:19,http://farm5.staticflickr.com/4086/5094162993_...,554625
...,...,...,...,...,...,...,...,...
118282,1,000000444010.jpg,http://images.cocodataset.org/train2017/000000...,480,640,2013-11-25 14:46:11,http://farm4.staticflickr.com/3697/9303670993_...,444010
118283,3,000000565004.jpg,http://images.cocodataset.org/train2017/000000...,427,640,2013-11-25 19:59:30,http://farm2.staticflickr.com/1278/4677568591_...,565004
118284,3,000000516168.jpg,http://images.cocodataset.org/train2017/000000...,480,640,2013-11-25 21:03:34,http://farm3.staticflickr.com/2379/2293730995_...,516168
118285,4,000000547503.jpg,http://images.cocodataset.org/train2017/000000...,375,500,2013-11-25 21:20:21,http://farm1.staticflickr.com/178/423174638_1c...,547503


In [7]:
pd.DataFrame(data['annotations'])

Unnamed: 0,image_id,id,caption
0,203564,37,A bicycle replica with a clock as the front wh...
1,322141,49,A room with blue walls and a white sink and door.
2,16977,89,A car that seems to be parked illegally behind...
3,106140,98,A large passenger airplane flying through the ...
4,106140,101,There is a GOL plane taking off in a partly cl...
...,...,...,...
591748,133071,829655,a slice of bread is covered with a sour cream ...
591749,410182,829658,A long plate hold some fries with some sliders...
591750,180285,829665,Two women sit and pose with stuffed animals.
591751,133071,829693,White Plate with a lot of guacamole and an ext...


🔑 **Note**: As we can see we don't need every column from data, so we can drop some. In fact we can only use annotations and generate url for img using id.
🔑 **Note**: I tried to create dataset for each image, no for each caption, but it will generate issues, because number of captions per image isn't constant. DataLoader should get equal number of targets per sample, so in this case I will use format of downloading image for each caption.

In [8]:
del data['info']
del data['licenses']
del data['images']

data['annotations'] = pd.DataFrame(data['annotations']).drop('id', axis=1)

In [9]:
data['annotations']

Unnamed: 0,image_id,caption
0,203564,A bicycle replica with a clock as the front wh...
1,322141,A room with blue walls and a white sink and door.
2,16977,A car that seems to be parked illegally behind...
3,106140,A large passenger airplane flying through the ...
4,106140,There is a GOL plane taking off in a partly cl...
...,...,...
591748,133071,a slice of bread is covered with a sour cream ...
591749,410182,A long plate hold some fries with some sliders...
591750,180285,Two women sit and pose with stuffed animals.
591751,133071,White Plate with a lot of guacamole and an ext...


In [10]:
# def id_to_url(row):
#     length_id = 12
#     id_ = str(row['image_id'])
#     id_str = ''.join(['0' for _ in range(length_id - len(id_))]) + id_

#     return f'http://images.cocodataset.org/train2017/{id_str}.jpg'

In [11]:
# data['annotations']['coco_url'] = data['annotations'].apply(lambda x: id_to_url(x['image_id']), axis=1)