<a href="https://colab.research.google.com/github/nonanne/Image-Captioning/blob/main/DeepLearning_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We will use the Flickr8k dataset. This is a dataset of 8,000 pairs of image-captions.


We will use Karpathy’s split.
Each image in this split has the following entries:
• sentences: the ground-truth sentences. Since an image can be described in various ways
by different people, each image is paired with multiple ground-truth sentences. Most images
have 5 captions. split: to which split (training, validation or testing) does the image belong to
• filename: the name of the image

In [28]:
import kagglehub
from google.colab import drive
import json
import torch
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
from PIL import Image
import json
import os
import json
from collections import Counter

In [29]:
#mount google drive
drive.mount('/content/drive')

#Path settings
# Folder containing resized images
DIR = '/content/drive/MyDrive/Share/VUB/DeepLearning'
IMAGE_DIR = os.path.join(DIR, 'Images')
JSON_PATH = os.path.join(DIR, 'Json/flickr8k_simplified.json')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [30]:
# dataset of 8000 paird of image-captions
path = kagglehub.dataset_download("adityajn105/flickr8k")

print("Path to dataset files:", path)

Using Colab cache for faster access to the 'flickr8k' dataset.
Path to dataset files: /kaggle/input/flickr8k


Each image in this split has the following entries:

• sentences: the ground-truth sentences. Since an image can be described in various ways
by different people, each image is paired with multiple ground-truth sentences. Most images
have 5 captions. You should train your model on at least 3 captions per image (the more
the better). If you choose to use 3, this will triple the number of training samples

• split: to which split (training, validation or testing) does the image belong to

• filename: the name of the image

**code for Imagenette cannot be used for Flickr8k**
1. datasets.Imagenette is a Dataset class made specifically for Imagenette
PyTorch only provides built-in dataset classes for a few datasets such as
datasets.Imagenette, datasets.CIFAR10, and datasets.MNIST.
There is no built-in Dataset class for Flickr8k.
→ Flickr8k requires creating your own custom Dataset class.
2. Flickr8k requires manually pairing each image with its captions
Using your JSON file (flickr8k_simplified.json), you must manually create a list like:
[filename, caption].
Imagenette automatically handles labels internally, but Flickr8k does not, so you must implement this yourself.

In [31]:
json_data = []
list_train = [] # eg. ['2513260012_03d33305cf.jpg', 'a black dog is running after a white dog in the snow .']
filenames = [] # image files
sentences = [] # contain 5 captures
word_counter = Counter() #Counter for counting words
max_len_caption = 0 # maximum of tokens in one sentence

try:
    with open(JSON_PATH, 'r', encoding='utf-8') as f:
        json_data = json.load(f)
    print(f"Success: read {JSON_PATH} . number of data: {len(json_data)}")

except FileNotFoundError:
    # if it cant fine file
    print(f"Error: no {JSON_PATH} .")

for item in json_data:
  if item["split"] == "train":
    filenames = item['filename']
    sentences = item['sentences'] # 5 captures in each item
    for sentence in sentences:
      list_train.append([filenames, sentence])
      # make all small characters
      tokens = sentence.lower().strip().split()
      max_len_caption = max(max_len_caption, len(tokens))
      word_counter.update(tokens)

#confirm
print(f"--- result ---")
print(f"the number of images for training: {len([x for x in json_data if x.get('split') == 'train'])}")
print(f"the number of generated data: {len(list_train)}")

print("\n--- first 5 data ---")
# split 5 sentenses to each sentences with the same file name.
for sample in list_train[:5]:
    print(sample)

Success: read /content/drive/MyDrive/Share/VUB/DeepLearning/Json/flickr8k_simplified.json . number of data: 8000
--- result ---
the number of images for training: 6000
the number of generated data: 30000

--- first 5 data ---
['2513260012_03d33305cf.jpg', 'a black dog is running after a white dog in the snow .']
['2513260012_03d33305cf.jpg', 'black dog chasing brown dog through snow']
['2513260012_03d33305cf.jpg', 'two dogs chase each other across the snowy ground .']
['2513260012_03d33305cf.jpg', 'two dogs play together in the snow .']
['2513260012_03d33305cf.jpg', 'two dogs running through a low lying body of water .']


You are required to implement the image captioning architecture shown in Figure 2. The archi-
tecture is composed of a Vision Transformer (ViT) and an autoregressive Transformer Decoder
with cross-attention across the output features of the ViT.  ***The images should be resized to
224 × 224, with a patch size of 16 × 16.*** If using Google Colab, resize the images before up-
loading them to Google Drive in order to save memory.

1. これからやるべきことの全体像 / Overall TODO

課題の指示を分解すると、必要なのはこの4つです：

1. Vision Transformer (ViT) Encoder

画像 224×224 を 16×16 パッチに分割 → 14×14 = 196 パッチ

各パッチをベクトルに埋め込み（patch embedding）

位置埋め込み（positional encoding）

Transformer Encoder を 自作 で実装（nn.Transformer は禁止）

2. Autoregressive Transformer Decoder

テキスト側（キャプション）のトークン列を embedding

マスク付き self-attention

Encoder の出力に対する cross-attention

こちらも Transformer Decoder を 自作 で実装

3. 出力

各タイムステップで語彙サイズ（vocab_size）のロジットを出す

4. 学習ループ

Teacher forcing：<start> ... を入力、次の単語を予測

CrossEntropyLoss で学習

In [32]:
import json
from collections import Counter

In [33]:
# prepare text sets
# separate app captions to tokens and register to id library word2idx
# add <pad>, <start>, <end>, <unk> for NLP control

# ----- 2. Make a vocabulary list -----
words = [w for w, c in word_counter.items()]
words = sorted(words)  # sort to check easier
print('words: ', words)

# ----- 3. add special tokens  -----
PAD_TOKEN = "<pad>"
START_TOKEN = "<start>"
END_TOKEN = "<end>"
UNK_TOKEN = "<unk>"

word2idx = dict()
idx2word = dict()

# register special tokens first
special_tokens = [PAD_TOKEN, START_TOKEN, END_TOKEN, UNK_TOKEN]
for idx, tok in enumerate(special_tokens):
    word2idx[tok] = idx
    idx2word[idx] = tok

# register normal words next
for i, w in enumerate(words, start=len(special_tokens)):
    word2idx[w] = i
    idx2word[i] = w
print(word2idx)
print(idx2word)

print("vocab size:", len(word2idx))
print("first 20 items in vocab:")
for i, (w, idx) in enumerate(list(word2idx.items())[:20]):
    print(w, "->", idx)



vocab size: 7709
first 20 items in vocab:
<pad> -> 0
<start> -> 1
<end> -> 2
<unk> -> 3
! -> 4
" -> 5
# -> 6
& -> 7
' -> 8
'n' -> 9
's -> 10
's-eye-view -> 11
'slide -> 12
( -> 13
) -> 14
, -> 15
- -> 16
-ependent -> 17
. -> 18
1 -> 19


In [34]:
# caption(文字列) を ID のリストに変換する関数
def encode_caption(sentence, word2idx, max_len):
    """
    caption: 1つのキャプション文字列
    word2idx: 単語 → ID の辞書
    max_len:  出力するID列の最大長（<start>, <end>, <pad> を含む）

    戻り値: 長さ max_len の ID リスト（list[int]）
    """

    # 1. 小文字にして、前後の空白を削除し、スペースで分割
    tokens = sentence.lower().strip().split()

    # 2. 先頭に <start> トークンを入れる
    ids = [word2idx["<start>"]]

    # 3. 各単語を ID に変換（辞書にない単語は <unk> にする）
    unk_id = word2idx["<unk>"]
    for t in tokens:
        ids.append(word2idx.get(t, unk_id))

    # 4. 最後に <end> を付ける
    ids.append(word2idx["<end>"])

    # 5. 長さを max_len に合わせる
    if len(ids) < max_len:
        # 短いとき → <pad> で埋める
        pad_id = word2idx["<pad>"]
        ids += [pad_id] * (max_len - len(ids))
    else:
        # 長いとき → 切り詰める
        ids = ids[:max_len]
        # 最後が <end> で終わるように保証（途中で切れた場合のケア）
        if ids[-1] != word2idx["<end>"]:
            ids[-1] = word2idx["<end>"]

    return ids


In [37]:
encoded_train = []
MAX_LEN = max_len_caption + 3 # include <start>, <end> and 1 more for just in case
print(max_len_caption)

for filename, sentence in list_train:
    img_path = os.path.join(IMAGE_DIR, filename)

    caption_ids = encode_caption(sentence, word2idx, MAX_LEN)

    encoded_train.append((img_path, caption_ids))

print(encoded_train)


38
[('/content/drive/MyDrive/Share/VUB/DeepLearning/Images/2513260012_03d33305cf.jpg', [1, 63, 696, 1987, 3475, 5614, 141, 63, 7519, 1987, 3391, 6863, 6161, 18, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), ('/content/drive/MyDrive/Share/VUB/DeepLearning/Images/2513260012_03d33305cf.jpg', [1, 696, 1987, 1235, 926, 1987, 6900, 6161, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), ('/content/drive/MyDrive/Share/VUB/DeepLearning/Images/2513260012_03d33305cf.jpg', [1, 7177, 1992, 1232, 2132, 4541, 92, 6863, 6197, 2996, 18, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), ('/content/drive/MyDrive/Share/VUB/DeepLearning/Images/2513260012_03d33305cf.jpg', [1, 7177, 1992, 4956, 6959, 3391, 6863, 6161, 18, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), ('/content/drive/MyDrive/Share/VUB/DeepLearning/Images/2513260

Vision Transformer (ViT)

In [36]:
# split image to patches
# each patch is a collection of 16x16x3 = 768 numbers (pixel values).
# 3 (16x16) arrays "stacked on top of each other"

