# Task 3-4: Image Captioning with CLIP and LM

**In this 3-4 task, there are a total of 9 steps, with a total score of 40 points.**

[CLIP](https://github.com/openai/CLIP) (Contrastive Language–Image Pretraining) is a multimodal pretraining model proposed by OpenAI, combining image and text modalities to achieve robust cross-modal understanding capabilities.

Language models (LM) are core tools in natural language processing (NLP), aiming to understand and generate text consistent with linguistic rules. They model language sequences in a probabilistic form, predicting the likelihood of words or tokens in a text.

Among them, [GPT-2](https://huggingface.co/openai-community/gpt2) (Generative Pre-trained Transformer 2) is a language generation model released by OpenAI in 2019. It is the second generation in the GPT (Generative Pre-trained Transformer) series and demonstrates strong capabilities in tasks such as text generation, summarization, and translation.

## (1) Motivation:
In the 3-4 task, can the cross-modal understanding capability of **CLIP** be combined with the text generation capability of language models to achieve both accurate image understanding and expressive language generation?

## (2) Dependencies

Please refer to `clip-captioner/requirements.txt` for the list of required dependencies.

## (3) Dataset Download
Download [train_captions](https://drive.google.com/file/d/1D3EzUK1d1lNhD2hAvRiKPThidiVbP2K_/view?usp=sharing) to `clip-captioner/data/coco/annotations`.

Download [training images](http://images.cocodataset.org/zips/train2014.zip) and [validation images](http://images.cocodataset.org/zips/val2014.zip) and unzip (We use Karpathy et el. split).


## (4)Dataset Implementation [10 pts]

Please implement the ImageCaptionDataset (line 20) in `clip-captioner/data/dataset.py` and the collate function (line 26) for the dataloader.

In [None]:
class ImageCaptionDataset(Dataset):
    # TODO: 需要实现一个 ImageCaptionDataset
    pass



def cl_fn(batch, tokenizer):
    # TODO: 需要实现一个 collate function
    pass

    # return img_emb, input_ids, attention_mask

## (5) Dataset Usage [2pts]
Please complete the specific instances of `ImageCaptionDataset` in `clip-captioner/evaluate.py` (line 109) and `clip-captioner/training.py` (line 50) based on your implemented dataset.

In [None]:
# TODO: 需要你自己实现一个ImageCaptionDataset在`data/dataset.py`中
dataset = ImageCaptionDataset()

## (6)Model Implementation [5pts]
Please implement a `TransformerEncoder` with multiple layers (`num_layers`) and multi-head attention (`n_heads`), without directly using `nn.TransformerEncoder`, `nn.TransformerEncoderLayer`, or `nn.MultiheadAttention`.

Refer to `clip-captioner/model/model.py` for details.

In [None]:
# TODO:  请你实现一个TransformerEncoder，要求多层（num_layers），多头（n_heads）
# 禁止： 直接使用 nn.TransformerEncoder、nn.TransformerEncoderLayer、nn.MultiheadAttention

# self.transformer_encoder = nn.TransformerEncoder(
#     nn.TransformerEncoderLayer(
#         d_model=embed_size,
#         nhead=n_heads,
#         dim_feedforward=embed_size * forward_expansion,
#         dropout=dropout,
#         batch_first=True,
#         device=device,
#     ),
#     num_layers=num_layers,
# ).to(self.device)

## (6)Model Train[5pts]

You also need to complete a training loop before training. Refer to `clip-captioner/model/trainer.py` (line 71).

In [None]:
for batch_idx, (img_emb, cap, att_mask) in enumerate(loop):
    # TODO: 请你实现一个 training loop
    pass

    loop.set_description(
        f"Epoch: {self.epoch} | Loss: {total_loss / (batch_idx + 1):.3f}"
    )
    loop.refresh()


**Note**: The training launcher is primarily `clip-captioner/training.py`, which requires two input arguments: the name of the checkpoint to save and the model size (two options: `S` and `L`, with configuration differences detailed in `clip-captioner/utils/config.py`).

In [None]:
parser.add_argument(
    "-C", "--checkpoint-name", type=str, default="", help="Checkpoint name"
)

parser.add_argument(
    "-S",
    "--size",
    type=str,
    default="S",
    help="Model size [S, L]",
    choices=["S", "L", "s", "l"],
)

## (7)  Visualization [3pts]  
Please run `clip-captioner/evaluate.py` and save **5 visualization results** (images + generated captions).

## (8) Further Improvements?  [10pts]
Modern large language models (LLMs) have demonstrated stronger text generation capabilities compared to GPT-2.  
Can you replace GPT-2 with a more powerful LLM (e.g., [Qwen2.5-0.5B](https://hf-mirror.com/Qwen/Qwen2.5-0.5B)) to enhance the performance of the Image Captioning model?  

## (9) Improvement Description [5pts]
Please describe your specific improvements below, including what functionalities were implemented in which files, and the results after the improvements:

 TODO: Improvement Description