# COCO Questions

We aim to generate a VQA dataset to further assessment, without ground truth answers. We should run the following steps:

1. Gather images from COCO dataset;
1. Generate questions with LLMs from image captions only;


In [1]:
!pip install -U datasets langchain langchain-groq langchain-google-vertexai

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting langchain
  Downloading langchain-0.3.7-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-groq
  Downloading langchain_groq-0.2.1-py3-none-any.whl.metadata (2.9 kB)
Collecting langchain-google-vertexai
  Downloading langchain_google_vertexai-2.0.7-py3-none-any.whl.metadata (3.8 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Collecting langchain-core<0.4.0,>=0.3.15 (from langchain)
  Downloading langchain_core-0.3.15-py3-non

## 1. Gather images from COCO

We sample images from COCO dataset.

In [None]:
!wget -O 'coco-2017-annotations.zip' 'http://images.cocodataset.org/annotations/annotations_trainval2017.zip'
!unzip coco-2017-annotations.zip

--2024-10-30 23:43:05--  http://images.cocodataset.org/annotations/annotations_trainval2017.zip
Resolving images.cocodataset.org (images.cocodataset.org)... 3.5.12.113, 52.217.197.185, 52.216.33.169, ...
Connecting to images.cocodataset.org (images.cocodataset.org)|3.5.12.113|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 252907541 (241M) [application/zip]
Saving to: ‘coco-2017-annotations.zip’


2024-10-30 23:43:12 (34.8 MB/s) - ‘coco-2017-annotations.zip’ saved [252907541/252907541]

Archive:  coco-2017-annotations.zip
  inflating: annotations/instances_train2017.json  
  inflating: annotations/instances_val2017.json  
  inflating: annotations/captions_train2017.json  
  inflating: annotations/captions_val2017.json  
  inflating: annotations/person_keypoints_train2017.json  
  inflating: annotations/person_keypoints_val2017.json  


In [None]:
import json
from collections import defaultdict
from typing import Any, Dict, List


def load_coco_captions(filename: str) -> List[Dict[str, Any]]:
    with open(filename) as fp:
        data = json.load(fp)

    licenses: Dict[int, str] = {
        license["id"]: license["url"]
        for license in data["licenses"]
    }

    captions: Dict[int, List[str]] = defaultdict(list)
    for annotation in data["annotations"]:
        captions[annotation["image_id"]].append(annotation["caption"])

    return [
        {
            "id": image_data["id"],
            "url": image_data["coco_url"],
            "license": licenses[image_data["license"]],
            "captions": captions[image_data["id"]],
            "height": image_data["height"],
            "width": image_data["width"],
            "date_captured": image_data["date_captured"]
        }
        for image_data in data["images"]
    ]

In [None]:
import random

# Sample used for experiments
eval_sample = load_coco_captions("annotations/captions_val2017.json")
# Sample used to formulate few-shot examples and debugging
help_sample = load_coco_captions("annotations/captions_train2017.json")

random.shuffle(eval_sample)
random.shuffle(help_sample)

eval_sample = eval_sample[:60]
help_sample = help_sample[:16]

In [None]:
with open("coco-eval-sample.json", "w+") as fp:
    json.dump(eval_sample, fp, ensure_ascii=False, indent=2)

with open("coco-dev-sample.json", "w+") as fp:
    json.dump(help_sample, fp, ensure_ascii=False, indent=2)

## 2. Generate Questions

In [None]:
# Language Model
import os
from google.colab import userdata
from langchain_groq import ChatGroq


os.environ["GROQ_API_KEY"] = userdata.get('GROQ_API_KEY')
llm = ChatGroq(model="llama-3.1-70b-versatile", temperature=0.7)

In [None]:
from langchain_core.prompts import ChatPromptTemplate


task = """
It will be given to you a Image Caption. You want to know more details about the
image itself. Your job is to generate a question to grasp more information about the
content mentioned in the image caption.

Examples:

Caption: A man and a woman standing next to each other in a living room.
Question: What activity do the man and the woman appear to be doing?
---
Caption: A table topped with cakes, coffee and desserts.
Question: What type of meal is laid out on the table?
---
Caption: A hand holding a glass of alcoholic drink in the snow.
Question: What kind of drink is inside the glass?
---
Caption: A parked snow mobile siting on the side of a train.
Question: Is the train a passenger or freight train?
---
Caption: A young girl with a downed kite in a field.
Question: How old is the young girl?
---
"""

prompt = ChatPromptTemplate.from_messages([
    ("system", task),
    ("user", "Caption: {caption}"),
    ("user", "Question:")
])

In [None]:
## Debug
prompt_value = prompt.invoke({
    "caption": "A microwave and a cone on asphalt by bushes."
})
for message in prompt_value.messages:
    print(message.content)


It will be given to you a Image Caption. You want to know more details about the
image itself. Your job is to generate a question to grasp more information about the
content mentioned in the image caption.

Examples:

Caption: A man and a woman standing next to each other in a living room.
Question: What activity do the man and the woman appear to be doing?
---
Caption: A table topped with cakes, coffee and desserts.
Question: What type of meal is laid out on the table?
---
Caption: A hand holding a glass of alcoholic drink in the snow.
Question: What kind of drink is inside the glass?
---
Caption: A parked snow mobile siting on the side of a train.
Question: Is the train a passenger or freight train?
---
Caption: A young girl with a downed kite in a field.
Question: How old is the young girl?
---

Caption: A microwave and a cone on asphalt by bushes.
Question:


In [None]:
question_generation_chain = prompt | llm

In [None]:
question_generation_chain.invoke({"caption": "A microwave and a cone on asphalt by bushes."})

AIMessage(content='What appears to be the purpose of the microwave being placed on the asphalt?', additional_kwargs={}, response_metadata={'token_usage': {'completion_tokens': 16, 'prompt_tokens': 235, 'total_tokens': 251, 'completion_time': 0.064, 'prompt_time': 0.040806155, 'queue_time': 0.003629665000000004, 'total_time': 0.104806155}, 'model_name': 'llama-3.1-70b-versatile', 'system_fingerprint': 'fp_5c5d1b5cfb', 'finish_reason': 'stop', 'logprobs': None}, id='run-b22247c9-697d-4056-8cbf-ef28faf39ac2-0', usage_metadata={'input_tokens': 235, 'output_tokens': 16, 'total_tokens': 251})

In [None]:
import time
import random
from tqdm.auto import tqdm

questions = []
for image in tqdm(eval_sample, desc="Generating questions"):
    caption = random.choice(image["captions"])
    try:
        question = question_generation_chain.invoke({"caption": caption}).content
        questions.append({
            **image,
            "question": question
        })
    except Exception as e:
        tqdm.write(f"Error: {e}")
    finally:
        time.sleep(1.0)

Generating questions:   0%|          | 0/60 [00:00<?, ?it/s]

In [None]:
with open("coco-eval-questions.json", "w+") as fp:
    json.dump(questions, fp, ensure_ascii=False, indent=2)

# VQA

We aim to sample from the original VQA dataset images with questions and respective human-annotated answers. This dataset will be used to assess Multi-Modal LLMs with ground-truth answers.

In [3]:
!wget "https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Annotations_Val_mscoco.zip"
!wget "https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Questions_Val_mscoco.zip"

!unzip v2_Annotations_Val_mscoco.zip
!unzip v2_Questions_Val_mscoco.zip

--2024-11-02 14:11:22--  https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Annotations_Val_mscoco.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.169.248, 52.217.92.158, 54.231.133.192, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.169.248|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10518930 (10M) [application/zip]
Saving to: ‘v2_Annotations_Val_mscoco.zip’


2024-11-02 14:11:22 (43.5 MB/s) - ‘v2_Annotations_Val_mscoco.zip’ saved [10518930/10518930]

--2024-11-02 14:11:23--  https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Questions_Val_mscoco.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.169.248, 52.217.92.158, 54.231.133.192, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.169.248|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3494929 (3.3M) [application/zip]
Saving to: ‘v2_Questions_Val_mscoco.zip’


2024-11-02 14:11:23 (20.9 MB/s) - ‘v2_Questions_Val_mscoco.zip’ saved [3494

In [7]:
import json
from tqdm.auto import tqdm
from typing import Dict, TypedDict, Literal


class Question(TypedDict):
    text: str
    image_id: int


def load_vqa_questions(filename: str) -> Dict[int, Question]:
    with open(filename) as fp:
        data = json.load(fp)

    return {
        question["question_id"]: {
            "text": question["question"],
            "image_id": question["image_id"]
        }
        for question in data["questions"]
    }

In [8]:
from typing import Literal


VQASubtype = Literal["train2014", "val2014", "train2017", "val2017"]


def create_image_url(image_id: int, subtype: VQASubtype) -> str:
    coco_base_url = f"http://images.cocodataset.org/{subtype}/"

    if "2014" in subtype:
        image_path = f"COCO_{subtype}_{image_id:012d}.jpg"
    else:
        image_path = f"{image_id:012d}.jpg"

    return coco_base_url + image_path

In [13]:
import random
import requests
from typing import List

def load_vqa(annotations_file: str, questions: Dict[int, Question], k=-1, shuffle=True) -> List[dict]:
    with open(annotations_file) as fp:
        data = json.load(fp)

    subtype = data["data_subtype"]
    if shuffle:
        random.shuffle(data["annotations"])
    k = k if k > 0 else len(data["annotations"])
    results = []
    with tqdm(total=k, desc="Loading VQA") as pbar:
        for annotation in data["annotations"]:
            image_url = create_image_url(annotation["image_id"], subtype)
            # Check if image exists
            image_exists = requests.get(image_url).ok
            # Check if question exists and match with image_id
            question_exists = (
                questions
                .get(annotation["question_id"], {})
                .get("image_id", -1) == annotation["image_id"]
            )
            if image_exists and question_exists:
                results.append({
                    "id": annotation["image_id"],
                    "url": image_url,
                    "question_type": annotation["question_type"],
                    "question": questions[annotation["question_id"]]["text"],
                    "answer_type": annotation["answer_type"],
                    "multiple_choice_answer": annotation["multiple_choice_answer"],
                    "answers": [
                        a["answer"] for a in annotation["answers"]
                    ]
                })
                pbar.update(1)

            if len(results) >= k:
                break

    return results

In [14]:
questions = load_vqa_questions("v2_OpenEnded_mscoco_val2014_questions.json")
vqa_sample = load_vqa("v2_mscoco_val2014_annotations.json", questions, k=60)

with open("vqa-eval.json", "w+") as fp:
    json.dump(vqa_sample, fp, ensure_ascii=False, indent=2)

Loading VQA:   0%|          | 0/60 [00:00<?, ?it/s]