<center><b><h1>MiniVQA</h1></b></center>


**MiniVQA** is based on the VQA dataset [1] using microsoft COCO images [2]

[1] Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., & Parikh, D. (2017). Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6904-6913).

[2] Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., ... & Zitnick, C. L. (2014, September). Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740-755). Springer, Cham.

**Python 3.7.10**

# Step 1. Downloading VQA annotations (questions and answers)

In [None]:
!wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Annotations_Train_mscoco.zip && unzip v2_Annotations_Train_mscoco && \
wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Questions_Train_mscoco.zip && unzip v2_Questions_Train_mscoco

In [3]:
import os
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
import json
import copy
import glob
import torch
import random
import operator
import numpy as np
import urllib.request
import matplotlib.pyplot as plt


from PIL import Image
from tqdm import tqdm
from google.colab import files
from numpy.random import choice
from sklearn.manifold import TSNE
from collections import Counter, defaultdict


# Step 2. Unpacking questions

In [4]:
annotations = json.load(open("v2_mscoco_train2014_annotations.json"))
questions = json.load(open("v2_OpenEnded_mscoco_train2014_questions.json"))

annotations = annotations['annotations']
questions_id = {q['question_id']: q['question'] for q in questions['questions']}

assert len(annotations) == len(questions_id)
print(len(questions_id), 'questions available')

443757 questions available


In [None]:
# Printing 5 random examples 
plt.figure(figsize=([30, 20]))
for i in range(5):
  plt.subplot(1, 5, i+1)
  ann = random.choice(list(annotations))
  filename = 'COCO_train2014_'+ str(ann['image_id']).zfill(12) + '.jpg'
  src = os.path.join('https://vqa_mscoco_images.s3.amazonaws.com/train2014/', filename)
  trg = os.path.join('/tmp', filename)
  if not os.path.exists(trg):
    urllib.request.urlretrieve(src, trg)
  plt.imshow(Image.open(trg).resize((128,128)))
  plt.xlabel('image_id: '+ str(ann['image_id']))
  plt.title(questions_id[ann['question_id']])

# Step 3. Group questions by answers 

In [None]:
questions_by_answer_type = defaultdict(list)

# Get most frequent answer amongst annotators 
def get_top_answer(answers):
  count_answers = Counter([answer['answer'] for answer in annotation['answers']])
  top_answer, _ = count_answers.most_common()[0]
  return top_answer

# Iterate through annotations
# Stack questions
for annotation in tqdm(annotations, position=0, leave=True):
  question_id =  annotation['question_id']
  image_id =  annotation['image_id']
  top_answer = get_top_answer(annotation['answers'])
  questions_by_answer_type[top_answer].append([questions_id[question_id], int(image_id), int(question_id)])

# we discard answers with less than 3 samples. We want at least one sample per slip (train,val,test)
answers_population = Counter({a:len(q) for a, q in questions_by_answer_type.items() if len(q) > 3})
num_unique_answers = len(answers_population)
print('\n', num_unique_answers, 'unique answers')

# Step 4. Decide the volume of your dataset

1. select **num_answers**, the number of possible different answers (i.e. how many classes)
2. select the **sampling_type** to choose answers ("top" to get the '*num_answers*' most common answers, or "random")  
3. you can choose to exclude the n most popular answers with **sampling_exclude_top** (the most popular answers is "no" and contains 80.000 questions). You can do the same, for bottom answers, with **sampling_exclude_bottom**.
4. if *sampling_type* is random, you can choose a minimum and maximum number of questions with **min_samples** and **max_samples**

In [7]:
num_answers = 100
sampling_type = "random" # choose between top or random
sampling_exclude_top = 50
sampling_exclude_bottom = 50
min_samples = 3000
max_samples = 4000

After adjusting parameters, run the following cell to display the results.
Restart the cell to generate new results.

In [None]:
def get_answers(num_answers, sampling_type, min_samples, sampling_exclude_top):

  assert (num_unique_answers - sampling_exclude_top) > num_answers, "too big sampling_exclude_top, not enough answer left for num_answers"
  assert min_samples < max_samples

  top = answers_population.most_common()
  top_answers = [a[0] for a in top]

  if sampling_type == 'top':
    answers = top_answers[sampling_exclude_top:num_answers + sampling_exclude_top]
    max_possible_samples = [answers_population[a] for a in answers]

  elif sampling_type == 'random':

    # select random answers
    rand_index = random.sample(range(sampling_exclude_top, num_unique_answers), num_answers)
    answers = [top_answers[r] for r in rand_index]

    # compute the number of samples available
    max_possible_samples = [answers_population[a] for a in answers]

    swaps = 0
    pbar = tqdm(10000, position=0, leave=True)
    while sum(max_possible_samples) < min_samples or sum(max_possible_samples) > max_samples:
      a = random.choice(answers)
      index = top_answers.index(a)

      if sum(max_possible_samples) < min_samples:
        # Feting a response with higher population
        min_index = sampling_exclude_top
        if index <= sampling_exclude_top: #cant do better
          continue
        new_answer = top_answers[random.randint(min_index, index-1)] 
      else:
        # Feting a response with lower population
        max_index = len(top_answers) - sampling_exclude_bottom
        if index >= max_index: #cant do better
          continue
        new_answer = top_answers[random.randint(index+1, max_index)]

      if new_answer in answers: # already have this answer
        continue

      answers.remove(a)
      answers.append(new_answer)

      # update num of samples
      max_possible_samples = [answers_population[a] for a in answers]

      swaps +=1
      if swaps == int(10e3):
        raise StopIteration("Too much iterations, it possible there is no configuration for your scenario")
      pbar.update(1)
      pbar.set_description('current_samples:{}'.format(sum(max_possible_samples)))
    print("\ndone!")
  else:
    raise NotImplementedError()
  dataset = {a:q for a, q in questions_by_answer_type.items() if a in answers}
  return dataset, answers, max_possible_samples


dataset, answers, max_possible_samples = get_answers(num_answers, sampling_type, min_samples, sampling_exclude_top)

assert len(set(answers)) == num_answers, "Something went wrong, let's try again"


# Print results
print("Chosen answers:",answers)
print("Num samples:", sum(max_possible_samples))
print("Labels distribution:")

plt.bar(range(num_answers), max_possible_samples)
plt.show()
print(Counter({a:len(q) for a, q in dataset.items()}).most_common())
print('\nUnhappy ? restart cell...')


# Step 5. Create dataset files


1.   **Samples clipping** : Set to select maximum n samples per answer
2.   **im_download**: You can choose to download images directly through http request. This process might be slow if you picked a lot of samples. Might as well get a cup of coffee.

In [10]:
im_download=False
sample_clipping=False # false for no clipping

In [None]:
image_question = defaultdict(list)
label_question = defaultdict(list)
question_label = {}
answer_list = list(dataset.keys())

im_dir = "data/images"
os.makedirs(im_dir, exist_ok=True)
pbar = tqdm(dataset.items(), position=0, leave=True)
for answer, questions in pbar:
  if sample_clipping:
    questions = questions[:sample_clipping]

  for i, q in enumerate(questions):
    question, image_id, question_id = q
    if im_download:
      filename = 'COCO_train2014_'+ str(image_id).zfill(12) + '.jpg'
      pbar.set_description("Downloading images for answer {} [{}/{}] {}".format(answer, i, len(questions), filename))
      src = os.path.join('https://vqa_mscoco_images.s3.amazonaws.com/train2014/', filename)
      trg = os.path.join(im_dir, filename)
      if not os.path.exists(trg):
        urllib.request.urlretrieve(src, trg)

    # Filling dict
    image_question[image_id].append((question_id, question))
    label_question[answer].append(question_id)
    question_label[question_id] = answer_list.index(answer)
  
print('\n')
print(len(image_question), 'images for', len(question_label), 'questions')

print('five samples of image_question', list(image_question.items())[:5])
print('five samples of question_label', list(question_label.items())[:5])
print('five samples of label_question', list(label_question.items())[:5])



# Step 6 (optional). Resize images

For your mini-VQA project, you might want to use lower resolution for your images (faster training).

1.   **resize** : Squared resize of n pixels

In [10]:
resize = 128

In [None]:
im_dir = "data/images"
outdir = "data/images_resized"
os.makedirs(outdir, exist_ok=True)

images = glob.glob(os.path.join(im_dir, '*'))
for im in tqdm(images, position=0, leave=True):
  try:
    im_resized = os.path.basename(im).replace(".jpg", "_resized.jpg")
    image = Image.open(im)
    image = image.resize((resize,resize), Image.ANTIALIAS)
    image.save(os.path.join(outdir, im_resized), "JPEG")
  except Exception as e:
    print(e)
    print("Its possible the image is corrupted, re-download it")
    raise


# Step 7. Explore the question embedding space
Compute embeddings using a pretrained bert-base-nli-mean-tokens. Feel free to change the model.


In [None]:
!pip install sentence_transformers

In [None]:
from sentence_transformers import SentenceTransformer
print("Computing embeddings of selected questions")
model = SentenceTransformer('bert-base-nli-mean-tokens').cuda().eval() 

qs = [q[1] for  _, v in image_question.items() for q in v]
ids = [q[0] for  _, v in image_question.items() for q in v]
at = {ann['question_id']:ann['question_type'] for ann in annotations}

sentence_embeddings = []
ans_type = []
for q, id in tqdm(zip(qs, ids), position=0, leave=True, total=len(qs)):
  with torch.no_grad():
    sentence_embeddings.append(model.encode(q))
  ans_type.append(at[id])
    
sentence_embeddings = np.array(sentence_embeddings)
ans_type = np.array(ans_type)

# # Reduce dimension
sentence_embeddings = np.array(sentence_embeddings)
viz = TSNE(n_components=2, n_jobs=4, verbose=1, n_iter=2000)
embeddings = viz.fit_transform(sentence_embeddings)

In [None]:
# Plotting
fig = plt.figure()
best_q_type = [k[0] for k in Counter(ans_type).most_common(5)]
random_q_type = random.sample(set(ans_type), 5)
plt.figure(figsize=([10, 5]))
for i, q_type in enumerate([best_q_type, random_q_type]):
  plt.subplot(1, 2, i+1)
  for g in q_type:
      ix = np.where(ans_type == g)
      plt.scatter(embeddings[ix, 0], embeddings[ix, 1], s=0.1,
                  cmap='Spectral', label=g)
  plt.legend(markerscale=10, loc='lower left')
plt.title("Questions embeddings per question_type (left popular, right random)")
plt.show()

# Step 8. Create split

1. You can choose train and validation size, rest is test
2. You can choose whether labels are homogeneously distributed across splits by setting **balanced** to True. False otherwise. 


In [16]:
train_size = 0.8
val_size = 0.1
balanced = True

In [None]:
assert sum([train_size, val_size]) < 1.0

train = []
val = []
test = []

lq = copy.deepcopy(label_question)

# put at least one sample of each answer in each split
for _, v in  lq.items():
  random.shuffle(v)    
  train.append(v.pop())
  val.append(v.pop())
  test.append(v.pop())

if balanced:
  # generating balanced splits, for each label, put the requested balance in splits
  for _, v in  lq.items():
    random.shuffle(v)
    num_samples = len(v)
    train += (v[:int(train_size*num_samples)])
    val   += (v[int(train_size*num_samples):int((train_size+val_size)*num_samples)])
    test  += (v[int((train_size+val_size)*num_samples):])
else:
    all_qid = [item for _, v in lq.items() for item in v]
    labels = list(lq.keys())
    num_samples = len(all_qid)

    random_weights = np.random.random(len(labels))

    while len(val) < int(val_size*num_samples):
      label_drawn = choice(labels, 1, p=random_weights/random_weights.sum())
      sample = random.choice(lq[label_drawn[0]])
      if sample in all_qid:
        val.append(sample)
        all_qid.remove(sample)

    random_weights = np.random.random(len(labels))

    while len(test) < int((1-(val_size+train_size))*num_samples):
      label_drawn = choice(labels, 1, p=random_weights/random_weights.sum())
      sample = random.choice(lq[label_drawn[0]])
      if sample in all_qid:
        test.append(sample)
        all_qid.remove(sample)
    
    train += all_qid

random.shuffle(train)
random.shuffle(val)
random.shuffle(test)

# assert no duplicates
assert len(set(train + val + test)) == (len(train) + len(val) + len(test))

for i, name in enumerate(['train', 'val', 'test']):
  split = eval(name)
  label_count = Counter()
  for q in split:
    label_count[question_label[q]] += 1
  plt.subplot(1, 3, i+1)
  plt.bar(range(len(label_count)), [label_count[i] for i in range(len(label_count))])
  plt.xlabel("label")
  plt.ylabel("count")
  plt.title(name)

plt.tight_layout()
print("Label repartition across splits")
plt.show()

# Last Step: download files

1. **test.csv** must be given to students. They have to fill it with their prediction formatted like **sample_submission.csv** which contains random predictions
2. **answer_key.csv** is the ground-truth file that has to be stored on kaggle
3. **answer_list.csv** maps the label to the answer in natural language (i.e. label 0 is answer at line 1 in answer_list, etc.)
3. **image_question.json** maps an image_id to a list of questions (that concerns image_id). Each question is a tuple (question_id, question). <br/>
To map an image_id to its corresponding image_file, do :
```
COCO_train2014_'+ str(image_id).zfill(12) + '.jpg
```



In [18]:
for fn, ids in [["train.csv", train], ["val.csv", val], ["test.csv", test], ["answer_key.csv", test], ["sample_submission.csv", test]]:
  lines = []
  for id in ids:
    # for sample_submission, choose random label
    if 'sample_submission' in fn:
      lines.append(str(id) +','+str(random.randint(0, len(label_question.keys())-1)))
    # for test.csv, dont print label
    elif 'test' in fn:
      lines.append(str(id))
    else:
      lines.append(str(id) +','+str(question_label[id]))
  
  open(fn, "w+").write('question_id,label\n'+'\n'.join(map(str, lines)))
  files.download(fn) 


open("answer_list.txt", "w+").write('\n'.join(map(str, answer_list)))
files.download('answer_list.txt') 

json.dump(image_question, open('image_question.json', 'w'))
files.download('image_question.json') 

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

If you have downloaded images, download your zip (the folder is data/images or data/images_resized if you resized images)

In [None]:
!zip -r -q images.zip data/images_resized

In [None]:
files.download("images.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### ENJOY