# Caption for CLEF16 Data

## 1. Data Validation

Verify that each image on the CLEF16 training dataset has a corresponding caption.

In [246]:
import pandas as pd
import json
import os
import re
import csv

IMAGES_CLEF_16_TRAINING = './ImageCLEF/SubfigureClassificationTraining2016'
CAPTIONS_CLEF_16_TRAINING = 'CompoundFigureDetectionTraining2016-Captions.csv'
CAPTIONS_CLEF_16_TEST = 'CompoundFigureDetectionTest2016-Captions.csv'

Obtain the list of image names (**training_ids**) from the CLEF16 image folder

In [243]:
classes = os.listdir(IMAGES_CLEF_16_TRAINING)
training_ids_16 = [imgid for folder in classes for imgid in os.listdir(os.path.join(IMAGES_CLEF_16_TRAINING, folder))]

In [290]:
def match_compound(img_id):
    p = re.compile('(-|[.])[0-9]*-[0-9]*.(jpg)$')
    match = p.search(img_id)
    return True if match != None else False

def get_compound_root(img_id):
    # Obtain the part of the image name without subfigure labeling
    p = re.compile('(-|[.])[0-9]*.(jpg)$')
    match = p.search(img_id)
    m_start, m_end = match.span()
    return img_id[:m_start]

def create_caption_dictionary(csv_file):
    captions = {}
    with open(CAPTIONS_CLEF_16_TRAINING, encoding='ISO-8859-1') as csvfile:
        reader = csv.reader(csvfile, delimiter='\t')
        for row in reader:
            # the original file has some empty lines
            if len(row) != 0:
                # lines using tab separators
                if len(row) == 1:
                    sp = row[0].split('\t')
                    if len(sp) == 2:
                        # remove the ,,,,,,,, sequences
                        captions[sp[0]] = sp[1].replace(',,', '').replace("\xa0", "")
                    else:
                        # there is at least one case there the id is not present
                        # so the line has only one value
                        captions[sp[0]] = ''
                elif len(row) == 2:
                    captions[row[0]] = row[1]
    return captions

# Tests match_compound
test = '1471-2202-9-58-19-6.jpg'
test2= 'IJBI2010-105610.007-1.jpg'
test3= '11373_2007_9226_Fig1_HTML-16.jpg'
print(match_compound(test))
print(match_compound(test2))
print(match_compound(test3))

True
True
False


In [291]:
captions_training = create_caption_dictionary(CAPTIONS_CLEF_16_TRAINING)
captions_test = create_caption_dictionary(CAPTIONS_CLEF_16_TEST)

print("There are {num} training subfigures.".format(num=len(training_ids_16)))
print("The captions dictionary has {num} captions".format(num=len(captions_training.keys())))

There are 6776 training subfigures.
The captions dictionary has 20987 captions


In [292]:
num_has_caption = 0
imgs_no_caption = []

for img_id in training_ids_16:    
    try:
        caption = captions_training[get_compound_root(img_id)]
        num_has_caption += 1
    except KeyError:
        imgs_no_caption.append(img_id)
        
print("{num} subfigures have a corresponding caption".format(num=num_has_caption))
print("Only {images} has no captions".format(images=('').join(imgs_no_caption)))

6775 subfigures have a corresponding caption
Only DRP2011-927852-001-2.jpg has no captions


After searching for the key in the dictionary, we can see that the problem is the use of **.001** instead of **-001**. We proceed to create a corresponding key.

In [293]:
found = []
for k in captions_training.keys():
    if 'DRP2011-927852' in k:
        found.append(k)
print(found)

captions_training['DRP2011-927852-001'] = captions_training['DRP2011-927852.001']

['DRP2011-927852.001']


In [294]:
num_has_caption = 0
imgs_no_caption = []

for img_id in training_ids_16:    
    try:
        caption = captions_training[get_compound_root(img_id)]
        num_has_caption += 1
    except KeyError:
        imgs_no_caption.append(img_id)
        
print("{num} subfigures have a corresponding caption".format(num=num_has_caption))

6776 subfigures have a corresponding caption


Checking if any subfigure has an empty caption.

In [297]:
count_empty = 0
for img_id in training_ids_16:
    if captions_training[get_compound_root(img_id)] == '':
        count_empty += 1
        print(img_id)

if count_empty == 0:
    print("There are no subfigure with empty labels")
else:
    print("There are {num} subfigures with empty labels".format(num=count_empty))

There are no subfigure with empty labels


## Captions with spatial indicators

Captions in compound figures can use the spatial position of the pane as a reference for the related text in the caption. In the example below (left), the top left image is referenced with the text **(A, top left)** but the image itself does not have an **A** label. As the given subfigure (image below, right) does not have any reference to the **A** label and we cannot guess the position in the original compound image, there is no way to relate just a part of the caption to it.

It's worth noticing that for this particular image, the whole caption may indeed be relevant for any pane just because this is an homogeneous case. For instance, **3-D** is mentioned before any specific pane caption and within pane captions. 

*Image on the left taken from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1508147/*

![10.1186/1472-6807-6-9 Figure 6](./samples/PMC1508147.png)