<a href="https://colab.research.google.com/github/mauro-nievoff/MultiCaRe_Dataset/blob/main/Dataset_Creation_Process/4_Turning_Captions_into_Image_Labels.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Turning Image Captions into Structured Data

The text found in image captions can be used to create labels for their corresponding images. In order to do this, three steps were followed:

1. Caption Pre-Processing
2. Data Extraction using Spark NLP
3. Data Normalization

## 1. Caption Pre-Processing

The main purpose of this step is to split captions that contain references to different images, and then assign each part of the caption to the correct referenced image. Let's take a look at the sample caption below:

In [1]:
sample_caption = '''Brain CT scan. There is a mass in the frontal lobe (A-C) and an intracerebral hemorrhage in the right parietotemporal lobe (C and D).'''

This caption has three parts:
- `Brain CT scan.`: Initial statement without explicit references. This part of the caption refers to all the parts of the image (A to D).
- `There is a mass in the frontal lobe`: A statement with a range reference (A-C). It refers to the image parts A, B and C.
- `and an intracerebral hemorrhage in the right parietotemporal lobe`: This statement refers to the image parts C and D.

### Secondary Functions

The `classify_chunks()` function is used to split a given text into smaller pieces (chunks), and then classify those chunks as 'reference' (e.g. A or C), 'split' (e.g. special characters as commas or dots) and 'other' (any other chunk).

In [2]:
import re

In [3]:
def classify_chunks(text):

  split_text = re.split(r'([;:./(/),]|-| and | to )', text)
  reference_tokens = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L']

  chunk_dicts = []
  for chunk in split_text:
    if chunk.strip() in reference_tokens:
      chunk_dicts.append({'chunk': chunk, 'token_type': 'reference'})
    elif chunk.strip() in [';', ':', '(', ')', ',', '.', 'and', 'to', '-']:
      chunk_dicts.append({'chunk': chunk, 'token_type': 'split'})
    else:
      chunk_dicts.append({'chunk': chunk, 'token_type': 'other'})
  return chunk_dicts

In [4]:
chunk_dicts = classify_chunks(sample_caption)
chunk_dicts[:10]

[{'chunk': 'Brain CT scan', 'token_type': 'other'},
 {'chunk': '.', 'token_type': 'split'},
 {'chunk': ' There is a mass in the frontal lobe ', 'token_type': 'other'},
 {'chunk': '(', 'token_type': 'split'},
 {'chunk': 'A', 'token_type': 'reference'},
 {'chunk': '-', 'token_type': 'split'},
 {'chunk': 'C', 'token_type': 'reference'},
 {'chunk': ')', 'token_type': 'split'},
 {'chunk': '', 'token_type': 'other'},
 {'chunk': ' and ', 'token_type': 'split'}]

Those chunks are then concatenated depending on their types using `concat_chunks()`. As a result, the original text is split into strings that are classified as 'caption' (image description) or 'reference' (e.g. '(A-C)').

In [5]:
def concat_chunks(chunk_dicts):

  caption_sections = []
  section_string = ''
  reference_string = ''

  for i, chunk in enumerate(chunk_dicts):
    if chunk['token_type'] == 'other':
      if reference_string != '':
        caption_sections.append({'string': reference_string, 'type': 'reference'})
        reference_string = ''
      section_string += chunk['chunk']
    elif chunk['token_type'] == 'split':
      if reference_string != '':
        reference_string += chunk['chunk']
      else:
        section_string += chunk['chunk']
    elif chunk['token_type'] == 'reference':
      if section_string != '':
        caption_sections.append({'string': section_string, 'type': 'caption'})
        section_string = ''
      reference_string += chunk['chunk']

  if reference_string:
    caption_sections.append({'string': reference_string, 'type': 'reference'})
  if section_string:
    caption_sections.append({'string': section_string, 'type': 'caption'})

  return caption_sections

In [6]:
caption_sections = concat_chunks(chunk_dicts)
caption_sections

[{'string': 'Brain CT scan. There is a mass in the frontal lobe (',
  'type': 'caption'},
 {'string': 'A-C)', 'type': 'reference'},
 {'string': ' and an intracerebral hemorrhage in the right parietotemporal lobe (',
  'type': 'caption'},
 {'string': 'C and D)', 'type': 'reference'},
 {'string': '.', 'type': 'caption'}]

If any reference range is present in the text (e.g. 'B-E' or 'B to E'), `expand_ranges()` will create a `tidy_refs` key including a list with all the references included in the range (e.g. B, C, D, E).

In [7]:
def expand_ranges(caption_sections):
  # This part of the code is used to turn range references (such as 'a-d') to list references (such as 'a, b, c, d').
  pattern_1 = r'(,|;| and )'
  pattern_2 = r'(-| to )'
  list_of_letters = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L']

  for dct in caption_sections:
    if dct['type'] == 'reference':
      dct['tidy_refs'] = []
      refs = re.split(pattern_1, dct['string'])
      for element in refs:
        if element != ' and ':
          consecutive_refs = re.split(pattern_2, element)
          if len(consecutive_refs) == 1:
            dct['tidy_refs'].append(consecutive_refs[0].strip())
          if len(consecutive_refs) > 1:
            range_start = re.sub(r'[^A-Z]', '', consecutive_refs[0].strip())
            range_end = re.sub(r'[^A-Z]', '', consecutive_refs[-1].strip())

            if (range_start in list_of_letters) and (range_end in list_of_letters):
              dct['tidy_refs'].append(range_start)
              reduced_list_of_letters = list_of_letters[list_of_letters.index(range_start)+1:list_of_letters.index(range_end)]
              for letter in reduced_list_of_letters:
                dct['tidy_refs'].append(letter)
              dct['tidy_refs'].append(range_end)
  return caption_sections

In [8]:
caption_sections = expand_ranges(caption_sections)
caption_sections

[{'string': 'Brain CT scan. There is a mass in the frontal lobe (',
  'type': 'caption'},
 {'string': 'A-C)', 'type': 'reference', 'tidy_refs': ['A', 'B', 'C']},
 {'string': ' and an intracerebral hemorrhage in the right parietotemporal lobe (',
  'type': 'caption'},
 {'string': 'C and D)', 'type': 'reference', 'tidy_refs': ['C', 'D)']},
 {'string': '.', 'type': 'caption'}]

### Main Pre-Processing Function

The `preprocess_caption()` function uses the secondary functions to turn a caption into a dataframe with references and their corresponding caption.

In [9]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

In [10]:
def preprocess_caption(caption_paragraph):

  organized_captions = []
  last_ref = ['common_string']
  for sentence in caption_paragraph.split('.'):
    chunk_dicts = classify_chunks(sentence)
    caption_sections = concat_chunks(chunk_dicts)
    caption_sections = expand_ranges(caption_sections)
    refs = [ref for ref in caption_sections if ref['type'] == 'reference']
    caps = [ref for ref in caption_sections if ref['type'] == 'caption']

    ### The way in which captions are assigned to references depends on the amount of reference and caption strings, and on their order.
    if (len(refs) == 0):
      organized_captions.append({'sentence': sentence, 'reference':last_ref})
    elif (len(refs) == 1):
      last_ref = refs[-1]['tidy_refs']
      organized_captions.append({'sentence': sentence, 'reference':last_ref})
    else:
      if len(refs) == len(caps):
        for i, r in enumerate(refs):
          last_ref = refs[i]['tidy_refs']
          organized_captions.append({'sentence': caps[i]['string'], 'reference': r['tidy_refs']})
      else:
        for i, c in enumerate(caps):
          if i != len(caps)-1:
            split_c = re.split(r'(,|;| and )', c['string'])
            last_ref = refs[i]['tidy_refs']
            if (len(split_c) == 1) or (i==0):
              organized_captions.append({'sentence': c['string'], 'reference': refs[i]['tidy_refs']})
            else:
              organized_captions.append({'sentence': ','.join(split_c[:-1]), 'reference': refs[i-1]['tidy_refs']})
              organized_captions.append({'sentence': split_c[-1], 'reference': refs[i]['tidy_refs']})
          else:
            organized_captions.append({'sentence': c['string'], 'reference': last_ref})

  # A list of all the present references is created.
  references = []
  for c in organized_captions:
    for r in c['reference']:
      if (r != 'common_string') and (r not in references):
        references.append(r)

  # Mapping references to captions
  mapping_dicts = []
  if references:
    for ref in references:
      r = re.sub(r'[^A-Z]', '', ref) # Special characters are removed from references.
      if r:
        reference_caption = '.'.join([c['sentence'] for c in organized_captions if ((c['reference'] == ['common_string']) or (ref in c['reference']))]) # Strings from the same reference are joined.
        mapping_dicts.append({'reference': r, 'caption': reference_caption})
  else:
    mapping_dicts.append({'reference': 'undivided_caption', 'caption': '. '.join([c['sentence'] for c in organized_captions])}) # In case no split is necessary for a specific caption.

  caption_df = pd.DataFrame(mapping_dicts)
  return caption_df

In [47]:
caption_df = preprocess_caption(sample_caption)

In [48]:
caption_df

Unnamed: 0,reference,caption
0,A,Brain CT scan. There is a mass in the frontal lobe (
1,B,Brain CT scan. There is a mass in the frontal lobe (
2,C,Brain CT scan. There is a mass in the frontal lobe (. and an intracerebral hemorrhage in the right parietotemporal lobe (.
3,D,Brain CT scan. and an intracerebral hemorrhage in the right parietotemporal lobe (.


### String Post-Processing

The caption was split and each part was correctly assigned to a reference. In the process, some special characters such as `(` may remain, so a new function is created to fix such errors.

In [39]:
def postprocess_string(string):

  output_string = ""
  for i in range(len(string)):
      if i < len(string) - 1 and string[i+1].isdigit() and string[i] == '!':
          output_string += '.'
      else:
          output_string += string[i]

  new_string = ''
  while new_string != output_string:
    new_string = output_string
    output_string = output_string.replace(' , and ', '').replace('(. ', '').replace('(.', '').replace(',,', ',').replace(',;', '').replace('..', '.').replace('  ', '. ').strip()
  if output_string.startswith('. ') or output_string.startswith('( ') or output_string.startswith(') '):
    output_string = output_string[2:]
  if output_string.endswith(' ('):
    output_string = output_string[:-2]
  if not output_string.endswith('.'):
    output_string += '.'

  return output_string

In [49]:
caption_df['caption'] = caption_df.caption.apply(postprocess_string)

In [50]:
caption_df

Unnamed: 0,reference,caption
0,A,Brain CT scan. There is a mass in the frontal lobe.
1,B,Brain CT scan. There is a mass in the frontal lobe.
2,C,Brain CT scan. There is a mass in the frontal lobe and an intracerebral hemorrhage in the right parietotemporal lobe.
3,D,Brain CT scan. and an intracerebral hemorrhage in the right parietotemporal lobe.


In [43]:
caption_df.to_csv('captions.csv', index = False)

## 2. Caption Labeling using Annotated N-Grams

At this step of the process, each image is assigned labels based on the content of its caption. To do this, first a list was created including all the unique sequences of words or tokens (n-grams) present in captions. The list was manually annotated using a spreadsheet, assigning corresponding labels to each n-gram based on the dataset taxonomy. The annotated file (included in the multiversity package) is used to assign labels to each caption based on the n-grams that it contains (e.g., the label 'ct' is assigned to captions containing the n-gram 'computed tomography').

In [None]:
!pip install multiversity

In [14]:
import importlib.resources as pkg_resources

In [21]:
annotated_ngrams_path = (pkg_resources.files("multiversity.data")/ "annotated_ngrams.csv").as_posix()

In [22]:
annotated_ngrams = pd.read_csv(annotated_ngrams_path)

In [23]:
annotated_ngrams[~annotated_ngrams.label_1.isna()].head(10)

Unnamed: 0,count,n,ngram,label_1,label_2,label_3
10,197630,1,left,left,,
11,195024,1,right,right,,
17,142115,1,ct,ct,,
25,98935,1,cells,pathology,,
34,83670,1,mri,mri,,
37,81441,1,tomography,ct,,
41,8855,1,metastatic,malignant,,
60,55521,3,computed tomography,ct,,
61,55017,1,contrast,contrast,,
69,48359,1,no,_assertion_absent,,


To assign the labels to the captions, the `CaptionLabeler` class is used.

In [24]:
from multiversity.multi_labeler import CaptionLabeler

In [44]:
cl = CaptionLabeler(caption_csv_path = 'captions.csv', annotated_ngrams_csv_path = annotated_ngrams_path)

In [45]:
cl.caption_df

Unnamed: 0,reference,caption,label_list
0,A,Brain CT scan. There is a mass in the frontal lobe.,"[head, ct, mass]"
1,B,Brain CT scan. There is a mass in the frontal lobe.,"[head, ct, mass]"
2,C,Brain CT scan. There is a mass in the frontal lobe and an intracerebral hemorrhage in the right parietotemporal lobe.,"[right, head, ct, mass]"
3,D,Brain CT scan. and an intracerebral hemorrhage in the right parietotemporal lobe.,"[right, head, ct]"


In cases where the label_list included incompatible labels, they were removed during postprocessing.

After this, the MultiCaRe Classifier was used in order to get ML labels for each image given its path (see the [MultiCaReClassifier folder](https://github.com/mauro-nievoff/MultiCaRe_Dataset/tree/main/MultiCaReClassifier) on this repo for more details).