<a href="https://colab.research.google.com/github/mauro-nievoff/MultiCaRe_Dataset/blob/main/2_Data_Extraction_from_PMC's_Case_Reports.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Extraction from PMC's Case Reports

In this notebook, we will discuss data extraction from a given case report article using BioC API from PMC. In particular, we will cover how to:
1. Retrieve the Article Metadata
2. Extract Clinical Cases from the Article Content
3. Get Demographic Information from Cases
4. Rearrange the Data to get the Final Outcome

## 1. Article Metadata

### Getting all ID Types for a given PMID

`get_id_mapping()` gets the PMCID and other relevant IDs corresponding to a given PMID by using Biopython.

It was decided to get PMIDs and then map them to the corresponding PMCIDs instead of searching for the PMCIDs using PMC's API because the search engine from PubMed is more suitable for the task than the one from PMC.

In [None]:
%%capture
!pip install Bio

In [None]:
from Bio import Entrez

In [None]:
Entrez.email = "your@email.com"
Entrez.api_key = "your_api_key"

In [None]:
def get_id_mapping(pmid):
  try:
    pmid_handle = Entrez.efetch(db="pubmed", id=pmid, rettype="xml")
    pmid_record = Entrez.read(pmid_handle)
    mapping = {e.attributes['IdType'].replace('pubmed', 'pmid').replace('pmc', 'pmcid'): str(e) for e in pmid_record['PubmedArticle'][0]['PubmedData']['ArticleIdList']}
    return mapping
  except:
    return {"pmid": pmid, "comment": 'Mapping error.'}

The following article will be used for all the examples from this notebook: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6744365/.

In [None]:
pmid = '31534908'
id_dict = get_id_mapping(pmid)
id_dict

{'pmid': '31534908',
 'pmcid': 'PMC6744365',
 'doi': '10.1016/j.idcr.2019.e00633',
 'pii': 'S2214-2509(19)30123-4'}

### Getting Article Metadata

`get_article_metadata()` retrieves the metadata corresponding to a given PMCID, including all the data points for citation, Mesh terms and keywords.

In [None]:
from bs4 import BeautifulSoup
import requests

In [None]:
def get_article_metadata(id_dict):

  # Creating article_soup using requests and BeautifulSoup.
  article = id_dict['pmcid']
  url = f"https://pubmed.ncbi.nlm.nih.gov/?term={article}&format=abstract&size=10"
  response = requests.get(url)
  article_soup = BeautifulSoup(response.content, 'html.parser')

  # The metadata is retrieved from the id_dict and the article_soup.
  article_dict = {}
  article_dict['doi'] = id_dict.get('doi')
  article_dict['pmid'] = id_dict.get('pmid')
  article_dict['pmcid'] = id_dict.get('pmcid')
  article_dict['title'] = article_soup.find("h1", class_="heading-title").text.replace('\n', '').strip()
  article_dict['journal'] = article_soup.find("div", class_="journal-actions dropdown-block").text.split("\n")[1].strip()
  article_dict['journal_detail'] = article_soup.find("span", class_="cit").text
  article_dict['year'] = article_dict['journal_detail'][:4]
  article_dict['link'] = article_soup.find('meta', {'name': 'citation_abstract_html_url'})['content']

  # Getting the list of authors.
  article_dict['authors'] = []
  divs = article_soup.find_all('div', {'class': 'authors-list'})
  for div in divs:
    name_list = div.find_all('a', {'class': 'full-name'})
    for name in name_list:
      if name.text not in article_dict['authors']:
        article_dict['authors'].append(name.text)

  # Getting Mesh terms and keywords.
  article_dict['major_mesh_terms'] = [term.text.replace('\n', '').strip().replace('*', '') for term in article_soup.find_all('button', class_='keyword-actions-trigger trigger keyword-link major')]
  article_dict['minor_mesh_terms'] = [term.text.replace('\n', '').strip() for term in article_soup.find_all('button', class_='keyword-actions-trigger trigger keyword-link')]
  article_dict['mesh_terms'] = article_dict['major_mesh_terms'] + article_dict['minor_mesh_terms']

  try:
    tags = article_soup.find_all('p')
    keywords = None
    for tag in tags:
      strong_tag = tag.find('strong', {'class': 'sub-title'})
      if strong_tag and strong_tag.text.strip() == 'Keywords:':
        article_dict['keywords'] = [kw.strip().lower() for kw in tag.text.split(':')[1].replace('\n', '').replace('.', '').split(';')]
  except:
    article_dict['keywords'] = None

  return article_dict

In [None]:
article_metadata = get_article_metadata(id_dict)

Example of article metadata:

In [None]:
article_metadata

{'doi': '10.1016/j.idcr.2019.e00633',
 'pmid': '31534908',
 'pmcid': 'PMC6744365',
 'title': 'Ceftriaxone use in brucellosis: A case series',
 'journal': 'IDCases',
 'journal_detail': '2019 Sep 5:18:e00633.',
 'year': '2019',
 'link': 'https://pubmed.ncbi.nlm.nih.gov/31534908/',
 'authors': ['Daniah F Fatani',
  'Walaa A Alsanoosi',
  'Mazen A Badawi',
  'Abrar K Thabit'],
 'major_mesh_terms': [],
 'minor_mesh_terms': ['Case Reports'],
 'mesh_terms': ['Case Reports'],
 'keywords': ['brucella',
  'brucellosis',
  'cases',
  'ceftriaxone',
  'saudi arabia',
  'zoonotic infection']}

## 2. Clinical Cases

### Useful Functions for Case Retrieval

`merge_paragraphs()` is a function used to merge paragraphs from the article content when they refer to the same case. To do so, all consecutive paragraphs are considered to belong to the same case until a certain age or gender is mentioned, because those mentions are usually included right at the beginning of the cases.

In [None]:
import re

In [None]:
def merge_paragraphs(paragraph_list):

  cases = []
  new_case = []
  age_pattern = r'[\s-](year|yr|month|day)s?(\s?-+|-?\s+)old'
  gender_pattern = r' (man|woman|male|female|gentleman|lady|boy|girl|child|baby|patient)( |,)'

  ## Defining split pattern to use (if age is mentioned in text, gender pattern should be ignored).

  n = 0
  for paragraph in paragraph_list:
    age_mention = re.search(age_pattern, paragraph.lower())
    if (age_mention is not None):
      n += 1

  if n > 0:
    split_pattern = age_pattern
  else:
    split_pattern = gender_pattern

  for idx, paragraph in enumerate(paragraph_list):
    relevant_mention = re.search(split_pattern, paragraph.lower())
    if (new_case != []) and (relevant_mention is not None):
      cases.append('\n'.join(new_case))
      new_case = [paragraph]
    elif (new_case == []) and (relevant_mention is not None):
      new_case.append(paragraph)
    elif (new_case != []) and (relevant_mention is None):
      new_case.append(paragraph)
    if (idx == len(paragraph_list) -1) and new_case:
      cases.append('\n'.join(new_case))
  return cases

`get_cases()` is a function used to get the cases from the article content. Such case texts can be included in specific 'CASE' sections (option 1), in other sections with 'case' in the header (option 2), or in other paragraphs which include age mentions (option 3).

In [None]:
def get_cases(soup):

  ### Option 1: Articles with specific section type ('CASE'):

  paragraphs = [case_section.parent.find('text').text for case_section in soup.find_all('infon', {'key': 'section_type'}, string = 'CASE') if (case_section.find('infon', {'key': 'type'}, string = 'paragraph')) or (case_section.parent.find('infon', {'key': 'type'}, string = 'paragraph'))]
  if paragraphs:
    return merge_paragraphs(paragraphs)
  elif paragraphs == []:

    ### Option 2: Articles with titles or subtitles including the word 'case':
    case_titles = [case_section for case_section in soup.find_all('infon', {'key': 'type'}, string = re.compile(r'(?i)title')) if case_section.parent.find('text', string = re.compile(r'(?i)case'))]
    cases = []
    if case_titles:
      for title in case_titles:
        title_case = []
        next_sibling = title.parent
        while True:
          next_sibling = next_sibling.find_next_sibling("passage")
          paragraphs = next_sibling.find_all("infon", {'key': 'type'}, string = 'paragraph')
          title = next_sibling.find_all('infon', {'key': 'type'}, string = re.compile(r'(?i)title'))
          if paragraphs:
            title_case.append('\n'.join([p.parent.find('text').text for p in paragraphs]))
          elif title:
            cases.append('\n'.join(title_case))
            title_case = []
            break
      if title_case:
        cases.append('\n'.join(title_case))
      if cases == ['']:
        cases = []

    if cases:
      return cases
    else:

      ### Option 3: Articles with paragraphs that mention ages.
      age_pattern = r'[\s-](year|yr|month|day)s?[\s-]old'
      sections = [s.parent.find('infon', {'key': 'section_type'}).text for s in soup.find_all('infon', {'key': 'type'}, string = 'paragraph') if s.parent.find('text', string = re.compile(age_pattern))]
      try:
        case_section = [s for s in sections if s.lower() != 'abstract'][0]
        paragraphs = [s.parent.find('text').text for s in soup.find_all('infon', {'key': 'type'}, string = 'paragraph') if s.parent.find('infon', {'key': 'section_type'}, string = case_section)]
        return merge_paragraphs(paragraphs)
      except:
        return []

`merge_captions()` is a function used to merge caption dicts that refer to the same image.

In [None]:
def merge_captions(caption_dicts):
  file_list = []
  for dct in caption_dicts:
    if dct['file'] not in file_list:
      file_list.append(dct['file'])

  new_dct_list = []
  for file_ in file_list:
    new_dict = {'tag': [t['tag'] for t in caption_dicts if t['file'] == file_][0],
                'caption': '. '.join([t['caption'] for t in caption_dicts if t['file'] == file_]).strip(),
                'file': file_,
                'caption_order': [t['caption_order'] for t in caption_dicts if t['file'] == file_][0]}
    new_dct_list.append(new_dict)

  ### Fixing caption_order after merging captions:

  order_numbers = []
  for dct in new_dct_list:
    order_numbers.append(dct['caption_order'])

  order_numbers.sort()

  for dct in new_dct_list:
    dct['caption_order'] = order_numbers.index(dct['caption_order'])+1

  return new_dct_list

`explode_list()` is a function used to flatten lists of cases.

In [None]:
def explode_list(lst):
  new_list = []
  for element in lst:
    if type(element) != list:
      new_list.append(element)
    else:
      for e in element:
        new_list.append(e)
  return new_list

### Case Retrieval

`get_case_report()` is a function that gets the relevant text from the content of a given article. Apart from the clinical cases, this function returns the amount of cases, any image caption present in the article, article license information and keywords.

In [None]:
def get_case_report(article_id):

  url = f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_xml/{article_id}/ascii"
  r = requests.get(url)
  if str(r) != '<Response [200]>':
    pass
  else:
    soup = BeautifulSoup(r.content, 'xml')
    article_dict = {}
    article_dict['pmcid'] = 'PMC' + soup.find('id').text
    article_dict['title'] = soup.find('infon', {'key': 'section_type'}, string = 'TITLE').parent.find('text').text

    # Getting the clinical cases.
    article_dict['cases'] = get_cases(soup)
    article_dict['cases'] = explode_list(article_dict['cases'])
    article_dict['cases'] = merge_paragraphs(article_dict['cases'])
    article_dict['case_amount'] = len(article_dict['cases'])

    # Getting caption data, including the actual caption, the name of the file and the tag given in the article for that image.
    article_dict['caption_dicts'] = [{'tag': caption_soup.parent.find('infon', {'key':'id'}).text,
                                      'caption': caption_soup.parent.find('text').text,
                                      'file': caption_soup.parent.find('infon', {'key': 'file'}).text,
                                      'caption_order': i+1} for i, caption_soup in enumerate(soup.find_all('infon', {'key': 'type'}, string = re.compile(r'fig_')))]
    article_dict['caption_dicts'] = merge_captions(article_dict['caption_dicts'])

    # Getting the article license and keywords if present.
    article_dict['license'] = soup.find('infon', {'key': 'license'}).text
    try:
      article_dict['keywords'] = soup.find('infon', {'key': 'kwd'}).text.split(', ')
    except:
      article_dict['keywords'] = None
    return article_dict

In [None]:
article_cases = get_case_report(id_dict['pmcid'])

Example of clinical case:

In [None]:
article_cases['cases'][0]

"A 25-year-old man, previously healthy, was initially admitted due to slowly progressive headache with blurry vision and fever for nine months. The patient recalls ingesting raw camel milk, which is a major risk factor for brucellosis. There was no previous contact with a tuberculosis case. The headache worsened one week before his admission and the patient lost vision in the left eye. His vital signs and cognitive function were normal. Pupils were reactive, but the patient was barely seeing the flash light with his left eye. Ophthalmologic examination revealed an atrophic optic disc mainly with decreased visual acuity bilaterally. Extraocular muscles were intact. The remaining neurological examination was unremarkable.\nHis diagnostic work up showed total white blood cell (WBC) count of 5.61 x 109 cells/mm3 and a C-reactive protein (CRP) level of 3.76 mg/L. His cerebrospinal fluid (CSF) acid fast bacilli (AFB) stain and Mycobacterium tuberculosis polymerase chain reaction (MTB-PCR) we

### Combining Cases with Article Metadata

`merge_data()` is a function used to combine jsons with metadata and article content that belong to the same article.

In [None]:
def merge_data(metadata, article_cases):

  article_dict = metadata
  id = article_dict['pmcid']
  if article_dict['keywords'] == []:
    article_dict['keywords'] = article_cases['keywords']
  article_dict['license'] = article_cases['license']
  article_dict['case_amount'] = article_cases['case_amount']

  cases = []
  if 'cases' in article_cases.keys():
    for i, case_ in enumerate(article_cases['cases']):
      cases.append({'case_id': f"{id}_{('0' + str(i+1))[-2:]}", 'case_text': case_.strip()})

  caption_dicts = []
  if 'caption_dicts' in article_cases.keys():
    caption_dicts = article_cases['caption_dicts']

  return {'article_id': id, 'article_metadata': article_dict, 'cases': cases, 'captions': caption_dicts}

In [None]:
case_report_data = merge_data(article_metadata, article_cases)

## 3. Extracting Demographic Information from Case

### Extracting the Age of the Patient

As sometimes the age can be expressed in words instead of numbers, string_to_number() function is created to convert those strings into integers. If the age is lower than 1 year old (e.g. months or weeks old), the age is turned into zero.

In [None]:
def string_to_number(text):

  text2int = {"one": 1, "two": 2, "three": 3, "four": 4, "five": 5, "six": 6, "seven": 7, "eight": 8, "nine": 9, "ten": 10,
              "eleven": 11, "twelve": 12, "thirteen": 13, "fourteen": 14, "fifteen": 15, "sixteen": 16, "seventeen": 17, "eighteen": 18, "nineteen": 19,
              "twenty": 20, "thirty": 30, "forty": 40, "fifty": 50, "sixty": 60, "seventy": 70, "eighty": 80, "ninety": 90, "hundred": 100}

  number = 0

  new_string = 'y'.join(text.split('y')[:-1])

  # Numerical substrings are removed from the original string (starting from the highest values), and the corresponding integer is summed to 'number'.
  for text_number in list(text2int.keys())[::-1]:
    numeric_word = re.search(text_number, new_string)
    if numeric_word:
      number = number + text2int[numeric_word.group()]
      new_string = new_string[:numeric_word.span()[0]] + new_string[numeric_word.span()[1]:]
  return number

These are some examples:

In [None]:
string_to_number('seventy three years-old')

73

In [None]:
string_to_number('forty-two yo')

42

In [None]:
string_to_number('eleven months old')

0

`get_age()` is a function that gets the age of the patient from a given clinical case.

In [None]:
def get_age(text):
  age_pattern_1 = r'(one|two|three|four|five|six|seven|eight|nine|ten|eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen|eighteen|nineteen|twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety|hundred|\d+)(\s?-+|-?\s+)(year|yr| yo |month|week|day)s?([^.]+)?(\s?-+|-?\s+)old'
  match_1 = re.search(age_pattern_1, text.lower())
  if match_1:
    age_string = match_1.group()
  else:
    age_pattern_2 = r'( aged| age of) (one|two|three|four|five|six|seven|eight|nine|ten|eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen|eighteen|nineteen|twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety|hundred|\d+)'
    match_2 = re.search(age_pattern_2, text.lower())
    if match_2:
      age_string = match_2.group()
    else:
      age_string = ''
  if re.search(r'(age|year| yo |yr)', age_string):
    age_match = re.search('\d+', age_string)
    if age_match:
      age = int(age_match.group())
    else:
      age = string_to_number(age_string)
  elif re.search(r'(month|week|day)', age_string):
    age = 0
  else:
    age = None
  return age

Age extractions are added to each case from case_report_data.

In [None]:
for c in case_report_data['cases']:
  c['age'] = get_age(c['case_text'])

### Extracting the Gender of the Patient

`string_to_gender_class()` returns either Female, Male or Transgender depending on the content of the clinical case.

In [None]:
def string_to_gender_class(text):

  specific_transgender_pattern = r'(female-to-male|male-to-female|female to male|male to female|transgender|transexual)'
  other_transgender_pattern = r'( mtf | ftm )'

  specific_female_pattern = r'\s(woman|girl|mistress|lady|female|puerpera|pregnant)[s\s.,:]'
  other_female_pattern = r"(\s(she|her|herself|hers)[s\s.,']| f )"

  specific_male_pattern = r'\s(man|boy|gentleman|mister|male|puerpera|pregnant)[s\s.,:]'
  other_male_pattern = r"(\s(he|his|him|himself)[s\s.,']| m )"

  # Accronyms are searched only in the first sentence of the case, to make sure not to get accronyms with a different meaning.
  if re.search(specific_transgender_pattern, text.lower()) or re.search(other_transgender_pattern, text.lower().split('.')[0]):
    return 'Transgender'
  elif re.search(specific_female_pattern, text.lower()):
    return 'Female'
  elif re.search(specific_male_pattern, text.lower()):
    return 'Male'
  elif re.search(other_female_pattern, text.lower()):
    return 'Female'
  elif re.search(other_male_pattern, text.lower()):
    return 'Male'
  else:
    return 'Unknown'

In [None]:
for c in case_report_data['cases']:
  # To reduce false positives, at first only the first sentence of the case is considered. If no gender is identified, then the whole case is used as an input.
  c['gender'] = string_to_gender_class(c['case_text'].split('.')[0])
  if c['gender'] == 'Unknown':
    c['gender'] = string_to_gender_class(c['case_text'])

## 4. Data Rearrangement

### Captions and images are assigned to their corresponding Clinical Cases

The first step to assign images and captions to cases is to get all the figure numbers mentioned in a specific case (e.g. turning the string "see fig. 1 and 2" into the list [1, 2]). This is the purpose of `get_fig_numbers()`.

In [None]:
def get_fig_numbers(input_string):

  ### Splitting the string into substrings that start with mentions of a figure (such as "fig." or "Figures") and end with a dot.

  patterns = r"(?:fig |figs |fig\.|figs\.|figure|figures|image\(fig |\(figs |\(fig\.|\(figs\.|\(figure|\(figures|\(image)(?:[^.]+)\."
  matches = re.findall(patterns, input_string.lower())

  ### Splitting substrings that include more than one figure mention.

  substrings = []
  for m in matches:
    fig_substrings = m.split('fig')
    for s in fig_substrings:
      if s:
        substrings.append(s)

  ### Tokenization of substrings using blank spaces.

  token_lists = []
  for s in substrings:
    token_lists.append(re.split(r'\s+', s))

  ### Adding fig numbers found in tokens to the fig_list object.

  fig_list = []
  for l in token_lists:

    previous_number_idx = None
    fig_number = None
    l_fig_list = []
    range_token = False
    break_id = None # This is used to ignore numbers that appear after closing parenthesis, semi-colons or other similar special characters.

    for idx, token in enumerate(l):

      if re.search(r"[^a-zA-Z0-9\(,.\-]", token):
        break_id = idx +1

      if idx == break_id:
        break

      if 'year' not in token: ### to ignore tokens with /d+-year-old format.

        ### If the first number appears more than 5 tokens away from the figure mention, it's likely to be referring to something else, so the loop is broken.

        if previous_number_idx is None:
          if idx > 5:
            break

        ### If any number appears more than 5 tokens away from the previous number, it's likely to be referring to something else, so the loop is broken.

        else:
          if (idx - previous_number_idx) > 5:
            break

          ### If '-', 'to' or 'through' come between two numbers, it is in fact a range so range_token is True. This is used, for example, to be able to extract '4-7' as [4, 5, 6, 7].

          if re.search(r'-|to|through', token) and (previous_number_idx  is not None) and (l_fig_list != []):
            range_token = True
            range_start = l_fig_list[-1]

        ### All the numbers present in the token are extracted. If the token is not a range, only the first number is added to l_fig_list.

        fig_number = re.findall(r'\d+', token)
        if fig_number:
          previous_number_idx = idx
          if re.search(r'-', token) and len(fig_number) > 1:
            f_range = list(range(int(fig_number[0]), int(fig_number[-1])+1))
            for e in f_range:
              l_fig_list.append(int(e))
          else:
            l_fig_list.append(int(fig_number[0]))

    ### If l_fig_list includes more than one number and it is a range (range_token == True), then all the numbers from that range are added to the l_fig_list.

    if range_token and (len(l_fig_list) > 1):
      range_end_index = l_fig_list.index(range_start) + 1
      if len(l_fig_list) > range_end_index:
        l_fig_list = list(range(range_start, l_fig_list[range_end_index] +1))

    ### The elements from l_fig_list are added to the outcome fig_list. If the number is higher than 16, it's unlikely to be a figure number, so its filtered out.

    for element in l_fig_list:
      if element < 16:
        if element not in fig_list:
          fig_list.append(element)

  return fig_list

In [None]:
for c in case_report_data['cases']:
  c['figs_in_text'] = get_fig_numbers(c['case_text'])

After getting all the figures mentioned in text, captions are assigned to cases if the fig number matches the caption order.

In [None]:
for case_ in case_report_data['cases']:
  case_captions = []
  for c in case_report_data['captions']:
    if c['caption_order'] in case_['figs_in_text']:
      case_captions.append(c)
  case_['case_image_list'] = case_captions

### Text References are assigned to the corresponding Images

`assign_text_references()` is used to assign to each figure the part of the case that mention such figure. For example, fig 1 is assigned any sentence from the clinical case that contains strings such as '(see fig 1)'.

In [None]:
def assign_text_references(case_dict):

  paragraphs = re.split('\n', case_dict['case_text'])
  sentences = []

  for p in paragraphs:
    sentence_limits = re.findall(r'\. [A-Z]', p)

    for i, l in enumerate(sentence_limits):
      sentence_end_index = re.search(l, p).span()[0]+1
      sentences.append(p[:sentence_end_index])
      p = p[sentence_end_index + 1:]
      if i == len(sentence_limits)-1:
        sentences.append(p)

  fig_references = []
  for s in sentences:
    patterns = r"(?:fig |figs |fig\.|figs\.|figure|figures|image|\(fig |\(figs |\(fig\.|\(figs\.|\(figure|\(figures|\(image)(?:[^.]+)\."
    if re.findall(patterns, s.lower()):
      fig_references.append(s)

  for c in case_dict['case_image_list']:
    c['text_references'] = []
    for r in fig_references:
      mentioned_figs = get_fig_numbers(r)
      if c['caption_order'] in mentioned_figs:
        c['text_references'].append(r)

In [None]:
for c in case_report_data['cases']:
  for l in c['case_image_list']:
    l['image_id']  = f"{c['case_id']}_{l['file']}" # An ID is created for each image.
  assign_text_references(c)

After rearranging the data, unnecessary data points are removed.

In [None]:
case_report_data.pop('captions')
for c in case_report_data['cases']:
  if 'figs_in_text' in c.keys():
    c.pop('figs_in_text')
  for l in c['case_image_list']:
    if 'caption_order' in l.keys():
      l.pop('caption_order')

## Final Outcome

The final outcome is split into different objects (metadata_dict, case_dict and image_dict) to reduce the size of each file in the whole case report dataset.

In [None]:
metadata_dict = {'article_id': case_report_data['article_id'], 'article_metadata': {key: value for key, value in case_report_data['article_metadata'].items()}}

In [None]:
metadata_dict

{'article_id': 'PMC6744365',
 'article_metadata': {'doi': '10.1016/j.idcr.2019.e00633',
  'pmid': '31534908',
  'pmcid': 'PMC6744365',
  'title': 'Ceftriaxone use in brucellosis: A case series',
  'journal': 'IDCases',
  'journal_detail': '2019 Sep 5:18:e00633.',
  'year': '2019',
  'link': 'https://pubmed.ncbi.nlm.nih.gov/31534908/',
  'authors': ['Daniah F Fatani',
   'Walaa A Alsanoosi',
   'Mazen A Badawi',
   'Abrar K Thabit'],
  'major_mesh_terms': [],
  'minor_mesh_terms': ['Case Reports'],
  'mesh_terms': ['Case Reports'],
  'keywords': ['brucella',
   'brucellosis',
   'cases',
   'ceftriaxone',
   'saudi arabia',
   'zoonotic infection'],
  'license': 'CC BY-NC-ND',
  'case_amount': 6}}

In [None]:
case_dict = {'article_id': case_report_data['article_id'], 'cases': [{key: value for key, value in case_.items() if key != 'case_image_list'} for case_ in case_report_data['cases']]}

In [None]:
case_dict

{'article_id': 'PMC6744365',
 'cases': [{'case_id': 'PMC6744365_01',
   'case_text': "A 25-year-old man, previously healthy, was initially admitted due to slowly progressive headache with blurry vision and fever for nine months. The patient recalls ingesting raw camel milk, which is a major risk factor for brucellosis. There was no previous contact with a tuberculosis case. The headache worsened one week before his admission and the patient lost vision in the left eye. His vital signs and cognitive function were normal. Pupils were reactive, but the patient was barely seeing the flash light with his left eye. Ophthalmologic examination revealed an atrophic optic disc mainly with decreased visual acuity bilaterally. Extraocular muscles were intact. The remaining neurological examination was unremarkable.\nHis diagnostic work up showed total white blood cell (WBC) count of 5.61 x 109 cells/mm3 and a C-reactive protein (CRP) level of 3.76 mg/L. His cerebrospinal fluid (CSF) acid fast baci

In [None]:
image_dict = {'article_id': case_report_data['article_id'], 'case_images': [{key: value for key, value in case_.items() if key in ['case_id', 'case_image_list']} for case_ in case_report_data['cases']]}

In [None]:
image_dict

{'article_id': 'PMC6744365',
 'case_images': [{'case_id': 'PMC6744365_01',
   'case_image_list': [{'tag': 'img0005',
     'caption': 'Brain magnetic resonance imaging of patient case 1 at baseline on T1 post contrast cuts showing multiple enhancing dural based lesions in both upper frontal lobes at different levels.',
     'file': 'pl1.jpg',
     'image_id': 'PMC6744365_01_pl1.jpg',
     'text_references': ['A magnetic resonance imaging (MRI) of his brain showed multiple, bilateral small dural-based nodular enhancements in both upper frontal lobes (Image 1).']},
    {'tag': 'img0010',
     'caption': 'Brain magnetic resonance imaging of patient case 1 after three weeks of treatment.',
     'file': 'pl2.jpg',
     'image_id': 'PMC6744365_01_pl2.jpg',
     'text_references': ['Fortunately, no interval development of new lesions was seen (Image 2).']}]},
  {'case_id': 'PMC6744365_02', 'case_image_list': []},
  {'case_id': 'PMC6744365_03', 'case_image_list': []},
  {'case_id': 'PMC6744365_