# Mapping CLEF codes to PubMed Ids

Images from the sub-image classification task from the MedicalCLEF task use a format *ImageId-FigureNumber-SubfigureNumber*. For instance, the sub-figure *1472-6807-6-9-6-1.jpg* is the subfigure 1 from figure 6 from the Image: 1472-6807-6-9. This ImageId is the DOI of the article (doi: 10.1186/1472-6807-6-9, pubmedid: 1508147). We only realized this relationship from a mapping file provided by Dina Demner (*ids.txt*).

![sample subfigure](./samples/1472-6807-6-9-6-1.jpg)

To access a pubmed article, use the url https://www.ncbi.nlm.nih.gov/pmc/articles/REPLACE_ID/ and replace the *REPLACE_ID* placeholder with a string of format *PMC + pubmedid* (e.g. PMC1508147). Pubmed also provides a faster way to access to the image by appending /figure/FX to the url, where X is the figure number (e.g. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1508147/figure/F6/). Notice that there is no way to access directly to the sub-figure because it is a compounded figure.


In [7]:
# https://www.ncbi.nlm.nih.gov/pmc/tools/id-converter-api/
import requests

request_url = 'https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/?tool=my_tool&email=my_email@example.com&ids=25640076;versions=no&format=json'
r = requests.get(request_url)
r.status_code

200

In [8]:
r.text

'{\n "status": "ok",\n "responseDate": "2020-06-10 22:35:54",\n "request": "tool=my_tool;email=my_email%40example.com;ids=25640076;versions=no;format=json",\n "records": [\n   {\n    "pmcid": "PMC4408612",\n    "pmid": "25640076",\n    "doi": "10.1016/j.neuron.2014.12.061"\n   }\n ]\n}\n'

In [10]:
j = r.json()

In [18]:
j['records'][0]['pmcid']

'PMC4408612'

In [3]:
from bs4 import BeautifulSoup
import urllib.request
from urllib.request import Request

url = 'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4408612/figure/F1'
hdr = {'User-Agent': 'Mozilla/5.0'}
req = Request(url,headers=hdr)
page = urllib.request.urlopen(req).read()

soup = BeautifulSoup(page, 'lxml')

In [6]:
imgs = soup.find_all('img')

In [8]:
imgs[0]

<img alt="Logo of nihpa" src="/corehtml/pmc/pmcgifs/logo-hhspa.png" usemap="#logo-imagemap"/>