# ‚úçÔ∏èüìú Cadmus



This project aims to build an automated full-text retrieval system for the generation of large biomedical corpora from published literature for research purposes. Cadmus has been developed for use in non-commercial research. Use out with this remit is not recommended, nor is the intended purpose.

GitHub: https://github.com/biomedicalinformaticsgroup/cadmus

# üìã Requirements


In order to run the code, you need a few things:

You need to have Java 7+.

You need to git clone the project and install it.

An API key from NCBI (this is used to search PubMed for articles using a search string or list of PubMed IDs; you can find more information [here](https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/)).

*In case you are running cadmus on a shared machine, you need to terminate all the Tika instances present in the tmp directory if you are not the owner of the instances, so cadmus can restart them for you.*

**Recommended requirements:**

An API key from Wiley, this key will allow you to get access to the OA and publications you or your institution has the right to access from Wiley. You can find more information [here](https://onlinelibrary.wiley.com/library-info/resources/text-and-datamining)

An API key from Elsevier, this key will allow you to get access to the OA and publications you or your institution has the right to access from Elsevier. You can find more information [here](https://dev.elsevier.com/)


# ‚úçÔ∏èüìú Import Cadmus

This notebook is using the following version of Python

In [2]:
!python --version

Python 3.12.12


In [None]:
!git clone https://github.com/biomedicalinformaticsgroup/cadmus.git

In [None]:
!pip install ./cadmus

# üìö Import libraries

In [5]:
from cadmus import bioscraping
from cadmus import parsed_to_df
import zipfile
import json
import pandas as pd

# üöÄ Retrieving a corpus using search terms (without subscriptions)

The format we are using for the search term(s) is the same as the one for [PubMed](https://pubmed.ncbi.nlm.nih.gov/). You can first try your search term(s) on PubMed and then use the same search term(s) as input for cadmus `bioscraping`.

In order to create your corpora, you are going to use the function called `bioscraping`. The function is taking the following required parameters:

1. A PubMed query string or a Python list of PubMed IDs
2. An email address
3. Your NCBI_API_KEY
   
The function can also receive optional parameters.

1. wiley_api_key parameter allows Wiley to identify which publications you or your institution have the right to access. It will give you access to the OA publications that you would not get access to without the key. **RECOMMENDED**
2. elsevier_api_key parameter allows Elsevier to identify which publications you or your institution have the right to access. It will give you access to the OA publications that you would not normally have access to without the key. **RECOMMENDED**
3. The "start" parameter tells the function at which service we were before failure (e.g. crossref, doi, PubMed Central API, ...).
4. The "idx" parameter tells the function what is the last saved row index (article).

Start and idx are designed to be used when restarting cadmus after a program failure. When Cadmus is running, there is a repeated output feed at the top of the live output.  This line will show you the stage and index that your output dataframe was last saved in case of failure for whatever reason. By using these optional parameters, the program will pick off where it left off, saving you from starting the process from the beginning again.

5. "full_search", in case you want to check if a document became available since the last time you tried. "full_search" has three predefined values:

    - The default value is 'None'; the function only looks for the new articles since the last run.
    - 'light', the function looks for the new articles since the last run and retried the row where we did not get any format.
    - 'heavy', the function looks for the new articles since the last run and retried the row where it did not retrieve at least one tagged version (i.e. HTML or XML) in combination with the PDF format.  

6. The "keep_abstract" parameter has the default value 'True' and can be changed to 'False'. When set to 'True', our parsing will load any format from the beginning of the document. If changes to 'False', our parsing is trying to identify the abstract from any format and starts to extract the text after it. We are offering the option of removing the abstract, but we can not guarantee that our approach is more reliable for doing so. In case you would like to apply your own parsing method for removing the abstract, feel free to load any file saved during the retrieval available in the output folder:
```"output/formats/{format}s/{index}.{suffix}.zip"```.  

In [6]:
pmids = ['19902024', '28089709', '28601864', '31138673', '31605240']

bioscraping(pmids,
            EMAIL, #You need to insert your email address here
            NCBI_API_KEY, #You need to insert your NCBI_API_KEY here
            wiley_api_key = WILEY_API_KEY, #This is an optional parameter.
            #You can insert your WILEY_API_KEY here
            elsevier_api_key = ELSEVIER_API_KEY, #This is an optional parameter.
            #You can insert your ELSEVIER_API_KEY here
            colab1 = True # This parameter is for this Notebook example only.
            #It bypass the edirect module that can not be run on Colab.
            )

Result for retrieved_df : 
Here is the performance thus far:
PDF:3 = 60.0%
XML:3 = 60.0%
HTML:0 = 0.0%
Plain Text:1 = 20.0%

We have a tagged version (HTML, XML) for 3 articles = 60.0%

We only have the abstract but not the associated content text for 1 articles = 20.0%

We have a content text for 4 out of 5 articles = 80.0%


Result for retrieved_df2 : 
Here is the performance thus far:
PDF:3 = 60.0%
XML:3 = 60.0%
HTML:0 = 0.0%
Plain Text:1 = 20.0%

We have a tagged version (HTML, XML) for 3 articles = 60.0%

We only have the abstract but not the associated content text for 1 articles = 20.0%

We have a content text for 4 out of 5 articles = 80.0%


In [7]:
with zipfile.ZipFile("./output/retrieved_df/retrieved_df2.json.zip", "r") as z:
    for filename in z.namelist():
        with z.open(filename) as f:
            data = f.read()
            data = json.loads(data)


f.close()
z.close()
metadata_retrieved_df = pd.read_json(data, orient='index')
metadata_retrieved_df.pmid = metadata_retrieved_df.pmid.astype(str)

There are 5 publications records and possibly full-texts extracted from the query.

In [8]:
metadata_retrieved_df.shape

(5, 25)

In [9]:
metadata_retrieved_df

Unnamed: 0,pmid,pmcid,title,abstract,mesh,keywords,authors,journal,pub_type,pub_date,...,pdf,xml,html,plain,pmc_tgz,xml_parse_d,html_parse_d,pdf_parse_d,plain_parse_d,content_text
7575147222f842aba0727e3632c8f6dc,19902024,PMC2771255,Changes on the physiological lactonase activit...,Low caloric diet (LCD) is used for weight loss...,,"[BMI, high-density lipoprotein, low caloric di...","[Kotani K, Sakane N, Sano Y, Tsuzaki K, Matsuo...",Journal of clinical biochemistry and nutrition,[Journal Article],2009-11-24,...,1,1,0,0,1,{'file_path': './output/formats/xmls/757514722...,{},{'file_path': './output/formats/pdfs/757514722...,{},1
8677a2c0e6214547acca432c72af98a2,28089709,PMC5357736,Objective assessment of dietary patterns by us...,BACKGROUND: Accurate monitoring of changes in ...,"[Adult, Biomarkers/*urine, Cross-Over Studies,...",,"[Garcia-Perez I, Posma JM, Gibson R, Chambers ...",The lancet. Diabetes & endocrinology,"[Journal Article, Randomized Controlled Trial,...",2017-03-24,...,1,0,0,1,1,{},{},{'file_path': './output/formats/pdfs/8677a2c0e...,{'file_path': './output/formats/txts/8677a2c0e...,1
e937a9417a6e4723bdb6d07e4cd71a64,28601864,PMC5644969,Effect of a High-Protein Diet versus Standard-...,BACKGROUND: Some studies have shown that prote...,"[Adult, Biomarkers/*blood, Blood Glucose/analy...","[Diet, Metabolic syndrome, Protein intake, Wei...","[Campos-Nonato I, Hernandez L, Barquera S]",Obesity facts,"[Journal Article, Randomized Controlled Trial]",2017-11-24,...,0,0,0,0,0,{},{},{},{},0
0b80f7eaaba9488dba9af951a818e062,31138673,PMC6538848,A Multi-omics Approach to Unraveling the Micro...,Long-term consumption of dietary fiber is gene...,,"[AXOS, dietary fiber, glucose homeostasis, lip...","[Benitez-Paez A, Kjolbaek L, Gomez Del Pulgar ...",mSystems,[Journal Article],2019-05-28,...,1,1,0,0,1,{'file_path': './output/formats/xmls/0b80f7eaa...,{},{'file_path': './output/formats/pdfs/0b80f7eaa...,{},1
8ff3e24cb260420498b5537e372a1a1e,31605240,PMC7165363,Elevated serum ceramides are linked with obesi...,INTRODUCTION: Low gut microbiome richness is a...,"[Adult, Ceramides/*blood/metabolism, Chromatog...","[Ceramides, Endotoxin, Glucose metabolism, Mic...","[Kayser BD, Prifti E, Lhomme M, Belda E, Dao M...",Metabolomics : Official journal of the Metabol...,"[Journal Article, Research Support, Non-U.S. G...",2019-10-11,...,0,1,0,0,0,{'file_path': './output/formats/xmls/8ff3e24cb...,{},{},{},1


In [10]:
metadata_retrieved_df.columns

Index(['pmid', 'pmcid', 'title', 'abstract', 'mesh', 'keywords', 'authors',
       'journal', 'pub_type', 'pub_date', 'doi', 'issn', 'crossref',
       'full_text_links', 'licenses', 'pdf', 'xml', 'html', 'plain', 'pmc_tgz',
       'xml_parse_d', 'html_parse_d', 'pdf_parse_d', 'plain_parse_d',
       'content_text'],
      dtype='object')

We can now call the 'parsed_to_df' function with the default value for the parameter 'path' to build the df using the same indexes and their coresponding full-texts.

In [11]:
retrieved_df = parsed_to_df(path = './output/retrieved_parsed_files/content_text/')

Only the lines where the full-text was retrieved are included.



In [12]:
retrieved_df.shape

(4, 1)

In [13]:
retrieved_df

Unnamed: 0,content_text
0b80f7eaaba9488dba9af951a818e062,Long-term consumption of dietary fiber is gene...
7575147222f842aba0727e3632c8f6dc,Low caloric diet (LCD) is used for weight loss...
8677a2c0e6214547acca432c72af98a2,S2213-8587(16)30419-3 Objective assessment o...
8ff3e24cb260420498b5537e372a1a1e,INTRODUCTION: Low gut microbiome richness is a...


Now we are updating the current project to add an extra 5 PMIDs by updating the input list.

In [14]:
pmids = ['19902024', '28089709', '28601864', '31138673', '31605240', '31918705', '32586265', '33037261', '33515003', '33578731']

bioscraping(pmids,
            EMAIL, #You need to insert your email address here
            NCBI_API_KEY, #You need to insert your NCBI_API_KEY here
            wiley_api_key = WILEY_API_KEY, #This is an optional parameter.
            #You can insert your WILEY_API_KEY here
            elsevier_api_key = ELSEVIER_API_KEY, #This is an optional parameter.
            #You can insert your ELSEVIER_API_KEY here
            colab2 = True # This parameter is for this Notebook example only.
            #It bypass the edirect module that can not be run on Colab.
            )

Result for retrieved_df : 
Here is the performance thus far:
PDF:3 = 60.0%
XML:4 = 80.0%
HTML:3 = 60.0%
Plain Text:1 = 20.0%

We have a tagged version (HTML, XML) for 4 articles = 80.0%

We only have the abstract but not the associated content text for 0 articles = 0.0%

We have a content text for 5 out of 5 articles = 100.0%


Result for retrieved_df2 : 
Here is the performance thus far:
PDF:7 = 70.0%
XML:7 = 70.0%
HTML:3 = 30.0%
Plain Text:2 = 20.0%

We have a tagged version (HTML, XML) for 7 articles = 70.0%

We only have the abstract but not the associated content text for 1 articles = 10.0%

We have a content text for 9 out of 10 articles = 90.0%


As you can observe from the output .i.e the difference of performance between retrieved_df and retrieved_df 2, this time, bioscraping only looked for 5 publications. The metadata_retrieved_df now counts 10 publications in total.

In [15]:
with zipfile.ZipFile("./output/retrieved_df/retrieved_df2.json.zip", "r") as z:
    for filename in z.namelist():
        with z.open(filename) as f:
            data = f.read()
            data = json.loads(data)


f.close()
z.close()
metadata_retrieved_df = pd.read_json(data, orient='index')
metadata_retrieved_df.pmid = metadata_retrieved_df.pmid.astype(str)

In [16]:
metadata_retrieved_df.shape

(10, 25)

In [17]:
metadata_retrieved_df.sample(5)

Unnamed: 0,pmid,pmcid,title,abstract,mesh,keywords,authors,journal,pub_type,pub_date,...,pdf,xml,html,plain,pmc_tgz,xml_parse_d,html_parse_d,pdf_parse_d,plain_parse_d,content_text
d5bfee539f74496da9d6f0fa639c8d66,33037261,PMC7547065,Altered metabolomic profiling of overweight an...,Exercise training and a healthy diet are the m...,"[Adolescent, Body Composition, Body Mass Index...",,"[Duft RG, Castro A, Bonfante ILP, Lopes WA, da...",Scientific reports,"[Journal Article, Research Support, Non-U.S. G...",2020-10-09,...,1,1,1,0,1,{'file_path': './output/formats/xmls/d5bfee539...,{'file_path': './output/formats/htmls/d5bfee53...,{'file_path': './output/formats/pdfs/d5bfee539...,{},1
0b80f7eaaba9488dba9af951a818e062,31138673,PMC6538848,A Multi-omics Approach to Unraveling the Micro...,Long-term consumption of dietary fiber is gene...,,"[AXOS, dietary fiber, glucose homeostasis, lip...","[Benitez-Paez A, Kjolbaek L, Gomez Del Pulgar ...",mSystems,[Journal Article],2019-05-28,...,1,1,0,0,1,{'file_path': './output/formats/xmls/0b80f7eaa...,{},{'file_path': './output/formats/pdfs/0b80f7eaa...,{},1
6fd66ba037574d1db4530ff3e213778b,33578731,PMC7916506,Gut Microbiota Profile and Changes in Body Wei...,Gut microbiota is essential for the developmen...,,"[16S sequencing, BMI, clinical trial, gut micr...","[Atzeni A, Galie S, Muralidharan J, Babio N, T...",Microorganisms,[Journal Article],2021-02-10,...,1,1,0,0,1,{'file_path': './output/formats/xmls/6fd66ba03...,{},{'file_path': './output/formats/pdfs/6fd66ba03...,{},1
e937a9417a6e4723bdb6d07e4cd71a64,28601864,PMC5644969,Effect of a High-Protein Diet versus Standard-...,BACKGROUND: Some studies have shown that prote...,"[Adult, Biomarkers/*blood, Blood Glucose/analy...","[Diet, Metabolic syndrome, Protein intake, Wei...","[Campos-Nonato I, Hernandez L, Barquera S]",Obesity facts,"[Journal Article, Randomized Controlled Trial]",2017-11-24,...,0,0,0,0,0,{},{},{},{},0
8677a2c0e6214547acca432c72af98a2,28089709,PMC5357736,Objective assessment of dietary patterns by us...,BACKGROUND: Accurate monitoring of changes in ...,"[Adult, Biomarkers/*urine, Cross-Over Studies,...",,"[Garcia-Perez I, Posma JM, Gibson R, Chambers ...",The lancet. Diabetes & endocrinology,"[Journal Article, Randomized Controlled Trial,...",2017-03-24,...,1,0,0,1,1,{},{},{'file_path': './output/formats/pdfs/8677a2c0e...,{'file_path': './output/formats/txts/8677a2c0e...,1


In [18]:
retrieved_df = parsed_to_df()

Only the lines where the full-text was retrieved are included.

In [19]:
retrieved_df.shape

(9, 1)

In [20]:
retrieved_df.sample(5)

Unnamed: 0,content_text
0b80f7eaaba9488dba9af951a818e062,Long-term consumption of dietary fiber is gene...
84927ce9fe4d49a881716ec697a76af7,BACKGROUND: Epidemiologic studies show that co...
8ff3e24cb260420498b5537e372a1a1e,INTRODUCTION: Low gut microbiome richness is a...
7575147222f842aba0727e3632c8f6dc,Low caloric diet (LCD) is used for weight loss...
4fa376bec991422899c69fdfb7616146,S0002-9165(22)00623-2 Maternal gut microbiot...


Below is an example of the function using a PubMed search query. Since we are having difficulties running edirect from Colab this is just an example and can not be run here.

In [None]:
"""
bioscraping(
    'mecp2 AND human[mesh] AND English[lang] AND ("2020/04/01"[Date - Publication] : "2020/06/30"[Date - Publication])''',
    EMAIL, #You need to insert your email address here
    NCBI_API_KEY, #You need to insert your NCBI_API_KEY here
    wiley_api_key = WILEY_API_KEY, #This is an optional parameter.
    #You can insert your WILEY_API_KEY here
    elsevier_api_key = ELSEVIER_API_KEY, #This is an optional parameter.
    #You can insert your ELSEVIER_API_KEY here
    colab1 = True # This parameter is for this Notebook example only.
    #It bypass the edirect module that can not be run on Colab.
    )
"""