# 3.2 Searching ClinicalTrials.gov

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github.com/hscells/apis-for-evidence-identification/blob/main/3-use-cases/3-2-searching-clinicaltrials-gov.ipynb) 

In this notebook we will be searching ClinicalTrials.gov for clinical trials. We'll see an example of searching with the same example as before, using "[Blue-Light Therapy for Acne Vulgaris: A Systematic Review and Meta-Analysis](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6846280/)"

In [1]:
!pip install requests pandas -q
import requests
import pandas as pd

## Searching and De-duplicating Results

First, we will explore how to search both PubMed and ClinicalTrials.gov, then deduplicate the results. We'll run the PubMed query from before and the ClinicalTrials.gov query, then analyse the results.

The first cell below contains the PubMed query and the ClinicalTrials.gov query. These could be swapped out to search for different topics and the rest of the notebook would still work.

In [2]:
pubmed_search_string = """
("Acne Vulgaris"[Mesh] OR Acne[tiab] OR Blackheads[tiab] OR Whiteheads[tiab] OR Pimples[tiab]) AND ("Phototherapy"[Mesh] OR "Blue light"[tiab] OR Phototherapy[tiab] OR Phototherapies[tiab] OR "Photoradiation therapy"[tiab] OR "Photoradiation Therapies"[tiab] OR "Light Therapy"[tiab] OR "Light Therapies"[tiab]) AND (Randomized controlled trial[pt] OR controlled clinical trial[pt] OR randomized[tiab] OR randomised[tiab] OR placebo[tiab] OR "drug therapy"[sh] OR randomly[tiab] OR trial[tiab] OR groups[tiab]) NOT (Animals[Mesh] not (Animals[Mesh] and Humans[Mesh]))
"""
clinicaltrials_search_string = "(Acne AND (Phototherapy OR light))"

We'll start by searching PubMed and ClinicalTrials.gov for the queries above. We'll then deduplicate the results by comparing the PMIDs from PubMed with the NCT IDs from ClinicalTrials.gov.

We'll start by searching PubMed for the query above.

In [3]:
def get_pmids_from_pubmed(query, retstart=0):
    # The query doesn't return more than 10,000 results, so we can retrieve all the PMIDs at once.
    pubmed_response = requests.get(  # GET request
        url="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi",  # URL of the API
        params={  # Parameters of the request
            "db": "pubmed",
            "term": pubmed_search_string,
            "retmax": 10_000,  # We can retrieve up to 10,000 studies at a time
            "retstart": retstart,  # Start at the first study
            "format": "json"
        }
    ).json()  # Parse the response as JSON

    n_studies = int(pubmed_response["esearchresult"]["count"])

    # Get the list of PMIDs from the response
    for pmid in pubmed_response["esearchresult"]["idlist"]:
        yield pmid

    # If we haven't reached the total number of studies, get the next page
    if n_studies > retstart:
        yield from get_pmids_from_pubmed(query, retstart=retstart + 10_000)


pubmed_pmids = list(get_pmids_from_pubmed(pubmed_search_string))
len(pubmed_pmids)

496

The above cell should print how many PMIDs were found in PubMed. We'll now search ClinicalTrials.gov for the query above.

In [4]:
# The query retrieves more than 100 results, so we need to page through the results
def get_ntcids_from_clinicaltrials(query, min_rnk=1, max_rnk=100):
    clinicaltrials_response = requests.get(  # GET request
        url="https://classic.clinicaltrials.gov/api/query/full_studies",  # URL of the API
        params={  # Parameters of the request
            "expr": query,
            "min_rnk": min_rnk,
            "max_rnk": max_rnk,
            "fmt": "json",
        }
    ).json()  # Parse the response as JSON

    # Grab the total number of studies
    n_studies = int(clinicaltrials_response["FullStudiesResponse"]["NStudiesFound"])

    # Yield the NCT IDs
    for study in clinicaltrials_response["FullStudiesResponse"]["FullStudies"]:
        nct_id = study["Study"]["ProtocolSection"]["IdentificationModule"]["NCTId"]
        yield nct_id

    # If we haven't reached the total number of studies, get the next page
    if n_studies > max_rnk:
        yield from get_ntcids_from_clinicaltrials(query, min_rnk=max_rnk + 1, max_rnk=max_rnk + 100)


clinicaltrials_nctids = list(get_ntcids_from_clinicaltrials(clinicaltrials_search_string))
len(clinicaltrials_nctids)

113

The above cell should print how many NCT IDs were found in ClinicalTrials.gov. We can now compare the results from PubMed and ClinicalTrials.gov to deduplicate the results.

PubMed provides the ability to seach for articles by a "secondary ID", which can bea search using the [SI] field. We can use this to search for the NCT IDs we found in ClinicalTrials.gov. We can then compare the results from PubMed and ClinicalTrials.gov. It's pretty easy for us to automatically create this query:

In [5]:
pubmed_ntcid_search_string = " OR ".join([f"{nct_id}[SI]" for nct_id in clinicaltrials_nctids])
pubmed_ntcid_search_string

'NCT03650881[SI] OR NCT04433143[SI] OR NCT00613444[SI] OR NCT04300010[SI] OR NCT03128723[SI] OR NCT02431494[SI] OR NCT00833183[SI] OR NCT04112407[SI] OR NCT00706433[SI] OR NCT05245045[SI] OR NCT02698436[SI] OR NCT03124381[SI] OR NCT02924428[SI] OR NCT01689935[SI] OR NCT00237978[SI] OR NCT01347879[SI] OR NCT01678482[SI] OR NCT04636242[SI] OR NCT06311890[SI] OR NCT00814918[SI] OR NCT02313467[SI] OR NCT00933543[SI] OR NCT01328080[SI] OR NCT04631250[SI] OR NCT01830764[SI] OR NCT01276535[SI] OR NCT01119651[SI] OR NCT01115322[SI] OR NCT00129428[SI] OR NCT05080764[SI] OR NCT05073211[SI] OR NCT03961607[SI] OR NCT01584674[SI] OR NCT04698239[SI] OR NCT01472900[SI] OR NCT00476697[SI] OR NCT04156815[SI] OR NCT01160848[SI] OR NCT02180282[SI] OR NCT03203122[SI] OR NCT01677221[SI] OR NCT06225570[SI] OR NCT00113425[SI] OR NCT04167982[SI] OR NCT03279003[SI] OR NCT06043102[SI] OR NCT04873089[SI] OR NCT00673933[SI] OR NCT04709289[SI] OR NCT00594425[SI] OR NCT01257555[SI] OR NCT05622253[SI] OR NCT03303170

We can now search PubMed for the NCT IDs we found in ClinicalTrials.gov. We'll then deduplicate the results by comparing the PMIDs from PubMed with the NCT IDs from ClinicalTrials.gov.

In [6]:
# The query doesn't return more than 10,000 results, so we can retrieve all the PMIDs at once.
pubmed_ntcid_response = requests.get(  # GET request
    url="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi",  # URL of the API
    params={  # Parameters of the request
        "db": "pubmed",
        "term": pubmed_ntcid_search_string,
        "retmax": 10_000,  # We can retrieve up to 10,000 studies at a time
        "format": "json"
    }
).json()  # Parse the response as JSON

# Get the list of PMIDs from the response
pubmed_ntcid_pmids = pubmed_ntcid_response["esearchresult"]["idlist"]
len(pubmed_ntcid_pmids)

5

The above cell should print how many PMIDs were found in PubMed for the NCT IDs we found in ClinicalTrials.gov. The following cell displays which the actual PMIDs we found.

In [7]:
pubmed_ntcid_pmids

['36946749', '30829754', '30452511', '29905384', '23538621']

Now that we have the list of PMIDs from PubMed and the PMIDs for the NCT IDs from ClinicalTrials.gov, we can deduplicate the results by taking the union of the two sets.

In [8]:
deduplicated_pmids = list(set(pubmed_pmids).union(set(pubmed_ntcid_pmids)))
len(deduplicated_pmids)

500

The cell aboveshould print how many PMIDs were found in total after deduplication. We now have everything we need to retrieve the study data for screening: The list of PMIDs from PubMed and the NCT IDs from ClinicalTrials.gov, without any overlapping studies from either service.

## Retrieving Study Data for Screening 

We'd like to minimise the time we spend processing all the data into a common format to screen it, so we'll normalise the data from both PubMed and ClinicalTrials.gov so we can either immediately screen it, or import it into a screening tool. Now that we have all the IDs for the studies we want to screen, we can retrieve the study data for screening. We'll start by retrieving the study data from PubMed.

The first cell here sets up what our data format will look like. In this basic example, studies will contain the title, abstract and either the PMID or NCT ID.

In [9]:
from collections import namedtuple

Study = namedtuple("Study", ["title", "abstract", "pmid", "nct_id"])

We'll now retrieve the study data from PubMed for the PMIDs we found. We'll slice up the PMIDs into chunks of 25, the API is unlcear on how many PMIDs can be retrieved at once, so we'll play it safe and use 25. We'll then parse the response and create a list of studies.

We also have to do a little bit more data processing, since we can't directly access the study data from PubMed in JSON. For an example of how the data we get back from the API we use below looks like, take a look at this URL: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=25594129&retmode=text&rettype=medline

In [10]:
# Slice up the PMIDs into chunks of 25
sliced_pmids = [deduplicated_pmids[i:i + 25] for i in range(0, len(deduplicated_pmids), 25)]

response = ""
for pmid_slice in sliced_pmids:
    response += requests.get(  # GET request
        url="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi",  # URL of the API
        params={  # Parameters of the request
            "db": "pubmed",
            "id": ",".join(pmid_slice),  # We can get multiple PMIDs at once
            "rettype": "medline",
            "retmode": "text"
        }
    ).text

We basically have a big string of raw data that we need to process manually. The following cell takes the raw data, splits it into sections (where a section corresponds to a study), and normalises it into the data format we specified above. There is much more data in the response than we need, but we'll only be extracting the title and abstract. 

In [11]:
pubmed_studies = []  # This will contain all the studies once processed
sections = response.split("\n\n")  # Thankfully, the responses can be split easily on two empty lines

for section in sections:  # Now, we process each section.
    # The next few lines of code convert the lines into a JSON format
    data_dict = {}
    last_key = None
    for line in section.splitlines():
        if line.strip() == "":
            continue
        if line[4] == "-":
            line = line.split("-")
            last_key = line[0].strip()
            data_dict[last_key] = line[1].strip()
        else:
            data_dict[last_key] += line.strip()

    # Here is really where we normalise the data into the format we want
    pubmed_studies.append(Study(
        title=data_dict["TI"],  # Note the "TI" field corresponds to the TI line in the raw data
        abstract=data_dict["AB"] if "AB" in data_dict else None,  # Some studies don't have an abstract!
        pmid=data_dict["PMID"],
        nct_id=None
    ))

We now have the study data from PubMed in the format we want. We can explore the results by displaying them in a table.

In [12]:
pd.DataFrame(pubmed_studies)

Unnamed: 0,title,abstract,pmid,nct_id
0,Optimization of hydrogel containing toluidine ...,Antibiotics and photodynamic therapy (PDT) are...,30825010,
1,Stem cell secretome as a mechanism for restori...,BACKGROUND: Living organisms are continuously ...,36199062,
2,Photodynamic action of red light for treatment...,BACKGROUND: Erythrasma is a superficial cutane...,16719870,
3,Characteristics and management of Asian skin.,Color differences in skin are due to the amoun...,30039861,
4,Photodynamic therapy: a new antimicrobial appr...,"Photodynamic therapy (PDT) employs a non(PS), ...",15122361,
...,...,...,...,...
495,Treatment of acne with photodynamic therapy.,Photodynamic therapy (PDT) with aminolevuninic...,22095176,
496,Lightcontrolled trials.,"OBJECTIVE: In dermatology, patient and physici...",29356026,
497,Drugs for discoid lupus erythematosus.,BACKGROUND: Discoid lupus erythematosus (DLE) ...,28476075,
498,Photofrinnonmelanomatous skin tumors in elderl...,OBJECTIVES/HYPOTHESIS: Aggressive nonmelanomat...,11404627,


Luklily for us, PubMed will tell us the list of clinical trials that it couldn't find when we searched them before. That means we can just use this list to search ClinicalTrials.gov for the missing studies.

In [13]:
missing_clinicaltrial_studies = pubmed_ntcid_response["esearchresult"]["errorlist"]["phrasesnotfound"]
len(missing_clinicaltrial_studies)

108

The number above contains the number of clinical trials minus those that were found in PubMed. We can now search ClinicalTrials.gov for the missing studies.

Since there are a lot of studies, we'll page through the results to get all the studies. We'll then parse the response and create a list of studies.

In [14]:
# The query retrieves more than 100 results, so we need to page through the results
def get_studies_from_clinicaltrials(query, min_rnk=1, max_rnk=100):
    clinicaltrials_response = requests.get(  # GET request
        url="https://classic.clinicaltrials.gov/api/query/full_studies",  # URL of the API
        params={  # Parameters of the request
            "expr": query,
            "min_rnk": min_rnk,
            "max_rnk": max_rnk,
            "fmt": "json",
        }
    ).json()  # Parse the response as JSON

    # Grab the total number of studies
    n_studies = int(clinicaltrials_response["FullStudiesResponse"]["NStudiesFound"])

    # Yield the NCT IDs
    for study in clinicaltrials_response["FullStudiesResponse"]["FullStudies"]:
        if study["Study"]["ProtocolSection"]["IdentificationModule"]["NCTId"] in missing_clinicaltrial_studies:
            yield Study(
                title=study["Study"]["ProtocolSection"]["IdentificationModule"]["BriefTitle"],
                abstract=study["Study"]["ProtocolSection"]["DescriptionModule"]["BriefSummary"],
                pmid=None,
                nct_id=study["Study"]["ProtocolSection"]["IdentificationModule"]["NCTId"]
            )

    # If we haven't reached the total number of studies, get the next page
    if n_studies > max_rnk:
        yield from get_studies_from_clinicaltrials(query, min_rnk=max_rnk + 1, max_rnk=max_rnk + 100)


clinicaltrials_studies = list(get_studies_from_clinicaltrials(clinicaltrials_search_string))

We now have the study data from ClinicalTrials.gov in the format we want. We can explore the results by displaying them in a table.

In [15]:
pd.DataFrame(clinicaltrials_studies)

Unnamed: 0,title,abstract,pmid,nct_id
0,The Comparative Efficacy of an Over the Counte...,This is a single-center prospective study of t...,,NCT03650881
1,Evaluate the Efficacy and Safety of Intense Pu...,Acne is a chronic inflammatory disease involvi...,,NCT04433143
2,Photodynamic Therapy in the Treatment of Acne,The purpose of this research project is to stu...,,NCT00613444
3,Blue Light Therapy of C. Acnes,This proposal aims to investigate a novel ligh...,,NCT04300010
4,A Study to Evaluate the Tolerance of an Acne T...,The study will look to evaluate the tolerance ...,,NCT03128723
...,...,...,...,...
103,Straberi Epistamp Needling Treatment For Skin ...,This pilot study will expand knowledge and app...,,NCT04742803
104,Study of Secukinumab Compared to Fumaderm® in ...,"This is a randomized, controlled, multicenter,...",,NCT02474082
105,Long-term Safety Study of Chronocort in the Tr...,This phase III study is an open-label extensio...,,NCT05299554
106,Effects of Interleukin-1 Receptor Antagonism o...,"A prospective, interventional, open-label, sin...",,NCT03578497


Since all the data is now in the same format, we can combine the results from PubMed and ClinicalTrials.gov into a single table.

In [16]:
pd.DataFrame(pubmed_studies + clinicaltrials_studies)

Unnamed: 0,title,abstract,pmid,nct_id
0,Optimization of hydrogel containing toluidine ...,Antibiotics and photodynamic therapy (PDT) are...,30825010,
1,Stem cell secretome as a mechanism for restori...,BACKGROUND: Living organisms are continuously ...,36199062,
2,Photodynamic action of red light for treatment...,BACKGROUND: Erythrasma is a superficial cutane...,16719870,
3,Characteristics and management of Asian skin.,Color differences in skin are due to the amoun...,30039861,
4,Photodynamic therapy: a new antimicrobial appr...,"Photodynamic therapy (PDT) employs a non(PS), ...",15122361,
...,...,...,...,...
603,Straberi Epistamp Needling Treatment For Skin ...,This pilot study will expand knowledge and app...,,NCT04742803
604,Study of Secukinumab Compared to Fumaderm® in ...,"This is a randomized, controlled, multicenter,...",,NCT02474082
605,Long-term Safety Study of Chronocort in the Tr...,This phase III study is an open-label extensio...,,NCT05299554
606,Effects of Interleukin-1 Receptor Antagonism o...,"A prospective, interventional, open-label, sin...",,NCT03578497


Once we have all the data in this DataFrame format, we can easily save the file to a CSV, Excel, or many other kinds of files.

In [17]:
pd.DataFrame(pubmed_studies + clinicaltrials_studies).to_csv("studies_to_screen.csv", index=False)
# There should now be a file called "studies_to_screen.csv" in the current directory

## Summary

In this notebook, we've shown how to search ClinicalTrials.gov for clinical trials and how to deduplicate the results with PubMed. We've also shown how to retrieve the study data for screening. We've saved the results to a CSV file for further analysis.

---
[top](https://github.com/hscells/apis-for-evidence-identification)<br/>
[next: Implementing a Synonym Finder](https://github.com/hscells/apis-for-evidence-identification/blob/main/3-use-cases/3-3-implementing-a-synonym-finder.ipynb)<br/>
[previous: Searching PubMed](https://github.com/hscells/apis-for-evidence-identification/blob/main/3-use-cases/3-1-searching-pubmed.ipynb)<br/>