## Prepare the ground truth data

Load all the necessary libraries here

In [22]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

import re
import pandas as pd


Custom function to extract text for inclusion criteria:
- Data is stored in standard format in [clinical trail repository](https://clinicaltrials.gov/).
The 'Inclusion Criteria' and 'Exclusion Criteria' are always initialized with respective headers.



In [None]:
# Extract inclusion criteria from an webpage given an url from clinical trial repository
start ='Inclusion Criteria:\n'
end = '\nExclusion Criteria:'

def criteria_ext(url, start=start,end=end):

  #open url and use html parser to read the text
  html = urlopen(url).read()
  soup = BeautifulSoup(html, features="html.parser")

  # kill all script and style elements
  for script in soup(["script", "style"]):
    script.extract() 
  # get text
  text = soup.get_text()

  # break into lines and remove leading and trailing space on each
  lines = (line.strip() for line in text.splitlines())
  # break multi-headlines into a line each
  chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
  # drop blank lines
  text = '\n'.join(chunk for chunk in chunks if chunk)

  # Avoid stops using try / except rules
  try:
    inclusion = text[text.index(start)+len(start):text.index(end)]
  except ValueError:
    inclusion ='Error'

  return inclusion

### Preparation of breast cancer data

Read breast cancer data 

In [30]:
BC_raw = pd.read_csv("/content/BreastCancer.csv")

In [31]:
BC_raw.head()

Unnamed: 0,Rank,NCT Number,Title,Acronym,Status,Study Results,Conditions,Interventions,Outcome Measures,Sponsor/Collaborators,...,Other IDs,Start Date,Primary Completion Date,Completion Date,First Posted,Results First Posted,Last Update Posted,Locations,Study Documents,URL
0,1,NCT05376241,Promoting Informed Choice for Breast Cancer Sc...,,Recruiting,No Results Available,Breast Cancer,Behavioral: Survey,Reactance|Disbelief|Source derogation|Self exe...,"University of Colorado, Denver|National Cancer...",...,20-1866.cc|R01CA254926,"August 7, 2020",August 2023,August 2025,"May 17, 2022",,"May 17, 2022","University of Colorado Hospital, Aurora, Color...",,https://ClinicalTrials.gov/show/NCT05376241
1,2,NCT05600257,The Effect of Digital Breast Tomosynthesis in ...,,Completed,No Results Available,Breast Cancer,Device: DBT|Device: DM,Overall survival rate,Kaohsiung Veterans General Hospital.,...,KSVGH22-CT11-01,"September 1, 2011","August 31, 2021","August 31, 2021","October 31, 2022",,"October 31, 2022","Kaohsiung Veterans General Hospital, Kaohsiung...",,https://ClinicalTrials.gov/show/NCT05600257
2,3,NCT05132790,Breast Cancer Study of Stereotactic Body Radia...,,Recruiting,No Results Available,Breast Cancer,Drug: SHR-1316 at a dose 20mg/kg q3w|Drug: SHR...,Pathological complete response (pCR) for TNBC ...,Shengjing Hospital|Jiangsu HengRui Medicine Co...,...,BC-NEO-IIT-SHR1316-SHR6390-RT,"November 12, 2021","December 15, 2022","October 15, 2023","November 24, 2021",,"September 13, 2022",Shengjing Hospital of China Medical University...,,https://ClinicalTrials.gov/show/NCT05132790
3,4,NCT03254875,Rehabilitation After Breast Cancer,REBECCA II,Completed,No Results Available,Breast Cancer,Behavioral: Individually tailored nurse naviga...,Distress|Depression|Anxiety|Health related qua...,"Danish Cancer Society|Rigshospitalet, Denmark|...",...,REBECCA II,"August 15, 2017","March 31, 2021","March 31, 2021","August 21, 2017",,"August 27, 2021","Rigshospitalet, Copenhagen, Denmark",,https://ClinicalTrials.gov/show/NCT03254875
4,5,NCT05563220,Open-Label Umbrella Study To Evaluate Safety A...,ELEVATE,Not yet recruiting,No Results Available,Breast Cancer|Metastatic Breast Cancer,Drug: Elacestrant|Drug: Alpelisib|Drug: Everol...,Determine the recommended Phase 2 dose (RP2D) ...,"Stemline Therapeutics, Inc.",...,STML-ELA-0222,"December 31, 2022","December 31, 2024","August 31, 2026","October 3, 2022",,"October 3, 2022",,,https://ClinicalTrials.gov/show/NCT05563220


In [33]:
# Store URLs and NCT IDs seperately
BC_url= BC_raw['URL']
BC_NCTID= BC_raw['NCT Number']

Very few studies contain no exclusion criteria. For such cases our serach method will result an error.

Consider the following demonstration

In [86]:
# Sample demonstration of Error message
print('Inclusion criteria for trial:',BC_NCTID[43])
criteria_ext(url=BC_url[43]).split('\n')

Inclusion criteria for trial: NCT02663973


['Error']

In [127]:
# Sample demonstartion with no Error
print('Inclusion criteria for trial:',BC_NCTID[10])
criteria_ext(url=BC_url[10]).split('\n')

Inclusion criteria for trial: NCT04360330


['Female, ≥ 50 years of age.',
 'Oncotype or MammaPrint diagnosis results are required prior to the start of treatment',
 'Histologically confirmed invasive breast cancer.',
 'Clinical stage T1N0M0.',
 'Receptor status: Estrogen-Receptor (ER)/Progesterone-Receptor (PR) positive and Human Epidermal Growth Factor Receptor 2 (HER2) negative.',
 'Unifocal breast cancer.',
 'Eastern Cooperative Oncology Group (ECOG) 0, 1.',
 'Ability to undergo MRI.',
 'Women of child-bearing potential (WOCBP) must agree to use adequate contraception or agree to undergo sexual abstinence prior to study entry and for the duration of study participation. WOCBP must have a negative serum or urine pregnancy test at time of enrollment. Should a woman become pregnant or suspect she is pregnant while she is participating in this study, she should inform her treating physician immediately.',
 'Ability to understand the investigational nature, potential risks and benefits of the research study and willingness to sig

All inclusion criteria for a trial are extracted as a single list. Seperate criterias are essentially different elements of this list.

In [85]:
#Initialize an empty list to store inclusion criteria for each trial
## Extract only 250 studies to start with
n_study = 250
incl_list = list()
from tqdm import tqdm
for i in tqdm(range(n_study)):
  incl= criteria_ext(url=BC_url[i]).split('\n')
  incl_list.append(incl)

100%|██████████| 250/250 [00:59<00:00,  4.21it/s]


In [103]:
# repeat clinical trial ID as many times as number of inclusion criteria
NCTID=[]
for i in range(len(incl_list)):
  ID=[BC_NCTID[i]]*len(incl_list[i])
  NCTID.extend(ID)

In [108]:
# stretch all inclusion criteri in single long list
all_incl = [item for sublist in incl_list for item in sublist]

In [112]:
# check : total number of data points
print('Number of inclusion criteria:',len(all_incl))
print('\nNumber of NCTIDs:',len(NCTID))

Number of inclusion criteria: 2016

Number of NCTIDs: 2016


List to data frame

In [115]:
BC_data=pd.DataFrame(list(zip(NCTID,all_incl)), columns=['Trial_ID','Incl_crit'])

Remove records that are recorded as Error.

In [123]:
BC_data= BC_data[BC_data['Incl_crit']!='Error']

In [124]:
BC_data.head()

Unnamed: 0,Trial_ID,Incl_crit
0,NCT05376241,Female
1,NCT05376241,Between 39-49 years of age
2,NCT05376241,No history of breast cancer
3,NCT05376241,No known BRCA 1/2 mutation
4,NCT05600257,individuals were diagnosed with breast cancer ...


In [125]:
# Store data frame to a csv for future use
BC_data.to_csv(r'/content/BreastCancer_incl.csv')

In [None]:
print('Number of studies dropped from first', n_study,'studies=', len(all_incl)- BC_data.shape[0])

### Preparation of Asthma data

Same steps as breast cancer data preparation are used.

In [132]:
AS_raw=pd.read_csv('/content/Asthma.csv')

In [133]:
AS_raw.head()

Unnamed: 0,Rank,NCT Number,Title,Acronym,Status,Study Results,Conditions,Interventions,Outcome Measures,Sponsor/Collaborators,...,Other IDs,Start Date,Primary Completion Date,Completion Date,First Posted,Results First Posted,Last Update Posted,Locations,Study Documents,URL
0,1,NCT04293588,Asthma: Phenotyping EXacerbations,APEX,Recruiting,No Results Available,Asthma,"Diagnostic Test: spirometry, FOT, Induced sput...",Proportion of study participants with an eosin...,University of Nottingham|AstraZeneca,...,19051,"November 22, 2019",July 2022,July 2022,"March 3, 2020",,"April 29, 2021","Nottingham respriatory research unit, Nottingh...",,https://ClinicalTrials.gov/show/NCT04293588
1,2,NCT04293445,Asthma: Phenotyping Exacerbations 2,APEX 2,Recruiting,No Results Available,ASTHMA,"Diagnostic Test: spirometry, FOT, Induced sput...",Proportion of study participants with an eosin...,University of Nottingham|AstraZeneca,...,19055,"November 22, 2019",July 2022,July 2022,"March 3, 2020",,"April 29, 2021","Nottingham respriatory research unit, Nottingh...",,https://ClinicalTrials.gov/show/NCT04293445
2,3,NCT03520881,Pediatric ASTHMA-Educator,,Completed,No Results Available,Asthma,Other: Pediatric ASTHMA-Educator mobile applic...,"Change from baseline asthma control to 2, 4, a...",Montefiore Medical Center,...,2013-2693A,"July 1, 2016","June 30, 2019","June 30, 2019","May 11, 2018",,"August 16, 2019","Montefiore Medical Center, Bronx, New York, Un...",,https://ClinicalTrials.gov/show/NCT03520881
3,4,NCT05439915,Asthma Diagnosis Through Peak Flows,DAPF-CSL,Not yet recruiting,No Results Available,Asthma,,Diagnosis of asthma|Acceptation|Ratios|Sensiti...,Consorci Sanitari de Terrassa,...,02-22-161-062,July 2022,January 2023,July 2023,"June 30, 2022",,"July 8, 2022",,,https://ClinicalTrials.gov/show/NCT05439915
4,5,NCT04125316,Level of FeNO in Chinese Asthma Patients,,Recruiting,No Results Available,Asthma,"Other: Observation, no intervention",the level of FeNO|FeNO and risk of asthma exac...,Chinese University of Hong Kong,...,FeNO_study_protocol V1,"October 15, 2019","June 30, 2022","December 31, 2022","October 14, 2019",,"March 10, 2022","The Chinese University of Hong Kong, Hong Kong...",,https://ClinicalTrials.gov/show/NCT04125316


In [140]:
# Store URLs and NCT IDs seperately
AS_url= AS_raw['URL']
AS_NCTID= AS_raw['NCT Number']

All inclusion criteria for a trial are extracted as a single list. Seperate criterias are essentially different elements of this list.

In [141]:
#Initialize an empty list to store inclusion criteria for each trial
## Extract only 250 studies to start with
n_study = 250
incl_list = list()
from tqdm import tqdm
for i in tqdm(range(n_study)):
  incl= criteria_ext(url=AS_url[i]).split('\n')
  incl_list.append(incl)

100%|██████████| 250/250 [01:01<00:00,  4.09it/s]


In [149]:
incl_list[10]

['Ages 12 to 21 years, inclusive, of both genders',
 'Physician diagnosis of persistent asthma or symptoms consistent with persistent asthma based on expert guidelines for diagnosis and management of asthma (1).',
 'Current use of a controller therapy such as an inhaled corticosteroid (ICS), ICS in combination with long-acting beta agonist (LABA), or leukotriene receptor antagonist (LTRA).',
 'Asthma is "not well controlled" (participant must have ≥1 of the following):',
 'Asthma Control Test (ACT) score <20,',
 'FEV1 <80% of predicted,',
 'Meets Global Initiative on Asthma (GINA) criteria for partly controlled or uncontrolled asthma (2):',
 'In the past 4 weeks, has the patient had:',
 'Daytime symptoms >2x/week?',
 'Any night waking due to asthma?',
 'SABA reliever needed >2x/week?',
 'Any activity limitation due to asthma?',
 '[0 = Well controlled; 1-2 = Partly controlled; 3-4 = Uncontrolled]',
 'A history of at least one exacerbation requiring systemic corticosteroids (oral, IM or 

In [150]:
# repeat clinical trial ID as many times as number of inclusion criteria
NCTID=[]
for i in range(len(incl_list)):
  ID=[AS_NCTID[i]]*len(incl_list[i])
  NCTID.extend(ID)

In [151]:
# stretch all inclusion criteri in single long list
all_incl = [item for sublist in incl_list for item in sublist]

In [152]:
# check : total number of data points
print('Number of inclusion criteria:',len(all_incl))
print('\nNumber of NCTIDs:',len(NCTID))

Number of inclusion criteria: 1235

Number of NCTIDs: 1235


List to data frame

In [155]:
AS_data=pd.DataFrame(list(zip(NCTID,all_incl)), columns=['Trial_ID','Incl_crit'])

Remove records that are recorded as Error.

In [156]:
AS_data= AS_data[AS_data['Incl_crit']!='Error']

In [161]:
AS_data.to_csv('/content/Asthma_incl.csv')

In [159]:
AS_data.shape

(1196, 2)

In [160]:
print('Number of studies dropped from first', n_study,'studies=', len(all_incl)- AS_data.shape[0])

Number of studies dropped from first 250 studies: 39
