## NCT ID extraction check

This script checks the use of Regex to extract an NCT-ID

The regex for extracting an NCT ID is: r"NCT\d\d\d\d\d\d\d\d"

In [2]:
import os
import json
import pandas as pd
import requests
import math
import time

In [None]:
sources = ["Harvard+Dataverse","Qiita","Mendeley","Zenodo","MassIVE","VEuPathDB","ClinEpiDB","NCBI+SRA",
           "Omics+Discovery+Index+(OmicsDI)","NCBI+GEO","The+Database+of+Genotypes+and+Phenotypes",
           "Human+Cell+Atlas"]
#sources = ["Harvard+Dataverse","Qiita","Mendeley"]
#sources = ["Zenodo"]

In [None]:
allresults = pd.DataFrame(["_id","name","description"])
for eachsource in sources:
    r = requests.get(f'https://api-staging.data.niaid.nih.gov/v1/query?=&q=*NCT* AND includedInDataCatalog.name:"{eachsource}"&fields=_id,name,description&fetch_all=true')
    results = json.loads(r.text)
    tmpdf = pd.DataFrame(results['hits'])
    allresults = pd.concat((allresults,tmpdf),ignore_index=True)
    if results['total']>=500:
        i=0
        maxscrolls = math.ceil(results['total']/500)
        scroll_id = results['_scroll_id']
        while i < maxscrolls:
            try:
                r2 = requests.get(f'https://api-staging.data.niaid.nih.gov/v1/query?scroll_id={scroll_id}')
                tmp = json.loads(r2.text)
                scroll_id = tmp['_scroll_id']
                tmpdf = pd.DataFrame(tmp['hits'])
                allresults = pd.concat((allresults,tmpdf),ignore_index=True)
                i=i+1
            except:
                break

In [None]:
testdf = allresults[['_id','name','description']].copy()
testdf['text'] = testdf['name']+'\n'+testdf['description']
testdf['nctid'] = testdf['text'].str.extract(r'(NCT\d\d\d\d\d\d\d\d)')
clean = testdf.loc[~testdf['nctid'].isna()]

In [None]:
print(clean.head(n=2))
recordlist = clean[['_id','nctid']]
recordlist.to_csv(os.path.join('data','nctid_list.tsv'),sep='\t',header=True)

## Compare GPT Zenodo results with NCT extraction results

In [20]:
clinical_ids = pd.read_csv(os.path.join('data','zenodo_clinical_preds_ids.csv'))
print(clinical_ids.head(n=2))
print(len(clinical_ids))

               id
0  zenodo_3686539
1  zenodo_1138009
155719


In [21]:
gpt_data = pd.read_csv(os.path.join('data','zenodo_biobert_outputs.csv'))
print(gpt_data.head(n=2))
print(len(gpt_data))

              id                                               name  \
0  zenodo_573982                     California Synthetic Ecosystem   
1  zenodo_902790  Supplementary material 1 from: Schmidt BC (201...   

                                         description  medical_relevance_score  \
0  California synthetic population dataset consis...                 0.486943   
1  Table S1. Specimen data for mtDNA barcode vouc...                 0.006291   

   non_medical_score  
0           0.513057  
1           0.993708  
425094


In [22]:
medically_relevant = gpt_data.loc[gpt_data['medical_relevance_score']>0.9957]
print(len(medically_relevant))
med_ids = medically_relevant['id'].unique().tolist()

not_clinical = ['zenodo_3686539','zenodo_1138009','zenodo_1153738','zenodo_3578408','zenodo_3383144',
                'zenodo_1137168','zenodo_45128','zenodo_495244','zenodo_60899','zenodo_50007',
                'zenodo_159222','zenodo_159986','zenodo_55054','zenodo_3866488','zenodo_4560431']
clinical = ['zenodo_161544','zenodo_3441790','zenodo_56248','zenodo_5035745']
for each_id in not_clinical:
    if each_id in med_ids:
        print("Not clinical present")


print(med_ids[0:20])

581
Not clinical present
['zenodo_56248', 'zenodo_4560431', 'zenodo_5035745', 'zenodo_5784074', 'zenodo_5796053', 'zenodo_4980488', 'zenodo_5012813', 'zenodo_4088859', 'zenodo_5012782', 'zenodo_4956158', 'zenodo_4961743', 'zenodo_5002582', 'zenodo_5024104', 'zenodo_4947709', 'zenodo_6370554', 'zenodo_5009776', 'zenodo_5025284', 'zenodo_4404285', 'zenodo_5259372', 'zenodo_4934655']


In [None]:
search_results = pd.read_csv(os.path.join('data','nde-results-clinical.csv'),usecols=['_id', 'name','description'])
print(search_results.head(n=2))

Non-bert classification approach:
The above results indicate that once again, GPT fails when the name/description is too short. Additionally, the clinical vs non-clinical classification results are very poor in general. Even at a threshold of 0.999, the results were just records with the shortest name+description

We can achieve a more relevant result than GPT by simply searching for 'clinical' in the NDE portal and filtering for Zenodo results. At least those results generally appear to be more relevant.

bert-based classification approach:
The results appear to be much better than the non-bert based approach. At a threshold of 0.9957, the known non-clinical datasets are filtered out.

It would be interesting to see how the bert-based classification approach compares with search results for terms such as:
* "randomized controlled trial"
* "randomized trial"
* "retrospective cohort"
* "prospective cohort"
* "interventional trial"
* "double blind study"
* "observational study"
* "cross-sectional study"
* "case-controlled study"

Filter out terms like "review" and save "species" to further filter out records with non-human species

In [6]:
#test = allresults[['_id','includedInDataCatalog','species']].head(n=5).copy()
def get_nested_name(row):
    if isinstance(row,str):
        chunks = row.split(',')
        nname = []
        for eachchunk in chunks:
            if 'name' in eachchunk:
                kv = eachchunk.split(':')
                nname.append(kv[1].strip().strip("'"))
    else:
        nname = -1
    return nname

    
#test['sources'] = test.apply(lambda row: get_nested_name(row['includedInDataCatalog']),axis=1)
#test['speciesname'] = test.apply(lambda row: get_nested_name(row['species']),axis=1)
#print(test)

In [None]:
clin_terms = ["randomized controlled trial","randomized trial","interventional trial",
              "prospective cohort","double blind study", "observational study", 
              "clinical study","cross-sectional study","retrospective cohort","case-control study"]
#clin_terms = ["randomized+controlled+trial"]


In [None]:
allresults = pd.DataFrame(columns = ["_id","name","description","includedInDataCatalog","species","_score","_ignored","searchphrase"])

for eachterm in clin_terms:
    r = requests.get(f'https://api.data.niaid.nih.gov/v1/query?=&q="{eachterm}"&fields=_id,name,description,includedInDataCatalog,species&fetch_all=true')
    results = json.loads(r.text)
    tmpdf = pd.DataFrame(results['hits'])
    print(eachterm, results['total'])
    tmpdf['searchphrase'] = eachterm
    allresults = pd.concat((allresults,tmpdf),ignore_index=True)
    if results['total']>=500:
        i=0
        scroll_id = results['_scroll_id']
        maxscrolls = math.ceil(results['total']/500)
        while i < maxscrolls:
            try:
                r2 = requests.get(f'https://api.data.niaid.nih.gov/v1/query?scroll_id={scroll_id}')
                tmp = json.loads(r2.text)
                scroll_id = tmp['_scroll_id']
                tmpdf = pd.DataFrame(tmp['hits'])
                tmpdf['searchphrase'] = eachterm
                allresults = pd.concat((allresults,tmpdf),ignore_index=True)
                i=i+1
            except:
                break

allresults.sort_values(by='_score', ascending=False, inplace=True)
allresults.drop_duplicates(subset='_id',keep='first',inplace=True)
clean_results = allresults[['_id','name',"description","includedInDataCatalog","_score","species","searchphrase"]].copy()
clean_results.to_csv(os.path.join('data','unfiltered_raw_search_results.tsv'),sep='\t',header=True)

In [2]:
allrawresults = pd.read_csv(os.path.join('data','unfiltered_raw_search_results.tsv'),delimiter='\t',header=0,index_col=0)
print(len(allrawresults))
print(allrawresults.tail(n=2))

100039
                    _id                                               name  \
91693         phs000185                  Genetic Studies in the Hutterites   
27278  phs000209.v10.p2  Multi-Ethnic Study of Atherosclerosis (MESA) C...   

                                             description  \
91693  We conducted genetic studies of disease-associ...   
27278  MESA The Multi-Ethnic Study of Atherosclerosis...   

                                   includedInDataCatalog    _score  \
91693  {'@type': 'DataCatalog', 'archivedAt': 'https:...  2.021154   
27278  {'@type': 'DataCatalog', 'archivedAt': 'https:...  1.985125   

                                                 species  \
91693  [{'alternateName': ['Human', 'Homo sapiens Lin...   
27278  [{'alternateName': ['Human', 'Homo sapian', 'H...   

                searchphrase  
91693  cross-sectional study  
27278     prospective cohort  


In [3]:
exempt_terms = ["review","literature","ecology","soil","biodiversity","forest","taxonomy",
                "ocean", "ice shelf", "space", "glacier", "marine", "conservation", "predation"]

In [13]:
reject_ids = set()
allrawresults['lowername'] = allrawresults['name'].astype(str).str.lower()
allrawresults['lowerdesc'] = allrawresults['description'].astype(str).str.lower()

for eachterm in exempt_terms:
    tmp_reject = allrawresults['_id'].loc[allrawresults['lowername'].astype(str).str.contains(eachterm)].unique().tolist()
    reject_ids = reject_ids.union(set(tmp_reject))
    tmp2_reject = allrawresults['_id'].loc[allrawresults['lowerdesc'].astype(str).str.contains(eachterm)].unique().tolist()
    reject_ids = reject_ids.union(set(tmp2_reject))
    print(eachterm, len(reject_ids))
reject_list = list(reject_ids)

filtered_results = allrawresults.loc[~allrawresults['_id'].isin(reject_list)]
print(len(filtered_results))

review 12032
literature 14099
ecology 14291
soil 14430
biodiversity 14469
forest 14864
taxonomy 14917
ocean 14957
ice shelf 14957
space 15549
glacier 15552
marine 15594
conservation 15687
predation 15695
84344


In [14]:
filtered_results['source'] = filtered_results.apply(lambda row: get_nested_name(row['includedInDataCatalog']),axis=1)
filtered_results['speciesname'] = filtered_results.apply(lambda row: get_nested_name(row['species']),axis=1)
cleanresults = filtered_results[['_id','name',"description","source","_score","speciesname","searchphrase"]].copy()
cleanresults.to_csv(os.path.join('data','filtered_search_results.tsv'),sep='\t',header=True)
print(cleanresults.head(n=2))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_results['source'] = filtered_results.apply(lambda row: get_nested_name(row['includedInDataCatalog']),axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_results['speciesname'] = filtered_results.apply(lambda row: get_nested_name(row['species']),axis=1)


                        _id  \
41375  dde_39de363a93dfc491   
41376  dde_94c560d2fe14b1b1   

                                                    name  \
41375  Immune Tolerance Network TrialShare Clinical T...   
41376                         AIDS Clinical Trials Group   

                                             description  \
41375  ITN TrialShare shares information about ITN's ...   
41376  ACTG is a global clinical trials network that ...   

                        source     _score  \
41375  [Data Discovery Engine]  996.93054   
41376  [Data Discovery Engine]  994.64970   

                                 speciesname    searchphrase  
41375  [Data Discovery Engine, Homo sapiens]  clinical study  
41376  [Data Discovery Engine, Homo sapiens]  clinical study  


In [15]:
filteredresults = pd.read_csv(os.path.join('data','filtered_search_results.tsv'),delimiter='\t',header=0,index_col=0)
print(len(filteredresults))
print(filteredresults.tail(n=2))

84344
             _id                                            name  \
76983  phs003210  The Nephrotic Syndrome Study Network (NEPTUNE)   
91693  phs000185               Genetic Studies in the Hutterites   

                                             description  \
76983  The Nephrotic Syndrome Study Network (NEPTUNE)...   
91693  We conducted genetic studies of disease-associ...   

                                             source    _score  \
76983  ['The Database of Genotypes and Phenotypes']  2.048550   
91693  ['The Database of Genotypes and Phenotypes']  2.021154   

                                             speciesname  \
76983  ['Homo sapiens', 'Rattus norvegicus', 'Molva m...   
91693              ['Homo sapiens', 'Rattus norvegicus']   

                searchphrase  
76983         clinical study  
91693  cross-sectional study  


In [16]:
nonclinrepos = filteredresults.loc[~((filteredresults['_id'].astype(str).str.contains('phs'))|
                                (filteredresults['_id'].astype(str).str.contains('vivli'))|
                                (filteredresults['_id'].astype(str).str.contains('clinepidb'))|
                                (filteredresults['_id'].astype(str).str.contains('dde'))|
                                (filteredresults['_id'].astype(str).str.contains('nichd'))|
                                (filteredresults['_id'].astype(str).str.contains('biotools'))|
                                (filteredresults['_id'].astype(str).str.contains('radx')))]
print(len(nonclinrepos))
print(nonclinrepos.tail(n=2))

nonclinrepos.to_csv(os.path.join('data','clin_in_nonclin_repos.tsv'),sep='\t',header=True)

72570
                    _id                                               name  \
106630        mtbls9087  Gut microbiota-derived butyric acid protected ...   
27267   model2307180001  Zerrouk2023 - Large scale computational modeli...   

                                              description  \
106630  BACKGROUND: Subarachnoid hemorrhage (SAH) is a...   
27267                                      No description   

                                     source    _score  \
106630  ['Omics Discovery Index (OmicsDI)']  2.496785   
27267   ['Omics Discovery Index (OmicsDI)']  2.217721   

                         speciesname        searchphrase  
106630  ['PubTator', 'Homo sapiens']  case-control study  
27267                             -1  prospective cohort  


In [17]:
clinrepos = filteredresults.loc[((filteredresults['_id'].astype(str).str.contains('phs'))|
                                (filteredresults['_id'].astype(str).str.contains('vivli'))|
                                (filteredresults['_id'].astype(str).str.contains('clinepidb'))|
                                (filteredresults['_id'].astype(str).str.contains('dde'))|
                                (filteredresults['_id'].astype(str).str.contains('nichd'))|
                                (filteredresults['_id'].astype(str).str.contains('biotools'))|
                                (filteredresults['_id'].astype(str).str.contains('radx')))]
print(len(clinrepos))
print(clinrepos.tail(n=2))

clinrepos.to_csv(os.path.join('data','clin_in_clin_repos.tsv'),sep='\t',header=True)

11774
             _id                                            name  \
76983  phs003210  The Nephrotic Syndrome Study Network (NEPTUNE)   
91693  phs000185               Genetic Studies in the Hutterites   

                                             description  \
76983  The Nephrotic Syndrome Study Network (NEPTUNE)...   
91693  We conducted genetic studies of disease-associ...   

                                             source    _score  \
76983  ['The Database of Genotypes and Phenotypes']  2.048550   
91693  ['The Database of Genotypes and Phenotypes']  2.021154   

                                             speciesname  \
76983  ['Homo sapiens', 'Rattus norvegicus', 'Molva m...   
91693              ['Homo sapiens', 'Rattus norvegicus']   

                searchphrase  
76983         clinical study  
91693  cross-sectional study  


In [3]:
nonclinrepos = pd.read_csv(os.path.join('data','clin_in_nonclin_repos.tsv'),delimiter='\t',header=0,index_col=0)
clinrepos = pd.read_csv(os.path.join('data','clin_in_clin_repos.tsv'),delimiter='\t',header=0,index_col=0)
nonclinrepoids = nonclinrepos['_id'].unique()
with open(os.path.join('data','clin_ids_in_nonclin_repos.txt'),'w') as outwrite:
    for eachid in nonclinrepoids:
        outwrite.write(str(eachid)+"\n")
clinrepoids = clinrepos['_id'].unique()
with open(os.path.join('data','clin_ids_in_clin_repos.txt'),'w') as outwrite:
    for eachid in clinrepoids:
        outwrite.write(str(eachid)+"\n")

In [19]:
zenodo_subset = filteredresults.loc[filteredresults['_id'].astype(str).str.contains('zenodo')]
#print(zenodo_subset.tail(n=5))
print(zenodo_subset.loc[zenodo_subset['searchphrase']=='observational study'].tail(n=10))

                   _id                                               name  \
39468   zenodo_3712941  Dataset related to article "Oocyte Cryopreserv...   
39919  zenodo_13284412  CLINICAL PROFILE OF PATIENTS WITH HEMOPTYSIS A...   
40208   zenodo_3860012  Effectiveness of prone positioning in non-intu...   
40209   zenodo_3799739  Epidemiology, risk factors and clinical course...   
40214   zenodo_6198802  Multi-center observational study on occurrence...   
40443   zenodo_7778290  CONCEPT-DIABETES DATA MODEL TO ANALYSE HEALTHC...   
40451   zenodo_5140893  Chest x-ray in the COVID-19 pandemic: Radiolog...   
40454  zenodo_13352917  EXPLORING THE INFLUENCE OF ECO-FRIENDLY INITIA...   
40552  zenodo_14801683  Morbidity and mortality in adults with a Fonta...   
40642   zenodo_4524729  Data set from Moons P, Apers S, Kovacs AH, Tho...   

                                             description      source  \
39468  Objective: The aim of the present study is to ...  ['Zenodo']   
39919  B

In [23]:
zenodo_overlap = zenodo_subset.loc[zenodo_subset['_id'].isin(med_ids)]
print(len(zenodo_overlap))

78


In [24]:
species_results = nonclinrepos.loc[~nonclinrepos['speciesname'].astype(str).str.contains('-1')]
nonhuman_results = species_results.loc[~species_results['speciesname'].astype(str).str.contains('Homo')]
print(nonhuman_results)

                        _id  \
1445    mendeley_dm2y8ztgvm   
91813   mendeley_wzcsrcvswz   
2826    mendeley_hnx44vxv5k   
92102   mendeley_7yy955mh7r   
3282    mendeley_srfrr9vf8f   
...                     ...   
40722       dryad_f4qrfj6zd   
40726             gse171645   
75573              gse30552   
75576             gse185748   
106605            mtbls9100   

                                                     name  \
1445    Supplemental Material - A Randomized Controlle...   
91813   Data for: Configuration of short- and long-thr...   
2826    Distribution of vaginal sensors per cows per w...   
92102   Supplemental Material - Treatment of pityriasi...   
3282    Effects of manual perineal protection and push...   
...                                                   ...   
40722   Subtyping of common complex diseases and disor...   
40726   Integrated molecular landscape perturbations u...   
75573   Expression data from mice lacking SIRT3 under ...   
75576   Combinato

### Further filter down the clinical subsets by removing ones that have non-human species annotations

This strategy will not work well due as a percentage of records have non-human species annotations due to false positives from the EXTRACT pipeline. None-the-less, it would be good to get some stats.

To do:
- get counts of records with no species info and subset by source

In [None]:
counts = allresults.groupby(['source','speciesname']).size().reset_index(name="counts")
counts.sort_values(by=['source','counts'], ascending=[True,False], inplace=True)
print(counts.head(n=2))
counts.to_csv(os.path.join('data','source_species_frequency.tsv'),sep='\t',header=True)

### Check for ecology, environmental studies for exclusion

In [None]:
nonclinrepos = pd.read_csv(os.path.join('data','clin_in_nonclin_repos.tsv'),delimiter='\t',header=0, index_col=0)
