## Fetching missing study design info

Some repositories only host clinical study data, and are missing any sort of study design info (which can be ingested to measTech). For these repositories, it will be possible to pull the corresponding NCT ID and use that to pull the study design information from clinicaltrials.gov

Info on the REST API for clinicaltrials.gov is here: https://clinicaltrials.gov/data-api/api#extapi
Note that NCT-ID's will generally be listed under 'identifier', but there may be more than one identifier listed per record.

These repos include:
- ACD@NIAID 
- Vivli 

In [24]:
import os
import pandas as pd
import json
import requests
import math
from datetime import datetime

In [3]:
script_path = os.getcwd()
parent_path = os.path.abspath(os.path.join(script_path, os.pardir))
result_path = os.path.join(script_path,'results')

### Generate mappings based on different combinations of clinical trial enum options

1. Generate a table with different combinations of the ENUM options
2. Map the different combinations to NCIT (whenever possible)
  * Search for matching options in OLS for NCIT. If unavailable, perform google search to see if the combinations are reasonable/possible.
    * Important, do NOT trust the AI generated summary without checking its sources. It can be flat out wrong.
  * https://docs.google.com/spreadsheets/d/1HMtmWAShwlKaCUCRiWNn6iIV6qv0ZhaODNIEShdWGV0/edit?gid=0#gid=0
3. Convert mapping into a dictionary by hashing the options into a single key and providing the array urls as the value

In [3]:
def generate_mapping_options():
    studytypes = ['OBSERVATIONAL','EXPANDED_ACCESS','INTERVENTIONAL']
    observationalmodeltypes = ['COHORT','CASE_CONTROL','CASE_ONLY','CASE_CROSSOVER',
                               'ECOLOGIC_OR_COMMUNITY','FAMILY_BASED','DEFINED_POPULATION',
                               'NATURAL_HISTORY','OTHER','NONE']
    observationaldesmodels = ['PATIENT_REGISTRY','NOT_REGISTRY','NONE']
    observationalmethodtypes = ['RETROSPECTIVE','PROSPECTIVE','CROSS_SECTIONAL','OTHER','NONE']
    interventionalmodeltypes = ['SINGLE_GROUP','PARALLEL','CROSSOVER','FACTORIAL','SEQUENTIAL','NONE']
    interventionaldesmodels = ['RANDOMIZED','NON_RANDOMIZED','NA']
    interventionalmethodtypes = ['NONE','SINGLE','DOUBLE','TRIPLE','QUADRUPLE']
    tmplist = []
    for eachtype in studytypes:
        studytype = eachtype
        if eachtype == 'OBSERVATIONAL':
            for eachmodel in observationalmodeltypes:
                studymodel = eachmodel
                for eachdesign in observationaldesmodels:
                    designmodel = eachdesign
                    for eachmethod in observationalmethodtypes:
                        designmethod = eachmethod
                        tmplist.append({
                        'studytype':studytype,
                        'studymodel':studymodel,
                        'designmodel':designmodel,
                        'designmethod':designmethod
                        }) 
        if eachtype == 'INTERVENTIONAL':
            for eachmodel in interventionalmodeltypes:
                studymodel = eachmodel
                for eachdesign in interventionaldesmodels:
                    designmodel = eachdesign
                    for eachmethod in interventionalmethodtypes:
                        designmethod = eachmethod
                        tmplist.append({
                        'studytype':studytype,
                        'studymodel':studymodel,
                        'designmodel':designmodel,
                        'designmethod':designmethod
                        }) 
        if eachtype == 'EXPANDED_ACCESS':
            studymodel = 'NA'
            designmodel = 'NA'
            designmethod = 'NA'
            tmplist.append({
                'studytype':studytype,
                'studymodel':studymodel,
                'designmodel':designmodel,
                'designmethod':designmethod
                })       
    tmpdf = pd.DataFrame(tmplist)
    return tmpdf      

In [4]:
tmpdf = generate_mapping_options()
tmpdf.to_csv(os.path.join(result_path,'clinical_combinations.tsv'),sep='\t',header=True)

The combinations of the different options were exported to a tab delimited file and then uploaded as a google spreadsheet where the manual mapping was performed.

The spreadsheet is at: https://docs.google.com/spreadsheets/d/1HMtmWAShwlKaCUCRiWNn6iIV6qv0ZhaODNIEShdWGV0/edit?gid=0#gid=0

## Using the mappings

### Formatting the mappings for ease of use

In [5]:
nct_map = pd.read_csv(os.path.join(result_path,'NCT_combo_mappings.csv'),delimiter=',',header=0).fillna('NA')
print(nct_map.head(n=2))

def generate_study_hash(row):
    study_hash = f"{row['studytype']}_{row['studymodel']}_{row['designmodel']}_{row['designmethod']}"
    return study_hash

def generate_IRI_list(row):
    IRI_list = [row['IRI']]
    if row['IRI.1'] != (None or 'NONE'):
        IRI_list.append(row['IRI.1'])
    if row['IRI.2'] != (None or 'NONE'):
        IRI_list.append(row['IRI.2'])
    return IRI_list

       studytype studymodel       designmodel   designmethod Exact match?  \
0  OBSERVATIONAL     COHORT  PATIENT_REGISTRY  RETROSPECTIVE           no   
1  OBSERVATIONAL     COHORT  PATIENT_REGISTRY    PROSPECTIVE           no   

                                          IRI  \
0  http://purl.obolibrary.org/obo/NCIT_C93228   
1  http://purl.obolibrary.org/obo/NCIT_C53308   

                                         IRI.1 IRI.2  
0  http://purl.obolibrary.org/obo/NCIT_C129000  NONE  
1  http://purl.obolibrary.org/obo/NCIT_C129000  NONE  


In [6]:
nct_map['studyhash'] = nct_map.apply(lambda row: generate_study_hash(row), axis=1)
nct_map['IRI_list'] = nct_map.apply(lambda row: generate_IRI_list(row), axis=1)
nct_map_ready = nct_map[['studyhash','IRI_list']].copy()
print(nct_map_ready.loc[nct_map_ready['studyhash'].str.contains('EXPANDED')])
print(nct_map_ready.head(n=2))

                    studyhash                                      IRI_list
150  EXPANDED_ACCESS_NA_NA_NA  [http://purl.obolibrary.org/obo/NCIT_C98722]
                                           studyhash  \
0  OBSERVATIONAL_COHORT_PATIENT_REGISTRY_RETROSPE...   
1  OBSERVATIONAL_COHORT_PATIENT_REGISTRY_PROSPECTIVE   

                                            IRI_list  
0  [http://purl.obolibrary.org/obo/NCIT_C93228, h...  
1  [http://purl.obolibrary.org/obo/NCIT_C53308, h...  


## Fetching applicable records

In [47]:
def fetch_records(repo_name,apibase):
    query_url = f'https://{apibase}.data.niaid.nih.gov/v1/query?q=includedInDataCatalog.name:"{repo_name}"&fields=_id,identifier&fetch_all=true'
    r = requests.get(query_url)
    cleanr = json.loads(r.text)
    hits = cleanr['hits']
    total_hits = cleanr['total']
    print('total hits for ',repo_name,': ',total_hits)
    df1 = pd.DataFrame(cleanr['hits'])
    print('total records in df1: ',len(df1))
    if total_hits > 500:
        scroll_id = cleanr['_scroll_id']
        i = 0 
        k = math.ceil(total_hits/500)
        while i < k:
            try:
                r2 = requests.get(f'https://{apibase}.data.niaid.nih.gov/v1/query?scroll_id={scroll_id}')
                tmp = json.loads(r2.text)
                scroll_id = tmp['_scroll_id']
                tmpdf = pd.DataFrame(tmp['hits'])
                df1 = pd.concat((df1,tmpdf),ignore_index=True)
                #print(len(df1))
            except:
                print("attempt ", i, " failed")
            i = i+1
        print('total records in df1: ',len(df1))
    return df1

def pop_nctID (row):
    if row['identifier'] == None:
        return -1
    elif isinstance(row['identifier'],str):
        tmpid = row['identifier'].split('')
    elif isinstance(row['identifier'],list):
        tmpid = row['identifier']
    if len(tmpid)>0:
        nctid = [x for x in tmpid if x[0:3]=='NCT']
        
        if len(nctid)==1:
            return nctid[0]
        if len(nctid)==0:
            return -1
        else:
            return nctid
    else:
        return -1

### Pull the NCT-IDS for each qualifying record

In [14]:
clinical_repos = ["Vivli","AccessClinicalData%40NIAID"]
repo_name = clinical_repos[0]
apibase = 'api-staging'
print(repo_name)

Vivli


In [10]:
%time
df1 = fetch_records(repo_name,apibase)

CPU times: total: 0 ns
Wall time: 0 ns
total hits for  Vivli :  7705
total records in df1:  500
attempt  15  failed
total records in df1:  7705


In [48]:
df1['nctid'] = df1.apply(lambda row: pop_nctID(row), axis=1) ## pull out the nctids
tmp = df1.explode('nctid') ## ensure 1 nctid per row
tmpnctlist = tmp['nctid'].unique().tolist()
nctidlist = [x for x in tmpnctlist if x!=-1]
print('total nctids found + empty: ',len(tmpnctlist))
print('unique nctids found: ',len(nctidlist))

total nctids found + empty:  5903
unique nctids found:  5902


In [49]:
tmp.drop(columns=['_ignored','_score','identifier'],inplace=True)
df2 = tmp.loc[tmp['nctid']!=-1]
print(df2)

                                             _id        nctid
0     vivli_27755b7f-bcfa-47d2-a22a-265e03eb7e3b  NCT00535782
1     vivli_279f5183-b02a-4eed-9380-915439b33e43  NCT01261572
2     vivli_27b73679-733f-48e6-81f2-859be82cd364  NCT02428231
3     vivli_27cca688-a4ef-4291-afc4-d6769d221e05  NCT00749190
4     vivli_27d7abc7-27cd-4864-af85-fbb8e16a14b9  NCT00118209
...                                          ...          ...
7700  vivli_bc1b19fb-4db8-410a-a150-57b026a1a685  NCT00765895
7701  vivli_bc733042-4c71-46af-b669-4852a5889719  NCT00871741
7702  vivli_bceed22b-44a1-47bb-980e-5a37340e708d  NCT00191906
7703  vivli_bd142d21-0bcc-41e0-b4ae-5e88336dae49  NCT00317174
7704  vivli_bd193ab9-9d51-4de0-bf1d-a693788a0cb0  NCT04442490

[5902 rows x 2 columns]


### fetch the CT study details from clinicaltrials.gov

In [50]:
## pull study details from the REST API for clinicaltrials.gov
def parse_study_data(eachid):
    r = requests.get(f"https://clinicaltrials.gov/api/v2/studies/{eachid}?format=json&fields=StudyType%7CDesignAllocation%7CDesignInterventionModel%7CDesignObservationalModel%7CDesignMasking%7CDesignTimePerspective%7CPatientRegistry%7CExpandedAccessTypes")
    tmpdict = json.loads(r.text)
    studytype = tmpdict['protocolSection']['designModule']['studyType']
    designmodule = tmpdict['protocolSection']['designModule']
    if 'designInfo' not in designmodule.keys():
        studymodel = 'NA'
        designmodel = 'NA'
        designmethod = 'NA'
    else:
        if studytype == "OBSERVATIONAL":
            designinfo = designmodule['designInfo']
            if 'observationModel' in designinfo.keys():
                studymodel = designinfo['observationalModel']
            else:
                studymodel = 'NONE'
            if 'patientRegistry' in designmodule.keys():
                if designmodule['patientRegistry'] == True:
                    designmodel = 'PATIENT_REGISTRY'
                else:
                    designmodel = 'NOT_REGISTRY'
            else:
                designmodel = 'NONE'
            if 'timePerspective' in designinfo.keys():
                designmethod = designinfo['timePerspective']
            else:
                designmethod = 'NONE'
        elif studytype == "EXPANDED_ACCESS":
            studymodel = 'NA'
            designmodel = 'NA'
            designmethod = 'NA'
        else:
            designinfo = designmodule['designInfo']
            if 'interventionModel' in designinfo.keys():
                studymodel = designinfo['interventionModel']
            else:
                studymodel = 'NONE'
            if 'allocation' in designinfo.keys():     
                designmodel = designinfo['allocation']
            else:
                designmodel = 'NA'
            if 'maskInfo' in designinfo.keys():
                designmethod = designinfo['maskingInfo']['masking']
            else:
                designmethod = 'NONE'
    resultdict = {
        'nctid':eachid,
        'studytype': studytype,
        'studymodel': studymodel,
        'designmodel': designmodel,
        'designmethod': designmethod,
        'studyhash': f'{studytype}_{studymodel}_{designmodel}_{designmethod}'
    }
    return resultdict

## Currently there isn't much point to mapping the expanded access details as there aren't good mappings
## For this reason, so it was split out into its own module (in case this changes)
def parse_expanded_access(designmodule):
    if 'individual' in designmodule['expandedAccessTypes']:
        if designmodule['expandedAccessTypes']['individual'] == True:
            studymodel = 'individual'
        else:
            studymodel = 'not individual'
    else:
        studymodel = 'not individual'
    if 'intermediate' in designmodule['expandedAccessTypes']:
        if designmodule['expandedAccessTypes']['intermediate'] == True:
            designmodel = 'intermediate-size participant populations'
        else:
            designmodel = 'not intermediate-size participation'
    else:
        designmodel = 'not intermediate-size participation'
    if 'treatment' in designmodule['expandedAccessTypes']:
        if designmodule['expandedAccessTypes']['treatment'] == True:
            designmethod = 'expanded treatment IND'
        else:
            designmethod = 'not expanded treatment IND'
    else:
        designmethod = 'not expanded treatment IND'
    return studymodel, designmodel, designmethod

In [51]:
%%time
testidlist = ['NCT06836648','NCT06841874','NCT06799221','NCT06843330','NCT06843031','NCT06843356']

studyinfolist = []
for eachid in nctidlist:
#for eachid in testidlist:
    print(eachid)
    studyinfo = parse_study_data(eachid)
    studyinfolist.append(studyinfo)

studydf = pd.DataFrame(studyinfolist)
print(studydf)

NCT00535782
NCT01261572
NCT02428231
NCT00749190
NCT00118209
NCT02224846
NCT00355641
NCT00496015
NCT01518478
NCT00420927
NCT01044862
NCT00000482
NCT00452400
NCT00034853
NCT03670810
NCT00273052
NCT00113165
NCT02314117
NCT01405820
NCT02800642
NCT00005009
NCT03093324
NCT02683785
NCT00430521
NCT00325143
NCT04262921
NCT04134728
NCT01706263
NCT00134056
NCT01294592
NCT00938158
NCT00929331
NCT01363440
NCT01016483
NCT01061684
NCT00532298
NCT04288856
NCT03038100
NCT00058526
NCT00109330
NCT01405833
NCT05252468
NCT00305448
NCT00858780
NCT04209634
NCT00288119
NCT00449774
NCT01103063
NCT03021304
NCT00236665
NCT01316653
NCT00054925
NCT02801669
NCT00415194
NCT04401579
NCT00321854
NCT00683488
NCT00514943
NCT00411697
NCT01023581
NCT03031496
NCT00290966
NCT02131064
NCT04425902
NCT03562195
NCT00807040
NCT03088904
NCT02791438
NCT00806572
NCT03041311
NCT00197041
NCT02322671
NCT02075255
NCT02045953
NCT01645930
NCT02569801
NCT02410278
NCT00358488
NCT03170232
NCT03441984
NCT00419562
NCT00239421
NCT00389207
NCT0

NCT00169455
NCT02802345
NCT01563042
NCT00461097
NCT00325728
NCT00484198
NCT01005680
NCT01009047
NCT02453282
NCT02586805
NCT01453296
NCT00325169
NCT01215097
NCT00115765
NCT01767116
NCT00036309
NCT00191646
NCT00574873
NCT02892409
NCT02876835
NCT00673452
NCT00313170
NCT00118690
NCT00236353
NCT01137812
NCT00693966
NCT00236444
NCT01998906
NCT00619489
NCT00319696
NCT00500656
NCT02121483
NCT02064907
NCT00468546
NCT01068743
NCT00088465
NCT01970865
NCT01647711
NCT00536835
NCT00545688
NCT00564850
NCT01011868
NCT01002742
NCT01262365
NCT00806819
NCT00688519
NCT00308438
NCT01874665
NCT04323527
NCT01640990
NCT02215096
NCT00779766
NCT00286429
NCT00745901
NCT01702467
NCT00966004
NCT00518336
NCT01640080
NCT00408993
NCT00462709
NCT00274547
NCT01822639
NCT01041404
NCT01459367
NCT01177813
NCT02066298
NCT03661138
NCT00857766
NCT01596504
NCT00190840
NCT00735696
NCT00975195
NCT00352534
NCT01931475
NCT00473668
NCT00119743
NCT00140738
NCT01595438
NCT01769469
NCT01169402
NCT04335552
NCT00196989
NCT01156311
NCT0

NCT00283712
NCT00442039
NCT02433366
NCT02348658
NCT01527916
NCT00986648
NCT02230332
NCT00317538
NCT00560313
NCT01281527
NCT00023244
NCT00244387
NCT00320489
NCT01829464
NCT00629642
NCT00935532
NCT00778895
NCT02691325
NCT00391092
NCT00274937
NCT02302079
NCT03175367
NCT00761150
NCT00075231
NCT00707967
NCT00251758
NCT00274066
NCT00546871
NCT01696058
NCT00296530
NCT00763451
NCT02203032
NCT02990338
NCT00976560
NCT01238861
NCT01742234
NCT01152190
NCT02171637
NCT00808067
NCT00411229
NCT02846545
NCT01147926
NCT00453687
NCT00816166
NCT00434811
NCT00538473
NCT01248065
NCT00244322
NCT02969356
NCT02451943
NCT02564900
NCT00936065
NCT00102440
NCT00207740
NCT00657150
NCT00775437
NCT01192152
NCT00317109
NCT01014988
NCT01064414
NCT01946880
NCT00281567
NCT01711736
NCT01340911
NCT00325156
NCT01086384
NCT00866918
NCT00372593
NCT02751931
NCT00065845
NCT00593736
NCT00075829
NCT01855750
NCT01069913
NCT01357889
NCT00226421
NCT04614948
NCT02474355
NCT00937950
NCT02833974
NCT02666287
NCT00443651
NCT01318070
NCT0

NCT00566020
NCT00703846
NCT00792623
NCT00102804
NCT00231413
NCT00098722
NCT00696423
NCT03246230
NCT04330690
NCT02252172
NCT04569825
NCT00446849
NCT01144507
NCT01103960
NCT00109226
NCT00337818
NCT01892930
NCT00715403
NCT01846416
NCT02683239
NCT00716976
NCT00006459
NCT00937326
NCT00107029
NCT00003088
NCT02447276
NCT01323946
NCT01147302
NCT02698267
NCT02006537
NCT02296125
NCT00689351
NCT00856284
NCT00320788
NCT01227668
NCT02051335
NCT00259870
NCT00246220
NCT00871117
NCT00962780
NCT00549120
NCT01114880
NCT02774278
NCT00419094
NCT01808612
NCT02014519
NCT00000476
NCT00168779
NCT01998971
NCT02172378
NCT01171963
NCT00260936
NCT01299610
NCT00790907
NCT04372186
NCT01620528
NCT01017666
NCT00384033
NCT01381874
NCT02985398
NCT00829010
NCT01884519
NCT00361335
NCT02788474
NCT00790933
NCT01424163
NCT00056407
NCT01214239
NCT01453023
NCT00910962
NCT00578227
NCT00370318
NCT02242318
NCT01491919
NCT00706563
NCT00569062
NCT02446418
NCT00425854
NCT00231608
NCT03626714
NCT00430248
NCT00379340
NCT01342913
NCT0

NCT04492475
NCT00382772
NCT00814762
NCT00796653
NCT00731120
NCT01415518
NCT01049568
NCT02257736
NCT00473330
NCT00190775
NCT00753675
NCT01272180
NCT01323192
NCT00335738
NCT00183248
NCT00152997
NCT01203332
NCT02606877
NCT02687451
NCT00236756
NCT01506895
NCT00969436
NCT02135445
NCT00237289
NCT01928914
NCT00558064
NCT00499616
NCT02013206
NCT01887912
NCT00421733
NCT01258738
NCT01777243
NCT01258530
NCT00321620
NCT02551653
NCT00168805
NCT00043914
NCT00981409
NCT04323800
NCT00348140
NCT01712984
NCT02521493
NCT02466425
NCT00814307
NCT03311724
NCT03401671
NCT02075541
NCT03906045
NCT02172391
NCT00360646
NCT01916720
NCT01594125
NCT00740051
NCT00306943
NCT02360215
NCT00367835
NCT02613520
NCT00328627
NCT00771615
NCT00005274
NCT00288509
NCT02918071
NCT01014091
NCT00406640
NCT01736475
NCT00000491
NCT00333775
NCT00390221
NCT00197028
NCT00699699
NCT00087737
NCT00160550
NCT03325881
NCT00299546
NCT03017885
NCT01099449
NCT02859350
NCT00000555
NCT01257230
NCT02388724
NCT04322123
NCT02096263
NCT00678886
NCT0

NCT00376168
NCT02116530
NCT02367781
NCT00826280
NCT00094367
NCT00734071
NCT00744471
NCT02326272
NCT00452699
NCT00621582
NCT02584959
NCT00676052
NCT00574249
NCT01040689
NCT00502775
NCT01702571
NCT01213849
NCT00064662
NCT02551159
NCT00601900
NCT00207662
NCT02446496
NCT00153088
NCT00000606
NCT00207766
NCT00753714
NCT02293148
NCT01817075
NCT01194414
NCT01343966
NCT02100644
NCT00197236
NCT01371799
NCT00504881
NCT00455572
NCT00372112
NCT00398801
NCT01206062
NCT01957163
NCT01217229
NCT00114413
NCT00349531
NCT01152437
NCT02788279
NCT01904383
NCT00133848
NCT00721396
NCT01762774
NCT02934607
NCT02625207
NCT01940965
NCT00249132
NCT00246233
NCT01488071
NCT01325571
NCT00235859
NCT00190814
NCT00789373
NCT02321436
NCT00091910
NCT03448419
NCT00144196
NCT00591578
NCT00347360
NCT03979313
NCT01842620
NCT02239120
NCT02194179
NCT00830869
NCT02604017
NCT00569127
NCT00333034
NCT00552279
NCT00391443
NCT00262600
NCT01646177
NCT05152485
NCT02420821
NCT01340664
NCT00316706
NCT02509104
NCT02348372
NCT00650832
NCT0

NCT00671125
NCT00044577
NCT01109004
NCT01587807
NCT00947115
NCT02429791
NCT01722045
NCT01035515
NCT00135486
NCT00453154
NCT00622284
NCT00949091
NCT00989612
NCT01550744
NCT00520676
NCT00569153
NCT00456807
NCT00279747
NCT03312907
NCT00639158
NCT01207440
NCT00216606
NCT01829711
NCT00957268
NCT02118792
NCT00307541
NCT00385840
NCT00647491
NCT01983813
NCT00195702
NCT03057977
NCT00734539
NCT01604928
NCT00523328
NCT00005300
NCT01988961
NCT00632736
NCT02651194
NCT01743001
NCT03532009
NCT00627393
NCT03493854
NCT00338884
NCT01386658
NCT00424268
NCT03984825
NCT00799487
NCT00240500
NCT01974999
NCT01196052
NCT00048568
NCT01261611
NCT00386009
NCT00589914
NCT00432276
NCT00903331
NCT00236431
NCT02879305
NCT00515099
NCT02382640
NCT00460265
NCT00144287
NCT04138056
NCT02497937
NCT03329937
NCT01790594
NCT03434977
NCT04015518
NCT01454739
NCT01328756
NCT00103454
NCT02054481
NCT00271882
NCT03951753
NCT01251653
NCT00195689
NCT01879423
NCT01166438
NCT03266172
NCT01371734
NCT00738920
NCT00816660
NCT00843024
NCT0

NCT00969228
NCT00531193
NCT01504412
NCT02793622
NCT02742766
NCT01691547
NCT00580606
NCT03478657
NCT00603746
NCT01667419
NCT00103922
NCT00510146
NCT00688688
NCT02033668
NCT00096733
NCT01366560
NCT00246025
NCT00703326
NCT02187055
NCT00938392
NCT00210912
NCT01145625
NCT00950300
NCT02097745
NCT04425629
NCT01499290
NCT02254447
NCT00000459
NCT01772199
NCT00896480
NCT00370097
NCT03095638
NCT01649297
NCT01141712
NCT01517984
NCT00307125
NCT00092911
NCT01485406
NCT01008995
NCT04435483
NCT00475358
NCT02993861
NCT02258542
NCT01313312
NCT00024921
NCT01071083
NCT00549302
NCT03178487
NCT01810692
NCT00675584
NCT02183415
NCT02750410
NCT01204658
NCT02155660
NCT01529515
NCT01483625
NCT01624233
NCT01564537
NCT00339183
NCT00282308
NCT00885118
NCT00152958
NCT03292588
NCT01088412
NCT00768079
NCT00247962
NCT00004560
NCT01086449
NCT00891176
NCT02337725
NCT01010750
NCT03948581
NCT00293176
NCT00549900
NCT01499095
NCT00883909
NCT02831517
NCT01894386
NCT01436110
NCT01673555
NCT00370591
NCT00356174
NCT01323803
NCT0

NCT03231709
NCT02207530
NCT02553317
NCT01430403
NCT01336608
NCT03693612
NCT00337571
NCT00333450
NCT01431963
NCT01721057
NCT00755846
NCT02182102
NCT03129100
NCT01453998
NCT01183780
NCT03170271
NCT02829307
NCT00363870
NCT01533922
NCT00354835
NCT01498679
NCT02972905
NCT00104299
NCT00885378
NCT00650390
NCT02072096
NCT00246636
NCT01165138
NCT02424539
NCT00672958
NCT01051661
NCT01519791
NCT00670501
NCT02462759
NCT04177108
NCT01850524
NCT00049517
NCT00637195
NCT00333788
NCT00289783
NCT00197184
NCT01591122
NCT02181335
NCT02636595
NCT00558571
NCT01689142
NCT00920803
NCT03355820
NCT01863550
NCT00460512
NCT01862328
NCT02070991
NCT00566631
NCT00235872
NCT01573442
NCT01418352
NCT02066129
NCT00364351
NCT00144872
NCT00292448
NCT00067613
NCT01019694
NCT00433511
NCT02291874
NCT00546078
NCT00139334
NCT00000616
NCT02138747
NCT00488319
NCT00842348
NCT00269997
NCT02922751
NCT00920660
NCT00000551
NCT00149214
NCT02709746
NCT01947036
NCT02706873
NCT01433393
NCT02913326
NCT03000530
NCT02377414
NCT00274534
NCT0

### standardize the CT measTech data according to the mapping

In [52]:
studydf_mapped = studydf.merge(nct_map_ready,on='studyhash',how='left')
print(studydf_mapped.head(n=2))

         nctid       studytype studymodel designmodel designmethod  \
0  NCT00535782  INTERVENTIONAL   PARALLEL  RANDOMIZED         NONE   
1  NCT01261572  INTERVENTIONAL   PARALLEL  RANDOMIZED         NONE   

                                 studyhash  \
0  INTERVENTIONAL_PARALLEL_RANDOMIZED_NONE   
1  INTERVENTIONAL_PARALLEL_RANDOMIZED_NONE   

                                            IRI_list  
0  [http://purl.obolibrary.org/obo/NCIT_C82639, h...  
1  [http://purl.obolibrary.org/obo/NCIT_C82639, h...  


### link the CT measTech data to the record

In [53]:
df3 = df2.merge(studydf_mapped,on='nctid',how='left')
print(df3.head(n=2))

                                          _id        nctid       studytype  \
0  vivli_27755b7f-bcfa-47d2-a22a-265e03eb7e3b  NCT00535782  INTERVENTIONAL   
1  vivli_279f5183-b02a-4eed-9380-915439b33e43  NCT01261572  INTERVENTIONAL   

  studymodel designmodel designmethod  \
0   PARALLEL  RANDOMIZED         NONE   
1   PARALLEL  RANDOMIZED         NONE   

                                 studyhash  \
0  INTERVENTIONAL_PARALLEL_RANDOMIZED_NONE   
1  INTERVENTIONAL_PARALLEL_RANDOMIZED_NONE   

                                            IRI_list  
0  [http://purl.obolibrary.org/obo/NCIT_C82639, h...  
1  [http://purl.obolibrary.org/obo/NCIT_C82639, h...  


### Export the results

In [54]:
df4 = df3[['_id','IRI_list']].copy()
df4.rename(columns={"IRI_list":'measurementTechniqueIRIs'},inplace=True)
df4.to_csv(os.path.join(result_path,f'{repo_name}_measTechIRIs_from_CT.tsv'),sep='\t',header=True)

## Getting statistics on the augmentation potential of this approach

This approach will only work for Clinical study datasets that have an associated identifier (nctid) from the National Clinical Trials registry (clinicaltrials.gov). This identifier will generally begin with 'NCT'.

To get statistics on the expected level of improvement by this approach:
1. Get total number of records from repositories expected to be hosting Clinical study datasets
2. Get total number of records in these repositories that have a measurementTechnique value
3. Get total number of records in these repositories that have an nctid but no measurementTechnique value

Dump result:
Repo | Total records | with measTech | without measTech | without measTech, but with nctid

In [5]:
clinrepos = ["Vivli","AccessClinicalData%40NIAID","RADx+Data+Hub","The+Database+of+Genotypes+and+Phenotypes" ]
repo_name = clinrepos[0]
apibase = 'api-staging'

In [7]:
query_url = f'https://{apibase}.data.niaid.nih.gov/v1/query?q=includedInDataCatalog.name:"{repo_name}"AND+identifier:NCT*&fetch_all=true'
r = requests.get(query_url)
cleanr = json.loads(r.text)
hits = cleanr['hits']
total_hits = cleanr['total']
print(total_hits)

5921


In [21]:
parameterdict = {
    "total":"",
    "with_nctid":"+AND+identifier:NCT*",
    "with_measTech":"+AND+_exists_:measurementTechnique.name",
    "without_measTech":"+AND+-_exists_:measurementTechnique.name",
    "no_measTech_has_nctid":"+AND+identifier:NCT* AND -_exists_:measurementTechnique.name"
}

In [22]:
def generate_stats(parameterdict, apibase, repo_name):
    stats_dict = {"repo_name":repo_name}
    for eachkey in list(parameterdict.keys()):
        query_url = f'https://{apibase}.data.niaid.nih.gov/v1/query?q=includedInDataCatalog.name:"{repo_name}"{parameterdict[eachkey]}&fetch_all=true'
        r = requests.get(query_url)
        cleanr = json.loads(r.text)
        total_hits = cleanr['total']
        stats_dict[eachkey] = total_hits
    return stats_dict

In [23]:
%%time
repo_stats = []
for eachrepo in clinrepos:
    repo_stats.append(generate_stats(parameterdict,apibase,eachrepo))

stats_df = pd.DataFrame(repo_stats)
print(stats_df)

                                  repo_name  total  with_nctid  with_measTech  \
0                                     Vivli   7725        5921              0   
1                AccessClinicalData%40NIAID     10           8              0   
2                             RADx+Data+Hub    178           0            178   
3  The+Database+of+Genotypes+and+Phenotypes   2523           0           2509   

   without_measTech  no_measTech_has_nctid  
0              7725                   5921  
1                10                      8  
2                 0                      0  
3                14                      0  
CPU times: total: 3.69 s
Wall time: 19.3 s


In [26]:
today = datetime.now()
stats_df.to_csv(os.path.join(result_path,f'augmentation_potential_stats_{datetime.strftime(today,"%Y.%m.%d")}.tsv'),sep='\t',header=True)


## Testing stuff

In [45]:
r = requests.get("https://clinicaltrials.gov/api/v2/studies/NCTP000001745?format=json&fields=StudyType%7CDesignAllocation%7CDesignInterventionModel%7CDesignObservationalModel%7CDesignMasking%7CDesignTimePerspective%7CPatientRegistry%7CExpandedAccessTypes")

print(r.text)

Parameter `nctId` has incorrect format
