## Pull trial info demo
- allows us to scrape the separate xml files found in the static download from clinicaltrials.gov and pull whatever information we want
- format of info might vary due to missing information

In [1]:
# import necessary packages
import os
import xmltodict
import pprint
import pandas as pd
import re

The data is stored in separate xml files that can be displayed as dictionaries, we could extract any of these headers, criteria is found under the 'eligibility' key under several sub-keys

In [2]:
# Initial exploration
os.chdir('/Users/meldrumapple/Desktop/Capstone/AllPublicXML/NCT0000xxxx')
with open('NCT00000102.xml', 'r', encoding='utf-8') as doc:
        file = doc.read()
dct=xmltodict.parse(file)
pprint.pprint(dct)

{'clinical_study': {'brief_summary': {'textblock': 'This study will test the '
                                                   'ability of extended '
                                                   'release nifedipine '
                                                   '(Procardia XL), a blood\r\n'
                                                   '      pressure medication, '
                                                   'to permit a decrease in '
                                                   'the dose of glucocorticoid '
                                                   'medication children\r\n'
                                                   '      take to treat '
                                                   'congenital adrenal '
                                                   'hyperplasia (CAH).'},
                    'brief_title': 'Congenital Adrenal Hyperplasia: Calcium '
                                   'Channels as Therapeutic Targets',
       

This is our main function
- right now it is written to just pull the ID of the trial, the condition that the trial is about, the intervention, and then the criteria. 
- A lot of the trial records are incomplete, so we are adding NAs when the information we are looking for does not existcr

In [3]:
def pull_trial(filename):
    '''
    main function to pull information from each file 
    filename: xml file name
    return: list of file information [nct id, condition, intervenction, age range, gender, healthy participants, criteria]
    '''
    #parse into folder where file is 
    folder=filename[0:7]+'xxxx'
    os.chdir('/Users/meldrumapple/Desktop/Capstone/AllPublicXML/'+str(folder))
    #open file and convert to dictionary
    with open(filename, 'r', encoding='utf-8') as doc:
        file = doc.read()
    dct=xmltodict.parse(file)
    
    #Extract trial info
    try: 
        num=dct['clinical_study']['id_info']['nct_id'] #nct id
    except: 
        num=pd.NA
    
    #pull condition
    try: 
        condition= dct['clinical_study']['condition']
    except: 
        condition=pd.NA
    
    # pull intervention data- this can be formatted as list of dictionary or just dictionary
    try: 
        intv=[dct['clinical_study']['intervention']['intervention_name'],dct['clinical_study']['intervention']['intervention_type']]
    except:
        try:
            intv=[dct['clinical_study']['intervention'][0]['intervention_name'],dct['clinical_study']['intervention'][0]['intervention_type']]
        except:
            intv=pd.NA #sometimes trial has no intervention category at all
    
    #pull age range
    try:
        ages=[dct['clinical_study']['eligibility']['minimum_age'],dct['clinical_study']['eligibility']['maximum_age']]
    except: 
        ages=pd.NA
        
    #pull gender
    try:
        gender=dct['clinical_study']['eligibility']['gender']
    except:
        gender=pd.NA
        
    # pull healthy
    try: 
        healthy=dct['clinical_study']['eligibility']['healthy_volunteers']
    except:
        healthy=pd.NA
    
    # Extract criteria and clean up
    try: 
        criteria= dct['clinical_study']['eligibility']['criteria']['textblock']
        # Cleaning Criteria Text: 
        criteria=criteria.lower() # make lowercase
        criteria = re.sub(r'\d+\.', ' ', criteria) #remove numbering
        for each in ['-','(',')',"'",':','i.e.','.','inclusion criteria', 'inclusion', ',']: #list of other punctuation to remove
            criteria=criteria.replace(each, '')
        criteria=criteria.split('exclusion criteria')
        if len(criteria)==1: 
            criteria=str(criteria[0]).split('exclusion')
        for i in range(len(criteria)):
            criteria[i]=re.sub(r'\s\s+', '##', criteria[i])
            criteria[i]=criteria[i].split('##')
            criteria[i]=[x for x in criteria[i] if x] #remove empty strings
    except:
        criteria=pd.NA
        
    # add all variables to a row in intialized dataframe
    df.loc[len(df.index)] = [num, condition, intv, ages, gender, healthy, criteria]
    return None

This is an example of pulling a couple files
- creates a row in the dataframe
- criteria is sometimes not in the format that we want

In [4]:
#initialize empty df
df=pd.DataFrame(columns=['nct_id', 'condition', 'intervention','ages','gender','healthy','criteria']) 

In [5]:
#example of pulling one file
pull_trial('NCT00000102.xml')
pull_trial('NCT00000277.xml')
pull_trial('NCT00000180.xml')
pull_trial('NCT00000271.xml')
df

Unnamed: 0,nct_id,condition,intervention,ages,gender,healthy,criteria
0,NCT00000102,Congenital Adrenal Hyperplasia,"[Nifedipine, Drug]","[14 Years, 35 Years]",All,No,[[diagnosed with congenital adrenal hyperplasi...
1,NCT00000277,Cocaine-Related Disorders,"[Mazindol, Drug]","[18 Years, N/A]",All,No,[[please contact site for information]]
2,NCT00000180,Memory Disorders,"[AIT-082, Drug]","[N/A, N/A]",All,Accepts Healthy Volunteers,
3,NCT00000271,"[Cocaine-Related Disorders, Substance-Related ...","[Desipramine, Drug]","[18 Years, 60 Years]",All,No,[[meets dsmiv criteria for current cocaine dep...


In [6]:
## some now formatted nicely
df['criteria'][0]

[['diagnosed with congenital adrenal hyperplasia cah',
  'normal ecg during baseline evaluation'],
 ['history of liver disease or elevated liver function tests',
  'history of cardiovascular disease']]

In [7]:
## some had duplicate spaces or new lines that don't indicate separate criteria, also just really hard to manipulate
with open('NCT00000271.xml', 'r', encoding='utf-8') as doc:
        file = doc.read()
dct=xmltodict.parse(file)
print(dct['clinical_study']['eligibility']['criteria']['textblock'])
df['criteria'][3]

Inclusion:

          1. Meets DSM-IV criteria for current cocaine dependence.

          2. Used cocaine at least one day in the past month.

          3. Currently meets DSM-IV criteria for Major Depression or Dysthymia.

          4. Depressive disorder is either:

               1. primary (antedates earliest lifetime substance abuse or

               2. persistent during 6 months of abstinence in the past or

               3. at least 3 months duration in the current episode

          5. Age 18-60.

          6. Able to give informed consent and comply with study procedures.

        Exclusion:

          1. Meets DSM-IV criteria for past mania (i.e. bipolar disorder), schizophrenia or any
             psychotic disorder other than transient psychosis due to drug abuse.

          2. History of seizures.

          3. History of allergic reaction to desipramine or imipramine.

          4. Chronic organic mental disorder.

          5. Significant current suicidal risk.

      

[['meets dsmiv criteria for current cocaine dependence',
  'used cocaine at least one day in the past month',
  'currently meets dsmiv criteria for major depression or dysthymia',
  'depressive disorder is either',
  'primary antedates earliest lifetime substance abuse or',
  'persistent during 6 months of abstinence in the past or',
  'at least 3 months duration in the current episode',
  'age 18',
  'able to give informed consent and comply with study procedures'],
 ['meets dsmiv criteria for past mania',
  'bipolar disorder schizophrenia or any',
  'psychotic disorder other than transient psychosis due to drug abuse',
  'history of seizures',
  'history of allergic reaction to desipramine or imipramine',
  'chronic organic mental disorder',
  'significant current suicidal risk',
  'pregnancy lactation or failure in sexually active female patients to use adequate',
  'contraceptive methods',
  'unstable physical disorders which might make participation hazardous such as',
  'hyperten

In [8]:
#initialize empty df
df=pd.DataFrame(columns=['nct_id', 'condition', 'intervention','ages','gender','healthy','criteria']) 

There are 5757 files found in the first 10000 NCT ids, we can get 5754 right now, so there are 3 that have formatting that is resistant

In [9]:
#pull from NCTIds 0 through 10000 and count number of failures
fails=0
for i in range(0, 10000):
    j='00000000'+str(i)           #add a bunch of zeros onto front of numbers
    j=j[-8:]                      #select only the last 8 digits
    try:
        pull_trial('NCT'+str(j)+'.xml')
    except:
        fails+=1

In [10]:
fails

4246

In [None]:
#df.to_csv('/Users/meldrumapple/Desktop/Capstone/sample')