Parsing the 'eligibility' block. Seperate the 'inclusion' and 'exclusion' criteria into a dictionary, such that 'each one of the item will be in a list as value of the dictionary.
{'inclusion criteria':\[item1, item2, item3\],'exclusion criteria':\[item1,item2,item3\]} <br>
The final output will be a csv with columns: NCTID, critera_item, inclusive_exclusion flag

In [1]:
import csv
import re
import pandas as pd

In [3]:
# use the xmltodict_out as input file
data = pd.read_csv("output\CT69_xmltodict_01_criteria.csv",sep="\t",names = ["NCTID","inclusion_exclusion"])
# there is a 0 row contains the first line, which is not part of the data.
criteria_data=data.drop(0,axis=0)
criteria_data.head()

Unnamed: 0,NCTID,inclusion_exclusion
1,NCT02599389,Inclusion Criteria: - Patient of age > 18 year...
2,NCT02710656,Inclusion Criteria: 1. Patient must sign the i...
3,NCT01221610,"Inclusion Criteria: 1. Age ≥ 50 years, 2. Info..."
4,NCT01867736,Inclusion Criteria: 1. Subject has provided wr...
5,NCT00696956,Inclusion Criteria: - Age between 18 and 95 ye...


In [189]:
#parsing inclusion_exclusion text block:
#1. seperate into two inclusion and exclusion text string.
#2. deal with two situations: 1) the criteria items were separated by " - " 
#                            2) the criteria items were separated by numerical bullets: 
#                               sometimes 

def extractCriteria (text):
    '''
    separate the list from the inclusion and exclusion text
    Param: text --> return of function splitCriteria
    return: text_list --> item list for inclusion and exclusion
    '''
    text_list = []
    pattern_1 = re.compile(r'\s{1}-\s{1}.{5}')
    pattern_2 = re.compile(r'(?<![and|or|\>|\<])\s{1}[0-9]{1,2}\.{1}\s*.{5}')
    pattern_3 = re.compile(r'.{5};.{5}')
    if len(pattern_1.findall(text)) >= 2:
        if re.match(r'^ -', text)  : 
            text = re.sub(r'^ -\s*','', text)
        text_list = text.split(' - ')
    elif len(pattern_2.findall(text)) >= 3:
        if re.match(r'^[0-9]', text):
            text = re.sub(r'^[0-9]\. ', '', text)
        text_list = re.split('(?<![and|or|\>|\<])\s{1}[0-9]{1,2}\.{1}\s*', text)
    elif len(pattern_3.findall(text)) >= 3:
        text_list = re.split(';',text)
    else:
        text_list = [text]
    if len(text_list) <=1:
        print("warning!  "+text+"  may not be processed.")
    return(text_list)

#split the text into inclusion criteria and exclusion criteria
def splitCriteria (text):
    '''
    spilt text into inclusion criteria and exclusion criteria
    Param: text --> column 'exclusion_exclusion' from the csv file
    return: two text string of inclusion and exclusion
    '''
    inclusion =""
    exclusion =""
    if re.match('General Exclusion Criteria', text): 
        text=text.split('General Exclusion Criteria:')
        inclusion=text[0]
        exclusion=text[1]
    elif re.match('Exclusion Criteria', text):
        text=text.split('Exclusion Criteria')
        inclusion=text[0]
        exclusion=text[1]
    else:
        inclusion = text
    return (inclusion, exclusion)


In [141]:
list = criteria_data['inclusion_exclusion']
inclusion_list = []
exclusion_list = []
for text in list:
    inclusion =""
    exclusion =""
    if re.search('General Exclusion Criteria', text) != None: 
        text=text.split('General Exclusion Criteria')
        inclusion=text[0]
        exclusion=text[1]
        exclusion = re.sub(r'^\s*:?\s*', ' ', exclusion)
    elif re.search('Exclusion Criteria', text) != None:
        text=text.split('Exclusion Criteria')
        inclusion=text[0]
        exclusion=text[1]
        exclusion = re.sub(r'^\s*:?\s*', ' ', exclusion)
    else:
        inclusion = text
    inclusion_list.append(inclusion)
    exclusion_list.append(exclusion)

criteria_data['inclusion'] = inclusion_list
criteria_data['exclusion'] = exclusion_list

criteria_data.head()

Unnamed: 0,NCTID,inclusion_exclusion,inclusion,exclusion
1,NCT02599389,Inclusion Criteria: - Patient of age > 18 year...,Inclusion Criteria: - Patient of age > 18 year...,- Life expectancy >18 months - Patient alread...
2,NCT02710656,Inclusion Criteria: 1. Patient must sign the i...,Inclusion Criteria: 1. Patient must sign the i...,1. Patient is already included in this study ...
3,NCT01221610,"Inclusion Criteria: 1. Age ≥ 50 years, 2. Info...","Inclusion Criteria: 1. Age ≥ 50 years, 2. Info...",1. Co-morbid conditions limiting life expecta...
4,NCT01867736,Inclusion Criteria: 1. Subject has provided wr...,Inclusion Criteria: 1. Subject has provided wr...,1. Flow-limiting (> 50% DS) inflow lesion pro...
5,NCT00696956,Inclusion Criteria: - Age between 18 and 95 ye...,Inclusion Criteria: - Age between 18 and 95 ye...,- Disease associated with life-expectancy les...


In [190]:
#parse inclusion
inclusion_item = []
for i in inclusion_list:
    i = re.sub(r'^Inclusion Criteria?:*\s*',' ',i)
    y = extractCriteria(i)
    inclusion_item.append(y)

#parse exclusion
exclusion_item = []
for j in exclusion_list:
    x = extractCriteria(j)
    #print (x)
    exclusion_item.append(x)



In [191]:
criteria_data['inclusion_item'] = inclusion_item
criteria_data['exclusion_item'] = exclusion_item

criteria_data.head()

Unnamed: 0,NCTID,inclusion_exclusion,inclusion,exclusion,inclusion_item,exclusion_item
1,NCT02599389,Inclusion Criteria: - Patient of age > 18 year...,Inclusion Criteria: - Patient of age > 18 year...,- Life expectancy >18 months - Patient alread...,"[Patient of age > 18 years, Patient with previ...","[Life expectancy >18 months, Patient already i..."
2,NCT02710656,Inclusion Criteria: 1. Patient must sign the i...,Inclusion Criteria: 1. Patient must sign the i...,1. Patient is already included in this study ...,"[, Patient must sign the informed consent form...","[, Patient is already included in this study (..."
3,NCT01221610,"Inclusion Criteria: 1. Age ≥ 50 years, 2. Info...","Inclusion Criteria: 1. Age ≥ 50 years, 2. Info...",1. Co-morbid conditions limiting life expecta...,"[ 1. Age ≥ 50 years, 2. Informed consent signe...","[, Co-morbid conditions limiting life expectan..."
4,NCT01867736,Inclusion Criteria: 1. Subject has provided wr...,Inclusion Criteria: 1. Subject has provided wr...,1. Flow-limiting (> 50% DS) inflow lesion pro...,"[, Subject has provided written informed conse...","[, Flow-limiting (> 50% DS) inflow lesion prox..."
5,NCT00696956,Inclusion Criteria: - Age between 18 and 95 ye...,Inclusion Criteria: - Age between 18 and 95 ye...,- Disease associated with life-expectancy les...,"[Age between 18 and 95 years,, peripheral vasc...",[Disease associated with life-expectancy less ...


In [194]:
# create new dataframe only with NCTID, inclusion_item, exclusion_item
new_data_inclu = criteria_data[['NCTID','inclusion_item']]
NCTIDs = []
inclusion_items = []
for __, row in new_data_inclu.iterrows():
    NCTID = row.NCTID
    for inclusion in row.inclusion_item:
        NCTIDs.append(NCTID)
        inclusion_items.append(inclusion)
inclusion_to_NCTID = pd.DataFrame({"NCTID":NCTIDs,"Inclusion Items":inclusion_items, "inclusion/exclusion":"inclusion"})
inclusion_to_NCTID.head()

Unnamed: 0,Inclusion Items,NCTID,inclusion/exclusion
0,Patient of age > 18 years,NCT02599389,inclusion
1,Patient with previously peripheral implanted s...,NCT02599389,inclusion
2,Patient who received this stent between 3-36 m...,NCT02599389,inclusion
3,Patient with one or more in-stent restenosis l...,NCT02599389,inclusion
4,Reference vessel diameter between 4 and 7 mm,NCT02599389,inclusion


In [195]:
outfile = input('please give the file name for your resulted inclusion item table: ' )
file = "output\\"+outfile+".csv"
#with open(file, 'w') as csvfile:
inclusion_to_NCTID.to_csv(file, index=False, header = True, sep='\t')

please give the file name for your resulted inclusion item table: CT69_inclusion_v1


In [134]:
pattern_1 = re.compile(r'\s{1}-\s{1}.{5}')
t = "Clinical Criteria - Male or non-pregnant female ≥18 years of age. - Rutherford Clinical Category 2-5 - Patient is willing to provide informed consent and comply with the required follow up visits, testing schedule, and medication regimen Angiographic Criteria - A single de novo or restenotic atherosclerotic lesion >70% in the SFA or popliteal artery that is ≥4 cm and ≤15 cm in total length. - Reference vessel diameter ≥4 mm and ≤ 6mm - Successful wire crossing of lesion - A patent inflow artery free from significant lesion (>50% stenosis) as confirmed by angiography (treatment of target lesion acceptable after successful treatment of inflow artery lesions) "
print (pattern_1.findall(t))
re.split('\s{1}-\s{1}',t)
t2 = "- Male or non-pregnant female ≥18 years of age. - Rutherford Clinical Category 2-5 - Patient is willing to provide informed consent and comply with the required follow up visits,"
re.split('\s{1}-\s{1}',t2)

[' - Male ', ' - Ruthe', ' - Patie', ' - A sin', ' - Refer', ' - Succe', ' - A pat']


['- Male or non-pregnant female ≥18 years of age.',
 'Rutherford Clinical Category 2-5',
 'Patient is willing to provide informed consent and comply with the required follow up visits,']

In [135]:
pattern_2 = re.compile(r'(?<![and|or|\>|\<])\s{1}[0-9]{1,2}\.{1}\s*.{5}')
t1 = "Clinical Inclusion Criteria: 1. Male or non-pregnant female ≥18 years of age; 2. Rutherford Clinical Category 2-4; 3. Patient is willing to provide informed consent, is geographically stable and comply with the required follow up visits, testing schedule and medication regimen; Angiographic Lesion Inclusion Criteria: 4. Length ≤15 cm; 5. Up to two focal lesions or segments within the designated 15 cm length of vessel may be treated (e.g. two discrete segments, separated by several cm, but both falling within a composite length of <15 cm); 6. ≥70% stenosis by visual estimate; 7. Lesion location starts ≥1 cm below the common femoral bifurcation and terminates distally ≤2 cm below the tibial plateau AND ≥1 cm above the origin of the TP trunk; 8. de novo lesion(s) or non-stented restenotic lesion(s) >90 days from prior angioplasty procedure; 9. Lesion is located at least 3 cm from any stent, if target vessel was previously stented; 10. Target vessel diameter between ≥4 and ≤6 mm and able to be treated with available device size matrix; 11. Successful, uncomplicated (without use of a crossing device) antegrade wire crossing of lesion; 12. A patent inflow artery free from significant lesion (≥50% stenosis) as confirmed by angiography (treatment of target lesion acceptable after successful treatment of inflow artery lesions); NOTE: Successful inflow artery treatment is defined as attainment of residual diameter stenosis ≤30% without death or major vascular complication. 13. At least one patent native outflow artery to the ankle, free from significant (≥50%) stenosis as confirmed by angiography that has not previously been revascularized (treatment of outflow disease is NOT permitted during the index procedure); 14. Contralateral limb lesion(s) cannot be treated within 2 weeks before and/or planned 30 days after the protocol treatment in order to avoid confounding complications; 15. No other prior vascular interventions within 2 weeks before and/or planned 30 days after the protocol treatment. "
print (pattern_2.findall(t1))
re.split('(?<![and|or|\>|\<])\s{1}[0-9]{1,2}\.{1}\s*', t1)
t3 = re.sub('Clinical Inclusion Criteria: ','', t1)
re.split('(?<![and|or|\>|\<])\s{1}[0-9]{1,2}\.{1}\s*', t3)

[' 1. Male ', ' 2. Ruthe', ' 3. Patie', ' 4. Lengt', ' 5. Up to', ' 6. ≥70% ', ' 7. Lesio', ' 8. de no', ' 9. Lesio', ' 10. Targe', ' 11. Succe', ' 12. A pat', ' 13. At le', ' 14. Contr', ' 15. No ot']


['1. Male or non-pregnant female ≥18 years of age;',
 'Rutherford Clinical Category 2-4;',
 'Patient is willing to provide informed consent, is geographically stable and comply with the required follow up visits, testing schedule and medication regimen; Angiographic Lesion Inclusion Criteria:',
 'Length ≤15 cm;',
 'Up to two focal lesions or segments within the designated 15 cm length of vessel may be treated (e.g. two discrete segments, separated by several cm, but both falling within a composite length of <15 cm);',
 '≥70% stenosis by visual estimate;',
 'Lesion location starts ≥1 cm below the common femoral bifurcation and terminates distally ≤2 cm below the tibial plateau AND ≥1 cm above the origin of the TP trunk;',
 'de novo lesion(s) or non-stented restenotic lesion(s) >90 days from prior angioplasty procedure;',
 'Lesion is located at least 3 cm from any stent, if target vessel was previously stented;',
 'Target vessel diameter between ≥4 and ≤6 mm and able to be treated with a

In [150]:
t4 = "Subject need 18 to 80 years of aged male or non-pregnant women; Subject has stable anina pectoris, or unstable anina pectoris, or old myocardial infarction patients, or ischemia with evidence but without symptom; Subject has a life expectancy of more than 2 years; Lesion vessel reference diameter is less than or equal to 2.5mm, target lesion length is less than or equal to 40mm; Target lesion stenosis is equal or greater than 70% or 50% and with ischemia; Single or two coronary small vessel lesions in situ; Subject can understand the trail purpose, sign informed consent voluntarily, agree to accept clinical telephone follow-up and angiographic follow-up at 9 months; Target lesion can be pre-expanded successfully (Guide wire can get through lesion, balloon pre-expand the remnant stenosis of vessel lumen is less or equal to 50%, without current limiting interlayer and thrombosis)."
pattern_3 = re.compile(r'.{5};.{5}')
print (pattern_3.findall(t4))
re.split(';',t4)

['women; Subj', 'mptom; Subj', 'years; Lesi', ' 40mm; Targ', 'hemia; Sing', ' situ; Subj', 'onths; Targ']


['Subject need 18 to 80 years of aged male or non-pregnant women',
 ' Subject has stable anina pectoris, or unstable anina pectoris, or old myocardial infarction patients, or ischemia with evidence but without symptom',
 ' Subject has a life expectancy of more than 2 years',
 ' Lesion vessel reference diameter is less than or equal to 2.5mm, target lesion length is less than or equal to 40mm',
 ' Target lesion stenosis is equal or greater than 70% or 50% and with ischemia',
 ' Single or two coronary small vessel lesions in situ',
 ' Subject can understand the trail purpose, sign informed consent voluntarily, agree to accept clinical telephone follow-up and angiographic follow-up at 9 months',
 ' Target lesion can be pre-expanded successfully (Guide wire can get through lesion, balloon pre-expand the remnant stenosis of vessel lumen is less or equal to 50%, without current limiting interlayer and thrombosis).']