# 2. Data Wrangling

https://www.kaggle.com/auriml/eligibilityforcancerclinicaltrials

A total of 6,186,572 labeled clinical statements were extracted from 49,201 interventional CT protocols on cancer (the URL for downloading this dataset is freely available at https://clinicaltrials.gov/ct2/results?term=neoplasmtype=Intrshowdow). Each CT downloaded is an XML file that follows a structure of fields defined by an XML schema of clinical trials [16]. The relevant data for this project are derived from the intervention, condition, and eligibility fields written in unstructured free-text language. The information in the eligibility criteria—both exclusion and inclusion criteria—are sets of phrases and/or sentences displayed in a free format, such as paragraphs, bulleted lists, enumeration lists, etc. None of these fields use common standards, nor do they enforce the use of standardized terms from medical dictionaries and ontologies. Moreover, the language had the problems of both polysemy and synonymy.
The original data were exploited by merging eligibility criteria together with the study condition and intervention, and subsequently transforming them into lists of short labeled clinical statements that consisted of two extracted features (see example in Figure 2), the label (Eligible or Not Eligible), and the processed text that included the original eligibility criterion merged with the study interventions and the study conditions. See more details: Bustos, A.; Pertusa, A. Learning Eligibility in Cancer Clinical Trials Using Deep Neural Networks. Appl. Sci. 2018, 8, 1206. www.mdpi.com/2076-3417/8/7/1206 www.mdpi.com/2076-3417/8/7/1206

## About the data set

Labeled eligibility criteria in interventional clinical trials on cancer as Eligible (label0) or Not Eligible (labeled1). Eligibility criteria are splitted into short sequences of plain words (and bigrams) separated by a whitespace and are augmented with information on study intervention and cancer type (source of clinical trial protocols: clinicaltrials.gov)

In [1]:
import zipfile
import pandas as pd

url = 'https://www.kaggle.com/auriml/eligibilityforcancerclinicaltrials/download'

# to-do: to make the code cleaner open a a HTTP stream and stuff it in ZipFile
with zipfile.ZipFile('data/raw/917_1673_bundle_archive.zip', 'r') as myzip:
    myzip.extractall('data/raw')

In [2]:
df = pd.read_csv('data/raw/labeledEligibilitySample1000000.csv', sep='\t', header=None, names=['is_eligible','intervention'])

In [3]:
df.shape

(1000000, 2)

In [4]:
df.describe()

Unnamed: 0,is_eligible,intervention
count,1000000,1000000
unique,2,957870
top,__label__0,study interventions are Cyclophosphamide . lym...
freq,500000,49


In [5]:
df.intervention

0         study interventions are recombinant CD40-ligan...
1         study interventions are Liposomal doxorubicin ...
2         study interventions are BI 836909 . multiple m...
3         study interventions are Immunoglobulins . recu...
4         study interventions are Paclitaxel . stage ova...
                                ...                        
999995    study interventions are Pazopanib . carcinoma ...
999996    study interventions are Dexamethasone 21-phosp...
999997    study interventions are Camptothecin . rectal ...
999998    study interventions are Cyclophosphamide . sta...
999999    study interventions are Cyclophosphamide . lym...
Name: intervention, Length: 1000000, dtype: object

In [6]:
df.is_eligible = df.is_eligible.str.endswith('0')

In [7]:
df

Unnamed: 0,is_eligible,intervention
0,True,study interventions are recombinant CD40-ligan...
1,True,study interventions are Liposomal doxorubicin ...
2,True,study interventions are BI 836909 . multiple m...
3,True,study interventions are Immunoglobulins . recu...
4,True,study interventions are Paclitaxel . stage ova...
...,...,...
999995,False,study interventions are Pazopanib . carcinoma ...
999996,False,study interventions are Dexamethasone 21-phosp...
999997,False,study interventions are Camptothecin . rectal ...
999998,False,study interventions are Cyclophosphamide . sta...


In [8]:
df['intervention'].str.startswith('study interventions are ').sum()

1000000

All interventions strarts with `sutdy interventions are ` so we remove them

In [9]:
i = len('study interventions are ')
df.intervention = df['intervention'].str.slice(start=i)

In [10]:
df

Unnamed: 0,is_eligible,intervention
0,True,recombinant CD40-ligand . melanoma skin diagno...
1,True,Liposomal doxorubicin . colorectal cancer diag...
2,True,BI 836909 . multiple myeloma diagnosis and ind...
3,True,Immunoglobulins . recurrent fallopian tube car...
4,True,Paclitaxel . stage ovarian cancer diagnosis an...
...,...,...
999995,False,Pazopanib . carcinoma renal cell diagnosis and...
999996,False,Dexamethasone 21-phosphate . uveal melanoma di...
999997,False,Camptothecin . rectal cancer diagnosis and cre...
999998,False,Cyclophosphamide . stage iii non hodgkin lymph...


In [11]:
itv = df.intervention.str.split(' \\. ', expand=True)
itv.columns = ['intervention', 'diagnosis']
itv

Unnamed: 0,intervention,diagnosis
0,recombinant CD40-ligand,melanoma skin diagnosis and no active cns meta...
1,Liposomal doxorubicin,colorectal cancer diagnosis and cardiovascular
2,BI 836909,multiple myeloma diagnosis and indwelling cent...
3,Immunoglobulins,recurrent fallopian tube carcinoma diagnosis a...
4,Paclitaxel,stage ovarian cancer diagnosis and patients mu...
...,...,...
999995,Pazopanib,carcinoma renal cell diagnosis and pregnant or...
999996,Dexamethasone 21-phosphate,uveal melanoma diagnosis and presence of any o...
999997,Camptothecin,rectal cancer diagnosis and creatinine clearan...
999998,Cyclophosphamide,stage iii non hodgkin lymphoma diagnosis and c...


In [12]:
df = pd.concat([df.iloc[:,0], itv], axis=1)

In [13]:
df.intervention.value_counts()

Antibodies, Monoclonal           34457
Antibodies                       22897
Paclitaxel                       22723
Albumin-Bound Paclitaxel         21169
Bevacizumab                      20445
                                 ...  
Helical Tomotherapy treatment        1
Multimodal Intervention              1
Webinar                              1
Muscle Tissue Biopsy Sample          1
No further treatment                 1
Name: intervention, Length: 15262, dtype: int64

In [14]:
df[df.intervention.str.startswith('Antibodies, Monoclonal')].groupby('is_eligible').count()

Unnamed: 0_level_0,intervention,diagnosis
is_eligible,Unnamed: 1_level_1,Unnamed: 2_level_1
False,18321,18321
True,16136,16136


In [15]:
df['intervention'].value_counts().index[-1:-20:-1]

Index(['No further treatment', 'Muscle Tissue Biopsy Sample', 'Webinar',
       'Multimodal Intervention', 'Helical Tomotherapy treatment',
       'Sham tibial nerve stimulation', 'Auto Plus', 'Routine Data collection',
       'bioelectrical impedance analysis', 'Air colonoscopy',
       'Agreement between oral and cervical HPV infection',
       'Educational and counseling in group', 'Serum proteomics',
       'The interactive e-Assist tool', 'preserving the left colic artery',
       'Stamm Approach', 'MRI scanner with sequence',
       'Goal directed fluid management based on continuous monitoring of stroke volume',
       'MR guided High Intensity focused ultrasound'],
      dtype='object')

In [16]:
df['intervention'].value_counts().index[:50]

Index(['Antibodies, Monoclonal', 'Antibodies', 'Paclitaxel',
       'Albumin-Bound Paclitaxel', 'Bevacizumab', 'Cyclophosphamide',
       'Immunoglobulins', 'Carboplatin', 'Cisplatin', 'Fludarabine',
       'Fludarabine phosphate', 'Gemcitabine', 'Sirolimus', 'Doxorubicin',
       'Everolimus', 'Liposomal doxorubicin', 'Docetaxel', 'Vaccines',
       'Mycophenolate mofetil', 'Rituximab', 'Mycophenolic Acid',
       'Fluorouracil', 'Vidarabine', 'Oxaliplatin', 'Capecitabine',
       'Irinotecan', 'Etoposide', 'Thalidomide', 'Erlotinib Hydrochloride',
       'Cyclosporins', 'Cetuximab', 'Cyclosporine', 'Etoposide phosphate',
       'Camptothecin', 'Pembrolizumab', 'Bortezomib', 'Tacrolimus',
       'Dexamethasone', 'Sorafenib', 'Melphalan', 'Lenalidomide',
       'Niacinamide', 'laboratory biomarker analysis', 'Temozolomide',
       'Dexamethasone 21-phosphate', 'BB 1101', 'Dexamethasone acetate',
       'Endothelial Growth Factors', 'Trastuzumab',
       'Laboratory Biomarker Analysis']

In [17]:
df.loc[10, 'intervention']

'study of high risk factors'

In [18]:
df.loc[10, 'diagnosis']

'precancerous condition diagnosis and able to understand the procedures and the potential risks involved as determined by clinic staff'

In [19]:
df[df.intervention.str.contains('high risk')]

Unnamed: 0,is_eligible,intervention,diagnosis
10,True,study of high risk factors,precancerous condition diagnosis and able to u...
18605,True,study of high risk factors,melanoma skin diagnosis and not specified
43726,True,study of high risk factors,unspecified adult solid tumor protocol specifi...
47365,True,study of high risk factors,precancerous condition diagnosis and patient c...
66759,True,study of high risk factors,increased risk of breast cancer as determined ...
...,...,...,...
765935,False,study of high risk factors,lung cancer diagnosis and cardiac surgery
794786,False,study of high risk factors,lung cancer diagnosis and deep ray therapy
872408,False,study of high risk factors,metastatic cancer diagnosis and diffuse pleura...
937083,False,study of high risk factors,concurrent untreated malignancy except nonmela...
