# 2. Data Wrangling

https://www.kaggle.com/auriml/eligibilityforcancerclinicaltrials

A total of 6,186,572 labeled clinical statements were extracted from 49,201 interventional CT protocols on cancer (the URL for downloading this dataset is freely available at https://clinicaltrials.gov/ct2/results?term=neoplasmtype=Intrshowdow). Each CT downloaded is an XML file that follows a structure of fields defined by an XML schema of clinical trials [16]. The relevant data for this project are derived from the intervention, condition, and eligibility fields written in unstructured free-text language. The information in the eligibility criteria—both exclusion and inclusion criteria—are sets of phrases and/or sentences displayed in a free format, such as paragraphs, bulleted lists, enumeration lists, etc. None of these fields use common standards, nor do they enforce the use of standardized terms from medical dictionaries and ontologies. Moreover, the language had the problems of both polysemy and synonymy.
The original data were exploited by merging eligibility criteria together with the study condition and intervention, and subsequently transforming them into lists of short labeled clinical statements that consisted of two extracted features (see example in Figure 2), the label (Eligible or Not Eligible), and the processed text that included the original eligibility criterion merged with the study interventions and the study conditions. See more details: Bustos, A.; Pertusa, A. Learning Eligibility in Cancer Clinical Trials Using Deep Neural Networks. Appl. Sci. 2018, 8, 1206. www.mdpi.com/2076-3417/8/7/1206 www.mdpi.com/2076-3417/8/7/1206

## About the data set

Labeled eligibility criteria in interventional clinical trials on cancer as Eligible (label0) or Not Eligible (labeled1). Eligibility criteria are splitted into short sequences of plain words (and bigrams) separated by a whitespace and are augmented with information on study intervention and cancer type (source of clinical trial protocols: clinicaltrials.gov)

In [1]:
import zipfile
import pandas as pd

url = 'https://www.kaggle.com/auriml/eligibilityforcancerclinicaltrials/download'

# to-do: to make the code cleaner open a a HTTP stream and stuff it in ZipFile
with zipfile.ZipFile('data/raw/917_1673_bundle_archive.zip', 'r') as myzip:
    myzip.extractall('data/raw')

In [2]:
df = pd.read_csv('data/raw/labeledEligibilitySample1000000.csv', sep='\t', header=None, names=['is_eligible','intervention'])

In [3]:
df.shape

(1000000, 2)

In [4]:
df.describe()

Unnamed: 0,is_eligible,intervention
count,1000000,1000000
unique,2,957870
top,__label__0,study interventions are Cyclophosphamide . lym...
freq,500000,49


In [5]:
df.is_eligible = df.is_eligible.str.endswith('0')

In [6]:
df

Unnamed: 0,is_eligible,intervention
0,True,study interventions are recombinant CD40-ligan...
1,True,study interventions are Liposomal doxorubicin ...
2,True,study interventions are BI 836909 . multiple m...
3,True,study interventions are Immunoglobulins . recu...
4,True,study interventions are Paclitaxel . stage ova...
...,...,...
999995,False,study interventions are Pazopanib . carcinoma ...
999996,False,study interventions are Dexamethasone 21-phosp...
999997,False,study interventions are Camptothecin . rectal ...
999998,False,study interventions are Cyclophosphamide . sta...


In [7]:
df.intervention[5], df.intervention[5000]

('study interventions are Antibodies, Monoclonal . recurrent verrucous carcinoma of the oral cavity diagnosis and must have undergone radiotherapy as component of prior treatment',
 'study interventions are Dideoxynucleosides . brain and central nervous system tumors diagnosis and platelets greater_than one hundred zero µl')

In [8]:
df['intervention'].str.startswith('study interventions are ').sum()

1000000

All interventions strarts with `sutdy interventions are ` so we remove it.

In [9]:
i = len('study interventions are ')
df.intervention = df['intervention'].str.slice(start=i)

In [10]:
df

Unnamed: 0,is_eligible,intervention
0,True,recombinant CD40-ligand . melanoma skin diagno...
1,True,Liposomal doxorubicin . colorectal cancer diag...
2,True,BI 836909 . multiple myeloma diagnosis and ind...
3,True,Immunoglobulins . recurrent fallopian tube car...
4,True,Paclitaxel . stage ovarian cancer diagnosis an...
...,...,...
999995,False,Pazopanib . carcinoma renal cell diagnosis and...
999996,False,Dexamethasone 21-phosphate . uveal melanoma di...
999997,False,Camptothecin . rectal cancer diagnosis and cre...
999998,False,Cyclophosphamide . stage iii non hodgkin lymph...


In [11]:
itv = df.intervention.str.split('\s\.\s|\sdiagnosis\sand\s', n=2, expand=True)
itv.columns = ['intervention', 'condition', 'criteria']
itv

Unnamed: 0,intervention,condition,criteria
0,recombinant CD40-ligand,melanoma skin,no active cns metastases by ct scan or mri
1,Liposomal doxorubicin,colorectal cancer,cardiovascular
2,BI 836909,multiple myeloma,indwelling central venous cateder or willingne...
3,Immunoglobulins,recurrent fallopian tube carcinoma,patients are allowed to receive but are not re...
4,Paclitaxel,stage ovarian cancer,patients must have recovered from the effects ...
...,...,...,...
999995,Pazopanib,carcinoma renal cell,pregnant or lactating female
999996,Dexamethasone 21-phosphate,uveal melanoma,presence of any ocular condition that in the o...
999997,Camptothecin,rectal cancer,creatinine clearance less_than fifty ml min
999998,Cyclophosphamide,stage iii non hodgkin lymphoma,concurrent administration of any other investi...


In [12]:
df = pd.concat([df.iloc[:,0], itv], axis=1)

In [13]:
df.describe()

Unnamed: 0,is_eligible,intervention,condition,criteria
count,1000000,1000000,1000000,927650
unique,2,15262,52832,232930
top,True,"Antibodies, Monoclonal",breast cancer,donor
freq,500000,34457,33368,9426


In [14]:
df.intervention.value_counts()

Antibodies, Monoclonal                                                 34457
Antibodies                                                             22897
Paclitaxel                                                             22723
Albumin-Bound Paclitaxel                                               21169
Bevacizumab                                                            20445
                                                                       ...  
Direct Transthoracic Cardiac Tumor Radio Frequency Ablation Therapy        1
Conventional Radiotherapy                                                  1
Salvia sample                                                              1
Bladder tumor biopsy                                                       1
light microscopy                                                           1
Name: intervention, Length: 15262, dtype: int64

In [15]:
df[df.intervention.str.startswith('Antibodies, Monoclonal')].groupby('is_eligible').count()

Unnamed: 0_level_0,intervention,condition,criteria
is_eligible,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
False,18321,18321,17167
True,16136,16136,15369


In [16]:
df['intervention'].value_counts().index[-1:-20:-1]

Index(['light microscopy', 'Bladder tumor biopsy', 'Salvia sample',
       'Conventional Radiotherapy',
       'Direct Transthoracic Cardiac Tumor Radio Frequency Ablation Therapy',
       'stretching/toning', 'mMammogram', 'Electro-Acupuncture',
       'Gastric surgery', 'Protein and creatine',
       'Warm Water Immersion Colonoscopy', 'body-mind-spirit group therapy',
       'Intervention Arm', 'Mindfulness based stress reduction (MBSR)',
       'Receptors, Antigen, B-Cell', 'NK Immunotherapy', 'PET/CT (low-dose)',
       'Reduced port-flush schedule', 'Baseline Strength Test'],
      dtype='object')

In [17]:
df['intervention'].value_counts().index[:50]

Index(['Antibodies, Monoclonal', 'Antibodies', 'Paclitaxel',
       'Albumin-Bound Paclitaxel', 'Bevacizumab', 'Cyclophosphamide',
       'Immunoglobulins', 'Carboplatin', 'Cisplatin', 'Fludarabine',
       'Fludarabine phosphate', 'Gemcitabine', 'Sirolimus', 'Doxorubicin',
       'Everolimus', 'Liposomal doxorubicin', 'Docetaxel', 'Vaccines',
       'Mycophenolate mofetil', 'Rituximab', 'Mycophenolic Acid',
       'Fluorouracil', 'Vidarabine', 'Oxaliplatin', 'Capecitabine',
       'Irinotecan', 'Etoposide', 'Thalidomide', 'Erlotinib Hydrochloride',
       'Cyclosporins', 'Cetuximab', 'Cyclosporine', 'Etoposide phosphate',
       'Camptothecin', 'Pembrolizumab', 'Bortezomib', 'Tacrolimus',
       'Dexamethasone', 'Sorafenib', 'Melphalan', 'Lenalidomide',
       'Niacinamide', 'laboratory biomarker analysis', 'Temozolomide',
       'Dexamethasone 21-phosphate', 'BB 1101', 'Dexamethasone acetate',
       'Endothelial Growth Factors', 'Trastuzumab',
       'Laboratory Biomarker Analysis']

In [18]:
df.loc[10, 'intervention']

'study of high risk factors'

In [19]:
df.loc[10, 'criteria']

'able to understand the procedures and the potential risks involved as determined by clinic staff'

In [20]:
df[df.intervention.str.contains('high risk')]

Unnamed: 0,is_eligible,intervention,condition,criteria
10,True,study of high risk factors,precancerous condition,able to understand the procedures and the pote...
18605,True,study of high risk factors,melanoma skin,not specified
43726,True,study of high risk factors,unspecified adult solid tumor protocol specific,nci zero zero69
47365,True,study of high risk factors,precancerous condition,patient characteristics
66759,True,study of high risk factors,increased risk of breast cancer as determined ...,
...,...,...,...,...
765935,False,study of high risk factors,lung cancer,cardiac surgery
794786,False,study of high risk factors,lung cancer,deep ray therapy
872408,False,study of high risk factors,metastatic cancer,diffuse pleural fibrosis
937083,False,study of high risk factors,concurrent untreated malignancy except nonmela...,


In [21]:
df[df.criteria=='donor']

Unnamed: 0,is_eligible,intervention,condition,criteria
93,True,Mycophenolic Acid,stage ii contiguous adult diffuse large cell l...,donor
103,True,Methylprednisolone Hemisuccinate,splenic marginal zone lymphoma,donor
136,True,Cyclosporins,recurrent adult immunoblastic large cell lymphoma,donor
140,True,Cyclophosphamide,recurrent childhood hodgkin lymphoma,donor
192,True,Tacrolimus,recurrent cutaneous cell non hodgkin lymphoma,donor
...,...,...,...,...
999663,False,Cyclosporine,recurrent adult grade iii lymphomatoid granulo...,donor
999701,False,Everolimus,lymphoma,donor
999764,False,Cyclosporins,contiguous stage ii adult diffuse small cleave...,donor
999767,False,Vidarabine,splenic marginal zone lymphoma,donor


`doner` as a inclusion or exclusion criteria for cancer trial does not make sense. We will probably drop these from the data set.

In [22]:
df.loc[999663]['condition']

'recurrent adult grade iii lymphomatoid granulomatosis'

In [23]:
df.loc[df.condition.str.endswith('lymphomatoid granulomatosis')].criteria.value_counts()

donor                                                                                                                                                                                                                        199
adult                                                                                                                                                                                                                         39
fertile men or women unwilling to use contraceptive techniques during and for twelve months following treatment                                                                                                               23
pregnancy                                                                                                                                                                                                                     22
donors                                                                                              

In [24]:
df.loc[df.criteria=='donor'].condition.value_counts()

recurrent small lymphocytic lymphoma                 223
recurrent mantle cell lymphoma                       184
recurrent adult hodgkin lymphoma                     161
recurrent marginal zone lymphoma                     134
recurrent adult immunoblastic large cell lymphoma    130
                                                    ... 
lymphoma non hodgkin                                   1
lymphoma cell                                          1
blastoid_variant mantle cell lymphoma                  1
myeloma                                                1
hepatocellular carcinoma                               1
Name: condition, Length: 255, dtype: int64

In [25]:
df.loc[df.criteria=='donor'].intervention.value_counts()

Fludarabine phosphate            1189
Fludarabine                      1169
Mycophenolic Acid                1161
Mycophenolate mofetil            1129
Vidarabine                        788
                                 ... 
Glycine                             1
Cadexomer iodine                    1
Leukapheresis                       1
Mesna                               1
Laboratory Biomarker Analysis       1
Name: intervention, Length: 62, dtype: int64