In [2]:
import pandas as pd
import xml.etree.ElementTree as et
from collections import defaultdict

## Preparing the Data

This notebook deals with combining the labels and features of clinical trials data. 

The labels in this case is binary relevance, and features is a combination of topic features e.g. gene and disease, and a combination of clinical trials information e.g. brief title and brief summary.

### Trial Info

In [2]:
LABELS_PATH = "../data/pm_labels_2018/qrels-treceval-clinical_trials-2018-v2.txt"
FEATURES_PATH = "../data/trials_pickle_2018/all_trials_2018.pickle"

df_labels = pd.read_csv(
    LABELS_PATH,
    names=["topic", "_", "id", "label"], sep=" "
)
df_features = pd.read_pickle(FEATURES_PATH)

df_features["id"].astype(str)
df_labels["id"].astype(str)

print(df_features.shape)
print(df_labels.shape)

# Raw combined
df = df_features.merge(df_labels, left_on="id", right_on="id", how="inner")
df.iloc[:2, ]

(7508, 16)
(14188, 4)


Unnamed: 0,score,id,brief_summary,brief_title,minimum_age,gender,primary_outcome,detailed_description,keywords,official_title,intervention_type,intervention_name,intervention_browse,condition_browse,inclusion,exclusion,topic,_,label
0,1.0,NCT01513603,"CLAG-M is an active, well tolerated regimen in...","Trial of Cladribine, Cytarabine, Mitoxantrone,...",,All,Complete remission percentage 1 month,Patients will receive standard dose CLAG-M (cl...,,Phase II Trial of CLAG-M in Relapsed ALL,Drug Drug,CLAG-M CLAG-M,Cytarabine Cladribine Mitoxantrone,Lymphoma Leukemia Precursor Cell Lymphoblastic...,- Relapsed or refractory acute lymphoblastic l...,,32,0,0
1,1.0,NCT00582257,The purpose of this study is to establish a ga...,Early Onset and Familial Gastric Cancer Registry,6570.0,All,Create registry of families w/ early onset & f...,,Gastric Cancer Stomach Cancer 05-118,Early Onset and Familial Gastric Cancer Registry,Behavioral,questionnaires,,Stomach Neoplasms,Patient/Relative Cohort: Must meet one or more...,Patients are ineligible for the study if they:...,33,0,0


## Checking duplicates

One topic can have several IDs, I have yet to check if the one ID can be relevant to several topics

In [38]:
df[df["id"] == "NCT00001832"][["id", "label", "topic"]]

Unnamed: 0,id,label,topic
13157,NCT00001832,0,20
13158,NCT00001832,1,21
13159,NCT00001832,1,22


In [31]:
df_dup_id = df_dup_id[(df_dup_id.duplicated(subset=["id"], keep=False)) & (df_dup_id["label"] == 1)]
print(df_dup_id.shape)
df_dup_id.head()

(788, 19)


Unnamed: 0,score,id,brief_summary,brief_title,minimum_age,gender,primary_outcome,detailed_description,keywords,official_title,intervention_type,intervention_name,intervention_browse,condition_browse,inclusion,exclusion,topic,_,label
13,1.0,NCT01369875,Background: - Tumor infiltrating lymphocytes (...,Modified Tumor Infiltrating Lymphocytes for Me...,6570,All,Clinical Tumor Regression. 3 years Clinical tu...,Background: - Tumor Infiltrating Lymphocyte (T...,Immunotherapy Cell Therapy Metastatic Cancer M...,Phase II Study of Lymphocytes Generated With E...,Drug Drug Biological Biological,Cyclophosphamide Fludarabine Aldesleukin Tumor...,Cyclophosphamide Fludarabine phosphate Fludara...,Melanoma Skin Neoplasms,1. Measurable metastatic melanoma with at leas...,1. Women of child-bearing potential who are pr...,21,0,1
14,1.0,NCT01369875,Background: - Tumor infiltrating lymphocytes (...,Modified Tumor Infiltrating Lymphocytes for Me...,6570,All,Clinical Tumor Regression. 3 years Clinical tu...,Background: - Tumor Infiltrating Lymphocyte (T...,Immunotherapy Cell Therapy Metastatic Cancer M...,Phase II Study of Lymphocytes Generated With E...,Drug Drug Biological Biological,Cyclophosphamide Fludarabine Aldesleukin Tumor...,Cyclophosphamide Fludarabine phosphate Fludara...,Melanoma Skin Neoplasms,1. Measurable metastatic melanoma with at leas...,1. Women of child-bearing potential who are pr...,22,0,1
258,1.0,NCT00397384,This phase I trial is studying the side effect...,Erlotinib Hydrochloride and Cetuximab in Treat...,6570,All,"Incidence of DLT, defined as recurring grade 2...",PRIMARY OBJECTIVES: I. To identify the maximum...,,A Phase I Clinical and Biological Evaluation o...,Drug Drug Other,cetuximab erlotinib hydrochloride laboratory b...,"Antibodies Immunoglobulins Antibodies, Monoclo...","Carcinoma Lung Neoplasms Carcinoma, Non-Small-...",- Patients must have histologically or cytolog...,- Patients who have had chemotherapy or radiot...,29,0,1
260,1.0,NCT00397384,This phase I trial is studying the side effect...,Erlotinib Hydrochloride and Cetuximab in Treat...,6570,All,"Incidence of DLT, defined as recurring grade 2...",PRIMARY OBJECTIVES: I. To identify the maximum...,,A Phase I Clinical and Biological Evaluation o...,Drug Drug Other,cetuximab erlotinib hydrochloride laboratory b...,"Antibodies Immunoglobulins Antibodies, Monoclo...","Carcinoma Lung Neoplasms Carcinoma, Non-Small-...",- Patients must have histologically or cytolog...,- Patients who have had chemotherapy or radiot...,33,0,1
264,1.0,NCT00397384,This phase I trial is studying the side effect...,Erlotinib Hydrochloride and Cetuximab in Treat...,6570,All,"Incidence of DLT, defined as recurring grade 2...",PRIMARY OBJECTIVES: I. To identify the maximum...,,A Phase I Clinical and Biological Evaluation o...,Drug Drug Other,cetuximab erlotinib hydrochloride laboratory b...,"Antibodies Immunoglobulins Antibodies, Monoclo...","Carcinoma Lung Neoplasms Carcinoma, Non-Small-...",- Patients must have histologically or cytolog...,- Patients who have had chemotherapy or radiot...,48,0,1


In [35]:
df_dup_id.shape

(788, 19)

In [36]:
df_dup_id.groupby(["id"])["label"].apply(len)

id
NCT00001832    2
NCT00003646    2
NCT00027586    3
NCT00062036    2
NCT00096382    2
              ..
NCT03026517    2
NCT03066661    2
NCT03088176    6
NCT03093116    2
NCT03101254    4
Name: label, Length: 238, dtype: int64

In [18]:
map_id_to_multi_label_topics = df.groupby(["id"])[["topic", "label"]].apply(lambda x: x)
map_id_to_multi_label_topics = map_id_to_multi_label_topics.reset_index()

In [3]:
print(df["id"].nunique())
print(df.shape)

7508
(14188, 19)


In [13]:
map_id_to_multi_topics = df.groupby(["id"])["topic", "label"].apply(len).reset_index()

In [5]:
map_id_to_multi_topics[map_id_to_multi_topics["id"] == "NCT00001144"]

Unnamed: 0,id,topic
7,NCT00001144,3


In [6]:
df[df["id"] == "NCT00001144"]

Unnamed: 0,score,id,brief_summary,brief_title,minimum_age,gender,primary_outcome,detailed_description,keywords,official_title,intervention_type,intervention_name,intervention_browse,condition_browse,inclusion,exclusion,topic,_,label
4686,1.0,NCT00001144,This study will investigate the safety and eff...,Modified Bone Marrow Stem Cell Transplantation...,,All,,CML is a disease which progresses to blast cri...,Peripheral Blood Stem Cells Engraftment Graft ...,Non-Myeloablative Allogeneic Peripheral Blood ...,Drug,Modified bone marrow stem cell transplantation,,"Leukemia Leukemia, Myeloid Leukemia, Myelogeno...",PATIENTS: Patients in chronic phase CML. Age 1...,,20,0,0
4687,1.0,NCT00001144,This study will investigate the safety and eff...,Modified Bone Marrow Stem Cell Transplantation...,,All,,CML is a disease which progresses to blast cri...,Peripheral Blood Stem Cells Engraftment Graft ...,Non-Myeloablative Allogeneic Peripheral Blood ...,Drug,Modified bone marrow stem cell transplantation,,"Leukemia Leukemia, Myeloid Leukemia, Myelogeno...",PATIENTS: Patients in chronic phase CML. Age 1...,,21,0,0
4688,1.0,NCT00001144,This study will investigate the safety and eff...,Modified Bone Marrow Stem Cell Transplantation...,,All,,CML is a disease which progresses to blast cri...,Peripheral Blood Stem Cells Engraftment Graft ...,Non-Myeloablative Allogeneic Peripheral Blood ...,Drug,Modified bone marrow stem cell transplantation,,"Leukemia Leukemia, Myeloid Leukemia, Myelogeno...",PATIENTS: Patients in chronic phase CML. Age 1...,,22,0,0


In [7]:
df[df.duplicated(subset=["id"], keep=False)].sort_values("id")

Unnamed: 0,score,id,brief_summary,brief_title,minimum_age,gender,primary_outcome,detailed_description,keywords,official_title,intervention_type,intervention_name,intervention_browse,condition_browse,inclusion,exclusion,topic,_,label
4687,1.0,NCT00001144,This study will investigate the safety and eff...,Modified Bone Marrow Stem Cell Transplantation...,,All,,CML is a disease which progresses to blast cri...,Peripheral Blood Stem Cells Engraftment Graft ...,Non-Myeloablative Allogeneic Peripheral Blood ...,Drug,Modified bone marrow stem cell transplantation,,"Leukemia Leukemia, Myeloid Leukemia, Myelogeno...",PATIENTS: Patients in chronic phase CML. Age 1...,,21,0,0
4688,1.0,NCT00001144,This study will investigate the safety and eff...,Modified Bone Marrow Stem Cell Transplantation...,,All,,CML is a disease which progresses to blast cri...,Peripheral Blood Stem Cells Engraftment Graft ...,Non-Myeloablative Allogeneic Peripheral Blood ...,Drug,Modified bone marrow stem cell transplantation,,"Leukemia Leukemia, Myeloid Leukemia, Myelogeno...",PATIENTS: Patients in chronic phase CML. Age 1...,,22,0,0
4686,1.0,NCT00001144,This study will investigate the safety and eff...,Modified Bone Marrow Stem Cell Transplantation...,,All,,CML is a disease which progresses to blast cri...,Peripheral Blood Stem Cells Engraftment Graft ...,Non-Myeloablative Allogeneic Peripheral Blood ...,Drug,Modified bone marrow stem cell transplantation,,"Leukemia Leukemia, Myeloid Leukemia, Myelogeno...",PATIENTS: Patients in chronic phase CML. Age 1...,,20,0,0
13191,1.0,NCT00001238,We will investigate the clinical manifestation...,Clinical Manifestations and Molecular Bases of...,730,All,Characterize the natural and clinical historie...,Background: - Disorders under investigation ar...,Urologic Malignant Disorders Birt Hogg Dube Vo...,Clinical Manifestations and Molecular Bases of...,,,,Disease Kidney Neoplasms,- Subject Category A: Category A will include ...,Persons unable to give informed consent.,41,0,0
13190,1.0,NCT00001238,We will investigate the clinical manifestation...,Clinical Manifestations and Molecular Bases of...,730,All,Characterize the natural and clinical historie...,Background: - Disorders under investigation ar...,Urologic Malignant Disorders Birt Hogg Dube Vo...,Clinical Manifestations and Molecular Bases of...,,,,Disease Kidney Neoplasms,- Subject Category A: Category A will include ...,Persons unable to give informed consent.,27,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8368,1.0,NCT03106155,"[Study Design] This study is a single arm, mul...",Vistusertib (AZD2014) Monotherapy in Relapsed ...,6570,All,Objective reponse rate (ORR) by RECIST 1.1 Up ...,,,"Phase II, Single-arm Study of Vistusertib (AZD...",Drug,vistusertib (AZD2014),,Lung Neoplasms Small Cell Lung Carcinoma,] - Provision of informed consent prior to any...,- Participation in another clinical study with...,46,0,0
8367,1.0,NCT03106155,"[Study Design] This study is a single arm, mul...",Vistusertib (AZD2014) Monotherapy in Relapsed ...,6570,All,Objective reponse rate (ORR) by RECIST 1.1 Up ...,,,"Phase II, Single-arm Study of Vistusertib (AZD...",Drug,vistusertib (AZD2014),,Lung Neoplasms Small Cell Lung Carcinoma,] - Provision of informed consent prior to any...,- Participation in another clinical study with...,36,0,0
8366,1.0,NCT03106155,"[Study Design] This study is a single arm, mul...",Vistusertib (AZD2014) Monotherapy in Relapsed ...,6570,All,Objective reponse rate (ORR) by RECIST 1.1 Up ...,,,"Phase II, Single-arm Study of Vistusertib (AZD...",Drug,vistusertib (AZD2014),,Lung Neoplasms Small Cell Lung Carcinoma,] - Provision of informed consent prior to any...,- Participation in another clinical study with...,30,0,0
3858,1.0,NCT03106415,This phase I/II trial studies the best dose of...,Pembrolizumab and Binimetinib in Treating Pati...,6570,All,MTD of pembrolizumab in combination with binim...,PRIMARY OBJECTIVES: I. To determine the maximu...,,Phase I/II Trial of Pembrolizumab in Combinati...,Drug Other Biological,Binimetinib Laboratory Biomarker Analysis Pemb...,Pembrolizumab,Breast Neoplasms Adenocarcinoma Triple Negativ...,- Histological confirmation of adenocarcinoma ...,- Any of the following: - Pregnant women - Nur...,35,0,0


In [8]:
df.columns

Index(['score', 'id', 'brief_summary', 'brief_title', 'minimum_age', 'gender',
       'primary_outcome', 'detailed_description', 'keywords', 'official_title',
       'intervention_type', 'intervention_name', 'intervention_browse',
       'condition_browse', 'inclusion', 'exclusion', 'topic', '_', 'label'],
      dtype='object')

For the baseline experiment, we take `brief_summary` and `brief_title`.

The `id` is useful for later if I need to reference something. This MUST be deleted when passing the input to training.

The `topic` is kept for randomising train-test split.

The `label` is the label...

In [9]:
SUBSET_COLUMNS = [
    "id", "brief_summary", "brief_title",
    "topic", "label"
]

df_input = df[SUBSET_COLUMNS].copy()

# Restrict to binary class
df_input["label"] = df_input["label"].replace(to_replace=2, value=1)

### Topics

In [10]:
TOPICS_PATH = "../data/pm_labels_2018"
TOPICS_XML = "topics2018.xml"

topics_xml = et.parse(f"{TOPICS_PATH}/{TOPICS_XML}")
topics_root = topics_xml.getroot()
topic_root = topics_root

In [11]:
topics_dict = defaultdict(list)
for child in topics_root:
    feature_arr = []
    topic_num = child.attrib["number"]
    for cc in child:
        feature_arr.append(cc.text)
    if int(topic_num) % 10 == 0:
        print(f"child.attrib[\"number\"]: {topic_num}")
        print(feature_arr)
    topics_dict[int(topic_num)] = feature_arr

child.attrib["number"]: 10
['melanoma', 'KIT (L576P)', '65-year-old female']
child.attrib["number"]: 20
['melanoma', 'high tumor mutational burden', '86-year-old female']
child.attrib["number"]: 30
['lung cancer', 'ROS1', '71-year-old female']
child.attrib["number"]: 40
['breast cancer', 'ERBB2', '56-year-old female']
child.attrib["number"]: 50
['acute myeloid leukemia', 'FLT3', '13-year-old male']


### Combining them

In [8]:
df_input.head()

Unnamed: 0,id,brief_summary,brief_title,topic,label
0,NCT01513603,"CLAG-M is an active, well tolerated regimen in...","Trial of Cladribine, Cytarabine, Mitoxantrone,...",32,0
1,NCT00582257,The purpose of this study is to establish a ga...,Early Onset and Familial Gastric Cancer Registry,33,0
2,NCT02472678,Patients with a neuroendocrine tumor (NET) fre...,Web-based Tailored Information and Support for...,20,0
3,NCT02956889,"This is a Fleming-A' Hern, single arm, multice...",To Assess The Efficacy And Safety Of Vismodegi...,43,0
4,NCT02510001,This trial is designed to try two new cancer d...,MErCuRIC1: MEK and MET Inhibition in Colorecta...,26,0


In [9]:
df_input["topic_info"] = df_input["topic"].map(topics_dict)
df_input.head()

Unnamed: 0,id,brief_summary,brief_title,topic,label,topic_info
0,NCT01513603,"CLAG-M is an active, well tolerated regimen in...","Trial of Cladribine, Cytarabine, Mitoxantrone,...",32,0,"[leukemia, ABL1, 4-year-old female]"
1,NCT00582257,The purpose of this study is to establish a ga...,Early Onset and Familial Gastric Cancer Registry,33,0,"[gastric cancer, EGFR, 60-year-old female]"
2,NCT02472678,Patients with a neuroendocrine tumor (NET) fre...,Web-based Tailored Information and Support for...,20,0,"[melanoma, high tumor mutational burden, 86-ye..."
3,NCT02956889,"This is a Fleming-A' Hern, single arm, multice...",To Assess The Efficacy And Safety Of Vismodegi...,43,0,"[basal cell carcinoma, PTCH1, 56-year-old female]"
4,NCT02510001,This trial is designed to try two new cancer d...,MErCuRIC1: MEK and MET Inhibition in Colorecta...,26,0,"[colorectal cancer, NRAS, 49-year-old male]"


In [10]:
# Be sure to specify index!
df_input[["disease", "gene", "age_disease"]] = pd.DataFrame(
    df_input["topic_info"].tolist(),
    index=df_input.index
)

In [11]:
import re

def find_age(string):
    # Match checks for patterns starting from the beginning
    x = re.match(r'\d+', string)
    if x is None:
        raise ValueError("Did not find age feature!")
    return int(x.group())

def find_gender(string):
    # Search checks for patterns throughout entire string
    x = re.search(r'(fe)?(male)+', string)
    if x is None:
        raise ValueError("Did not find gender feature!")
    return x.group()

In [12]:
df_input["age"] = df_input["age_disease"].apply(find_age)
df_input["gender"] = df_input["age_disease"].apply(find_gender)

In [13]:
df_input.iloc[:2, ]

Unnamed: 0,id,brief_summary,brief_title,topic,label,topic_info,disease,gene,age_disease,age,gender
0,NCT01513603,"CLAG-M is an active, well tolerated regimen in...","Trial of Cladribine, Cytarabine, Mitoxantrone,...",32,0,"[leukemia, ABL1, 4-year-old female]",leukemia,ABL1,4-year-old female,4,female
1,NCT00582257,The purpose of this study is to establish a ga...,Early Onset and Familial Gastric Cancer Registry,33,0,"[gastric cancer, EGFR, 60-year-old female]",gastric cancer,EGFR,60-year-old female,60,female


### Subset data

To make experimenting code easier, we subset the data

In [14]:
def subset_data(df, col, p):
    num_lab_1 = df[col].value_counts()[0]
    num_lab_2 = df[col].value_counts()[1]

    lab1_subset = df[df[col] == 0].sample(int(p*num_lab_1))
    lab2_subset = df[df[col] == 1].sample(int(p*num_lab_2))
    
    return pd.concat([lab1_subset, lab2_subset])

In [19]:
df_inputs_sub = subset_data(df_input, "label", p=0.3).reset_index(drop=True)
df_inputs_sub["label"].value_counts()

0    3642
1     614
Name: label, dtype: int64

In [20]:
df["label"].value_counts()

0    12141
1     1174
2      873
Name: label, dtype: int64

In [None]:
break

# Sandbox

In [None]:
df_labels = pd.read_csv("../data/pm_labels_2018/qrels-treceval-clinical_trials-2018-v2.txt",
                 names=["topic", "_", "id", "label"], sep=" ")
df_labels.head()

In [None]:
df_features = pd.read_pickle("../data/trials_pickle_2018/all_trials_2018.pickle")
df_features.columns

In [None]:
df_features.head()

In [None]:
df_labels.dtypes

In [None]:
df_features.dtypes

In [None]:
df_features["id"].astype(str)
df_label["id"].astype(str)

df = df_features.join(df_label, on="id")
df.head()

In [None]:
df_features["id"].astype(str)
df_label["id"].astype(str)

print(df_features.shape)
print(df_labels.shape)

df = df_features.merge(df_label, left_on="id", right_on="id", how="inner")
df.head()

In [None]:
df["input"] = df["brief_summary"] + df["brief_title"] \
                + df["keywords"] + df["primary_outcome"]

In [None]:
df["keywords"].value_counts()

In [None]:
df["primary_outcome"].value_counts()

In [None]:
df["input"].loc[0]

In [None]:
df_input = df[["label", "input"]].copy()
df_input.head()

In [None]:
df_input["label"] = df_input["label"].replace(to_replace=2, value=1)

In [None]:
df_input["label"].value_counts()

In [None]:
from random import sample

def subset_data(p):
    """ 
    Subsetting data for binary only 
    
    p: Percetange between [0, 1]
    """
    num_lab_1 = df_input["label"].value_counts()[0]
    num_lab_2 = df_input["label"].value_counts()[1]
    
    idx_arr_lab_1 = sample(range(0, num_lab_1), int(p*num_lab_1))
    idx_arr_lab_2 = sample(range(0, num_lab_1), int(p*num_lab_2))
    
    return idx_arr_lab_1, idx_arr_lab_2

In [None]:
p = 0.3
idx_arr_lab_1, idx_arr_lab_2 = subset_data(p)

In [None]:
df_input[df_input["label"] == 0]

In [None]:
p = 0.3

num_lab_1 = df_input["label"].value_counts()[0]
num_lab_2 = df_input["label"].value_counts()[1]

lab1_subset = df_input[df_input["label"] == 0].sample(int(p*num_lab_1))
lab2_subset = df_input[df_input["label"] == 1].sample(int(p*num_lab_2))

In [None]:
lab1_subset.shape

In [None]:
lab2_subset.shape

## Prepping Topic Info

In [None]:
TOPICS_PATH = "../data/pm_labels_2018"
TOPICS_XML = "topics2018.xml"

topics_xml = et.parse(f"{TOPICS_PATH}/{TOPICS_XML}")
topics_root = topics_xml.getroot()
topic_root = topics_root

In [None]:
topics_root.attrib

In [None]:
for child in topics_root:
    print(child.attrib["number"])
    for cc in child:
        print(cc.tag)

In [None]:
topics_arr = []
for child in topics_root:
    topic_arr = []
    print(f"child.attrib[\"number\"]: {child.attrib["number"]}")
    topic_arr.append(child.attrib["number"])
    for cc in child:
        print(cc.text)
        topic_arr.append(cc.text)
    topics_arr.append(topic_arr)

In [None]:
topics_arr = []
for child in topics_root:
    feature_arr = []
    topic_num = child.attrib["number"]
    print(f"child.attrib[\"number\"]: {topic_num}")
    for cc in child:
        print(cc.text)
        feature_arr.append(cc.text)
    topics_arr.append({topic_num: feature_arr})

In [None]:
topics_arr

In [None]:
df_inputs

In [45]:
seq_a = [1, 2, 3]
seq_b = None
for pair_a, pair_b in zip(seq_a, seq_b):
    print(pair_a)
    print(pair_b)

TypeError: zip argument #2 must support iteration

In [47]:
from tqdm import tqdm

for x in tqdm([1, 2, 3]):
    print(x)

100%|██████████| 3/3 [00:00<00:00, 7815.47it/s]

1
2
3





In [1]:
def parse_topics(topics_path, topics_xml):
    """ Specifc XML parsing of topics file for TREC PM dataset """
    topics_xml = et.parse(f"{topics_path}/{topics_xml}")
    topics_root = topics_xml.getroot()
    topic_root = topics_root

    topics_dict = defaultdict(list)
    for child in topics_root:
        feature_arr = []
        topic_num = child.attrib["number"]
        for cc in child:
            feature_arr.append(cc.text)
        if int(topic_num) % 10 == 0:
            print(f"child.attrib[\"number\"]: {topic_num}")
            print(feature_arr)
        topics_dict[int(topic_num)] = feature_arr

    return topics_dict


## 2017 Topics

In [5]:
year = 2018

In [27]:
labels_path = f"../data/pm_labels_{year}/qrels_treceval_clinical_trials_{year}.txt"
features_path = f"../data/trials_pickle_{year}/all_trials_{year}.pickle"

df_labels = pd.read_csv(
    labels_path,
    names=["topic", "_", "id", "label"], sep=" "
)
df_features = pd.read_pickle(features_path)

df_features["id"].astype(str)
df_labels["id"].astype(str)

print(df_features.shape)
print(df_labels.shape)

# Raw combined
df = df_features.merge(df_labels, left_on="id", right_on="id", how="inner")
print(df.iloc[:2, ])

SUBSET_COLUMNS = [
    "id", "brief_summary", "brief_title",
    "topic", "label"
]

df

(7643, 16)
(13019, 4)
   score           id                                      brief_summary  \
0    1.0  NCT01774162  Endoscopic ultrasound (EUS) is a well-establis...   
1    1.0  NCT01774162  Endoscopic ultrasound (EUS) is a well-establis...   

                                         brief_title minimum_age gender  \
0  EUS-guided Fine Needle Biopsy With a New Core ...        6570    All   
1  EUS-guided Fine Needle Biopsy With a New Core ...        6570    All   

                                     primary_outcome  \
0  Sampling Adequacy at time of procedure The abi...   
1  Sampling Adequacy at time of procedure The abi...   

                                detailed_description  \
0  Background: Endoscopic ultrasound (EUS) is a w...   
1  Background: Endoscopic ultrasound (EUS) is a w...   

                                            keywords  \
0  Endoscopic Ultrasound Fine needle aspiration F...   
1  Endoscopic Ultrasound Fine needle aspiration F...   

                

Unnamed: 0,score,id,brief_summary,brief_title,minimum_age,gender,primary_outcome,detailed_description,keywords,official_title,intervention_type,intervention_name,intervention_browse,condition_browse,inclusion,exclusion,topic,_,label
0,1.0,NCT01774162,Endoscopic ultrasound (EUS) is a well-establis...,EUS-guided Fine Needle Biopsy With a New Core ...,6570,All,Sampling Adequacy at time of procedure The abi...,Background: Endoscopic ultrasound (EUS) is a w...,Endoscopic Ultrasound Fine needle aspiration F...,Endoscopic Ultrasound Guided Fine Needle Biops...,Device Device,Fine needle biopsy using ProCore needle for hi...,,Adenocarcinoma Gastrointestinal Stromal Tumors...,- Adult patient 18 years or older - Able to re...,- No detectable lesion - lesion inaccessible t...,18,0,0
1,1.0,NCT01774162,Endoscopic ultrasound (EUS) is a well-establis...,EUS-guided Fine Needle Biopsy With a New Core ...,6570,All,Sampling Adequacy at time of procedure The abi...,Background: Endoscopic ultrasound (EUS) is a w...,Endoscopic Ultrasound Fine needle aspiration F...,Endoscopic Ultrasound Guided Fine Needle Biops...,Device Device,Fine needle biopsy using ProCore needle for hi...,,Adenocarcinoma Gastrointestinal Stromal Tumors...,- Adult patient 18 years or older - Able to re...,- No detectable lesion - lesion inaccessible t...,27,0,0
2,1.0,NCT01774162,Endoscopic ultrasound (EUS) is a well-establis...,EUS-guided Fine Needle Biopsy With a New Core ...,6570,All,Sampling Adequacy at time of procedure The abi...,Background: Endoscopic ultrasound (EUS) is a w...,Endoscopic Ultrasound Fine needle aspiration F...,Endoscopic Ultrasound Guided Fine Needle Biops...,Device Device,Fine needle biopsy using ProCore needle for hi...,,Adenocarcinoma Gastrointestinal Stromal Tumors...,- Adult patient 18 years or older - Able to re...,- No detectable lesion - lesion inaccessible t...,28,0,0
3,1.0,NCT01774162,Endoscopic ultrasound (EUS) is a well-establis...,EUS-guided Fine Needle Biopsy With a New Core ...,6570,All,Sampling Adequacy at time of procedure The abi...,Background: Endoscopic ultrasound (EUS) is a w...,Endoscopic Ultrasound Fine needle aspiration F...,Endoscopic Ultrasound Guided Fine Needle Biops...,Device Device,Fine needle biopsy using ProCore needle for hi...,,Adenocarcinoma Gastrointestinal Stromal Tumors...,- Adult patient 18 years or older - Able to re...,- No detectable lesion - lesion inaccessible t...,30,0,0
4,1.0,NCT01226147,An open-label study to evaluate the efficacy a...,Efficacy and Safety of Tamibarotene(AM80) for ...,7300,All,Renal Function 24 weeks Urinary Protein values...,Tamibarotene is a synthetic retinoid presently...,Lupus Nephritis SLE retinoid tamibarotene,,Drug,Tamibarotene,,Nephritis Lupus Nephritis,- Steroid refractory lupus nephritis - more th...,- Pregnant or breastfeeding female patients - ...,7,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13014,1.0,NCT02595554,Neoadjuvant chemotherapy (NACT) and radical su...,Neoadjuvant Chemotherapy and Radical Surgery i...,6570,Female,Disease free survival 5 years,Patients with International Federation of Gyne...,"Cervical cancer, Stage IIB, Neoadjuvant chemot...",Neoadjuvant Chemotherapy and Radical Surgery V...,Radiation Drug Procedure Drug,Concurrent chemoirradiation Paclitaxel Radical...,Paclitaxel Albumin-Bound Paclitaxel Cisplatin,Uterine Cervical Neoplasms,- Patients with newly histologically confirmed...,- The presence of uncontrolled life-threatenin...,15,0,0
13015,1.0,NCT03031288,Aim of the study is to evaluate the effects of...,Effects of Vented Base Feeding Bottle in Prete...,,All,Frequency of cardiorespiratory events (feeding...,In preterm infants a deficit of both coordinat...,gastroesophageal reflux swallow feeding bottle...,Vented Base Feeding Bottle in Preterm Infants ...,Device Device,A B,,,- weight ≥ 1500 g at the time of examination -...,- congenital abnormalities - perinatal asphyxi...,1,0,0
13016,1.0,NCT02900677,The aims of this three-year study are to: 1. e...,Developing Prehabilitation Program in Patients...,7300,All,Self-report Questionnaire Change from baseline...,Develop and evaluate the effect of patient-cen...,Pancreatic Cancer Fatigue Nutrition Quality of...,Developing and Testing the Effects of Patient-...,Behavioral,Physical and nutrition program,,Pancreatic Neoplasms,- patients with pancreatic cancer and are goin...,- poor functional status,16,0,0
13017,1.0,NCT00021125,RATIONALE: Radiation therapy uses high-energy ...,Radiation Therapy in Treating Patients With He...,,All,Time to local recurrence,OBJECTIVES: - Compare adjuvant continuous hype...,squamous cell carcinoma of the skin stage I sq...,A Randomized Controlled Trial Of CHARTWEL (a C...,Procedure Radiation,adjuvant therapy radiation therapy,,Head and Neck Neoplasms Skin Neoplasms,DISEASE CHARACTERISTICS: - Histologically conf...,,29,0,0


In [28]:
TOPICS_PATH = f"../data/pm_labels_{year}"
TOPICS_XML = f"topics{year}.xml"

topics_dict = parse_topics(TOPICS_PATH, TOPICS_XML)

child.attrib["number"]: 10
['Lung adenocarcinoma', 'KRAS (G12C)', '61-year-old female', 'Hypertension, Hypercholesterolemia']
child.attrib["number"]: 20
['Liposarcoma', 'MDM2 Amplification', '26-year-old male', 'None']
child.attrib["number"]: 30
['Pancreatic adenocarcinoma', 'RB1, TP53, KRAS', '57-year-old female', 'None']


In [29]:
topics_dict

defaultdict(list,
            {1: ['Liposarcoma',
              'CDK4 Amplification',
              '38-year-old male',
              'GERD'],
             2: ['Colon cancer',
              'KRAS (G13D), BRAF (V600E)',
              '52-year-old male',
              'Type II Diabetes, Hypertension'],
             3: ['Meningioma',
              'NF2 (K322), AKT1(E17K)',
              '45-year-old female',
              'None'],
             4: ['Breast cancer',
              'FGFR1 Amplification, PTEN (Q171)',
              '67-year-old female',
              'Depression, Hypertension, Heart Disease'],
             5: ['Melanoma',
              'BRAF (V600E), CDKN2A Deletion',
              '45-year-old female',
              'None'],
             6: ['Melanoma',
              'NRAS (Q61K)',
              '55-year-old male',
              'Hypertension'],
             7: ['Lung cancer', 'EGFR (L858R)', '50-year-old female', 'Lupus'],
             8: ['Lung cancer',
              'EML4-

In [30]:
df["topic_info"] = df["topic"].map(topics_dict)

In [31]:
df.head()

Unnamed: 0,score,id,brief_summary,brief_title,minimum_age,gender,primary_outcome,detailed_description,keywords,official_title,intervention_type,intervention_name,intervention_browse,condition_browse,inclusion,exclusion,topic,_,label,topic_info
0,1.0,NCT01774162,Endoscopic ultrasound (EUS) is a well-establis...,EUS-guided Fine Needle Biopsy With a New Core ...,6570,All,Sampling Adequacy at time of procedure The abi...,Background: Endoscopic ultrasound (EUS) is a w...,Endoscopic Ultrasound Fine needle aspiration F...,Endoscopic Ultrasound Guided Fine Needle Biops...,Device Device,Fine needle biopsy using ProCore needle for hi...,,Adenocarcinoma Gastrointestinal Stromal Tumors...,- Adult patient 18 years or older - Able to re...,- No detectable lesion - lesion inaccessible t...,18,0,0,"[Pancreatic cancer, CDK6 Amplification, 48-yea..."
1,1.0,NCT01774162,Endoscopic ultrasound (EUS) is a well-establis...,EUS-guided Fine Needle Biopsy With a New Core ...,6570,All,Sampling Adequacy at time of procedure The abi...,Background: Endoscopic ultrasound (EUS) is a w...,Endoscopic Ultrasound Fine needle aspiration F...,Endoscopic Ultrasound Guided Fine Needle Biops...,Device Device,Fine needle biopsy using ProCore needle for hi...,,Adenocarcinoma Gastrointestinal Stromal Tumors...,- Adult patient 18 years or older - Able to re...,- No detectable lesion - lesion inaccessible t...,27,0,0,"[Pancreatic adenocarcinoma, KRAS, TP53, 49-yea..."
2,1.0,NCT01774162,Endoscopic ultrasound (EUS) is a well-establis...,EUS-guided Fine Needle Biopsy With a New Core ...,6570,All,Sampling Adequacy at time of procedure The abi...,Background: Endoscopic ultrasound (EUS) is a w...,Endoscopic Ultrasound Fine needle aspiration F...,Endoscopic Ultrasound Guided Fine Needle Biops...,Device Device,Fine needle biopsy using ProCore needle for hi...,,Adenocarcinoma Gastrointestinal Stromal Tumors...,- Adult patient 18 years or older - Able to re...,- No detectable lesion - lesion inaccessible t...,28,0,0,"[Pancreatic ductal adenocarcinoma, ERBB3, 73-y..."
3,1.0,NCT01774162,Endoscopic ultrasound (EUS) is a well-establis...,EUS-guided Fine Needle Biopsy With a New Core ...,6570,All,Sampling Adequacy at time of procedure The abi...,Background: Endoscopic ultrasound (EUS) is a w...,Endoscopic Ultrasound Fine needle aspiration F...,Endoscopic Ultrasound Guided Fine Needle Biops...,Device Device,Fine needle biopsy using ProCore needle for hi...,,Adenocarcinoma Gastrointestinal Stromal Tumors...,- Adult patient 18 years or older - Able to re...,- No detectable lesion - lesion inaccessible t...,30,0,0,"[Pancreatic adenocarcinoma, RB1, TP53, KRAS, 5..."
4,1.0,NCT01226147,An open-label study to evaluate the efficacy a...,Efficacy and Safety of Tamibarotene(AM80) for ...,7300,All,Renal Function 24 weeks Urinary Protein values...,Tamibarotene is a synthetic retinoid presently...,Lupus Nephritis SLE retinoid tamibarotene,,Drug,Tamibarotene,,Nephritis Lupus Nephritis,- Steroid refractory lupus nephritis - more th...,- Pregnant or breastfeeding female patients - ...,7,0,0,"[Lung cancer, EGFR (L858R), 50-year-old female..."


In [32]:
df.index

Int64Index([    0,     1,     2,     3,     4,     5,     6,     7,     8,
                9,
            ...
            13009, 13010, 13011, 13012, 13013, 13014, 13015, 13016, 13017,
            13018],
           dtype='int64', length=13019)

In [33]:
print(df["topic_info"].iloc[9])

['Melanoma', 'BRAF (V600E), CDKN2A Deletion', '45-year-old female', 'None']


In [34]:
ti = [len(x) for x in df["topic_info"].tolist()]
set(ti)

{4}

In [35]:
set(ti).pop()

4

In [36]:
df[["disease", "gene", "age_disease"]] = pd.DataFrame(
    df["topic_info"].tolist(),
    index=df.index
)

ValueError: Columns must be same length as key