## Data Cleaning and Reformating

This jupyer notebook is for preprocessing and converting the descriptions(free text) from detailed_description.txt and information from designs.txt and conditions.txt to fit the required format of different models. 

In [2]:
import pandas as pd

First, screening:

We are only interested in the clinical trials whose purpose is treatment.

In [3]:
# import detailed_descriptions.txt
data_filename = "data/detailed_descriptions.txt"
df = pd.read_csv(data_filename, sep='|')
# import designs.txt
d_file = "data/designs.txt"
d_df = pd.read_csv(d_file, sep='|')
# import conditions.txt
codi_file = "data/conditions.txt"
condi_df = pd.read_csv(codi_file, sep='|')

In [4]:
# tract the reports (nct id) with the treating purpose
treatment_id_df = d_df[d_df['primary_purpose']=="Treatment"]['nct_id']

In [5]:
# screen out the treating clinical trial descriptions and conditions
screened_des = pd.merge(df,treatment_id_df,on='nct_id')
screened_cond = pd.merge(screened_des,condi_df, on='nct_id')

Second cleaning:

In [6]:
# extract the free text from the file
description = screened_des['description'].values.tolist()

In [7]:
# cleaning the free text data:
# removing '\r~~', '\r~', '~', etc.

import re;

def clean_description (str):
    # remove '~', 'r~', and 'r~~'
    p = re.compile('\r~~?')
    s = p.sub('', str)
    p = re.compile('~\s+')
    s = p.sub('', s)
    p = re.compile('\s\s+')
    s = p.sub(' ', s)
    p = re.compile('\|')
    s = p.sub('', s)
    p = re.compile('\s-\s|-\s')
    s = p.sub('', s)
    return s

description_cleaned = list(map(clean_description, description))


In [10]:
# write the cleaned descriptions to file as the input data of CubNER
# screened_des['cleaned_description'] = description_cleaned
# temp = screened_des.sort_values(['nct_id'],ascending=False)
# des_only_ordered = temp['cleaned_description']
# description_file = open("data/treatment_description_only.txt", "w")
# for element in des_only_ordered:
#     description_file.write(element + "\n")
# description_file.close()

In [9]:
# id extraction
# des_only_nctid_ordered = temp['nct_id']
# des_only_nctid_ordered.to_csv('ids.csv', sep=',',header=True, index=False)

The following code is for preparing the input data for DNorm.

The input data for DNorm is in PubTator format, since it was built to normalize disease from biomedical article and does text mining from the tiles and the abstracts. 

-------------------------------------------------------------------------------------------------------------------

An example of the PubTator input format:

20085714|t|Autosomal-dominant striatal degeneration is caused by a mutation in the phosphodiesterase 8B gene.

20085714|a|Autosomal-dominant striatal degeneration (ADSD) is an autosomal-dominant movement disorder affecting the striatal part of the basal ganglia. ADSD is characterized by bradykinesia, dysarthria, and muscle rigidity. These symptoms resemble idiopathic Parkinson disease, but tremor is not present. Using genetic linkage analysis, we have mapped the causative genetic defect to a 3.25 megabase candidate region on chromosome 5q13.3-q14.1. A maximum LOD score of 4.1 (Theta = 0) was obtained at marker D5S1962. Here we show that ADSD is caused by a complex frameshift mutation (c.94G>C+c.95delT) in the phosphodiesterase 8B (PDE8B) gene, which results in a loss of enzymatic phosphodiesterase activity. We found that PDE8B is highly expressed in the brain, especially in the putamen, which is affected by ADSD. PDE8B degrades cyclic AMP, a second messenger implied in dopamine signaling. Dopamine is one of the main neurotransmitters involved in movement control and is deficient in Parkinson disease. We believe that the functional analysis of PDE8B will help to further elucidate the pathomechanism of ADSD as well as contribute to a better understanding of movement disorders.

--------------------------------------------------------------------------------------------------------------------

't' stands for title, and 'a' stands for 'abstract'. The numbers in the first column are identifiers. The idea here is to have the conditions specified by the clinical trial registers to be the 'title', since these condictions were 'summarized', and have the detialed descriptions of the trial to be the 'abstract'. 

In [125]:
# create a list of "a" representing 'abstract'
alist = ["a"]*len(description_cleaned)

In [126]:
# append the 'a' and cleaned description to the original dataframe
screened_des['cleaned_description'] = description_cleaned
screened_des['type'] = alist

# drop the id and description columns since they are not useful
ndf = screened_des.drop(['id', 'description'], axis=1)

In [128]:
# reorder the columns making it consistent with the PubTator input format:
order = [0,2,1] 
ndf = ndf[[ndf.columns[i] for i in order]]

In [61]:
# I generated a toy dataset to test the format and tried it with DNorm.

short_ndf = ndf.head(10)

In [63]:
# just filled the title with 'tiltle'
ids = short_ndf['nct_id'].values.tolist()
for id in ids:
    short_ndf = short_ndf.append({'nct_id': id, 'type': 't', 'cleaned_description': 'title'}, ignore_index = True)

In [66]:
# sorting for haveing rows with the same id be together and 't' on top of 'a'
short_ndf = short_ndf.sort_values(['nct_id', 'type'],ascending=False)

In [67]:
short_ndf.to_csv('PubTator_short.txt', sep='|',header=False, index=False)

In [129]:
# the following code is for generatingt the data in PubTator input format
# nctid | type | content

condi_ndf = screened_cond.drop(['id_x','id_y','description','name'], axis=1)

In [130]:
condi_ndf_dic = condi_ndf.to_dict('records')

In [133]:
# combining the condistions shared the same nctid with 'and'
# store in a dictionary for easily accessing the content(value) by id(key) later
nctid_names_dict = {}
for d in condi_ndf_dic:
    nid = d['nct_id']
    if nid in nctid_names_dict.keys():
        nctid_names_dict[nid] = nctid_names_dict[nid] + ' and ' + clean_description(d['downcase_name'])
    else:
        nctid_names_dict[nid] = clean_description(d['downcase_name'])

In [134]:
nctid_names_dict

{'NCT05080569': 'infertility unexplained and luteal phase defect and fertility issues and pregnancy related',
 'NCT05080543': 'septic shock',
 'NCT05080530': 'painful diabetic neuropathy',
 'NCT05077644': 'post-partum depression',
 'NCT05080322': 'allergic rhinitis',
 'NCT05080283': 'adult cochlear implant recipients',
 'NCT05080205': 'insulin resistance and obesity, morbid',
 'NCT05080114': 'laparoscopy and surgery and hysterectomy',
 'NCT05080075': 'knee osteoarthritis',
 'NCT05080010': 'retinoblastoma',
 'NCT04978467': 'stroke',
 'NCT05079802': 'resorption of tooth or root',
 'NCT05079789': 'sodium retention and edema and nephrotic syndrome',
 'NCT05079646': 'fracture resistance',
 'NCT05079568': 'gastroparesis',
 'NCT05079555': 'hip fractures',
 'NCT05079529': 'cardiometabolic syndrome',
 'NCT05079503': 'locally advanced rectal cancer',
 'NCT05079490': 'developmental coordination disorder',
 'NCT05079464': 'mild cognitive impairment and vascular cognitive impairment',
 'NCT05079269

In [136]:
# For the nct id that does not have a condition in conditions.txt, 'title' would be added to the content column
ids = list(set(ndf['nct_id'].values.tolist()))
nctid_names_dict_keys = nctid_names_dict.keys()
tol = len(ids)
i = 0
for id in ids:
    ndf = ndf.append({'nct_id': id, 'type': 't', 'cleaned_description': nctid_names_dict[id] if id in nctid_names_dict_keys else 'title'}, ignore_index = True)
    i += 1
    if i%10000 == 0:
        print(f'processing {i}/{tol}')
    

processing 10000/130396
processing 20000/130396
processing 30000/130396
processing 40000/130396
processing 50000/130396
processing 60000/130396
processing 70000/130396
processing 80000/130396
processing 90000/130396
processing 100000/130396
processing 110000/130396
processing 120000/130396
processing 130000/130396


In [137]:
ndf = ndf.sort_values(['nct_id', 'type'],ascending=False)

In [138]:
compression_opts = dict(method='zip', archive_name='PubTator_cleaned_des.tsv')
ndf.to_csv('out.zip', sep='|', index=False, header=False, compression=compression_opts)