## Treatments and Diseases/Problems Extraction from Tagged Entities

treatment_description_only.txt_treatmenttagged and treatment_description_only.txt_problemtagged are the results from CubNER which contains one token and its NER tag in one line. The following code is for extracting the mentions from the NER results. For instance, from 

proprioceptive B-treatment

neuromuscular I-treatment

facilitation I-treatment

exercises I-treatment

or O

to 'proprioceptive neuromuscular facilitation exercises'.

In [55]:
import pandas as pd

In [60]:
import re
def clean(str):
    p = re.compile('[\(|\)|,|.]|®')
    s = p.sub('', str)
    return s

In [61]:
# Extract treatments from treatment_description_only.txt_treatmenttagged
# output: a list containing lists of extracted treatment mentions

treatments = []
one_trial = []
intext = False
with open('data/treatment_description_only.txt_treatmenttagged') as file:
    for line in file:
        l = line.split()
        if l == []:
            treatments.append(list(set(one_trial))) # remove duplicate elements
            one_trial = []
        elif len(l)<3:
            pass
        else:
#             print(l)
            tag = l[2]
            # combine tokens to string with the tagger as hints (B-treatment: begining of a term; I-treatment: inside the term)
            if tag=='B-treatment':
                if intext:
                    one_trial.append(clean(intext))
                    intext = False
                intext = l[0]
            elif tag=='I-treatment':
                intext = intext + ' ' + l[0]
        

In [62]:
# check the first 10 extracted treatment mentions

treatments[:10]

[['proprioceptive neuromuscular facilitation exercises',
  'Motion',
  'proprioceptive neuromuscular facilitation',
  'cardiac overload'],
 ['conventional exercise programmers', 'radical mastectomy', 'kg weight'],
 ['required neuromuscular relaxation',
  'external force',
  'n=214 Hypothesis: Post-intervention balance and gait assessments',
  'specific rehabilitative interventions',
  'their postural control Conventional balance perturbation',
  'balance perturbation intervention',
  'gait and balance-control',
  'inpatient rehabilitation services',
  'standard BWSS training',
  'rehabilitative methods'],
 [],
 ['LY01005 36 mg', 'ZOLADEX  36 mg', 'stroke lesion', 'chemotherapy yes'],
 [],
 ['blocks',
  'serum E2 LH and FSH Safety evaluation including vital signs physical examination laboratory tests'],
 ['Placebo 2 pills'],
 ['therapy including subthalamic nucleus deep brain stimulation',
  'the brain cholinergic neurotransmitter system',
  'Gait problems postural instability',
  'righ

In [65]:
# Extract diseases from treatment_description_only.txt_problemtagged
# output: a list containing lists of extracted disease mentions

diseases = []
one_trial = []
intext = False
with open('data/treatment_description_only.txt_problemtagged') as file:
    for line in file:
        l = line.split()
        if l == []:
            diseases.append(list(set(one_trial))) # remove duplicate elements
            one_trial = []
        elif len(l)<3:
            pass
        else:
#             print(l)
            tag = l[2]
            # combine tokens to string with the tagger as hints (B-problem: begining of a term; I-problem: inside the term)
            if tag=='B-problem':
                if intext:
                    one_trial.append(clean(intext))
                    intext = False
                intext = l[0]
            elif tag=='I-problem':
                intext = intext + ' ' + l[0]
        

In [66]:
# check the first 10 extraced disease mentions

diseases[:10]

[['repetition maximum', 'cardiac overload'],
 ['Motion', 'kg weight'],
 ['required neuromuscular relaxation',
  'n=214 Hypothesis: Post-intervention balance and gait assessments',
  'stroke survivors',
  'balance',
  'impaired balance regulation loss',
  'functional independence',
  'fall risk',
  'stroke lesion',
  'falling',
  'mobility-related functional tasks',
  'falls',
  'any injurious falls',
  'poor coordination',
  'a BBS score',
  'gait and balance-control',
  'daily living activities Fear',
  'inpatient rehabilitation services',
  'other injurious falls'],
 ['BWSS sessions and time',
  '14 skipping mutation positive results screening period'],
 ['mg/tablet 1 tablet/time Blood samples',
  'ZOLADEX  36 mg',
  'withdrawal',
  'follow-up death',
  'LY01005 36 mg'],
 ['serum E2 LH and FSH Safety evaluation including vital signs physical examination laboratory tests'],
 ['Amitriptyline Blinded Period 2 2nd 4 weeks:',
  'blocks',
  'schizophrenia',
  'Amitriptyline + 25mg',
  'scr

In [67]:
# import ids.csv
data_filename = "data/ids.csv"
df = pd.read_csv(data_filename)

In [68]:
# reconnect the nct_id and the corresonding diseas and treatment mentions
df['disease'] = diseases
df['treatment'] = treatments

In [69]:
df

Unnamed: 0,nct_id,disease,treatment
0,NCT05110365,"[repetition maximum, cardiac overload]",[proprioceptive neuromuscular facilitation exe...
1,NCT05110339,"[Motion, kg weight]","[conventional exercise programmers, radical ma..."
2,NCT05110300,"[required neuromuscular relaxation, n=214 Hypo...","[required neuromuscular relaxation, external f..."
3,NCT05110196,"[BWSS sessions and time, 14 skipping mutation ...",[]
4,NCT05110170,"[mg/tablet 1 tablet/time Blood samples, ZOLADE...","[LY01005 36 mg, ZOLADEX 36 mg, stroke lesion,..."
...,...,...,...
130391,NCT00000116,"[Retinitis pigmentosa RP, inherited retinal de...","[reductions, infusions, vitamin]"
130392,NCT00000115,"[intraocular pressure, the retinal pigment epi...",[standardized Early Treatment Diabetic Retinop...
130393,NCT00000114,"[vision, standardized Early Treatment Diabetic...","[reductions, four treatment groups: 15000 IU/d..."
130394,NCT00000113,"[vision and blindness, vitamin E, eye growth, ...",[]


In [70]:
# save in file
df.to_csv('extracted_disease_treatment_terms.csv', sep=',',header=True, index=False)

In [83]:
# for treatment: expend the list of mentions and have one nct id and one mention as one row 
ids = df['nct_id']
df_treatment = pd.DataFrame(columns=['nct_id', 'treatment'])
for i in range (len(ids)):
    idls = [ids[i]]*len(treatments[i])
    df_temp = pd.DataFrame(columns=['nct_id', 'treatment'])
    df_temp['nct_id'] = idls
    df_temp['treatment'] = treatments[i]
    df_treatment = df_treatment.append(df_temp)
    

In [85]:
df_treatment.to_csv('data/treatment_terms.csv', sep=',',header=True, index=False)

In [84]:
df_treatment

Unnamed: 0,nct_id,treatment
0,NCT05110365,proprioceptive neuromuscular facilitation exer...
1,NCT05110365,Motion
2,NCT05110365,proprioceptive neuromuscular facilitation
3,NCT05110365,cardiac overload
0,NCT05110339,conventional exercise programmers
...,...,...
1,NCT00000114,four treatment groups: 15000 IU/day vitamin
2,NCT00000114,Laser acuity
0,NCT00000102,nifedipine vs placebo
1,NCT00000102,adrenocorticotropic hormone


In [87]:
# for disease: expend the list of mentions and have one nct id and one mention as one row 
df_disease = pd.DataFrame(columns=['nct_id', 'disease'])
for i in range (len(ids)):
    idls = [ids[i]]*len(diseases[i])
    df_temp = pd.DataFrame(columns=['nct_id', 'disease'])
    df_temp['nct_id'] = idls
    df_temp['disease'] = diseases[i]
    df_disease = df_disease.append(df_temp)

In [88]:
df_disease

Unnamed: 0,nct_id,disease
0,NCT05110365,repetition maximum
1,NCT05110365,cardiac overload
0,NCT05110339,Motion
1,NCT05110339,kg weight
0,NCT05110300,required neuromuscular relaxation
...,...,...
5,NCT00000113,Myopia Evaluation Trial COMET
6,NCT00000113,progressive addition lenses
7,NCT00000113,single vision lenses
8,NCT00000113,accommodation and myopia


In [89]:
df_disease.to_csv('data/disease_terms.csv', sep=',',header=True, index=False)