# Preprocessing training data for temporal relation extraction in medical texts

Goal of temporal relation expression is to link named entites - in the medical domain clinical event and clinical event or temporal expression and temporal expression. Prerequisite of temporal relation in general is named-entity recognition (NER) of clinical events (CE) and temporal expressions (TE).

In this notebook, the MIMIC-III dataset (https://physionet.org/content/mimiciii/1.4/) will accesses and pre-processed. The following aspects of the data will be weakly annotated:
* clinical events
* temporal expressions
* temporal links

Keywords for clinical events and for temporal expressions will be used for annotation. Goal is to obtain weak labels and create $T$ and $Z$ matrices that can be further used for experiments within `knodle`.

Steps in this notebook: 
* loading the data and preparing the pandas dataframe
* generate keywod lists and tag the data (clinical events & temporal expressions)
* generate z- and t- matrices (including a weak labeling for temporal links)

### Loading the data
As a first step, the following part of the MIMIC-III dataset will be loaded:
* NOTEEVENTS.csv

Part of the D_ICD_DIAGNOSES.csv will be employed as keywords for labeling clinical events (diseases).

In [1]:
import os
from time import time # enables time measurement
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import pickle
import nltk
import re
import scipy

from minio import Minio
from pprint import pprint
from itertools import chain
from sklearn.model_selection import train_test_split
from tqdm import tqdm
from scipy import sparse
from nltk.corpus import stopwords
from string import punctuation
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.util import bigrams, trigrams
from joblib import dump, load

# plotly fuctions
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio
import torch                                        # root package
from torch.utils.data import Dataset, DataLoader    # dataset representation and loading

# pandas-plotting via plotly
pd.options.plotting.backend = "plotly"

pd.set_option('display.max_colwidth', None)

## Load dataset

In [2]:
Notes = pd.read_csv("./data/NOTEEVENTS.csv", low_memory=False)

## Data preview

In [3]:
Notes.head(1)

Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,CHARTDATE,CHARTTIME,STORETIME,CATEGORY,DESCRIPTION,CGID,ISERROR,TEXT
0,174,22532,167853.0,2151-08-04,,,Discharge summary,Report,,,"Admission Date: [**2151-7-16**] Discharge Date: [**2151-8-4**]\n\n\nService:\nADDENDUM:\n\nRADIOLOGIC STUDIES: Radiologic studies also included a chest\nCT, which confirmed cavitary lesions in the left lung apex\nconsistent with infectious process/tuberculosis. This also\nmoderate-sized left pleural effusion.\n\nHEAD CT: Head CT showed no intracranial hemorrhage or mass\neffect, but old infarction consistent with past medical\nhistory.\n\nABDOMINAL CT: Abdominal CT showed lesions of\nT10 and sacrum most likely secondary to osteoporosis. These can\nbe followed by repeat imaging as an outpatient.\n\n\n\n [**First Name8 (NamePattern2) **] [**First Name4 (NamePattern1) 1775**] [**Last Name (NamePattern1) **], M.D. [**MD Number(1) 1776**]\n\nDictated By:[**Hospital 1807**]\nMEDQUIST36\n\nD: [**2151-8-5**] 12:11\nT: [**2151-8-5**] 12:21\nJOB#: [**Job Number 1808**]\n"


The most important column is **TEXT**. It contains a clinical note stored as a string. For temporal relation extraction, maybe **CHARTDATE** could also be relevant for relative temporal links.

In [4]:
Notes.TEXT[0]

'Admission Date:  [**2151-7-16**]       Discharge Date:  [**2151-8-4**]\n\n\nService:\nADDENDUM:\n\nRADIOLOGIC STUDIES:  Radiologic studies also included a chest\nCT, which confirmed cavitary lesions in the left lung apex\nconsistent with infectious process/tuberculosis.  This also\nmoderate-sized left pleural effusion.\n\nHEAD CT:  Head CT showed no intracranial hemorrhage or mass\neffect, but old infarction consistent with past medical\nhistory.\n\nABDOMINAL CT:  Abdominal CT showed lesions of\nT10 and sacrum most likely secondary to osteoporosis. These can\nbe followed by repeat imaging as an outpatient.\n\n\n\n                            [**First Name8 (NamePattern2) **] [**First Name4 (NamePattern1) 1775**] [**Last Name (NamePattern1) **], M.D.  [**MD Number(1) 1776**]\n\nDictated By:[**Hospital 1807**]\nMEDQUIST36\n\nD:  [**2151-8-5**]  12:11\nT:  [**2151-8-5**]  12:21\nJOB#:  [**Job Number 1808**]\n'

## Preparing a Pandas Dataframe
### Partially remove punctuation

'\n' needs to be replaced by space, as it is partially directly between words (they would not be tokenised if there is no space inbetween). '[\*\*' and '\*\*]' also need to be replaced so that the temporal tagger is able to find the dates.

Comments:
* ',' is not replaced, as it could be used to link clinical events as OVERLAP (currently, it is not)
* p.o., q.d., b.i.d., q.h.s., are not replaced, as they are needed for temporal expression extraction
* ':' is not replace, because of time (e.g. '10:43')

In [5]:
Notes.TEXT = Notes.TEXT.apply(lambda x: x.lower())
Notes.TEXT = Notes.TEXT.apply(lambda x: re.sub(r'\n', ' ', x))
Notes.TEXT = Notes.TEXT.apply(lambda x: re.sub(r'\(', ' ', x))
Notes.TEXT = Notes.TEXT.apply(lambda x: re.sub(r'\)', ' ', x))
Notes.TEXT = Notes.TEXT.apply(lambda x: re.sub(r'\/', ' ', x))
Notes.TEXT = Notes.TEXT.apply(lambda x: re.sub(r'=', ' ', x))
Notes.TEXT = Notes.TEXT.apply(lambda x: re.sub(r' \-+ ', ' ', x))
Notes.TEXT = Notes.TEXT.apply(lambda x: re.sub(r' * ', ' ', x))
Notes.TEXT = Notes.TEXT.apply(lambda x: re.sub(r' # ', ' ', x))
Notes.TEXT = Notes.TEXT.apply(lambda x: re.sub(r'\[[*]{2}[a-z0-9 ]*[*]{2}\]', '', x))
Notes.TEXT = Notes.TEXT.apply(lambda x: re.sub(r'\[[*]{2}|[*]{2}\]', '', x))
# removes duplicates of spaces 
Notes.TEXT = Notes.TEXT.apply(lambda x: re.sub(r' +', ' ', x.lower()))
Notes.TEXT[0]

'admission date: 2151-7-16 discharge date: 2151-8-4 service: addendum: radiologic studies: radiologic studies also included a chest ct, which confirmed cavitary lesions in the left lung apex consistent with infectious process tuberculosis. this also moderate-sized left pleural effusion. head ct: head ct showed no intracranial hemorrhage or mass effect, but old infarction consistent with past medical history. abdominal ct: abdominal ct showed lesions of t10 and sacrum most likely secondary to osteoporosis. these can be followed by repeat imaging as an outpatient. , m.d. dictated by: medquist36 d: 2151-8-5 12:11 t: 2151-8-5 12:21 job#: '

In [6]:
# sentence-tokenisation
Notes['SENT']  = Notes.TEXT.apply(lambda x: sent_tokenize(x))
print(Notes.SENT[0])

['admission date: 2151-7-16 discharge date: 2151-8-4 service: addendum: radiologic studies: radiologic studies also included a chest ct, which confirmed cavitary lesions in the left lung apex consistent with infectious process tuberculosis.', 'this also moderate-sized left pleural effusion.', 'head ct: head ct showed no intracranial hemorrhage or mass effect, but old infarction consistent with past medical history.', 'abdominal ct: abdominal ct showed lesions of t10 and sacrum most likely secondary to osteoporosis.', 'these can be followed by repeat imaging as an outpatient.', ', m.d.', 'dictated by: medquist36 d: 2151-8-5 12:11 t: 2151-8-5 12:21 job#:']


In [7]:
# save tokenized sentences in joblib format
pickle_file = './sent_notes.joblib'
with open(pickle_file, 'wb') as f:
    dump(Notes.SENT, f, compress='zlib')

In [8]:
# load file:
t0 = time()
with open('./sent_notes.joblib', 'rb') as f:
    sent_notes = load(f)
print('loading time: ', time()-t0)

loading time:  773.1161766052246


In [9]:
del Notes  # if 'Notes' is loaded at the beginning, it can be deleted (will not be used anymore further down)
print(sent_notes[0])

['admission date: 2151-7-16 discharge date: 2151-8-4 service: addendum: radiologic studies: radiologic studies also included a chest ct, which confirmed cavitary lesions in the left lung apex consistent with infectious process tuberculosis.', 'this also moderate-sized left pleural effusion.', 'head ct: head ct showed no intracranial hemorrhage or mass effect, but old infarction consistent with past medical history.', 'abdominal ct: abdominal ct showed lesions of t10 and sacrum most likely secondary to osteoporosis.', 'these can be followed by repeat imaging as an outpatient.', ', m.d.', 'dictated by: medquist36 d: 2151-8-5 12:11 t: 2151-8-5 12:21 job#:']


In [10]:
# explode --> each sentence is an instance, adding an ID for each note

all_notes = pd.DataFrame(sent_notes)
all_sents = all_notes.explode('SENT')
all_sents.reset_index(inplace = True)
all_sents.rename(columns = {'index': 'note_id'}, inplace = True)

In [11]:
all_sents.head()

Unnamed: 0,note_id,SENT
0,0,"admission date: 2151-7-16 discharge date: 2151-8-4 service: addendum: radiologic studies: radiologic studies also included a chest ct, which confirmed cavitary lesions in the left lung apex consistent with infectious process tuberculosis."
1,0,this also moderate-sized left pleural effusion.
2,0,"head ct: head ct showed no intracranial hemorrhage or mass effect, but old infarction consistent with past medical history."
3,0,abdominal ct: abdominal ct showed lesions of t10 and sacrum most likely secondary to osteoporosis.
4,0,these can be followed by repeat imaging as an outpatient.


In [12]:
# add sentence ID per note ID
all_sents['sent_id'] = all_sents.groupby(by = 'note_id').cumcount()

In [13]:
all_sents.head()

Unnamed: 0,note_id,SENT,sent_id
0,0,"admission date: 2151-7-16 discharge date: 2151-8-4 service: addendum: radiologic studies: radiologic studies also included a chest ct, which confirmed cavitary lesions in the left lung apex consistent with infectious process tuberculosis.",0
1,0,this also moderate-sized left pleural effusion.,1
2,0,"head ct: head ct showed no intracranial hemorrhage or mass effect, but old infarction consistent with past medical history.",2
3,0,abdominal ct: abdominal ct showed lesions of t10 and sacrum most likely secondary to osteoporosis.,3
4,0,these can be followed by repeat imaging as an outpatient.,4


In [14]:
all_sents.shape

(34983602, 3)

## Generate keyword-lists

As a first step, clinical events and temproal expressions will be tagged:
* **clinical events**: keywords from diseases_exp.txt will be used, extracted from MeSH (https://www.nlm.nih.gov/mesh/meshhome.html), and D_ICD_DIAGNOSES.csv (part of the MIMIC-III dataset)
* **temporal expressions**: the Transformer_Temporal_Tagger was first used to annotate temporal expressions (TIMEX3 schema (time, date, frequency, duration)), as the model is trained on data tagged with HeidelTime https://github.com/satya77/Transformer_Temporal_Tagger/blob/master/README.md). However, the model was not applicable due to a very long running time. Therefore regular expressions are defined further down in this notebook.


### Tag NEs of clinical events

Keywords will be later exchanged by annotated keywords using the pandas function .replace(), for which regex can be set as true. As I do not want granulation to be tagged as 'g ranula=EVENT-B tion', '\W' is added before and after each keyword ('\Wranula\W') in the keyword-list. When later keywords are replaced by annotated keywords (e.g. 'ranula' by 'ranula=EVENT-B'), they are only replaced if the keyword does not have a preceding letter or number --> a part of granulation will not be tagged with ranula. In order for that to work, a whitespace is added in the list of annotated keywords ' ranula=EVENT-B ', as otherwise the keywords can not be tokenised later.

In [15]:
# load keywords and add all keywords to a keyword-list (keywords_regex)

# some additional keywords are added manually - they can be removed
additional_keywords = ['admission', 'discharge', 'fevers', 'chills', 'nausea', 'vomiting', 'chest pain', 'confusion', 'neck pain', 'cough', 'smoking smoke', 'alcohol abuse', 'confusion', 'vomiting', 'weakness', 'ct scan', 'pain', 'confusion', 'numbness', 'shortness of breath']

data_path = "./"

# MeSH data
with open('./data/diseases_exp.txt') as f1:
    lines1 = f1.readlines()

# column 'SHORT_TITLE - diseaseClasses1'    
with open('./data/DC1.txt') as f2:
    lines2 = f2.readlines()

# column 'LONG_TITLE - specificDiseases1'
with open('./data/SD1.txt') as f3:
    lines3 = f3.readlines()
    
def clean_key(word):
    word = word.lower()
    word = re.sub(r',|\n', '', word)
    word = re.sub(r'\/| +', ' ', word)
    return word

dis_keywords = sorted(set([clean_key(l1) for l1 in lines1]))
dis_keywords = sorted(dis_keywords, key=len, reverse=True)

dc1_keywords = list()
dc1_keywords = sorted(set([clean_key(l2) for l2 in lines2]))
dc1_keywords = sorted(dc1_keywords, key=len, reverse=True)

sd1_keywords = list()
sd1_keywords = sorted(set([clean_key(l3) for l3 in lines3]))
sd1_keywords = sorted(sd1_keywords, key=len, reverse=True)

all_keywords = set(dis_keywords + dc1_keywords + sd1_keywords + additional_keywords)
    
keywords_regex = list()
for k in all_keywords:
    regex = r'\W' + k + '\W'
    keywords_regex.append(regex)

print(len(keywords_regex), 'keywords currently used')
print(keywords_regex[:50])

5123 keywords currently used
['\\Wcandidiasis invasive\\W', '\\Wmyelolipoma\\W', '\\Wthreaten abortantepart\\W', '\\Wviral arthritisup arm\\W', '\\Wrectum injuryclosed\\W', '\\Wpoiselectro cal wat agt\\W', '\\Wmeralgia paresthetica\\W', '\\Warterivirus infections\\W', '\\Walternating esotropia\\W', '\\Wcarcinoma lobular\\W', '\\Wschizophrenia necsubchr\\W', '\\Wattn concentrate deficit\\W', '\\Whemarthrosis forearm\\W', '\\Wconjunctival edema\\W', '\\Weosinophilia\\W', '\\Wtraum arthropathypelvis\\W', '\\Wendocarditis nos\\W', '\\Wsarcoidosis\\W', '\\Whantavirus pulmonary syndrome\\W', '\\Whypocalcem hypomagnes nb\\W', '\\Wuterine tumorpostpartum\\W', '\\Wjoint effusionankle\\W', '\\Wodontogenic tumors\\W', '\\Wplague unspecified\\W', '\\Wtuberculosis avian\\W', '\\Weye infections\\W', '\\Whypermobility syndrome\\W', '\\Wbagassosis\\W', '\\Wallerg arthritishand\\W', '\\Wassaultpoisoning nos\\W', '\\Wcommon cold\\W', '\\Wcicatricial lagophthalm\\W', '\\Wkeratoconus unspecified\\W', '\\W

### Generate a keyword list of tagged clinical events (BIO-schema)

The BIO-schema (inside outside beginning) was introduced by Ramshaw & Marcus 1999. 

In [16]:
# generate list with keywords containing a NE-tag
# list of annotated keywords
dis_ce_keywords = list()

for k in all_keywords:
    kwds = k.split(' ')
    kwds_string = ''
    if len(kwds) > 1:
        kwds_string = kwds[0] + '=EVENT-B'
        for i in range(1, len(kwds)):
            kwds_string += ' ' + kwds[i] + '=EVENT-I'
    elif len(kwds) == 1:
        kwds_string = k + '=EVENT-B'
    kwds_string = ' '+kwds_string+' '
    dis_ce_keywords.append(kwds_string)
    
# examples
print(dis_ce_keywords[-100:-70])

[' bact=EVENT-B arthritispelvis=EVENT-I ', ' instrumnt=EVENT-B failinfusion=EVENT-I ', ' thrombophlebitantepart=EVENT-B ', ' seborrhea=EVENT-B ', ' drowning=EVENT-B nonfatal=EVENT-I submer=EVENT-I ', ' hxthrombophlebitis=EVENT-B ', ' talipes=EVENT-B varus=EVENT-I ', ' flavivirus=EVENT-B infections=EVENT-I ', ' hypothermia=EVENT-B ', ' otorrhea=EVENT-B unspecified=EVENT-I ', ' amyloidosis=EVENT-B nec=EVENT-I ', ' amyotrophic=EVENT-B sclerosis=EVENT-I ', ' chronic=EVENT-B nasopharyngitis=EVENT-I ', ' heredit=EVENT-B elliptocytosis=EVENT-I ', ' tuberculosis=EVENT-B male=EVENT-I genital=EVENT-I ', ' rosacea=EVENT-B ', ' subacute=EVENT-B thyroiditis=EVENT-I ', ' cough=EVENT-B ', ' chondromalacia=EVENT-B patellae=EVENT-I ', ' gonococcal=EVENT-B spondylitis=EVENT-I ', ' zoster=EVENT-B sine=EVENT-I herpete=EVENT-I ', ' rheumatoid=EVENT-B arthritis=EVENT-I ', ' carcinoma=EVENT-B 256=EVENT-I walker=EVENT-I ', ' conns=EVENT-B syndrome=EVENT-I ', ' mediastinal=EVENT-B cyst=EVENT-I ', ' infestation

### Generate an additional keyword list of NEs (NEs marked with '\*')
An additional keyword-list is generated, as the algorithm used to obtain the embeddings (X-matrix) for relation extraction needs the NE to be marked with '\*' at the beginning and the end. The paper where the apporach is described in further detail can be found here: https://ojs.aaai.org/index.php/AAAI/article/view/17717 (Zhou, W., Huang, K., Ma, T., & Huang, J. (2021, May). Document-level relation extraction with adaptive thresholding and localized context pooling. In Proceedings of the AAAI conference on artificial intelligence (Vol. 35, No. 16, pp. 14612-14620).)

In [17]:
# list of annotated keywords
ast_keywords = list()

for k in all_keywords:
    kwds = k.split(' ')
    kwds_string = ''
    if len(kwds) > 1:
        kwds_string = '*' + kwds[0]
        for i in range(1, len(kwds)):
            kwds_string += ' ' + kwds[i]
        kwds_string += '*'
    elif len(kwds) == 1:
        kwds_string = '*' + k + '*'
    kwds_string = ' '+kwds_string+' '
    ast_keywords.append(kwds_string)

# examples
print(ast_keywords[-100:-70])

[' *bact arthritispelvis* ', ' *instrumnt failinfusion* ', ' *thrombophlebitantepart* ', ' *seborrhea* ', ' *drowning nonfatal submer* ', ' *hxthrombophlebitis* ', ' *talipes varus* ', ' *flavivirus infections* ', ' *hypothermia* ', ' *otorrhea unspecified* ', ' *amyloidosis nec* ', ' *amyotrophic sclerosis* ', ' *chronic nasopharyngitis* ', ' *heredit elliptocytosis* ', ' *tuberculosis male genital* ', ' *rosacea* ', ' *subacute thyroiditis* ', ' *cough* ', ' *chondromalacia patellae* ', ' *gonococcal spondylitis* ', ' *zoster sine herpete* ', ' *rheumatoid arthritis* ', ' *carcinoma 256 walker* ', ' *conns syndrome* ', ' *mediastinal cyst* ', ' *infestation nos* ', ' *silicotuberculosis* ', ' *meconium obstruction* ', ' *lyme neuroborreliosis* ', ' *achromatopsia* ']


### Replace keywords in the DF with annotated keywords (BIO-schema)

In the following, only a subset of the data (40000 sentences) will be employed.

In [18]:
#to try the algorithm, only use a subset of the data

nobs = 40000 
test_df = all_sents.iloc[0:nobs,:]
ce_annotations = test_df.copy()

In [19]:
ce_annotations.shape

(40000, 3)

In [20]:
# replace keywords with annotated keywords
t0=time()
ce_annotations = test_df.copy()
ce_annotations.SENT = test_df.SENT.replace(to_replace=keywords_regex, value=dis_ce_keywords, regex = True)
print('estimated hours for annotation on full data (hours):', (time()-t0)*all_sents.shape[0]/60/60/nobs)

estimated hours for annotation on full data (hours): 183.41505795662692


In [21]:
## lookup certain sentences within notes
# ce_annotations.SENT[(ce_annotations.note_id == 10) & (ce_annotations.sent_id == 93)]

In [22]:
no_kwd_id = (test_df.SENT == ce_annotations.SENT)

In [23]:
# ce_only contains only sentences with at least 1 clinical event
# the left index: all 40000 sentences are enumerated (from 0-39999), but only those containing a clinical event are listed

ce_only = ce_annotations[~no_kwd_id]
ce_only.head()

Unnamed: 0,note_id,SENT,sent_id
0,0,"admission date: 2151-7-16 discharge=EVENT-B date: 2151-8-4 service: addendum: radiologic studies: radiologic studies also included a chest ct, which confirmed cavitary lesions in the left lung apex consistent with infectious process tuberculosis=EVENT-B",0
7,1,"admission date: 2118-6-2 discharge=EVENT-B date: 2118-6-14 date of birth: sex: f service: micu and then to medicine history of present illness: this is an 81-year-old female with a history of emphysema not on home o2 , who presents with three days of shortness=EVENT-B of=EVENT-I breath=EVENT-I thought by her primary care doctor to be a copd flare.",0
8,1,two days prior to admission=EVENT-B she was started on a prednisone taper and one day prior to admission=EVENT-B she required oxygen at home in order to maintain oxygen saturation greater than 90%.,1
12,1,"review of systems is negative for the following: fevers=EVENT-B chills=EVENT-B nausea=EVENT-B vomiting=EVENT-B night sweats, change in weight, gastrointestinal complaints, neurologic changes, rashes, palpitations=EVENT-B orthopnea=EVENT-B",5
13,1,"is positive for the following: chest pressure occasionally with shortness=EVENT-B of=EVENT-I breath=EVENT-I with exertion, some shortness=EVENT-B of=EVENT-I breath=EVENT-I that is positionally related, but is improved with nebulizer treatment.",6


### Tag NEs of temporal expressions - TIMEX3

According to the TIMEX3 schema, there are 4 classes:
* date
* duration
* frequency
* time

This annotation scheme is explained in more detail in Sohn et al. (2013).

Examples:

**Date** description—explicit (eg, 09/29/1993, 2015-11-13, etc.) and relative (eg, 'the day of admission', 'hospital day three', 'this time', etc)

**Frequency**: can be identified via indication words 'each', 'every','x','times', or Latin abbreviations such as
b.i.d., p.r.n., q6h, q day

**Duration**: 'for 10 days', 'for half a year' 

**Time**: [hh]:[mm]:[ss] or [YYYY]-[MM]-[DD]T[hh]:[mm]:[ss]. 

#### Simple rules to identify the four types of temporal expressions:

In [24]:
t0 = time()

month = 'January|February|March|April|May|June|July|August|September|October|November|December|Jan\.|Feb\.|Mar\.|Apr\.|Jun\.|Jul\.|Aug\.|Sept\.|Oct\.|Nov\.|Dec\.'
count = 'one|two|three|four|five|six|seven|eight|nine|ten|eleven'

date_regex = r'[0-9]{4}(?:\-[0-9]{1,2}\-[0-9]{1,2})?|\d+\-\s[0-9]{4}|today|\d+ of (?:'+month+')|(?:'+month+') \d+(?:,)?(?: \d{4})?|day (?:'+count+')'
date_regex = r'[\s\.,^]' + date_regex + r'?!\-' #r'[\s\.,]'
frequency_regex = r'(?:each|every|\d+|'+count+') time(?:s)?\s?(?:a|per)?\s?(?:day|week|month|year|hours)?(?:daily|weekly)?|daily|weekly|b\.i\.d\.| p\.r\.n\.|q\.h\.s\.|q\.d\.|\d+h|q day|a regular basis|many times' # q.d. = once a day, b.i.d. = twice a day, q.h.s. = before bed
frequency_regex = frequency_regex + r'\W' # not 3 year, but 3 years needs to be tagged
duration_regex = r'(?:\d{1,2}-)?(?:\d+|'+count+') (?:day(?:s)?|week(?:s)?|month(?:s)?|years?)|(?:1|one) year|q\d{1,2}h|d{1,2}\s?wk|\d+\s?mos' # 40 year should not be tagged (e.g. in 40 year old patient), 40 years should be tagged (e.g. smoked for 40 years); '3 day' oin the other hand should be tagged (e.g. '3 day course')
duration_regex = duration_regex + r'\W'
time_regex = r'(?:[0-9]{1,2}\:[0-9]{2}(?:\:[0-9]{2})?|[0-9]{2}(?:\s)?(?:am|a.m.|pm|p.m.)|night|noon|midnight|morning)'

#### 3 keyword lists containing temporal relations will be created:
* list, containing the temporal expressions occurring in the text, which map the regular expressions for the four types
* list, containing the temporal expressions tagged according to the BIO-schema, including the type of temporal expression
* list, containing the temporal expressions tagged with an '*' at the beginning and the end (the tags do not contain the type of temporal expression)

By creating these three lists in parallel, the tags can be easily replaced by one another.

In [25]:
def split_temp_exp(temp_exp_list, exp_type):
    """ function to tag whether the word is at the beginning or 
    inside of the temporal expression (BIO-schema) """
    ann_temp_exp = list()
    for k in temp_exp_list:
        k = k.strip()
        kwds = k.split(' ')
        kwds_string = ''
        if len(kwds) > 1:
            kwds_string = ' ' + kwds[0] + '=' + exp_type + '-B'
            for i in range(1, len(kwds)):
                kwds_string += ' ' + kwds[i] + '=' + exp_type + '-I'
        elif len(kwds) == 1:
            kwds_string = ' ' + k + '=' + exp_type + '-B'
        kwds_string += ' '
        ann_temp_exp.append(kwds_string)
    return ann_temp_exp

In [26]:
def ast_temp_exp(temp_exp_list):
    """ function to tag the temporal expression with an '*' 
    at the beginning and the end (*-schema) """
    ann_ast_te = list()
    for k in temp_exp_list:
        k = k.strip()
        kwds = k.split(' ')
        kwds_string = ''
        if len(kwds) > 1:
            kwds_string = ' *' + kwds[0]
            for i in range(1, len(kwds)):
                kwds_string += ' ' + kwds[i]
            kwds_string += '* '
        elif len(kwds) == 1:
            kwds_string = ' *' + k + '* '
        ann_ast_te.append(kwds_string)
    return ann_ast_te

In [27]:
#dates
found_dates = ce_only.SENT.str.findall(date_regex)
date_values = [v for v in found_dates.values.tolist() if v]
# how the TE is mentioned in the data
# create a list containing all occurring dates
date_val = list(set(np.concatenate(date_values).flat))
# annotated TE mentioned in the data
# create another list with all occurring dates including annotations
te_date_val = split_temp_exp(date_val, 'DATE')
ast_te_date_val = ast_temp_exp(date_val)

#frequencies
found_frequency = ce_only.SENT.str.findall(frequency_regex)
frequency_values = [v for v in found_frequency.values.tolist() if v]
frequency_val = list(set(np.concatenate(frequency_values).flat))
te_frequency_val = split_temp_exp(frequency_val, 'FREQUENCY')
ast_te_frequency_val = ast_temp_exp(frequency_val)

#duration
found_duration = ce_only.SENT.str.findall(duration_regex)
duration_values = [v for v in found_duration.values.tolist() if v]
duration_val = list(set(np.concatenate(duration_values).flat))
te_duration_val = split_temp_exp(duration_val, 'DURATION')
ast_te_duration_val = ast_temp_exp(duration_val)

#time
found_time = ce_only.SENT.str.findall(time_regex)
time_values = [v for v in found_time.values.tolist() if v]
time_val = list(set(np.concatenate(time_values).flat))
te_time_val = split_temp_exp(time_val, 'TIME')
ast_te_time_val = ast_temp_exp(time_val)

#temporal expression keyword-list
all_TE_keyw = date_val + frequency_val + duration_val + time_val
all_TE_ann= te_date_val + te_frequency_val + te_duration_val + te_time_val
all_TE_ast = ast_te_date_val + ast_te_frequency_val + ast_te_duration_val + ast_te_time_val
print('Example: 30th temporal expression found in the data')
print('TE occurring in the text:', all_TE_keyw[30])
print('TE tagged according to BIO-schema:', all_TE_ann[30])
print('TE tagged with \'*\' for generating the embeddings for the relation extraction:', all_TE_ast[30])

Example: 30th temporal expression found in the data
TE occurring in the text:  2118-7-15
TE tagged according to BIO-schema:  2118-7-15=DATE-B 
TE tagged with '*' for generating the embeddings for the relation extraction:  *2118-7-15* 


In [28]:
# replace the temporal expressions by annotated temporal expressions (BIO-schema)
ce_and_te_ann = ce_only.copy()
ce_and_te_ann.SENT = ce_only.SENT.replace(to_replace=all_TE_keyw, value=all_TE_ann, regex = True)
ce_and_te_ann.head(10)

Unnamed: 0,note_id,SENT,sent_id
0,0,"admission date: 2151-7-16=DATE-B discharge=EVENT-B date: 2151-8-4=DATE-B service: addendum: radiologic studies: radiologic studies also included a chest ct, which confirmed cavitary lesions in the left lung apex consistent with infectious process tuberculosis=EVENT-B",0
7,1,"admission date: 2118-6-2=DATE-B discharge=EVENT-B date: 2118-6-14=DATE-B date of birth: sex: f service: micu and then to medicine history of present illness: this is an 81-year-old female with a history of emphysema not on home o2 , who presents with three=DURATION-B days=DURATION-I of shortness=EVENT-B of=EVENT-I breath=EVENT-I thought by her primary care doctor to be a copd flare.",0
8,1,two=DURATION-B day=DURATION-I s prior to admission=EVENT-B she was started on a prednisone taper and one=DURATION-B day=DURATION-I prior to admission=EVENT-B she required oxygen at home in order to maintain oxygen saturation greater than 90%.,1
12,1,"review of systems is negative for the following: fevers=EVENT-B chills=EVENT-B nausea=EVENT-B vomiting=EVENT-B night=TIME-B sweats, change in weight, gastrointestinal complaints, neurologic changes, rashes, palpitations=EVENT-B orthopnea=EVENT-B",5
13,1,"is positive for the following: chest pressure occasionally with shortness=EVENT-B of=EVENT-I breath=EVENT-I with exertion, some shortness=EVENT-B of=EVENT-I breath=EVENT-I that is positionally related, but is improved with nebulizer treatment.",6
19,1,2. lacunar cva=EVENT-B,12
34,1,medications on admission=EVENT-B 1. hydrochlorothiazide 25 q.d.=FREQUENCY-B,27
44,1,allergies: norvasc leads to lightheadedness and headache=EVENT-B,37
49,1,"physical exam at time of admission=EVENT-B blood pressure 142 76, heart rate 100 and regular, respirations at 17-21, and 97% axillary temperature.",42
58,1,"heart exam: tachycardic, regular, obscured by loud bilateral wheezing=EVENT-B with increase in the expiratory phase as well as profuse scattered rhonchi throughout the lung fields.",51


In [29]:
# get rid of annotation duplicates
# e.g. if 2 keywords match, such as 'bacterial infections' and 'infections'
duplicates = ['EVENT-B (?:=)?EVENT-I','EVENT-I (?:=)?EVENT-B', 'EVENT-I (?:=)?EVENT-I', 'EVENT-B (?:=)?EVENT-B', 'DATE-I\s+=DATE-I', 'FREQUENCY-I\s+=FREQUENCY-I', 'FREQUENCY-B\s+=FREQUENCY-I', 'FREQUENCY-I\s+=FREQUENCY-B']
corrections = ['EVENT-I', 'EVENT-I', 'EVENT-I', 'EVENT-B', 'DATE-I', 'FREQUENCY-I', 'FREQUENCY-I', 'FREQUENCY-I']
ann_clean = ce_and_te_ann.replace(to_replace=duplicates, value=corrections, regex = True)

ce_and_te_ann = ann_clean

In [30]:
# filter sentences which contain 2 NE
# if there are more than 2 NE, generate a list where always 2 NE are marked and the other NEs not, then explode the list, keeping sentence ID and note ID

def check_number_NE(x):
    row = '-'                             # '-' == sentence contains less than two named entities
    if x.count("-B") == 2:
        row = x
    # if there are more than 2 NE, generate a list where always 2 NE are marked
    if x.count("-B") > 2:
        amount_NEs = x.count("-B")                # count NEs
        row = list()
        new_row = x                               # sent with all NEs marked
        for i in range(1,(amount_NEs)):
            keep_row = new_row
            for y in range(i,(amount_NEs)):
                row_part3 = '-B'
                row_part1 = new_row.split("-B", 2)
                new_row_p3 = ''
                if len(row_part1) > 2:
                    row_part3 = row_part1[2]
                    new_row_p3 = row_part1[2]
                    if re.search('-B ', row_part1[2]):
                        row_part3 = row_part1[2].split("-B", 1)
                        row_part4 = re.sub(r'\=(?:EVENT|DATE|DURATION|FREQUENCY|TIME)$', '', row_part3[0])
                        row_part5 = re.sub(r'\=(?:EVENT|DATE|DURATION|FREQUENCY|TIME)(?:\-)?(?:I|B)?', '', row_part3[1])
                        row_part3 = row_part4 + row_part5
                        # new part 3 - exclude 'I' tags before the next 'B' tag
                        new_r_p = row_part1[2].split('-B', 1)
                        new_r_p2 = re.sub(r'\=(?:EVENT|DATE|DURATION|FREQUENCY|TIME)\-I', '', new_r_p[0])
                        if len(new_r_p) >= 2:
                            new_row_p3 = new_r_p2 + '-B' + new_r_p[1]
                        else:
                            new_row_p3 = new_r_p2 + '-B'
                new_row = row_part1[0] + "-B" + row_part1[1] + "-B" + row_part3
                row.append(new_row)
                # remove tags from 2nd NE
                middle_part = re.sub(r'\=(?:EVENT|DATE|DURATION|FREQUENCY|TIME)$', '', row_part1[1])
                new_row = row_part1[0] + "-B" + middle_part + new_row_p3
            new_row_p1 = keep_row.split("-B", 1)
            p1_new = re.sub(r'\=(?:EVENT|DATE|DURATION|FREQUENCY|TIME)', '', new_row_p1[0])
            second_part = new_row_p1[1].split("-B", 1)
            p2_new = re.sub(r'\=(?:EVENT|DATE|DURATION|FREQUENCY|TIME)\-I', '', second_part[0])
            if len(second_part)>1:
                new_row = p1_new + p2_new + '-B' + second_part[1]
            else:
                new_row = p1_new + p2_new + '-B' 
    return row

ct_te_filtered = ce_and_te_ann.copy() 
ct_te_filtered.SENT = ct_te_filtered.SENT.apply(lambda x: check_number_NE(x))
ct_te_filtered[:2]

Unnamed: 0,note_id,SENT,sent_id
0,0,"[admission date: 2151-7-16=DATE-B discharge=EVENT-B date: 2151-8-4 service: addendum: radiologic studies: radiologic studies also included a chest ct, which confirmed cavitary lesions in the left lung apex consistent with infectious process tuberculosis , admission date: 2151-7-16=DATE-B discharge date: 2151-8-4=DATE-B service: addendum: radiologic studies: radiologic studies also included a chest ct, which confirmed cavitary lesions in the left lung apex consistent with infectious process tuberculosis , admission date: 2151-7-16=DATE-B discharge date: 2151-8-4 service: addendum: radiologic studies: radiologic studies also included a chest ct, which confirmed cavitary lesions in the left lung apex consistent with infectious process tuberculosis=EVENT-B , admission date: 2151-7-16 discharge=EVENT-B date: 2151-8-4=DATE-B service: addendum: radiologic studies: radiologic studies also included a chest ct, which confirmed cavitary lesions in the left lung apex consistent with infectious process tuberculosis , admission date: 2151-7-16 discharge=EVENT-B date: 2151-8-4 service: addendum: radiologic studies: radiologic studies also included a chest ct, which confirmed cavitary lesions in the left lung apex consistent with infectious process tuberculosis=EVENT-B , admission date: 2151-7-16 discharge date: 2151-8-4=DATE-B service: addendum: radiologic studies: radiologic studies also included a chest ct, which confirmed cavitary lesions in the left lung apex consistent with infectious process tuberculosis=EVENT-B ]",0
7,1,"[admission date: 2118-6-2=DATE-B discharge=EVENT-B date: 2118-6-14 date of birth: sex: f service: micu and then to medicine history of present illness: this is an 81-year-old female with a history of emphysema not on home o2 , who presents with three days of shortness of breath thought by her primary care doctor to be a copd flare., admission date: 2118-6-2=DATE-B discharge date: 2118-6-14=DATE-B date of birth: sex: f service: micu and then to medicine history of present illness: this is an 81-year-old female with a history of emphysema not on home o2 , who presents with three days of shortness of breath thought by her primary care doctor to be a copd flare., admission date: 2118-6-2=DATE-B discharge date: 2118-6-14 date of birth: sex: f service: micu and then to medicine history of present illness: this is an 81-year-old female with a history of emphysema not on home o2 , who presents with three=DURATION-B days=DURATION-I of shortness of breath thought by her primary care doctor to be a copd flare., admission date: 2118-6-2=DATE-B discharge date: 2118-6-14 date of birth: sex: f service: micu and then to medicine history of present illness: this is an 81-year-old female with a history of emphysema not on home o2 , who presents with three days of shortness=EVENT-B of=EVENT-I breath=EVENT-I thought by her primary care doctor to be a copd flare., admission date: 2118-6-2 discharge=EVENT-B date: 2118-6-14=DATE-B date of birth: sex: f service: micu and then to medicine history of present illness: this is an 81-year-old female with a history of emphysema not on home o2 , who presents with three days of shortness of breath thought by her primary care doctor to be a copd flare., admission date: 2118-6-2 discharge=EVENT-B date: 2118-6-14 date of birth: sex: f service: micu and then to medicine history of present illness: this is an 81-year-old female with a history of emphysema not on home o2 , who presents with three=DURATION-B days=DURATION-I of shortness of breath thought by her primary care doctor to be a copd flare., admission date: 2118-6-2 discharge=EVENT-B date: 2118-6-14 date of birth: sex: f service: micu and then to medicine history of present illness: this is an 81-year-old female with a history of emphysema not on home o2 , who presents with three days of shortness=EVENT-B of=EVENT-I breath=EVENT-I thought by her primary care doctor to be a copd flare., admission date: 2118-6-2 discharge date: 2118-6-14=DATE-B date of birth: sex: f service: micu and then to medicine history of present illness: this is an 81-year-old female with a history of emphysema not on home o2 , who presents with three=DURATION-B days=DURATION-I of shortness of breath thought by her primary care doctor to be a copd flare., admission date: 2118-6-2 discharge date: 2118-6-14=DATE-B date of birth: sex: f service: micu and then to medicine history of present illness: this is an 81-year-old female with a history of emphysema not on home o2 , who presents with three days of shortness=EVENT-B of=EVENT-I breath=EVENT-I thought by her primary care doctor to be a copd flare., admission date: 2118-6-2 discharge date: 2118-6-14 date of birth: sex: f service: micu and then to medicine history of present illness: this is an 81-year-old female with a history of emphysema not on home o2 , who presents with three=DURATION-B days=DURATION-I of shortness=EVENT-B of=EVENT-I breath=EVENT-I thought by her primary care doctor to be a copd flare.]",0


In [31]:
# explode the data, so that each sentence with 2 NE marked is in one row
more_than_2NE = ct_te_filtered.explode('SENT')
more_than_2NE.head()

Unnamed: 0,note_id,SENT,sent_id
0,0,"admission date: 2151-7-16=DATE-B discharge=EVENT-B date: 2151-8-4 service: addendum: radiologic studies: radiologic studies also included a chest ct, which confirmed cavitary lesions in the left lung apex consistent with infectious process tuberculosis",0
0,0,"admission date: 2151-7-16=DATE-B discharge date: 2151-8-4=DATE-B service: addendum: radiologic studies: radiologic studies also included a chest ct, which confirmed cavitary lesions in the left lung apex consistent with infectious process tuberculosis",0
0,0,"admission date: 2151-7-16=DATE-B discharge date: 2151-8-4 service: addendum: radiologic studies: radiologic studies also included a chest ct, which confirmed cavitary lesions in the left lung apex consistent with infectious process tuberculosis=EVENT-B",0
0,0,"admission date: 2151-7-16 discharge=EVENT-B date: 2151-8-4=DATE-B service: addendum: radiologic studies: radiologic studies also included a chest ct, which confirmed cavitary lesions in the left lung apex consistent with infectious process tuberculosis",0
0,0,"admission date: 2151-7-16 discharge=EVENT-B date: 2151-8-4 service: addendum: radiologic studies: radiologic studies also included a chest ct, which confirmed cavitary lesions in the left lung apex consistent with infectious process tuberculosis=EVENT-B",0


In [32]:
# reset the index, so that in the next step, when the data is exploded to words, it is still clear which words are part of which sentence
more_than_2NE = more_than_2NE.reset_index(drop=True, inplace=False)
more_than_2NE.head()

Unnamed: 0,note_id,SENT,sent_id
0,0,"admission date: 2151-7-16=DATE-B discharge=EVENT-B date: 2151-8-4 service: addendum: radiologic studies: radiologic studies also included a chest ct, which confirmed cavitary lesions in the left lung apex consistent with infectious process tuberculosis",0
1,0,"admission date: 2151-7-16=DATE-B discharge date: 2151-8-4=DATE-B service: addendum: radiologic studies: radiologic studies also included a chest ct, which confirmed cavitary lesions in the left lung apex consistent with infectious process tuberculosis",0
2,0,"admission date: 2151-7-16=DATE-B discharge date: 2151-8-4 service: addendum: radiologic studies: radiologic studies also included a chest ct, which confirmed cavitary lesions in the left lung apex consistent with infectious process tuberculosis=EVENT-B",0
3,0,"admission date: 2151-7-16 discharge=EVENT-B date: 2151-8-4=DATE-B service: addendum: radiologic studies: radiologic studies also included a chest ct, which confirmed cavitary lesions in the left lung apex consistent with infectious process tuberculosis",0
4,0,"admission date: 2151-7-16 discharge=EVENT-B date: 2151-8-4 service: addendum: radiologic studies: radiologic studies also included a chest ct, which confirmed cavitary lesions in the left lung apex consistent with infectious process tuberculosis=EVENT-B",0


In [33]:
# filter sentences with two temporal expressions tagged
# --> no temporal link can be established between such 2 NE

def remove_sent(x):
    # two temporal expressions tagged
    if re.match(r'.*(?:DATE-B|FREQUENCY-B|TIME-B|DURATION-B).*(?:DATE-B|FREQUENCY-B|TIME-B|DURATION-B).*', x):
        x = '-'
    # twice the same clinical event tagged
    # as this is included in the test data, it is currently also included here - although it is debateable
#    elif re.match(r'.*EVENT-B.*EVENT-B.*', x):
#        strsplit = x.split('=EVENT-B')
#        first_event = strsplit[0].split(' ')
#        second_event = strsplit[1].split(' ')
#        if first_event[-1] == second_event[-1]:
#            x = '-'
    return x

more_than_2NE.SENT = more_than_2NE.SENT.apply(lambda x: remove_sent(x))
more_than_2NE.head()

Unnamed: 0,note_id,SENT,sent_id
0,0,"admission date: 2151-7-16=DATE-B discharge=EVENT-B date: 2151-8-4 service: addendum: radiologic studies: radiologic studies also included a chest ct, which confirmed cavitary lesions in the left lung apex consistent with infectious process tuberculosis",0
1,0,-,0
2,0,"admission date: 2151-7-16=DATE-B discharge date: 2151-8-4 service: addendum: radiologic studies: radiologic studies also included a chest ct, which confirmed cavitary lesions in the left lung apex consistent with infectious process tuberculosis=EVENT-B",0
3,0,"admission date: 2151-7-16 discharge=EVENT-B date: 2151-8-4=DATE-B service: addendum: radiologic studies: radiologic studies also included a chest ct, which confirmed cavitary lesions in the left lung apex consistent with infectious process tuberculosis",0
4,0,"admission date: 2151-7-16 discharge=EVENT-B date: 2151-8-4 service: addendum: radiologic studies: radiologic studies also included a chest ct, which confirmed cavitary lesions in the left lung apex consistent with infectious process tuberculosis=EVENT-B",0


In [34]:
# think about also filtering sentences including negations, such as
# - patient was without ...
# - patient denies ...

In [35]:
# word tokenisation --> each sentence is a list of words
word_notes = more_than_2NE.copy()
word_notes.SENT  = more_than_2NE.SENT.apply(lambda x: word_tokenize(x))

In [36]:
# remove sentences with more than x words and with only one word ('-')
max_sent_length = 40
word_notes['SENT_length'] = word_notes.SENT.str.len()
word_notes = word_notes[word_notes.SENT_length < max_sent_length]
word_notes = word_notes[1 < word_notes.SENT_length]
word_notes.head(10)

Unnamed: 0,note_id,SENT,sent_id,SENT_length
0,0,"[admission, date, :, 2151-7-16=DATE-B, discharge=EVENT-B, date, :, 2151-8-4, service, :, addendum, :, radiologic, studies, :, radiologic, studies, also, included, a, chest, ct, ,, which, confirmed, cavitary, lesions, in, the, left, lung, apex, consistent, with, infectious, process, tuberculosis]",0,37
2,0,"[admission, date, :, 2151-7-16=DATE-B, discharge, date, :, 2151-8-4, service, :, addendum, :, radiologic, studies, :, radiologic, studies, also, included, a, chest, ct, ,, which, confirmed, cavitary, lesions, in, the, left, lung, apex, consistent, with, infectious, process, tuberculosis=EVENT-B]",0,37
3,0,"[admission, date, :, 2151-7-16, discharge=EVENT-B, date, :, 2151-8-4=DATE-B, service, :, addendum, :, radiologic, studies, :, radiologic, studies, also, included, a, chest, ct, ,, which, confirmed, cavitary, lesions, in, the, left, lung, apex, consistent, with, infectious, process, tuberculosis]",0,37
4,0,"[admission, date, :, 2151-7-16, discharge=EVENT-B, date, :, 2151-8-4, service, :, addendum, :, radiologic, studies, :, radiologic, studies, also, included, a, chest, ct, ,, which, confirmed, cavitary, lesions, in, the, left, lung, apex, consistent, with, infectious, process, tuberculosis=EVENT-B]",0,37
5,0,"[admission, date, :, 2151-7-16, discharge, date, :, 2151-8-4=DATE-B, service, :, addendum, :, radiologic, studies, :, radiologic, studies, also, included, a, chest, ct, ,, which, confirmed, cavitary, lesions, in, the, left, lung, apex, consistent, with, infectious, process, tuberculosis=EVENT-B]",0,37
16,1,"[two=DURATION-B, day=DURATION-I, s, prior, to, admission=EVENT-B, she, was, started, on, a, prednisone, taper, and, one, day, prior, to, admission, she, required, oxygen, at, home, in, order, to, maintain, oxygen, saturation, greater, than, 90, %, .]",1,35
18,1,"[two=DURATION-B, day=DURATION-I, s, prior, to, admission, she, was, started, on, a, prednisone, taper, and, one, day, prior, to, admission=EVENT-B, she, required, oxygen, at, home, in, order, to, maintain, oxygen, saturation, greater, than, 90, %, .]",1,35
19,1,"[two, day, s, prior, to, admission=EVENT-B, she, was, started, on, a, prednisone, taper, and, one=DURATION-B, day=DURATION-I, prior, to, admission, she, required, oxygen, at, home, in, order, to, maintain, oxygen, saturation, greater, than, 90, %, .]",1,35
20,1,"[two, day, s, prior, to, admission=EVENT-B, she, was, started, on, a, prednisone, taper, and, one, day, prior, to, admission=EVENT-B, she, required, oxygen, at, home, in, order, to, maintain, oxygen, saturation, greater, than, 90, %, .]",1,35
21,1,"[two, day, s, prior, to, admission, she, was, started, on, a, prednisone, taper, and, one=DURATION-B, day=DURATION-I, prior, to, admission=EVENT-B, she, required, oxygen, at, home, in, order, to, maintain, oxygen, saturation, greater, than, 90, %, .]",1,35


In [37]:
# explode the data, so that each word is its own instance
# this is used to remove punctuation symbols and tag the words
all_words = word_notes.copy()
all_words = all_words.explode('SENT')
print('amount of words: ', len(all_words))
all_words.head(10)
all_words[:10]

amount of words:  125205


Unnamed: 0,note_id,SENT,sent_id,SENT_length
0,0,admission,0,37
0,0,date,0,37
0,0,:,0,37
0,0,2151-7-16=DATE-B,0,37
0,0,discharge=EVENT-B,0,37
0,0,date,0,37
0,0,:,0,37
0,0,2151-8-4,0,37
0,0,service,0,37
0,0,:,0,37


In [38]:
# if the word does not contain [a-z0-9], it will be removed
# this includes punctuation, as well as sentences with no or only one NE, which were replace by '-'
filter_punct = all_words['SENT'].str.contains("^\W$")
all_words = all_words[~filter_punct]
all_words[:10]

Unnamed: 0,note_id,SENT,sent_id,SENT_length
0,0,admission,0,37
0,0,date,0,37
0,0,2151-7-16=DATE-B,0,37
0,0,discharge=EVENT-B,0,37
0,0,date,0,37
0,0,2151-8-4,0,37
0,0,service,0,37
0,0,addendum,0,37
0,0,radiologic,0,37
0,0,studies,0,37


In [39]:
def create_tuple(x):
    """replace word by tuple: (word, tag)"""
    if re.match(r'(?:.*)(?:EVENT|DATE|DURATION|FREQUENCY|TIME)(?:.*)', x):
        wtuple = x.split('=')
        return (wtuple[0], wtuple[1])
    else:
        return (x, 'OTHER')  
word_tag_inst = all_words.copy()   
word_tag_inst.SENT = word_tag_inst.SENT.apply(lambda x: create_tuple(x))
word_tag_inst.head(10)

Unnamed: 0,note_id,SENT,sent_id,SENT_length
0,0,"(admission, OTHER)",0,37
0,0,"(date, OTHER)",0,37
0,0,"(2151-7-16, DATE-B)",0,37
0,0,"(discharge, EVENT-B)",0,37
0,0,"(date, OTHER)",0,37
0,0,"(2151-8-4, OTHER)",0,37
0,0,"(service, OTHER)",0,37
0,0,"(addendum, OTHER)",0,37
0,0,"(radiologic, OTHER)",0,37
0,0,"(studies, OTHER)",0,37


In [40]:
## check, how many NEs are tagged

#counter = 0
#counter2 = 0
#for word, tag in  word_tag_inst.SENT:
#    if re.search(r'.*-B$', tag):      
#        counter += 1
#    if re.search(r'.*\-.*', tag):
#        counter2 += 1
#print('number of NEs (counting -B tags):', counter)
#print('number of words with tags other than OTHER (counting -B and -I tags):', counter2)

In [41]:
# reverse the explode function
merge_sent = word_tag_inst.copy()
merge_sent = (merge_sent.groupby(by = [merge_sent.index, merge_sent.note_id,merge_sent.sent_id])
      .agg({'SENT': lambda x: x.tolist()})) # to remove index, add: '.reset_index()'
merge_sent.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,SENT
Unnamed: 0_level_1,note_id,sent_id,Unnamed: 3_level_1
0,0,0,"[(admission, OTHER), (date, OTHER), (2151-7-16, DATE-B), (discharge, EVENT-B), (date, OTHER), (2151-8-4, OTHER), (service, OTHER), (addendum, OTHER), (radiologic, OTHER), (studies, OTHER), (radiologic, OTHER), (studies, OTHER), (also, OTHER), (included, OTHER), (a, OTHER), (chest, OTHER), (ct, OTHER), (which, OTHER), (confirmed, OTHER), (cavitary, OTHER), (lesions, OTHER), (in, OTHER), (the, OTHER), (left, OTHER), (lung, OTHER), (apex, OTHER), (consistent, OTHER), (with, OTHER), (infectious, OTHER), (process, OTHER), (tuberculosis, OTHER)]"
2,0,0,"[(admission, OTHER), (date, OTHER), (2151-7-16, DATE-B), (discharge, OTHER), (date, OTHER), (2151-8-4, OTHER), (service, OTHER), (addendum, OTHER), (radiologic, OTHER), (studies, OTHER), (radiologic, OTHER), (studies, OTHER), (also, OTHER), (included, OTHER), (a, OTHER), (chest, OTHER), (ct, OTHER), (which, OTHER), (confirmed, OTHER), (cavitary, OTHER), (lesions, OTHER), (in, OTHER), (the, OTHER), (left, OTHER), (lung, OTHER), (apex, OTHER), (consistent, OTHER), (with, OTHER), (infectious, OTHER), (process, OTHER), (tuberculosis, EVENT-B)]"
3,0,0,"[(admission, OTHER), (date, OTHER), (2151-7-16, OTHER), (discharge, EVENT-B), (date, OTHER), (2151-8-4, DATE-B), (service, OTHER), (addendum, OTHER), (radiologic, OTHER), (studies, OTHER), (radiologic, OTHER), (studies, OTHER), (also, OTHER), (included, OTHER), (a, OTHER), (chest, OTHER), (ct, OTHER), (which, OTHER), (confirmed, OTHER), (cavitary, OTHER), (lesions, OTHER), (in, OTHER), (the, OTHER), (left, OTHER), (lung, OTHER), (apex, OTHER), (consistent, OTHER), (with, OTHER), (infectious, OTHER), (process, OTHER), (tuberculosis, OTHER)]"
4,0,0,"[(admission, OTHER), (date, OTHER), (2151-7-16, OTHER), (discharge, EVENT-B), (date, OTHER), (2151-8-4, OTHER), (service, OTHER), (addendum, OTHER), (radiologic, OTHER), (studies, OTHER), (radiologic, OTHER), (studies, OTHER), (also, OTHER), (included, OTHER), (a, OTHER), (chest, OTHER), (ct, OTHER), (which, OTHER), (confirmed, OTHER), (cavitary, OTHER), (lesions, OTHER), (in, OTHER), (the, OTHER), (left, OTHER), (lung, OTHER), (apex, OTHER), (consistent, OTHER), (with, OTHER), (infectious, OTHER), (process, OTHER), (tuberculosis, EVENT-B)]"
5,0,0,"[(admission, OTHER), (date, OTHER), (2151-7-16, OTHER), (discharge, OTHER), (date, OTHER), (2151-8-4, DATE-B), (service, OTHER), (addendum, OTHER), (radiologic, OTHER), (studies, OTHER), (radiologic, OTHER), (studies, OTHER), (also, OTHER), (included, OTHER), (a, OTHER), (chest, OTHER), (ct, OTHER), (which, OTHER), (confirmed, OTHER), (cavitary, OTHER), (lesions, OTHER), (in, OTHER), (the, OTHER), (left, OTHER), (lung, OTHER), (apex, OTHER), (consistent, OTHER), (with, OTHER), (infectious, OTHER), (process, OTHER), (tuberculosis, EVENT-B)]"
16,1,1,"[(two, DURATION-B), (day, DURATION-I), (s, OTHER), (prior, OTHER), (to, OTHER), (admission, EVENT-B), (she, OTHER), (was, OTHER), (started, OTHER), (on, OTHER), (a, OTHER), (prednisone, OTHER), (taper, OTHER), (and, OTHER), (one, OTHER), (day, OTHER), (prior, OTHER), (to, OTHER), (admission, OTHER), (she, OTHER), (required, OTHER), (oxygen, OTHER), (at, OTHER), (home, OTHER), (in, OTHER), (order, OTHER), (to, OTHER), (maintain, OTHER), (oxygen, OTHER), (saturation, OTHER), (greater, OTHER), (than, OTHER), (90, OTHER)]"
18,1,1,"[(two, DURATION-B), (day, DURATION-I), (s, OTHER), (prior, OTHER), (to, OTHER), (admission, OTHER), (she, OTHER), (was, OTHER), (started, OTHER), (on, OTHER), (a, OTHER), (prednisone, OTHER), (taper, OTHER), (and, OTHER), (one, OTHER), (day, OTHER), (prior, OTHER), (to, OTHER), (admission, EVENT-B), (she, OTHER), (required, OTHER), (oxygen, OTHER), (at, OTHER), (home, OTHER), (in, OTHER), (order, OTHER), (to, OTHER), (maintain, OTHER), (oxygen, OTHER), (saturation, OTHER), (greater, OTHER), (than, OTHER), (90, OTHER)]"
19,1,1,"[(two, OTHER), (day, OTHER), (s, OTHER), (prior, OTHER), (to, OTHER), (admission, EVENT-B), (she, OTHER), (was, OTHER), (started, OTHER), (on, OTHER), (a, OTHER), (prednisone, OTHER), (taper, OTHER), (and, OTHER), (one, DURATION-B), (day, DURATION-I), (prior, OTHER), (to, OTHER), (admission, OTHER), (she, OTHER), (required, OTHER), (oxygen, OTHER), (at, OTHER), (home, OTHER), (in, OTHER), (order, OTHER), (to, OTHER), (maintain, OTHER), (oxygen, OTHER), (saturation, OTHER), (greater, OTHER), (than, OTHER), (90, OTHER)]"
20,1,1,"[(two, OTHER), (day, OTHER), (s, OTHER), (prior, OTHER), (to, OTHER), (admission, EVENT-B), (she, OTHER), (was, OTHER), (started, OTHER), (on, OTHER), (a, OTHER), (prednisone, OTHER), (taper, OTHER), (and, OTHER), (one, OTHER), (day, OTHER), (prior, OTHER), (to, OTHER), (admission, EVENT-B), (she, OTHER), (required, OTHER), (oxygen, OTHER), (at, OTHER), (home, OTHER), (in, OTHER), (order, OTHER), (to, OTHER), (maintain, OTHER), (oxygen, OTHER), (saturation, OTHER), (greater, OTHER), (than, OTHER), (90, OTHER)]"
21,1,1,"[(two, OTHER), (day, OTHER), (s, OTHER), (prior, OTHER), (to, OTHER), (admission, OTHER), (she, OTHER), (was, OTHER), (started, OTHER), (on, OTHER), (a, OTHER), (prednisone, OTHER), (taper, OTHER), (and, OTHER), (one, DURATION-B), (day, DURATION-I), (prior, OTHER), (to, OTHER), (admission, EVENT-B), (she, OTHER), (required, OTHER), (oxygen, OTHER), (at, OTHER), (home, OTHER), (in, OTHER), (order, OTHER), (to, OTHER), (maintain, OTHER), (oxygen, OTHER), (saturation, OTHER), (greater, OTHER), (than, OTHER), (90, OTHER)]"


### Alternative tagging schema

Up to now, the BIO-schema was used to tag the data. However, for generating the embeddings for relation extraction (X-matrix in Knodle), another tag-schema is necessary. In this notebook, the apporach from Zhou et al. (2021) is used to generate the embeddings (https://github.com/wzhouad/ATLOP/blob/main/prepro.py). For that, NEs are marked with '\*' at the beginning and the end. When the keyword-lists are generated above, there were two different tagged keyword lists generated in parallel: one with the BIO-schema and one which I call '\*'-schema. This allows to replace the BIO-tags with '\*'-tags.

For that, the DF all_words will be used. This way, the DF merge_sent (BIO-schema) and the reulting DF ast_sent ('\*'-schema) contain the same data (with the same preprocessing) with different tags.

In [42]:
# reverse the explode function to merge the words to sentences (lists of words) and then the words to a string (sentence)
ast_sent = all_words.copy()
ast_sent = (ast_sent.groupby(by = [ast_sent.index, ast_sent.note_id, ast_sent.sent_id])
      .agg({'SENT': lambda x: x.tolist()})) # to remove index, add: '.reset_index()'
ast_sent.SENT = ast_sent.SENT.str.join(' ')
# add ' ' at the end and the beginning of the sentence, because otherwise the keyword is not found 
ast_sent.SENT = ast_sent.SENT.apply(lambda x: ' ' + x + ' ')
ast_sent.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,SENT
Unnamed: 0_level_1,note_id,sent_id,Unnamed: 3_level_1
0,0,0,admission date 2151-7-16=DATE-B discharge=EVENT-B date 2151-8-4 service addendum radiologic studies radiologic studies also included a chest ct which confirmed cavitary lesions in the left lung apex consistent with infectious process tuberculosis
2,0,0,admission date 2151-7-16=DATE-B discharge date 2151-8-4 service addendum radiologic studies radiologic studies also included a chest ct which confirmed cavitary lesions in the left lung apex consistent with infectious process tuberculosis=EVENT-B
3,0,0,admission date 2151-7-16 discharge=EVENT-B date 2151-8-4=DATE-B service addendum radiologic studies radiologic studies also included a chest ct which confirmed cavitary lesions in the left lung apex consistent with infectious process tuberculosis
4,0,0,admission date 2151-7-16 discharge=EVENT-B date 2151-8-4 service addendum radiologic studies radiologic studies also included a chest ct which confirmed cavitary lesions in the left lung apex consistent with infectious process tuberculosis=EVENT-B
5,0,0,admission date 2151-7-16 discharge date 2151-8-4=DATE-B service addendum radiologic studies radiologic studies also included a chest ct which confirmed cavitary lesions in the left lung apex consistent with infectious process tuberculosis=EVENT-B


In [43]:
# all_TE_ann - temporal expressions with BIO-schema
# all_TE_ast - temporal expressions annotated with *-schema
# dis_ce_keywords - clinical events annotated with BIO-schema
# ast_keywords - clinical events annotated with *-schema

ast_ann_sent = ast_sent.copy()

ast_ann_sent.SENT = ast_ann_sent.SENT.replace(to_replace=all_TE_ann, value=all_TE_ast, regex = True)
ast_ann_sent.SENT = ast_ann_sent.SENT.replace(to_replace=dis_ce_keywords, value=ast_keywords, regex = True)
ast_ann_sent.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,SENT
Unnamed: 0_level_1,note_id,sent_id,Unnamed: 3_level_1
0,0,0,admission date *2151-7-16* *discharge* date 2151-8-4 service addendum radiologic studies radiologic studies also included a chest ct which confirmed cavitary lesions in the left lung apex consistent with infectious process tuberculosis
2,0,0,admission date *2151-7-16* discharge date 2151-8-4 service addendum radiologic studies radiologic studies also included a chest ct which confirmed cavitary lesions in the left lung apex consistent with infectious process *tuberculosis*
3,0,0,admission date 2151-7-16 *discharge* date *2151-8-4* service addendum radiologic studies radiologic studies also included a chest ct which confirmed cavitary lesions in the left lung apex consistent with infectious process tuberculosis
4,0,0,admission date 2151-7-16 *discharge* date 2151-8-4 service addendum radiologic studies radiologic studies also included a chest ct which confirmed cavitary lesions in the left lung apex consistent with infectious process *tuberculosis*
5,0,0,admission date 2151-7-16 discharge date *2151-8-4* service addendum radiologic studies radiologic studies also included a chest ct which confirmed cavitary lesions in the left lung apex consistent with infectious process *tuberculosis*


In [44]:
# save tokenized sentences in joblib format
pickle_file2 = './ast_ann_sent.joblib'
with open(pickle_file2, 'wb') as f:
    dump(ast_ann_sent.SENT, f, compress='zlib')

## Building labeling functions for temporal relation extraction

The i2b2 dataset was annotated with the i2b2 schema. In the i2b2 corpus schema, there are 8 different classes to temporally link clinical events and/or temporal expressions: BEFORE, BEF_OVERLAP, OVERLAP, DURING, ENDS_BY, AFTER, BEGUN_BY, SIMULTANEOUS (Sun et al. 2013). As an analysis of Sun et al. showed that the data agreement was low on some TLINKs, they merged them to three labels:
* BEFORE: BEFORE, ENDED_BY, BEFORE_OVERLAP
* AFTER: BEGUN_BY, AFTER
* OVERLAP: SIMULTANEOUS, OVERLAP, DURING
 
As a first step, temporal relation between two NEs within a sentence are taken into account (later maybe also include neighboring sentences)


* dimensions of Z and T matrices:
    - Z(instances * LF) = sentence with 2 NE marked (bio-schema) * LF (keywords for 3 different classes + regular expressions)
    - T(LF * classes) = LF (keywords for 3 different classes + regular expressions) * classes (TR --> 3 temporal links)

### Generate Z matrix

* The instances of the matrix: word_tag_inst containing (word, tag) tuples including the ID of the note as well as the sentence within this note
* Features: all keywords for temporal links 

In [45]:
# load keywords for temporal links

data_path = "./"
with open('./before.txt') as f:
    before = f.readlines()
    
with open('./after.txt') as f:
    after = f.readlines()

with open('./overlap.txt') as f:
    overlap = f.readlines()

before_kw = sorted(set([clean_key(kw) for kw in before])) 
after_kw = sorted(set([clean_key(kw) for kw in after]))
overlap_kw = sorted(set([clean_key(kw) for kw in overlap]))

# generate tuples: (keyword, class)
before_tuples = [(kw, 'before') for kw in before_kw]
after_tuples = [(kw, 'after') for kw in after_kw]
overlap_tuples = [(kw, 'overlap') for kw in overlap_kw]
all_TL_tuples = before_tuples + after_tuples + overlap_tuples

all_TLinks = before_kw + after_kw + overlap_kw

print(len(all_TLinks), 'keywords currently mapped')
print(all_TLinks)

# all_Tinks is needed to generate the Z matrix
# all_TL_tuples is needed to generate the T matrix

72 keywords currently mapped
['afore', 'aforetime', 'ago', 'ahead', 'ante', 'antecedently', 'anteriorly', 'before', 'before present', 'began', 'ere', 'fore', 'former', 'formerly', 'forward', 'heretofore', 'in advance', 'in front', 'in the past', 'past', 'precendently', 'previous', 'previously', 'prior', 'since', 'sooner', 'started', 'up to now', 'after', 'afterwards', 'back', 'back of', 'behind', 'below', 'ensuing', 'hind', 'hindmost', 'in the rear', 'later', 'next', 'posterior', 'postliminary', 'rear', 'subsequential', 'subsequently', 'succeeding', 'thereafter ', 'accompanying', 'agreeing', 'at the same time', 'coetaneous', 'coeval', 'coexistent', 'coexisting', 'coincide', 'coincident', 'coinciding', 'concurrent', 'concurring', 'contemporaneous', 'contemporary', 'in sync', 'intersect', 'or', 'over', 'overlap', 'simultaneous', 'synchronal', 'synchronic', 'synchronous', 'while', 'with']


In [46]:
# generate Z matrix
# check for each keyword, whether it is in a sentence and then add a column for that keyword

def merge_words(x):
    words = [w for w, t in x]
    sentence = ' '.join(words)
    return sentence

for kwd in all_TLinks:
    kwd_regex = ' ' + kwd + ' '
    # create one column (sentence_merged)
    sentence_merged = merge_sent.SENT.apply(
        lambda x: (re.search(kwd_regex, merge_words(x)) != None)
    ).astype('int')
    column_name = kwd 
    merge_sent[kwd] = sentence_merged # column added for each keyword
    
# keywords with regular expressions - keywords also need to be added to keyword-lists
    
# 'in' + date in the past (e.g. 'in 2009')
in_regex_p = 'in \d{4}'
sent_in_p = merge_sent.SENT.apply(
    lambda x: (re.search(in_regex_p, merge_words(x)) != None)
    ).astype('int') # --> 1 if the keyword is found in the sentence
column_name = 'in-past'
merge_sent['in-past'] = sent_in_p
# add keyword to keyword-lists
all_TL_tuples += [('in-past', 'before')]
all_TLinks += ['in-past']

# 'in' + date in the future (e.g. 'in three weeks')
in_regex_f = 'in .{2,8}(?:hour|day|week|month|year)'
sent_in_f = merge_sent.SENT.apply(
    lambda x: (re.search(in_regex_f, merge_words(x)) != None)
    ).astype('int')
column_name = 'in-future'
merge_sent['in-future'] = sent_in_f
all_TL_tuples += [('in-future', 'after')]
all_TLinks += ['in-future']

# 'for' + duration (e.g. 'for 10 years')
for_regex = 'for .{2,8}(?:hour|day|week|month|year)'
sent_for = merge_sent.SENT.apply(
    lambda x: (re.search(for_regex, merge_words(x)) != None)
    ).astype('int')
column_name = 'for'
merge_sent['for'] = sent_for
all_TL_tuples += [('for', 'before')]
all_TLinks += ['for']

# what is not included yet, but has potential to increase the model:
# - 'in' and 'for' are only relevant it one NE is a TE and one a clinical event
# - ',' could be used to annotated 'simultaneous' (headache, nausea and fever)
# - 'on' + date (e.g. 12/12) --> overlap
# - 'and' between two CE --> overlap
    
# temporal links in toy example
merge_sent.iloc[:,5::].sum()

ante                0
antecedently        0
anteriorly          0
before             24
before present      0
                 ... 
while              24
with              981
in-past            30
in-future         113
for               121
Length: 71, dtype: int64

In [47]:
# print(merge_sent[all_TLinks][:10])
# print(len(all_TLinks))

In [48]:
test_data = merge_sent.copy()
test_data.reset_index(inplace = True)
test_data.drop(columns = ['sent_id', 'note_id', 'SENT', 'level_0'], inplace = True)

test_data.head()

Unnamed: 0,afore,aforetime,ago,ahead,ante,antecedently,anteriorly,before,before present,began,...,overlap,simultaneous,synchronal,synchronic,synchronous,while,with,in-past,in-future,for
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0


In [49]:
# print(test_data.keys())
# print(all_TLinks)

In [50]:
zmatrix = test_data.values.astype('int') 
zmatrix = torch.tensor(zmatrix)

In [51]:
pickle_file3 = './z_torch_matrix.joblib'
with open(pickle_file3, 'wb') as f:
    dump(zmatrix, f, compress='zlib')

### Generate T-matrix

* The instances of the matrix: word_tag_inst containing (word, tag) tuples including the ID of the note as well as the sentence within this note
* Features: all keywords for temporal links 

In [57]:
t_matrix = pd.DataFrame(all_TL_tuples)

all_classes = set([cl for kwd, cl in all_TL_tuples])

for cl in all_classes:
    t_matrix[cl] = t_matrix[1].apply(
        lambda x: x == cl
    ).astype('int')                        # boolean is converted to int
    column_name = cl 

del t_matrix[1]                            # delete class in tuple

print(t_matrix)
print(t_matrix[0])

            0  after  overlap  before
0       afore      0        0       1
1   aforetime      0        0       1
2         ago      0        0       1
3       ahead      0        0       1
4        ante      0        0       1
..        ...    ...      ...     ...
70      while      0        1       0
71       with      0        1       0
72    in-past      0        0       1
73  in-future      1        0       0
74        for      0        0       1

[75 rows x 4 columns]
0         afore
1     aforetime
2           ago
3         ahead
4          ante
        ...    
70        while
71         with
72      in-past
73    in-future
74          for
Name: 0, Length: 75, dtype: object


In [58]:
t_matrix2 = t_matrix.copy()
t_matrix2.drop(columns = [0], inplace = True)

In [59]:
t_matrix = t_matrix2.values.astype('int') # float?
t_matrix2 = torch.tensor(t_matrix)

# save tokenized sentences in joblib format
pickle_file_v02 = './t_matrix.joblib'
with open(pickle_file_v02, 'wb') as f:
    dump(t_matrix2, f, compress='zlib')

### Obtain contextual embeddings for the X-matrix

The rows of the X-Matrix contain the same columns as the Z-matrix: sentences with two marked NE. However, the sentences are not marked according to the bio-schema, but the \*-schema. This is due to requirements for using the code provided at  https://github.com/wzhouad/ATLOP (Zhou et al. 2021). 

How the contextual embeddings are obtained for the X-matrix can be found in the Jupyter notebook word_embeddings_for_relation_extraction-task_3.ipynb

### Test- and development data

The preprocessing of the test- and development data can be found in the Jupyter notebook preprocessing_the_test_data-i2b2_data.ipynb

# References

Gumiel, Y. B., Silva e Oliveira, L. E., Claveau, V., Grabar, N., Paraiso, E. C., Moro, C., & Carvalho, D. R. (2021). Temporal Relation Extraction in Clinical Texts: A Systematic Review. ACM Computing Surveys (CSUR), 54(7), 1-36.

Sohn, S., Wagholikar, K. B., Li, D., Jonnalagadda, S. R., Tao, C., Komandur Elayavilli, R., & Liu, H. (2013). Comprehensive temporal information detection from clinical text: medical events, time, and TLINK identification. Journal of the American Medical Informatics Association, 20(5), 836-842.

Sun, W., Rumshisky, A., & Uzuner, O. (2013). Evaluating temporal relations in clinical text: 2012 i2b2 challenge. Journal of the American Medical Informatics Association, 20(5), 806-813.

Zhou, W., Huang, K., Ma, T., & Huang, J. (2021, May). Document-level relation extraction with adaptive thresholding and localized context pooling. In Proceedings of the AAAI conference on artificial intelligence (Vol. 35, No. 16, pp. 14612-14620).