# Fine-tuning XLM-R on CodiEsp-D

In this notebook, following a multi-label sequence classification approach, the XLM-R model is fine-tuned on both the training and development sets of the CodiEsp-D corpus. Additionally, the predictions made by the model on the test set are saved, in order to futher evaluate the clinical coding performance of the model (see `results/CodiEsp-D/Evaluation.ipynb`).

In [1]:
import tensorflow as tf

# Auxiliary components
import sys
sys.path.append("..")
from nlp_utils import *

# XLM-R tokenizer
from transformers import XLMRobertaTokenizer
import sentencepiece_pb2
model_name = "xlm-roberta-base"
tokenizer = XLMRobertaTokenizer.from_pretrained(model_name)
spt = sentencepiece_pb2.SentencePieceText()

# Hyper-parameters
text_col = "raw_text"
SEQ_LEN = 128
BATCH_SIZE = 16
EPOCHS = 37
LR = 3e-5
train_weight = 4
all_abs_weight = 1

random_seed = 0
tf.random.set_seed(random_seed)

## Load text

Firstly, all text files from training and development CodiEsp corpora are loaded in different dataframes.

Also, CIE-Diagnóstico codes are loaded.

In [2]:
corpus_path = "../datasets/codiesp_v4/"
abs_corpus_path = "../datasets/abstractsWithCIE10_v2/"

### Training corpus

In [3]:
%%time
train_path = corpus_path + "train/text_files/"
train_files = [f for f in os.listdir(train_path) if os.path.isfile(train_path + f)]
train_data = load_text_files(train_files, train_path)
df_text_train = pd.DataFrame({'doc_id': [s.split('.txt')[0] for s in train_files], 'raw_text': train_data})

CPU times: user 3.74 ms, sys: 8.12 ms, total: 11.9 ms
Wall time: 11.5 ms


In [4]:
df_text_train.shape

(500, 2)

In [5]:
df_text_train.head()

Unnamed: 0,doc_id,raw_text
0,S0365-66912006000600011-1,Un varón de 13 años es remitido para valoració...
1,S1139-13752009000200010-2,Paciente de 42 años diagnosticado de pancoliti...
2,S1130-05582017000100037-1,"Varón de 72 años, sin antecedentes médicos de ..."
3,S1139-76322016000300015-1,Lactante de ocho días cuyos padres consultan p...
4,S0211-69952011000100019-1,Mujer de 47 años de edad con antecedentes de g...


In [6]:
df_text_train.raw_text[0]

'Un varón de 13 años es remitido para valoración oftalmológica por mala visión. Fenotípicamente era un niño de talla corta con una estatura de 133 cm, braquimorfia y braquidactilia en las cuatro extremidades.\nEl paciente presentaba un error refractivo corregido de -13,00 -6,50 a 1º en el ojo derecho y de -16,00-6,25 a 179º en el izquierdo. Con dicha corrección alcanzaba una agudeza visual de 0,4 y 0,2 respectivamente. No existía diplopía monocular ni hallazgos en la motilidad ocular extrínseca e intrínseca.\nEl diámetro corneal horizontal era de 12,0 mm en ambos ojos y la paquimetría de 613 y 611 micras respectivamente. La cámara anterior era estrecha, apreciándose iridofacodonesis bilateral. Se evidenció microesferofaquia con desplazamiento anterior de ambos cristalinos dentro de la cámara posterior.\nLa presión intraocular era de 20 mmHg bilateralmente. Gonioscópicamente se apreció un ángulo estrecho simétrico en ambos ojos grado II según Schaffer.\nLa exploración mediante topógrafo

We also load the CIE-Diagnóstico codes table:

In [7]:
df_codes_train = pd.read_table(corpus_path + "train/trainD.tsv", sep='\t', header=None)

In [8]:
df_codes_train.columns = ["doc_id", "code"]

In [9]:
df_codes_train.shape

(5639, 2)

In [10]:
df_codes_train.head()

Unnamed: 0,doc_id,code
0,S0004-06142005000700014-1,n44.8
1,S0004-06142005000700014-1,z20.818
2,S0004-06142005000700014-1,r60.9
3,S0004-06142005000700014-1,r52
4,S0004-06142005000700014-1,a23.9


In [11]:
len(set(df_codes_train["doc_id"]))

500

### Development corpus

In [12]:
%%time
dev_path = corpus_path + "dev/text_files/"
dev_files = [f for f in os.listdir(dev_path) if os.path.isfile(dev_path + f)]
dev_data = load_text_files(dev_files, dev_path)
df_text_dev = pd.DataFrame({'doc_id': [s.split('.txt')[0] for s in dev_files], 'raw_text': dev_data})

CPU times: user 5.72 ms, sys: 0 ns, total: 5.72 ms
Wall time: 5.44 ms


In [13]:
df_text_dev.shape

(250, 2)

In [14]:
df_text_dev.head()

Unnamed: 0,doc_id,raw_text
0,S1130-63432016000600013-1,"Varón de 75 años, con antecedentes de hiperuri..."
1,S0365-66912003000600010-1,"Paciente de 33 años de edad, gestante de 34 se..."
2,S0211-69952012000200030-1,Mujer de 67 años con múltiples factores de rie...
3,S0365-66912004000900009-1,Paciente de 55 años que acudió a urgencias por...
4,S1139-76322016000300016-2,"Lactante de 1 mes y 29 días, sin antecedentes ..."


In [15]:
df_text_dev.raw_text[0]

'Varón de 75 años, con antecedentes de hiperuricemia en tratamiento con Alopurinol que ingresa para realización de resección transuretral de próstata.\nPostoperatorio inmediato sin incidencias con tratamiento con Pantoprazol, Ciprofloxacino, Paracetamol, Enantyum y Alopurinol. Al cuarto día de postoperatorio presenta mareos, temblor con componente mioclónico en extremidades y tronco e incapacidad para caminar, sin verse alteraciones analíticas. En esta situación se pauta Rivotril y se suspende el tratamiento con Ciprofloxacino, desapareciendo la clínica mioclónica y mejorando el estado del paciente, por lo que se decide el alta hospitalaria.\n\n'

We also load the CIE-Diagnóstico codes table:

In [16]:
df_codes_dev = pd.read_table(corpus_path + "dev/devD.tsv", sep='\t', header=None)

In [17]:
df_codes_dev.columns = ["doc_id", "code"]

In [18]:
df_codes_dev.shape

(2677, 2)

In [19]:
df_codes_dev.head()

Unnamed: 0,doc_id,code
0,S0004-06142005000900016-1,q62.11
1,S0004-06142005000900016-1,n28.89
2,S0004-06142005000900016-1,n39.0
3,S0004-06142005000900016-1,r31.9
4,S0004-06142005000900016-1,n23


In [20]:
len(set(df_codes_dev["doc_id"]))

250

### Train-Dev abstracts corpus

From the additional abstracts corpus, we load the text from the abstracts containing CIE-Diagnóstico codes which are present either in the training or development CodiEsp-D corpora:

In [21]:
%%time
df_text_all_abs = pd.read_table(abs_corpus_path + "train_dev_abstracts_text.tsv", sep='\t')

CPU times: user 1.58 s, sys: 124 ms, total: 1.7 s
Wall time: 1.7 s


In [22]:
df_text_all_abs.shape

(100397, 3)

We only select the abstracts with a subtokens sequence length <= the maximum input sequence size used by the XLM-R model (128 subtokens):

In [23]:
all_abs_doc_one_frag = set(pd.read_table(abs_corpus_path + "all_abstracts_seq_len_xlm_r_128.tsv", 
                                               sep='\t', header=None)[0])

In [24]:
len(all_abs_doc_one_frag)

20438

In [25]:
df_text_all_abs = df_text_all_abs[df_text_all_abs["doc_id"].isin(all_abs_doc_one_frag)]

In [26]:
df_text_all_abs.shape

(13339, 3)

In [27]:
df_text_all_abs.head()

Unnamed: 0,doc_id,raw_text,punc_text
38,biblio-1000756,Este libro es el resultado de un trabajo minuc...,Este libro es el resultado de un trabajo minuc...
48,biblio-1002637,Se efectuó una revisión actualizada sobre el d...,Se efectuó una revisión actualizada sobre el d...
105,biblio-1005037,La microlitiasis testicular (TM) es una patolo...,La microlitiasis testicular TM es una patologí...
115,biblio-1005118,Se proponen algunas consideraciones teóricas s...,Se proponen algunas consideraciones teóricas s...
144,biblio-1005452,Objetivo: entender la relación entre la depres...,Objetivo entender la relación entre la depresi...


We also load the CIE-Diagnóstico codes from the previously loaded abstracts:

In [28]:
df_codes_d_all_abs = pd.read_table(abs_corpus_path + "train_dev_abstracts_codes.tsv", sep='\t', 
                                   header=None)

In [29]:
df_codes_d_all_abs.columns = ["doc_id", "code"]

In [30]:
df_codes_d_all_abs = df_codes_d_all_abs[df_codes_d_all_abs["doc_id"].isin(all_abs_doc_one_frag)]

In [31]:
df_codes_d_all_abs.shape

(18290, 2)

In [32]:
df_codes_d_all_abs.head()

Unnamed: 0,doc_id,code
1,lil-286177,i82.40
2,lil-286177,i82.90
10,lil-506160,q03.1
45,lil-176866,g51.0
46,lil-176866,r29.810


We join the training and development CodiEsp as well as the abstracts codes dataframes together:

In [33]:
df_codes_train_dev_abs = pd.concat([df_codes_train, df_codes_dev, df_codes_d_all_abs])

In [34]:
df_codes_train_dev_abs.shape

(26606, 2)

In [35]:
df_codes_train_dev_abs.head()

Unnamed: 0,doc_id,code
0,S0004-06142005000700014-1,n44.8
1,S0004-06142005000700014-1,z20.818
2,S0004-06142005000700014-1,r60.9
3,S0004-06142005000700014-1,r52
4,S0004-06142005000700014-1,a23.9


## Creating corpora of annotated sentences

Leveraging the information available for the named-entity-recognition and normalization (NER-N) CodiEsp-X task, we create both a training and a development corpus of annotated sentences with CIE-Diagnóstico codes.

Firstly, we pre-process the NER-N precedure-codes annotations available for both the training and development corpora.

In [36]:
# Training corpus

In [37]:
%%time

codiesp_x_train = pd.read_table(corpus_path + "train/trainX.tsv", sep='\t', header=None)

CPU times: user 7.46 ms, sys: 3.56 ms, total: 11 ms
Wall time: 10.4 ms


In [38]:
codiesp_x_train.columns = ["doc_id", "type", "code", "word", "location"]

In [39]:
codiesp_x_train.shape

(9181, 5)

In [40]:
codiesp_x_train.head()

Unnamed: 0,doc_id,type,code,word,location
0,S0004-06142005000700014-1,PROCEDIMIENTO,bw03zzz,Rx tórax,2163 2171
1,S0004-06142005000700014-1,PROCEDIMIENTO,3e02329,Estreptomicina intramuscular,2787 2801;2810 2823
2,S0004-06142005000700014-1,DIAGNOSTICO,n44.8,teste derecho aumentado de tamaño,1343 1376
3,S0004-06142005000700014-1,DIAGNOSTICO,z20.818,exposición a Brucella,594 615
4,S0004-06142005000700014-1,DIAGNOSTICO,r60.9,edemas,1250 1256


In [41]:
codiesp_x_train = codiesp_x_train[codiesp_x_train["type"] == "DIAGNOSTICO"]

In [42]:
codiesp_x_train.shape

(7209, 5)

In [43]:
df_codes_train_ner = process_ner_labels(codiesp_x_train).sort_values(["doc_id", "start", "end"])

In [44]:
df_codes_train_ner.head()

Unnamed: 0,doc_id,type,code,word,start,end
3,S0004-06142005000700014-1,DIAGNOSTICO,r52,dolores,78,85
13,S0004-06142005000700014-1,DIAGNOSTICO,m25.50,dolores osteoarticulares,78,102
9,S0004-06142005000700014-1,DIAGNOSTICO,r50.9,fiebre,147,153
14,S0004-06142005000700014-1,DIAGNOSTICO,a23.9,brucella,360,368
10,S0004-06142005000700014-1,DIAGNOSTICO,r50.9,síndrome febril,534,549


In [45]:
df_codes_train_ner.shape

(8272, 6)

In [46]:
# Development corpus

In [47]:
%%time

codiesp_x_dev = pd.read_table(corpus_path + "dev/devX.tsv", sep='\t', header=None)

CPU times: user 0 ns, sys: 6.5 ms, total: 6.5 ms
Wall time: 6.15 ms


In [48]:
codiesp_x_dev.columns = ["doc_id", "type", "code", "word", "location"]

In [49]:
codiesp_x_dev.shape

(4477, 5)

In [50]:
codiesp_x_dev.head()

Unnamed: 0,doc_id,type,code,word,location
0,S0004-06142005000900016-1,PROCEDIMIENTO,bt41zzz,ecografía renal derecha,307 316;348 361
1,S0004-06142005000900016-1,PROCEDIMIENTO,ct13,gammagrafía renal,739 756
2,S0004-06142005000900016-1,DIAGNOSTICO,q62.11,estenosis en la unión pieloureteral derecha,540 583
3,S0004-06142005000900016-1,DIAGNOSTICO,n28.89,ectasia pielocalicial,326 347
4,S0004-06142005000900016-1,DIAGNOSTICO,n39.0,infecciones del tracto urinario,198 229


In [51]:
codiesp_x_dev = codiesp_x_dev[codiesp_x_dev["type"] == "DIAGNOSTICO"]

In [52]:
codiesp_x_dev.shape

(3431, 5)

In [53]:
df_codes_dev_ner = process_ner_labels(codiesp_x_dev).sort_values(["doc_id", "start", "end"])

In [54]:
df_codes_dev_ner.head()

Unnamed: 0,doc_id,type,code,word,start,end
11,S0004-06142005000900016-1,DIAGNOSTICO,k26.9,ulcus duodenal,37,51
14,S0004-06142005000900016-1,DIAGNOSTICO,k59.00,estreñimiento,54,67
4,S0004-06142005000900016-1,DIAGNOSTICO,n23,dolor en fosa renal,85,104
5,S0004-06142005000900016-1,DIAGNOSTICO,n28.0,crisis renoureteral,128,147
13,S0004-06142005000900016-1,DIAGNOSTICO,n20.0,nefrolitiasis,168,181


In [55]:
df_codes_dev_ner.shape

(3947, 6)

Now, using the character start-end positions of each sentence from the CodiEsp corpus (see `datasets/CodiEsp-Sentence-Split.ipynb`), we annotate the sentences with CIE-Diagnóstico codes. Also, using XLM-R tokenizer, each sentence is converted into a sequence of subwords, which are further converted into vocabulary indices (input IDs) and attention mask arrays (XLM-R input tensors). We also generate a *fragments* dataset indicating the number of produced annotated sentences for each document.

In [56]:
# Sentence-Split information
ss_corpus_path = "../datasets/CodiEsp-SSplit-text/"

### Training corpus

In [57]:
label_list = list(df_codes_train_dev_abs["code"])

In [58]:
len(label_list)

26606

In [59]:
len(set(label_list))

2194

In [60]:
from sklearn.preprocessing import MultiLabelBinarizer

mlb_encoder = MultiLabelBinarizer()
mlb_encoder.fit([label_list])

MultiLabelBinarizer()

In [61]:
# Number of distinct codes
num_labels = len(mlb_encoder.classes_)

In [62]:
num_labels

2194

Only training texts that are annotated with CIE-Diagnóstico codes are considered:

In [63]:
len(set(df_text_train["doc_id"]) - set(df_codes_train_ner["doc_id"]))

0

In [64]:
train_doc_list = sorted(set(df_codes_train_ner["doc_id"]))

In [65]:
len(train_doc_list)

500

In [66]:
# Sentence-Split data

In [67]:
%%time
ss_sub_corpus_path = ss_corpus_path + "train/"
ss_files = [f for f in os.listdir(ss_sub_corpus_path) if os.path.isfile(ss_sub_corpus_path + f)]
ss_dict_train = load_ss_files(ss_files, ss_sub_corpus_path)

CPU times: user 15.5 ms, sys: 0 ns, total: 15.5 ms
Wall time: 15.2 ms


In [68]:
%%time
train_ind, train_att, train_y, train_frag, train_start_end_frag = ss_create_frag_input_data_xlmr(df_text=df_text_train, 
                                                  text_col=text_col, 
                                                  df_ann=df_codes_train_ner, doc_list=train_doc_list, ss_dict=ss_dict_train,
                                                  tokenizer=tokenizer, sp_pb2=spt, lab_encoder=mlb_encoder, seq_len=SEQ_LEN)

100%|██████████| 500/500 [00:10<00:00, 45.59it/s]

CPU times: user 11.1 s, sys: 21.3 ms, total: 11.1 s
Wall time: 11.1 s





In [69]:
# Sanity check

In [70]:
train_ind.shape

(7741, 128)

In [71]:
train_att.shape

(7741, 128)

In [72]:
train_y.shape

(7741, 2194)

In [73]:
len(train_frag)

500

In [74]:
len(train_start_end_frag)

7741

In [75]:
# Check n_frag distribution across texts
pd.Series(train_frag).describe()

count    500.000000
mean      15.482000
std        7.666762
min        2.000000
25%       10.000000
50%       14.000000
75%       19.000000
max       54.000000
dtype: float64

In [76]:
# Inspect a randomly selected text and its encoded version
check_id = np.random.randint(low=0, high=len(train_doc_list), size=1)[0]

In [77]:
check_id

10

In [78]:
train_doc_list[check_id]

'S0004-06142006000500002-2'

In [79]:
df_text_train[df_text_train["doc_id"] == train_doc_list[check_id]][text_col].values[0]

'Paciente de 46 años que consultó por dolor a nivel de hipogastrio, eyaculación dolorosa, hemospermia y sensación de peso a nivel testicular atribuido hasta entonces a varicocele derecho ya conocido desde hacía un año. Entre sus antecedentes personales destacaba un episodio de prostatitis aguda un año antes de la consulta.\nA la exploración física el paciente presentaba buen estado general, varicocele derecho y no se palpaban masas a nivel de ambos testículos. El tacto rectal mostraba una próstata irregular, ligeramente aumentada de tamaño con zonas induradas y algo dolorosa a la exploración.\nSe solicitó ecografía urológica integral que mostró una imagen nodular hipoecoica en teste derecho con hipervascularización circundante y un quiste simple en testículo izquierdo.\nEn la TAC abdómino-pélvica se objetivaban a nivel hepático varias imágenes que podían corresponder con metástasis o hemangiomas. En RMN realizada posteriormente nos confirman que se trata de hemangiomas. La Rx de tórax 

In [80]:
check_id_frag = sum(train_frag[:check_id])
for i in range(check_id_frag, check_id_frag + train_frag[check_id]):
    print(mlb_encoder.inverse_transform(np.array([train_y[i]])), "\n")

[('i86.1', 'n53.12', 'r10.30', 'r36.1', 'r52')] 

[('n41.0',)] 

[('i86.1',)] 

[('r52',)] 

[('d40.11', 'n44.2')] 

[('d18.00',)] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[('d40.11',)] 

[()] 



In [81]:
for i in range(check_id_frag, check_id_frag + train_frag[check_id]):
    print(list(zip([tokenizer._convert_id_to_token(int(ind)) for ind in train_ind[i]][1:len(train_start_end_frag[i])+1], 
               train_start_end_frag[i])))
    print("\n")

[('▁Pacient', (0, 7)), ('e', (7, 8)), ('▁de', (8, 11)), ('▁46', (11, 14)), ('▁años', (14, 20)), ('▁que', (20, 24)), ('▁consult', (24, 32)), ('ó', (32, 34)), ('▁por', (34, 38)), ('▁dolor', (38, 44)), ('▁a', (44, 46)), ('▁nivel', (46, 52)), ('▁de', (52, 55)), ('▁hipo', (55, 60)), ('gast', (60, 64)), ('rio', (64, 67)), (',', (67, 68)), ('▁e', (68, 70)), ('ya', (70, 72)), ('cula', (72, 76)), ('ción', (76, 81)), ('▁dolor', (81, 87)), ('osa', (87, 90)), (',', (90, 91)), ('▁hemos', (91, 97)), ('per', (97, 100)), ('mia', (100, 103)), ('▁y', (103, 105)), ('▁sensación', (105, 116)), ('▁de', (116, 119)), ('▁peso', (119, 124)), ('▁a', (124, 126)), ('▁nivel', (126, 132)), ('▁testi', (132, 138)), ('cular', (138, 143)), ('▁atribui', (143, 151)), ('do', (151, 153)), ('▁hasta', (153, 159)), ('▁entonces', (159, 168)), ('▁a', (168, 170)), ('▁var', (170, 174)), ('ico', (174, 177)), ('cele', (177, 181)), ('▁derecho', (181, 189)), ('▁ya', (189, 192)), ('▁conocido', (192, 201)), ('▁desde', (201, 207)), ('▁ha

In [82]:
check_id_frag = sum(train_frag[:check_id])
for frag in train_ind[check_id_frag:check_id_frag + train_frag[check_id]]:
    print(' '.join([tokenizer._convert_id_to_token(int(ind)) for ind in frag]), "\n")

<s> ▁Pacient e ▁de ▁46 ▁años ▁que ▁consult ó ▁por ▁dolor ▁a ▁nivel ▁de ▁hipo gast rio , ▁e ya cula ción ▁dolor osa , ▁hemos per mia ▁y ▁sensación ▁de ▁peso ▁a ▁nivel ▁testi cular ▁atribui do ▁hasta ▁entonces ▁a ▁var ico cele ▁derecho ▁ya ▁conocido ▁desde ▁hacía ▁un ▁año . </s> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> 

<s> ▁Entre ▁sus ▁antecede ntes ▁personales ▁destaca ba ▁un ▁episodio ▁de ▁prostat itis ▁a guda ▁un ▁año ▁antes ▁de ▁la ▁consulta . </s> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <p

In [83]:
# Fragment labels distribution
pd.Series(np.sum(train_y, axis=1)).describe()

count    7741.000000
mean        0.930242
std         1.344062
min         0.000000
25%         0.000000
50%         0.000000
75%         1.000000
max        14.000000
dtype: float64

### Development corpus

Only development texts that are annotated with CIE-Diagnóstico codes are considered:

In [84]:
len(set(df_text_dev["doc_id"]) - set(df_codes_dev_ner["doc_id"]))

0

In [85]:
dev_doc_list = sorted(set(df_codes_dev_ner["doc_id"]))

In [86]:
len(dev_doc_list)

250

In [87]:
%%time
ss_sub_corpus_path = ss_corpus_path + "dev/"
ss_files = [f for f in os.listdir(ss_sub_corpus_path) if os.path.isfile(ss_sub_corpus_path + f)]
ss_dict_dev = load_ss_files(ss_files, ss_sub_corpus_path)

CPU times: user 82 µs, sys: 8.05 ms, total: 8.13 ms
Wall time: 7.75 ms


In [88]:
%%time
dev_ind, dev_att, dev_y, dev_frag, dev_start_end_frag = ss_create_frag_input_data_xlmr(df_text=df_text_dev, 
                                                  text_col=text_col, 
                                                  df_ann=df_codes_dev_ner, doc_list=dev_doc_list, ss_dict=ss_dict_dev,
                                                  tokenizer=tokenizer, sp_pb2=spt, lab_encoder=mlb_encoder, seq_len=SEQ_LEN)

100%|██████████| 250/250 [00:05<00:00, 46.47it/s]

CPU times: user 5.45 s, sys: 11.2 ms, total: 5.46 s
Wall time: 5.44 s





In [89]:
# Sanity check

In [90]:
dev_ind.shape

(4109, 128)

In [91]:
dev_att.shape

(4109, 128)

In [92]:
dev_y.shape

(4109, 2194)

In [93]:
len(dev_frag)

250

In [94]:
len(dev_start_end_frag)

4109

In [95]:
# Check n_frag distribution across texts
pd.Series(dev_frag).describe()

count    250.000000
mean      16.436000
std        8.415771
min        4.000000
25%       11.000000
50%       15.000000
75%       20.750000
max       65.000000
dtype: float64

In [96]:
# Inspect a randomly selected text and its encoded version
check_id = np.random.randint(low=0, high=len(dev_doc_list), size=1)[0]

In [97]:
check_id

190

In [98]:
dev_doc_list[check_id]

'S1134-80462015000200004-1'

In [99]:
df_text_dev[df_text_dev["doc_id"] == dev_doc_list[check_id]][text_col].values[0]

'Varón de 42 años con antecedentes personales de alcoholismo, abuso de cocaína y fractura de meseta tibial derecha en enero de 2011 que se trató quirúrgicamente (osteosíntesis-placa con injerto cresta iliaca). Se produce una infección de herida quirúrgica que acaba generando una osteomielitis tibial derecha a los 4 meses, requiriendo varias intervenciones quirúrgicas para extracción de material de osteosíntesis, desbridamiento y limpieza.\nEl paciente está ingresado desde el 4.o mes y en tratamiento con antibioterapia endovenosa de última generación que se administra por una vía central, y vía oral con 2 comprimidos de distraneurine y bromazepam/12 h desde su ingreso (para prevenir un delirium por deprivación alcohólica); como analgesia: pregabalina 75-0-150, paracetamol 1 g/8 h endovenosa, metamizol 2 g/8 h endovenosa, tramadol 100 mg/8 h endovenosa, Enantyun® 25/8 h vía oral y morfina 5 mg/6 h endovenosa (un total de 20 mg endovenosos al día).\nAl 8.o mes de evolución del proceso int

In [100]:
check_id_frag = sum(dev_frag[:check_id])
for i in range(check_id_frag, check_id_frag + dev_frag[check_id]):
    print(mlb_encoder.inverse_transform(np.array([dev_y[i]])), "\n")

[('f10.20', 'f14.10', 's82.101')] 

[('b99.9', 'm86.161', 'm86.9', 't81.31x')] 

[('f10.921',)] 

[()] 

[('r52',)] 

[('f32.9', 'r52')] 

[('r50.9',)] 

[('r52',)] 

[()] 

[()] 

[()] 

[('r40.1',)] 

[()] 

[('f11.93', 'r45.1')] 

[()] 

[('f05', 'f11.93')] 

[()] 

[('r45.1',)] 

[()] 

[('r52',)] 

[()] 

[()] 

[('f32.9', 'r40.1', 'r45.1')] 

[()] 

[()] 

[('g06.0', 'i38', 'm86.9', 'r50.9')] 

[('f05',)] 

[('i63.9',)] 



In [101]:
for i in range(check_id_frag, check_id_frag + dev_frag[check_id]):
    print(list(zip([tokenizer._convert_id_to_token(int(ind)) for ind in dev_ind[i]][1:len(dev_start_end_frag[i])+1], 
               dev_start_end_frag[i])))
    print("\n")

[('▁Var', (0, 3)), ('ón', (3, 6)), ('▁de', (6, 9)), ('▁42', (9, 12)), ('▁años', (12, 18)), ('▁con', (18, 22)), ('▁antecede', (22, 31)), ('ntes', (31, 35)), ('▁personales', (35, 46)), ('▁de', (46, 49)), ('▁alcohol', (49, 57)), ('ismo', (57, 61)), (',', (61, 62)), ('▁abuso', (62, 68)), ('▁de', (68, 71)), ('▁coca', (71, 76)), ('ína', (76, 80)), ('▁y', (80, 82)), ('▁fra', (82, 86)), ('ctura', (86, 91)), ('▁de', (91, 94)), ('▁mese', (94, 99)), ('ta', (99, 101)), ('▁tibi', (101, 106)), ('al', (106, 108)), ('▁derecha', (108, 116)), ('▁en', (116, 119)), ('▁enero', (119, 125)), ('▁de', (125, 128)), ('▁2011', (128, 133)), ('▁que', (133, 137)), ('▁se', (137, 140)), ('▁tra', (140, 144)), ('tó', (144, 147)), ('▁qui', (147, 151)), ('rú', (151, 154)), ('rg', (154, 156)), ('icamente', (156, 164)), ('▁(', (164, 166)), ('os', (166, 168)), ('te', (168, 170)), ('os', (170, 172)), ('í', (172, 174)), ('ntes', (174, 178)), ('is', (178, 180)), ('-', (180, 181)), ('plac', (181, 185)), ('a', (185, 186)), ('▁con

In [102]:
check_id_frag = sum(dev_frag[:check_id])
for frag in dev_ind[check_id_frag:check_id_frag + dev_frag[check_id]]:
    print(' '.join([tokenizer._convert_id_to_token(int(ind)) for ind in frag]), "\n")

<s> ▁Var ón ▁de ▁42 ▁años ▁con ▁antecede ntes ▁personales ▁de ▁alcohol ismo , ▁abuso ▁de ▁coca ína ▁y ▁fra ctura ▁de ▁mese ta ▁tibi al ▁derecha ▁en ▁enero ▁de ▁2011 ▁que ▁se ▁tra tó ▁qui rú rg icamente ▁( os te os í ntes is - plac a ▁con ▁in jer to ▁cre sta ▁ili aca ). </s> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> 

<s> ▁Se ▁produce ▁una ▁infección ▁de ▁her ida ▁qui rú r gica ▁que ▁acaba ▁genera ndo ▁una ▁osteo miel itis ▁tibi al ▁derecha ▁a ▁los ▁4 ▁meses , ▁requiri endo ▁varias ▁interven ciones ▁qui rú rg icas ▁para ▁extra cción ▁de ▁material ▁de ▁osteo sí ntes is , ▁des brid amiento ▁y ▁limpieza . </s> <pad> <pad> <p

In [103]:
# Fragment labels distribution
pd.Series(np.sum(dev_y, axis=1)).describe()

count    4109.000000
mean        0.832806
std         1.250330
min         0.000000
25%         0.000000
50%         0.000000
75%         1.000000
max        14.000000
dtype: float64

### Train-Dev abstracts corpus

In [104]:
len(set(df_text_all_abs["doc_id"]) - set(df_codes_d_all_abs["doc_id"]))

0

In [105]:
all_abs_doc_list = sorted(set(df_codes_d_all_abs["doc_id"]))

In [106]:
len(all_abs_doc_list)

13339

In [107]:
%%time
all_abs_ind, all_abs_att, all_abs_y, all_abs_frag = create_frag_input_data_xlmr(df_text=df_text_all_abs, text_col=text_col, 
                                     df_label=df_codes_d_all_abs, doc_list=all_abs_doc_list, tokenizer=tokenizer, sp_pb2=spt, 
                                     lab_encoder=mlb_encoder, seq_len=SEQ_LEN)

100%|██████████| 13339/13339 [00:27<00:00, 484.57it/s]


CPU times: user 27.8 s, sys: 137 ms, total: 28 s
Wall time: 27.8 s


In [108]:
all_abs_ind.shape

(13339, 128)

In [109]:
all_abs_att.shape

(13339, 128)

In [110]:
all_abs_y.shape

(13339, 2194)

In [111]:
len(all_abs_frag)

13339

In [112]:
# Check n_frag distribution across texts
pd.Series(all_abs_frag).describe()

count    13339.0
mean         1.0
std          0.0
min          1.0
25%          1.0
50%          1.0
75%          1.0
max          1.0
dtype: float64

In [113]:
# Inspect a randomly selected text and its encoded version
check_id = np.random.randint(low=0, high=len(all_abs_doc_list), size=1)[0]

In [114]:
check_id

5925

In [115]:
all_abs_doc_list[check_id]

'lil-249367'

In [116]:
df_text_all_abs[df_text_all_abs["doc_id"] == all_abs_doc_list[check_id]][text_col].values[0]

'Describe sobre el derrame pleural purulento, señala el clasico razonamiento clinico frente al estudio de RX y el tratamiento de de la pleuresia purulenta'

In [117]:
check_id_frag = sum(all_abs_frag[:check_id])
for i in range(check_id_frag, check_id_frag + all_abs_frag[check_id]):
    print(mlb_encoder.inverse_transform(np.array([all_abs_y[i]])), "\n")

[('j90',)] 



In [118]:
check_id_frag = sum(all_abs_frag[:check_id])
for frag in all_abs_ind[check_id_frag:check_id_frag + all_abs_frag[check_id]]:
    print(' '.join([tokenizer._convert_id_to_token(int(ind)) for ind in frag]), "\n")

<s> ▁De scribe ▁sobre ▁el ▁der ram e ▁ple ural ▁pur ul ento , ▁señala ▁el ▁clasic o ▁raz on amiento ▁clinic o ▁frente ▁al ▁estudio ▁de ▁R X ▁y ▁el ▁tratamiento ▁de ▁de ▁la ▁ple ures ia ▁pur ul enta </s> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> 



In [119]:
# Fragment labels distribution
pd.Series(np.sum(all_abs_y, axis=1)).describe()

count    13339.000000
mean         1.371167
std          0.733460
min          1.000000
25%          1.000000
50%          1.000000
75%          2.000000
max         10.000000
dtype: float64

### Training & Development & Abstracts corpus

We merge the previously generated datasets:

In [120]:
# Indices
train_dev_abs_ind = np.concatenate((train_ind, dev_ind, all_abs_ind))

In [121]:
train_dev_abs_ind.shape

(25189, 128)

In [122]:
# Attention masks
train_dev_abs_att = np.concatenate((train_att, dev_att, all_abs_att))

In [123]:
train_dev_abs_att.shape

(25189, 128)

In [124]:
# y
train_dev_abs_y = np.concatenate((train_y, dev_y, all_abs_y))

In [125]:
train_dev_abs_y.shape

(25189, 2194)

## Fine-tuning

Using the corpus of labeled sentences, we fine-tune the model on a multi-label sentence classification task.

In [126]:
from transformers import TFXLMRobertaForSequenceClassification

model = TFXLMRobertaForSequenceClassification.from_pretrained(model_name, from_pt=True)

All PyTorch model weights were used when initializing TFXLMRobertaForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFXLMRobertaForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [127]:
model.summary()

Model: "tfxlm_roberta_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
roberta (TFRobertaMainLayer) multiple                  277453056 
_________________________________________________________________
classifier (TFRobertaClassif multiple                  592130    
Total params: 278,045,186
Trainable params: 278,045,186
Non-trainable params: 0
_________________________________________________________________


In [128]:
from tensorflow.keras import Input, Model
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.initializers import GlorotUniform

input_ids = Input(shape=(SEQ_LEN,), name='input_ids', dtype='int64')
attention_mask = Input(shape=(SEQ_LEN,), name='attention_mask', dtype='int64')
inputs = [input_ids, attention_mask]

cls_token = model.layers[0](input_ids=inputs[0], attention_mask=inputs[1])[0][:, 0, :] # take <s> token output representation (equiv. to [CLS]) 
out_logits = Dense(units=num_labels, kernel_initializer=GlorotUniform(seed=random_seed))(cls_token) # Multi-label classification
out_act = Activation('sigmoid')(out_logits)

model = Model(inputs=[input_ids, attention_mask], outputs=out_act)

In [129]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_ids (InputLayer)          [(None, 128)]        0                                            
__________________________________________________________________________________________________
attention_mask (InputLayer)     [(None, 128)]        0                                            
__________________________________________________________________________________________________
roberta (TFRobertaMainLayer)    TFBaseModelOutputWit 277453056   input_ids[0][0]                  
__________________________________________________________________________________________________
tf_op_layer_strided_slice (Tens [(None, 768)]        0           roberta[0][0]                    
______________________________________________________________________________________________

In [130]:
model.input

[<tf.Tensor 'input_ids:0' shape=(None, 128) dtype=int64>,
 <tf.Tensor 'attention_mask:0' shape=(None, 128) dtype=int64>]

In [131]:
model.output

<tf.Tensor 'activation_4/Identity:0' shape=(None, 2194) dtype=float32>

In [132]:
# Sample weights

In [133]:
n_train_dev_frags = sum(train_frag) + sum(dev_frag)

In [134]:
n_train_dev_frags

11850

In [135]:
train_dev_abs_weights = np.array([train_weight] * n_train_dev_frags + [all_abs_weight] * (train_dev_abs_y.shape[0] - 
                                                                                      n_train_dev_frags))

In [None]:
%%time
from tensorflow.keras import optimizers, losses
import tensorflow_addons as tfa

optimizer = tfa.optimizers.RectifiedAdam(learning_rate=LR)
loss = losses.BinaryCrossentropy(from_logits=False)
model.compile(optimizer=optimizer, loss=loss)

history = model.fit(x={'input_ids': train_dev_abs_ind, 'attention_mask': train_dev_abs_att}, y=train_dev_abs_y,
          batch_size=BATCH_SIZE, epochs=EPOCHS, sample_weight=train_dev_abs_weights, shuffle=True)

Epoch 1/37
Epoch 2/37
Epoch 3/37

## Test set predictions

Finally, the predictions made by the model on the test set are saved. For this purpose, firstly, each sentence from the test corpus must be converted into a sequence of subwords (input IDs and attention mask arrays). Then, the predictions made by the model at the sentence-level are saved, to be further evaluated at document-level (see `results/CodiEsp-D/Evaluation.ipynb`).

In [136]:
%%time
test_path = corpus_path + "test/text_files/"
test_files = [f for f in os.listdir(test_path) if os.path.isfile(test_path + f)]
test_data = load_text_files(test_files, test_path)
df_text_test = pd.DataFrame({'doc_id': [s.split('.txt')[0] for s in test_files], 'raw_text': test_data})

CPU times: user 0 ns, sys: 18.5 ms, total: 18.5 ms
Wall time: 94 ms


In [137]:
df_text_test.shape

(250, 2)

In [138]:
df_text_test.head()

Unnamed: 0,doc_id,raw_text
0,S0365-66912007000900014-1,Paciente varón de 34 años de edad diagnosticad...
1,S0211-69952014000200012-1,"Un varón de 48 años, de raza caucásica, con IR..."
2,S1139-76322017000200009-1,Presentamos el caso clínico de un niño de cinc...
3,S0210-48062010000100019-1,"Paciente varón de 53 años, diagnosticado de es..."
4,S1130-14732005000500006-1,Se trata de un varón de 20 años diagnosticado ...


In [139]:
df_text_test.raw_text[0]

'Paciente varón de 34 años de edad diagnosticado de varicela tres semanas antes ya resuelta sin complicaciones. Acude a urgencias por presentar disminución de agudeza visual en su ojo izquierdo.\nEn la exploración oftalmológica presenta una agudeza visual corregida de 1 en el ojo derecho (OD) y de 0,6 en el ojo izquierdo (OI). El estudio con lámpara de hendidura demuestra en el OI un tyndall celular de 4+, precipitados queráticos inferiores (3+) y sin presentar la cornea tinción con fluoresceína, siendo normal el OD. La presión intraocular fue de 16mmHg en ambos ojos.\nEn la exploración fundoscópica inicial del OI se aprecia leve vitritis (1+) sin focos de retinitis.\nSe instaura tratamiento tópico con corticoides y midriáticos. A los 2 días se observa leve disminución del tyndall celular (3+) en cámara anterior pero en fondo de ojo aparece un foco periférico de retinitis necrotizante en el área temporal asociado a vasculitis retiniana.\nSe ingresa al paciente y se instaura tratamiento

In [140]:
test_doc_list = sorted(set(df_text_test["doc_id"]))

In [141]:
len(test_doc_list)

250

In [142]:
%%time
ss_sub_corpus_path = ss_corpus_path + "test/"
ss_files = [f for f in os.listdir(ss_sub_corpus_path) if os.path.isfile(ss_sub_corpus_path + f)]
ss_dict_test = load_ss_files(ss_files, ss_sub_corpus_path)

CPU times: user 45.2 ms, sys: 12.1 ms, total: 57.3 ms
Wall time: 205 ms


In [143]:
%%time
test_ind, test_att, _, test_frag, _ = ss_create_frag_input_data_xlmr(df_text=df_text_test, 
                                                  text_col=text_col,
                                                  # Since labels are ignored, we pass df_codes_train_ner as df_ann
                                                  df_ann=df_codes_train_ner, doc_list=test_doc_list, ss_dict=ss_dict_test,
                                                  tokenizer=tokenizer, sp_pb2=spt, lab_encoder=mlb_encoder, seq_len=SEQ_LEN)

100%|██████████| 250/250 [00:01<00:00, 248.37it/s]


CPU times: user 1.08 s, sys: 8.49 ms, total: 1.09 s
Wall time: 1.07 s


In [113]:
%%time
test_preds = model.predict({'input_ids': test_ind, 'attention_mask': test_att})

CPU times: user 14.9 s, sys: 1.68 s, total: 16.6 s
Wall time: 16.9 s


In [114]:
test_preds.shape

(3950, 727)

In [144]:
results_dir_path = "../results/CodiEsp-D/"

In [178]:
%%time
np.save(file=results_dir_path + "predictions/xlm_r_seed_" + str(random_seed) + "_test_preds.npy", arr=test_preds)

CPU times: user 2.21 ms, sys: 4.65 ms, total: 6.87 ms
Wall time: 6.02 ms


In [145]:
# To be further used when evaluating model performance at document level
np.save(file=results_dir_path + "xlm_r_test_frags.npy", arr=test_frag)
np.save(file=results_dir_path + "classes.npy", arr=mlb_encoder.classes_)
np.save(file=results_dir_path + "test_docs.npy", arr=test_doc_list)