# Fine-tuning mBERT on CodiEsp-P

In this notebook, following a multi-label sequence classification approach, the mBERT model is fine-tuned on both the training and development sets of the CodiEsp-P corpus. Additionally, the predictions made by the model on the test set are saved, in order to futher evaluate the clinical coding performance of the model (see `results/CodiEsp-P/Evaluation.ipynb`).

In [1]:
import tensorflow as tf

# Auxiliary components
import sys
sys.path.append("..")
from nlp_utils import *

# mBERT tokenizer
from keras_bert import load_vocabulary, Tokenizer
model_path = "multi_cased_L-12_H-768_A-12/"
config_path = model_path + "bert_config.json"
checkpoint_path = model_path + "bert_model.ckpt"
vocab_file = "vocab.txt"
tokenizer = Tokenizer(token_dict=load_vocabulary(model_path + vocab_file), pad_index=0, cased=True)

# Hyper-parameters
text_col = "raw_text"
training = False
trainable = True
SEQ_LEN = 128
BATCH_SIZE = 16
EPOCHS = 30
LR = 3e-5

random_seed = 0
tf.set_random_seed(random_seed)

Using TensorFlow backend.


## Load text

Firstly, all text files from training and development CodiEsp corpora are loaded in different dataframes.

Also, CIE-Procedimiento codes are loaded.

In [2]:
corpus_path = "../datasets/codiesp_v4/"

### Training corpus

In [3]:
%%time
train_path = corpus_path + "train/text_files/"
train_files = [f for f in os.listdir(train_path) if os.path.isfile(train_path + f)]
train_data = load_text_files(train_files, train_path)
df_text_train = pd.DataFrame({'doc_id': [s.split('.txt')[0] for s in train_files], 'raw_text': train_data})

CPU times: user 12.1 ms, sys: 17 µs, total: 12.1 ms
Wall time: 11.4 ms


In [4]:
df_text_train.shape

(500, 2)

In [5]:
df_text_train.head()

Unnamed: 0,doc_id,raw_text
0,S0365-66912006000600011-1,Un varón de 13 años es remitido para valoració...
1,S1139-13752009000200010-2,Paciente de 42 años diagnosticado de pancoliti...
2,S1130-05582017000100037-1,"Varón de 72 años, sin antecedentes médicos de ..."
3,S1139-76322016000300015-1,Lactante de ocho días cuyos padres consultan p...
4,S0211-69952011000100019-1,Mujer de 47 años de edad con antecedentes de g...


In [6]:
df_text_train.raw_text[0]

'Un varón de 13 años es remitido para valoración oftalmológica por mala visión. Fenotípicamente era un niño de talla corta con una estatura de 133 cm, braquimorfia y braquidactilia en las cuatro extremidades.\nEl paciente presentaba un error refractivo corregido de -13,00 -6,50 a 1º en el ojo derecho y de -16,00-6,25 a 179º en el izquierdo. Con dicha corrección alcanzaba una agudeza visual de 0,4 y 0,2 respectivamente. No existía diplopía monocular ni hallazgos en la motilidad ocular extrínseca e intrínseca.\nEl diámetro corneal horizontal era de 12,0 mm en ambos ojos y la paquimetría de 613 y 611 micras respectivamente. La cámara anterior era estrecha, apreciándose iridofacodonesis bilateral. Se evidenció microesferofaquia con desplazamiento anterior de ambos cristalinos dentro de la cámara posterior.\nLa presión intraocular era de 20 mmHg bilateralmente. Gonioscópicamente se apreció un ángulo estrecho simétrico en ambos ojos grado II según Schaffer.\nLa exploración mediante topógrafo

We also load the CIE-Procedimiento codes table:

In [7]:
df_codes_train = pd.read_table(corpus_path + "train/trainP.tsv", sep='\t', header=None)

In [8]:
df_codes_train.columns = ["doc_id", "code"]

In [9]:
df_codes_train.shape

(1550, 2)

In [10]:
df_codes_train.head()

Unnamed: 0,doc_id,code
0,S0004-06142005000700014-1,bw03zzz
1,S0004-06142005000700014-1,3e02329
2,S0004-06142005000700014-1,bw40zzz
3,S0004-06142005000700014-1,bv44zzz
4,S0004-06142005000700014-1,bn20


In [11]:
len(set(df_codes_train["doc_id"]))

435

### Development corpus

In [12]:
%%time
dev_path = corpus_path + "dev/text_files/"
dev_files = [f for f in os.listdir(dev_path) if os.path.isfile(dev_path + f)]
dev_data = load_text_files(dev_files, dev_path)
df_text_dev = pd.DataFrame({'doc_id': [s.split('.txt')[0] for s in dev_files], 'raw_text': dev_data})

CPU times: user 3.67 ms, sys: 3.97 ms, total: 7.64 ms
Wall time: 6.95 ms


In [13]:
df_text_dev.shape

(250, 2)

In [14]:
df_text_dev.head()

Unnamed: 0,doc_id,raw_text
0,S1130-63432016000600013-1,"Varón de 75 años, con antecedentes de hiperuri..."
1,S0365-66912003000600010-1,"Paciente de 33 años de edad, gestante de 34 se..."
2,S0211-69952012000200030-1,Mujer de 67 años con múltiples factores de rie...
3,S0365-66912004000900009-1,Paciente de 55 años que acudió a urgencias por...
4,S1139-76322016000300016-2,"Lactante de 1 mes y 29 días, sin antecedentes ..."


In [15]:
df_text_dev.raw_text[0]

'Varón de 75 años, con antecedentes de hiperuricemia en tratamiento con Alopurinol que ingresa para realización de resección transuretral de próstata.\nPostoperatorio inmediato sin incidencias con tratamiento con Pantoprazol, Ciprofloxacino, Paracetamol, Enantyum y Alopurinol. Al cuarto día de postoperatorio presenta mareos, temblor con componente mioclónico en extremidades y tronco e incapacidad para caminar, sin verse alteraciones analíticas. En esta situación se pauta Rivotril y se suspende el tratamiento con Ciprofloxacino, desapareciendo la clínica mioclónica y mejorando el estado del paciente, por lo que se decide el alta hospitalaria.\n\n'

We also load the CIE-Procedimiento codes table:

In [16]:
df_codes_dev = pd.read_table(corpus_path + "dev/devP.tsv", sep='\t', header=None)

In [17]:
df_codes_dev.columns = ["doc_id", "code"]

In [18]:
df_codes_dev.shape

(817, 2)

In [19]:
df_codes_dev.head()

Unnamed: 0,doc_id,code
0,S0004-06142005000900016-1,bt41zzz
1,S0004-06142005000900016-1,ct13
2,S0004-06142005001000011-1,3e1m39z
3,S0004-06142005001000011-1,0tcb
4,S0004-06142005001000011-1,bt02


In [20]:
len(set(df_codes_dev["doc_id"]))

222

We join the training and development CodiEsp codes dataframes together:

In [21]:
df_codes_train_dev = pd.concat([df_codes_train, df_codes_dev])

In [22]:
df_codes_train_dev.shape

(2367, 2)

In [23]:
df_codes_train_dev.head()

Unnamed: 0,doc_id,code
0,S0004-06142005000700014-1,bw03zzz
1,S0004-06142005000700014-1,3e02329
2,S0004-06142005000700014-1,bw40zzz
3,S0004-06142005000700014-1,bv44zzz
4,S0004-06142005000700014-1,bn20


## Creating corpora of annotated sentences

Leveraging the information available for the named-entity-recognition and normalization (NER-N) CodiEsp-X task, we create both a training and a development corpus of annotated sentences with CIE-Procedimiento codes.

Firstly, we pre-process the NER-N precedure-codes annotations available for both the training and development corpora.

In [24]:
# Training corpus

In [24]:
%%time

codiesp_x_train = pd.read_table(corpus_path + "train/trainX.tsv", sep='\t', header=None)

CPU times: user 13.8 ms, sys: 191 µs, total: 14 ms
Wall time: 12.5 ms


In [25]:
codiesp_x_train.columns = ["doc_id", "type", "code", "word", "location"]

In [26]:
codiesp_x_train.shape

(9181, 5)

In [27]:
codiesp_x_train.head()

Unnamed: 0,doc_id,type,code,word,location
0,S0004-06142005000700014-1,PROCEDIMIENTO,bw03zzz,Rx tórax,2163 2171
1,S0004-06142005000700014-1,PROCEDIMIENTO,3e02329,Estreptomicina intramuscular,2787 2801;2810 2823
2,S0004-06142005000700014-1,DIAGNOSTICO,n44.8,teste derecho aumentado de tamaño,1343 1376
3,S0004-06142005000700014-1,DIAGNOSTICO,z20.818,exposición a Brucella,594 615
4,S0004-06142005000700014-1,DIAGNOSTICO,r60.9,edemas,1250 1256


In [28]:
codiesp_x_train = codiesp_x_train[codiesp_x_train["type"] == "PROCEDIMIENTO"]

In [29]:
codiesp_x_train.shape

(1972, 5)

In [30]:
df_codes_train_ner = process_ner_labels(codiesp_x_train).sort_values(["doc_id", "start", "end"])

In [31]:
df_codes_train_ner.head()

Unnamed: 0,doc_id,type,code,word,start,end
0,S0004-06142005000700014-1,PROCEDIMIENTO,bw03zzz,Rx tórax,2163,2171
3,S0004-06142005000700014-1,PROCEDIMIENTO,bw40zzz,Ecografía abdominal,2173,2192
5,S0004-06142005000700014-1,PROCEDIMIENTO,bn20,TAC craneal,2194,2205
4,S0004-06142005000700014-1,PROCEDIMIENTO,bv44zzz,Ecografía testicular,2287,2307
1,S0004-06142005000700014-1,PROCEDIMIENTO,3e02329,Estreptomicina intramuscular,2787,2801


In [32]:
df_codes_train_ner.shape

(2769, 6)

In [34]:
# Development corpus

In [33]:
%%time

codiesp_x_dev = pd.read_table(corpus_path + "dev/devX.tsv", sep='\t', header=None)

CPU times: user 7.39 ms, sys: 3.57 ms, total: 11 ms
Wall time: 8.47 ms


In [34]:
codiesp_x_dev.columns = ["doc_id", "type", "code", "word", "location"]

In [35]:
codiesp_x_dev.shape

(4477, 5)

In [36]:
codiesp_x_dev.head()

Unnamed: 0,doc_id,type,code,word,location
0,S0004-06142005000900016-1,PROCEDIMIENTO,bt41zzz,ecografía renal derecha,307 316;348 361
1,S0004-06142005000900016-1,PROCEDIMIENTO,ct13,gammagrafía renal,739 756
2,S0004-06142005000900016-1,DIAGNOSTICO,q62.11,estenosis en la unión pieloureteral derecha,540 583
3,S0004-06142005000900016-1,DIAGNOSTICO,n28.89,ectasia pielocalicial,326 347
4,S0004-06142005000900016-1,DIAGNOSTICO,n39.0,infecciones del tracto urinario,198 229


In [37]:
codiesp_x_dev = codiesp_x_dev[codiesp_x_dev["type"] == "PROCEDIMIENTO"]

In [38]:
codiesp_x_dev.shape

(1046, 5)

In [39]:
df_codes_dev_ner = process_ner_labels(codiesp_x_dev).sort_values(["doc_id", "start", "end"])

In [40]:
df_codes_dev_ner.head()

Unnamed: 0,doc_id,type,code,word,start,end
0,S0004-06142005000900016-1,PROCEDIMIENTO,bt41zzz,ecografía renal derecha,307,316
1,S0004-06142005000900016-1,PROCEDIMIENTO,bt41zzz,ecografía renal derecha,348,361
2,S0004-06142005000900016-1,PROCEDIMIENTO,ct13,gammagrafía renal,739,756
3,S0004-06142005001000011-1,PROCEDIMIENTO,3e1m39z,diálisis peritoneal,95,114
7,S0004-06142005001000011-1,PROCEDIMIENTO,0270,angioplastia transluminal de la coronaria derecha,424,473


In [41]:
df_codes_dev_ner.shape

(1540, 6)

Now, using the character start-end positions of each sentence from the CodiEsp corpus (see `datasets/CodiEsp-Sentence-Split.ipynb`), we annotate the sentences with CIE-Procedimiento codes. Also, using mBERT tokenizer, each sentence is converted into a sequence of subwords, which are further converted into vocabulary indices (input IDs) and segments arrays (BERT input tensors). We also generate a *fragments* dataset indicating the number of produced annotated sentences for each document.

In [42]:
# Sentence-Split information
ss_corpus_path = "../datasets/CodiEsp-SSplit-text/"

### Training corpus

In [43]:
label_list = list(df_codes_train_dev["code"])

In [44]:
len(label_list)

2367

In [45]:
len(set(label_list))

727

In [46]:
from sklearn.preprocessing import MultiLabelBinarizer

mlb_encoder = MultiLabelBinarizer()
mlb_encoder.fit([label_list])

MultiLabelBinarizer(classes=None, sparse_output=False)

In [47]:
# Number of distinct codes
num_labels = len(mlb_encoder.classes_)

In [48]:
num_labels

727

Only training texts that are annotated with CIE-Procedimiento codes are considered:

In [49]:
# Some train documents (texts) are not annotated 
len(set(df_text_train["doc_id"]) - set(df_codes_train_ner["doc_id"]))

65

In [50]:
train_doc_list = sorted(set(df_codes_train_ner["doc_id"]))

In [51]:
len(train_doc_list)

435

In [54]:
# Sentence-Split data

In [52]:
%%time
ss_sub_corpus_path = ss_corpus_path + "train/"
ss_files = [f for f in os.listdir(ss_sub_corpus_path) if os.path.isfile(ss_sub_corpus_path + f)]
ss_dict_train = load_ss_files(ss_files, ss_sub_corpus_path)

CPU times: user 18.3 ms, sys: 0 ns, total: 18.3 ms
Wall time: 17.6 ms


In [53]:
%%time
train_ind, train_seg, train_y, train_frag, train_start_end_frag = ss_create_frag_input_data_bert(df_text=df_text_train, 
                                                  text_col=text_col, 
                                                  df_ann=df_codes_train_ner, doc_list=train_doc_list, ss_dict=ss_dict_train,
                                                  tokenizer=tokenizer, lab_encoder=mlb_encoder, seq_len=SEQ_LEN,
                                                  greedy=False)

100%|██████████| 435/435 [00:06<00:00, 63.87it/s]


CPU times: user 6.91 s, sys: 44.3 ms, total: 6.96 s
Wall time: 6.89 s


In [57]:
# Sanity check

In [54]:
train_ind.shape

(7025, 128)

In [55]:
train_seg.shape

(7025, 128)

In [56]:
train_y.shape

(7025, 727)

In [57]:
len(train_frag)

435

In [58]:
len(train_start_end_frag)

7025

In [59]:
# Check n_frag distribution across texts
pd.Series(train_frag).describe()

count    435.000000
mean      16.149425
std        7.778513
min        4.000000
25%       10.500000
50%       15.000000
75%       20.000000
max       54.000000
dtype: float64

In [60]:
# Inspect a randomly selected text and its encoded version
check_id = np.random.randint(low=0, high=len(train_doc_list), size=1)[0]

In [61]:
check_id

110

In [62]:
train_doc_list[check_id]

'S0211-69952013000500035-1'

In [63]:
df_text_train[df_text_train["doc_id"] == train_doc_list[check_id]][text_col].values[0]

'Se presenta el caso de un varón de 64 años sin antecedentes de interés, que consulta por pérdida de 17 kg de peso, astenia, anorexia y anemia. Se realizó una colonoscopia que mostraba cambios inflamatorios inespecíficos de la mucosa colónica y una gastroscopia que evidenciaba neoplasia gástrica y gastropatía antral, con toma de biopsias. La histología de la biopsia gástrica confirmó una proliferación neoplásica con patrón sólido sugestiva de GIST, y la inmunohistoquímica presentaba positividad para CD117. Se observaron depósitos de amiloide AA en las biopsias de la mucosa gástrica, del tumor y de la mucosa colónica.\nPosteriormente, comenzó con edemas en los miembros inferiores y diarrea. En la analítica destacaba: hemoglobina 8,7 g/dl, hematocrito 28%, volumen corpuscular medio 75, creatinina sérica 1,3 mg/dl; proteínas totales 5,9 g/dl; albúmina 1,36 g/dl; colesterol 96 mg/dl; índice de saturación de transferrina 13%. Proteinuria de 5,1 g/día y sedimento con 4-6 hematíes/campo. Dada

In [64]:
check_id_frag = sum(train_frag[:check_id])
for i in range(check_id_frag, check_id_frag + train_frag[check_id]):
    print(mlb_encoder.inverse_transform(np.array([train_y[i]])), "\n")

[()] 

[('0db68zx', '0dbe8zx', '0dj68zz', '0djd8zz')] 

[('0db6',)] 

[('0db6',)] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[('5a1d',)] 



In [67]:
for i in range(check_id_frag, check_id_frag + train_frag[check_id]):
    print(list(zip([tokenizer._token_dict_inv[ind] for ind in train_ind[i]][1:len(train_start_end_frag[i])+1], 
               train_start_end_frag[i])))
    print("\n")

[('Se', (0, 2)), ('presenta', (3, 11)), ('el', (12, 14)), ('caso', (15, 19)), ('de', (20, 22)), ('un', (23, 25)), ('var', (26, 29)), ('##ón', (29, 31)), ('de', (32, 34)), ('64', (35, 37)), ('años', (38, 42)), ('sin', (43, 46)), ('ante', (47, 51)), ('##cedent', (51, 57)), ('##es', (57, 59)), ('de', (60, 62)), ('interés', (63, 70)), (',', (70, 71)), ('que', (72, 75)), ('consulta', (76, 84)), ('por', (85, 88)), ('pérdida', (89, 96)), ('de', (97, 99)), ('17', (100, 102)), ('kg', (103, 105)), ('de', (106, 108)), ('peso', (109, 113)), (',', (113, 114)), ('as', (115, 117)), ('##tenia', (117, 122)), (',', (122, 123)), ('ano', (124, 127)), ('##re', (127, 129)), ('##xia', (129, 132)), ('y', (133, 134)), ('ane', (135, 138)), ('##mia', (138, 141)), ('.', (141, 142))]


[('Se', (143, 145)), ('realizó', (146, 153)), ('una', (154, 157)), ('colonos', (158, 165)), ('##co', (165, 167)), ('##pia', (167, 170)), ('que', (171, 174)), ('mostra', (175, 181)), ('##ba', (181, 183)), ('cambios', (184, 191)), ('i

In [69]:
check_id_frag = sum(train_frag[:check_id])
for frag in train_ind[check_id_frag:check_id_frag + train_frag[check_id]]:
    print(' '.join([tokenizer._token_dict_inv[ind] for ind in frag]), "\n")

[CLS] Se presenta el caso de un var ##ón de 64 años sin ante ##cedent ##es de interés , que consulta por pérdida de 17 kg de peso , as ##tenia , ano ##re ##xia y ane ##mia . [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] 

[CLS] Se realizó una colonos ##co ##pia que mostra ##ba cambios in ##f ##lama ##torio ##s in ##es ##pec ##ífic ##os de la mu ##cosa col ##ónica y una gas ##tros ##co ##pia que evidencia ##ba neo ##pla ##sia g ##ást ##rica y gas ##tro ##pat ##ía ant ##ral , con toma de bio ##psia ##s . [SEP

In [70]:
# Fragment labels distribution
pd.Series(np.sum(train_y, axis=1)).describe()

count    7025.000000
mean        0.304911
std         0.652322
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max         6.000000
dtype: float64

### Development corpus

Only development texts that are annotated with CIE-Procedimiento codes are considered:

In [71]:
# Some dev documents (texts) are not annotated 
len(set(df_text_dev["doc_id"]) - set(df_codes_dev_ner["doc_id"]))

28

In [72]:
dev_doc_list = sorted(set(df_codes_dev_ner["doc_id"]))

In [73]:
len(dev_doc_list)

222

In [74]:
%%time
ss_sub_corpus_path = ss_corpus_path + "dev/"
ss_files = [f for f in os.listdir(ss_sub_corpus_path) if os.path.isfile(ss_sub_corpus_path + f)]
ss_dict_dev = load_ss_files(ss_files, ss_sub_corpus_path)

CPU times: user 6.13 ms, sys: 3.83 ms, total: 9.96 ms
Wall time: 9.37 ms


In [75]:
%%time
dev_ind, dev_seg, dev_y, dev_frag, dev_start_end_frag = ss_create_frag_input_data_bert(df_text=df_text_dev, 
                                                  text_col=text_col, 
                                                  df_ann=df_codes_dev_ner, doc_list=dev_doc_list, ss_dict=ss_dict_dev,
                                                  tokenizer=tokenizer, lab_encoder=mlb_encoder, seq_len=SEQ_LEN,
                                                  greedy=False)

100%|██████████| 222/222 [00:03<00:00, 60.41it/s]


CPU times: user 3.73 s, sys: 23.8 ms, total: 3.75 s
Wall time: 3.72 s


In [77]:
# Sanity check

In [76]:
dev_ind.shape

(3808, 128)

In [77]:
dev_seg.shape

(3808, 128)

In [78]:
dev_y.shape

(3808, 727)

In [79]:
len(dev_frag)

222

In [80]:
len(dev_start_end_frag)

3808

In [81]:
# Check n_frag distribution across texts
pd.Series(dev_frag).describe()

count    222.000000
mean      17.153153
std        8.327785
min        4.000000
25%       11.000000
50%       15.000000
75%       21.000000
max       65.000000
dtype: float64

In [82]:
# Inspect a randomly selected text and its encoded version
check_id = np.random.randint(low=0, high=len(dev_doc_list), size=1)[0]

In [83]:
check_id

166

In [84]:
dev_doc_list[check_id]

'S1130-14732009000300006-1'

In [85]:
df_text_dev[df_text_dev["doc_id"] == dev_doc_list[check_id]][text_col].values[0]

'Una mujer de 26 años sin antecedentes neurológicos previos, asmática y en tratamiento con anticonceptivos orales, presenta un cuadro de cefalea holocraneal muy intensa acompañada de fotofobia, nauseas y vómitos sin fiebre de unas 24 horas de evolución, que se complica a la mañana siguiente de forma súbita con inestabilidad para la marcha y diplopia. Acude a urgencias de nuestro centro; en el examen neurológico presenta discreta rigidez nucal, oftalmoplejia internuclear derecha, Romberg tambaleante y lateropulsión izquierda en la marcha, siendo el resto normal. Se le práctica una tomografía axial computarizada objetivándose una lesión con densidad negativa (-20 a -67 Unidades Hounsfield) localizada en la región temporal así como múltiples imágenes ovulares diseminadas por las cisternas supraselares derechas, cuadrigéminas bilaterales, ángulo pontocerebeloso y asta frontal del ventrículo lateral izquierdo que se interpretan como partículas de grasa. Con la sospecha de meningitis química

In [86]:
check_id_frag = sum(dev_frag[:check_id])
for i in range(check_id_frag, check_id_frag + dev_frag[check_id]):
    print(mlb_encoder.inverse_transform(np.array([dev_y[i]])), "\n")

[()] 

[()] 

[('b020',)] 

[()] 

[('b020', 'b030')] 

[()] 

[()] 

[()] 

[('4a02x4z',)] 

[()] 

[()] 

[()] 

[()] 



In [87]:
for i in range(check_id_frag, check_id_frag + dev_frag[check_id]):
    print(list(zip([tokenizer._token_dict_inv[ind] for ind in dev_ind[i]][1:len(dev_start_end_frag[i])+1], 
               dev_start_end_frag[i])))
    print("\n")

[('Una', (0, 3)), ('mujer', (4, 9)), ('de', (10, 12)), ('26', (13, 15)), ('años', (16, 20)), ('sin', (21, 24)), ('ante', (25, 29)), ('##cedent', (29, 35)), ('##es', (35, 37)), ('neu', (38, 41)), ('##rol', (41, 44)), ('##ógicos', (44, 50)), ('pre', (51, 54)), ('##vios', (54, 58)), (',', (58, 59)), ('as', (60, 62)), ('##mática', (62, 68)), ('y', (69, 70)), ('en', (71, 73)), ('tratamiento', (74, 85)), ('con', (86, 89)), ('antico', (90, 96)), ('##nce', (96, 99)), ('##pti', (99, 102)), ('##vos', (102, 105)), ('oral', (106, 110)), ('##es', (110, 112)), (',', (112, 113)), ('presenta', (114, 122)), ('un', (123, 125)), ('cuadro', (126, 132)), ('de', (133, 135)), ('ce', (136, 138)), ('##fal', (138, 141)), ('##ea', (141, 143)), ('hol', (144, 147)), ('##oc', (147, 149)), ('##rane', (149, 153)), ('##al', (153, 155)), ('muy', (156, 159)), ('intensa', (160, 167)), ('ac', (168, 170)), ('##om', (170, 172)), ('##pa', (172, 174)), ('##ñada', (174, 178)), ('de', (179, 181)), ('foto', (182, 186)), ('##fo',

In [88]:
check_id_frag = sum(dev_frag[:check_id])
for frag in dev_ind[check_id_frag:check_id_frag + dev_frag[check_id]]:
    print(' '.join([tokenizer._token_dict_inv[ind] for ind in frag]), "\n")

[CLS] Una mujer de 26 años sin ante ##cedent ##es neu ##rol ##ógicos pre ##vios , as ##mática y en tratamiento con antico ##nce ##pti ##vos oral ##es , presenta un cuadro de ce ##fal ##ea hol ##oc ##rane ##al muy intensa ac ##om ##pa ##ñada de foto ##fo ##bia , nau ##sea ##s y v ##óm ##itos sin fie ##bre de unas 24 horas de evolución , que se com ##pli ##ca a la mañana siguiente de forma sú ##bita con in ##esta ##bilidad para la marcha y di ##plo ##pia . [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] 

[CLS] A ##cud ##e a ur ##gen ##cias de nuestro centro ; en el examen neu ##rol ##ógico presenta disc ##reta ri ##gide ##z nu ##cal , ofta ##lm ##op ##lej ##ia intern ##uc ##lea ##r derecha , Rom ##berg tam ##bal ##ean ##te y later ##op ##ul ##sión izquierda en la marcha , siendo el resto normal . [SEP] [PAD] [PAD] [PAD] [PAD] [

In [89]:
# Fragment labels distribution
pd.Series(np.sum(dev_y, axis=1)).describe()

count    3808.000000
mean        0.309086
std         0.637504
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max         5.000000
dtype: float64

### Training & Development corpus

We merge the previously generated datasets:

In [90]:
# Indices
train_dev_ind = np.concatenate((train_ind, dev_ind))

In [91]:
train_dev_ind.shape

(10833, 128)

In [92]:
# Segments
train_dev_seg = np.concatenate((train_seg, dev_seg))

In [93]:
train_dev_seg.shape

(10833, 128)

In [94]:
# y
train_dev_y = np.concatenate((train_y, dev_y))

In [95]:
train_dev_y.shape

(10833, 727)

## Fine-tuning

Using the corpus of labeled sentences, we fine-tune the model on a multi-label sentence classification task.

In [96]:
from keras.backend.tensorflow_backend import set_session

# Prevent GPU memory allocation problems
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
set_session(tf.Session(config=config))

In [None]:
from keras_bert import load_trained_model_from_checkpoint

bert_model = load_trained_model_from_checkpoint(
    config_file=config_path, 
    checkpoint_file=checkpoint_path, 
    training=training,                                       
    trainable=trainable, 
    seq_len=SEQ_LEN
)

In [99]:
model.summary()

Model: "tfxlm_roberta_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
roberta (TFRobertaMainLayer) multiple                  277453056 
_________________________________________________________________
classifier (TFRobertaClassif multiple                  592130    
Total params: 278,045,186
Trainable params: 278,045,186
Non-trainable params: 0
_________________________________________________________________


In [100]:
model.layers

[<transformers.models.roberta.modeling_tf_roberta.TFRobertaMainLayer at 0x7f4c78857090>,
 <transformers.models.roberta.modeling_tf_roberta.TFRobertaClassificationHead at 0x7f4c494b8650>]

In [101]:
from tensorflow.keras import Input, Model
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.initializers import GlorotUniform

input_ids = Input(shape=(SEQ_LEN,), name='input_ids', dtype='int64')
attention_mask = Input(shape=(SEQ_LEN,), name='attention_mask', dtype='int64')
inputs = [input_ids, attention_mask]

cls_token = model.layers[0](input_ids=inputs[0], attention_mask=inputs[1])[0][:, 0, :] # take <s> token output representation (equiv. to [CLS]) 
out_logits = Dense(units=num_labels, kernel_initializer=GlorotUniform(seed=random_seed))(cls_token) # Multi-label classification
out_act = Activation('sigmoid')(out_logits)

model = Model(inputs=[input_ids, attention_mask], outputs=out_act)

In [102]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_ids (InputLayer)          [(None, 128)]        0                                            
__________________________________________________________________________________________________
attention_mask (InputLayer)     [(None, 128)]        0                                            
__________________________________________________________________________________________________
roberta (TFRobertaMainLayer)    TFBaseModelOutputWit 277453056   input_ids[0][0]                  
__________________________________________________________________________________________________
tf_op_layer_strided_slice (Tens [(None, 768)]        0           roberta[0][0]                    
______________________________________________________________________________________________

In [103]:
model.input

[<tf.Tensor 'input_ids:0' shape=(None, 128) dtype=int64>,
 <tf.Tensor 'attention_mask:0' shape=(None, 128) dtype=int64>]

In [104]:
model.output

<tf.Tensor 'activation_4/Identity:0' shape=(None, 727) dtype=float32>

In [None]:
%%time
from tensorflow.keras import optimizers, losses
import tensorflow_addons as tfa

optimizer = tfa.optimizers.RectifiedAdam(learning_rate=LR)
loss = losses.BinaryCrossentropy(from_logits=False)
model.compile(optimizer=optimizer, loss=loss)

history = model.fit(x={'input_ids': train_dev_ind, 'attention_mask': train_dev_att}, y=train_dev_y,
          batch_size=BATCH_SIZE, epochs=EPOCHS, shuffle=True)

Epoch 1/41
Epoch 2/41
Epoch 3/41
Epoch 4/41
Epoch 5/41
Epoch 6/41
Epoch 7/41
Epoch 8/41
Epoch 9/41
Epoch 10/41
Epoch 11/41
Epoch 12/41
Epoch 13/41
Epoch 14/41
Epoch 15/41
Epoch 16/41
Epoch 17/41

## Test set predictions

Finally, the predictions made by the model on the test set are saved. For this purpose, firstly, each sentence from the test corpus must be converted into a sequence of subwords (input IDs and attention mask arrays). Then, the predictions made by the model at the sentence-level are saved, to be further evaluated at document-level (see `results/CodiEsp-P/Evaluation.ipynb`).

In [98]:
%%time
test_path = corpus_path + "test/text_files/"
test_files = [f for f in os.listdir(test_path) if os.path.isfile(test_path + f)]
test_data = load_text_files(test_files, test_path)
df_text_test = pd.DataFrame({'doc_id': [s.split('.txt')[0] for s in test_files], 'raw_text': test_data})

CPU times: user 6.74 ms, sys: 0 ns, total: 6.74 ms
Wall time: 6.07 ms


In [99]:
df_text_test.shape

(250, 2)

In [100]:
df_text_test.head()

Unnamed: 0,doc_id,raw_text
0,S0365-66912007000900014-1,Paciente varón de 34 años de edad diagnosticad...
1,S0211-69952014000200012-1,"Un varón de 48 años, de raza caucásica, con IR..."
2,S1139-76322017000200009-1,Presentamos el caso clínico de un niño de cinc...
3,S0210-48062010000100019-1,"Paciente varón de 53 años, diagnosticado de es..."
4,S1130-14732005000500006-1,Se trata de un varón de 20 años diagnosticado ...


In [101]:
df_text_test.raw_text[0]

'Paciente varón de 34 años de edad diagnosticado de varicela tres semanas antes ya resuelta sin complicaciones. Acude a urgencias por presentar disminución de agudeza visual en su ojo izquierdo.\nEn la exploración oftalmológica presenta una agudeza visual corregida de 1 en el ojo derecho (OD) y de 0,6 en el ojo izquierdo (OI). El estudio con lámpara de hendidura demuestra en el OI un tyndall celular de 4+, precipitados queráticos inferiores (3+) y sin presentar la cornea tinción con fluoresceína, siendo normal el OD. La presión intraocular fue de 16mmHg en ambos ojos.\nEn la exploración fundoscópica inicial del OI se aprecia leve vitritis (1+) sin focos de retinitis.\nSe instaura tratamiento tópico con corticoides y midriáticos. A los 2 días se observa leve disminución del tyndall celular (3+) en cámara anterior pero en fondo de ojo aparece un foco periférico de retinitis necrotizante en el área temporal asociado a vasculitis retiniana.\nSe ingresa al paciente y se instaura tratamiento

In [102]:
test_doc_list = sorted(set(df_text_test["doc_id"]))

In [103]:
len(test_doc_list)

250

In [111]:
%%time
ss_sub_corpus_path = ss_corpus_path + "test/"
ss_files = [f for f in os.listdir(ss_sub_corpus_path) if os.path.isfile(ss_sub_corpus_path + f)]
ss_dict_test = load_ss_files(ss_files, ss_sub_corpus_path)

CPU times: user 34.7 ms, sys: 0 ns, total: 34.7 ms
Wall time: 34.1 ms


In [112]:
%%time
test_ind, test_att, _, test_frag, _ = ss_create_frag_input_data_xlmr(df_text=df_text_test, 
                                                  text_col=text_col,
                                                  # Since labels are ignored, we pass df_codes_train_ner as df_ann
                                                  df_ann=df_codes_train_ner, doc_list=test_doc_list, ss_dict=ss_dict_test,
                                                  tokenizer=tokenizer, sp_pb2=spt, lab_encoder=mlb_encoder, seq_len=SEQ_LEN)

100%|██████████| 250/250 [00:00<00:00, 279.35it/s]


CPU times: user 956 ms, sys: 12.4 ms, total: 968 ms
Wall time: 951 ms


In [113]:
%%time
test_preds = model.predict({'input_ids': test_ind, 'attention_mask': test_att})

CPU times: user 14.9 s, sys: 1.68 s, total: 16.6 s
Wall time: 16.9 s


In [114]:
test_preds.shape

(3950, 727)

In [104]:
results_dir_path = "../results/CodiEsp-P/"

In [178]:
%%time
np.save(file=results_dir_path + "predictions/mbert_seed_" + str(random_seed) + "_test_preds.npy", arr=test_preds)

CPU times: user 2.21 ms, sys: 4.65 ms, total: 6.87 ms
Wall time: 6.02 ms


In [111]:
# To be further used when evaluating model performance at document level
np.save(file=results_dir_path + "mbert_test_frags.npy", arr=test_frag)