# Fine-tuning BETO on CodiEsp-P

In this notebook, following a multi-label sequence classification approach, the BETO model is fine-tuned on both the training and development sets of the CodiEsp-P corpus. Additionally, the predictions made by the model on the test set are saved, in order to futher evaluate the clinical coding performance of the model (see `results/CodiEsp-P/Evaluation.ipynb`).

In [1]:
import tensorflow as tf

# Auxiliary components
import sys
sys.path.append("..")
from nlp_utils import *

# BETO tokenizer
from keras_bert import load_vocabulary, Tokenizer
model_path = "BETO_cased/"
config_path = model_path + "config.json"
checkpoint_path = model_path + "model.ckpt-2000000"
vocab_file = "vocab.txt"
tokenizer = Tokenizer(token_dict=load_vocabulary(model_path + vocab_file), pad_index=1, cased=True)

# Hyper-parameters
text_col = "raw_text"
training = False
trainable = True
SEQ_LEN = 128
BATCH_SIZE = 16
EPOCHS = 36
LR = 3e-5

random_seed = 0
tf.set_random_seed(random_seed)

Using TensorFlow backend.


## Load text

Firstly, all text files from training and development CodiEsp corpora are loaded in different dataframes.

Also, CIE-Procedimiento codes are loaded.

In [2]:
corpus_path = "../datasets/codiesp_v4/"

### Training corpus

In [3]:
%%time
train_path = corpus_path + "train/text_files/"
train_files = [f for f in os.listdir(train_path) if os.path.isfile(train_path + f)]
train_data = load_text_files(train_files, train_path)
df_text_train = pd.DataFrame({'doc_id': [s.split('.txt')[0] for s in train_files], 'raw_text': train_data})

CPU times: user 2.97 ms, sys: 8.33 ms, total: 11.3 ms
Wall time: 10.7 ms


In [4]:
df_text_train.shape

(500, 2)

In [5]:
df_text_train.head()

Unnamed: 0,doc_id,raw_text
0,S0004-06142007000600016-2,Paciente varón de 35 años con tumoración en po...
1,S1137-66272009000500017-1,Lactante de sexo femenino que ingresó a los 7 ...
2,S0365-66912007001100010-1,Paciente de 63 años que refería déficit de agu...
3,S0365-66912009000300010-1,Se presenta el caso de un varón de 24 años de ...
4,S0211-69952013000500035-1,Se presenta el caso de un varón de 64 años sin...


In [6]:
df_text_train.raw_text[0]

'Paciente varón de 35 años con tumoración en polo superior de teste derecho hallada de manera casual durante una autoexploración, motivo por el cual acude a consulta de urología donde se realiza exploración física, apreciando masa de 1cm aproximado de diámetro dependiente de epidídimo, y ecografía testicular, que se informa como lesión nodular sólida en cabeza de epidídimo derecho. Se realiza RMN. Confirmando masa nodular, siendo el tumor adenomatoide de epidídimo la primera posibilidad diagnóstica.\n\nSe decide, en los dos casos, resección quirúrgica de tumoración nodular en cola epidídimo derecho, sin realización de orquiectomía posterior.\nEn ambos casos se realizó examen anátomopatológico de la pieza quirúrgica. Hallazgos histológicos macroscópicos: formación nodular de 1,5 cms (caso1) y 1,2 cms (caso 2) de consistencia firme, coloración blanquecina y bien delimitada. Microscópicamente se observa proliferación tumoral constituida por estructuras tubulares en las que la celularidad 

We also load the CIE-Procedimiento codes table:

In [7]:
df_codes_train = pd.read_table(corpus_path + "train/trainP.tsv", sep='\t', header=None)

In [8]:
df_codes_train.columns = ["doc_id", "code"]

In [9]:
df_codes_train.shape

(1550, 2)

In [10]:
df_codes_train.head()

Unnamed: 0,doc_id,code
0,S0004-06142005000700014-1,bw03zzz
1,S0004-06142005000700014-1,3e02329
2,S0004-06142005000700014-1,bw40zzz
3,S0004-06142005000700014-1,bv44zzz
4,S0004-06142005000700014-1,bn20


In [11]:
len(set(df_codes_train["doc_id"]))

435

### Development corpus

In [12]:
%%time
dev_path = corpus_path + "dev/text_files/"
dev_files = [f for f in os.listdir(dev_path) if os.path.isfile(dev_path + f)]
dev_data = load_text_files(dev_files, dev_path)
df_text_dev = pd.DataFrame({'doc_id': [s.split('.txt')[0] for s in dev_files], 'raw_text': dev_data})

CPU times: user 2.15 ms, sys: 3.47 ms, total: 5.62 ms
Wall time: 5.06 ms


In [13]:
df_text_dev.shape

(250, 2)

In [14]:
df_text_dev.head()

Unnamed: 0,doc_id,raw_text
0,S1698-44472004000100009-1,Varón de 64 años de edad con tumefacción mandi...
1,S1139-76322015000300013-1,Niña de tres años que acude a Urgencias tras l...
2,S1130-05582015000100004-1,Se presenta el caso de una mujer de 60 años de...
3,S1887-85712015000200005-1,Paciente varón de cinco años de edad que tras ...
4,S1699-65852010000300002-1,"LTR. Paciente de sexo masculino, de 32 años de..."


In [15]:
df_text_dev.raw_text[0]

'Varón de 64 años de edad con tumefacción mandibular derecha de 6 meses de evolución. La radiografía simple mostraba una lesión expansiva bien delimitada, osteolítica, multiloculada, localizada en rama horizontal mandibular. La tomografía computerizada presentaba una lesión expansiva con destrucción de la cortical ósea. Con el diagnóstico provisional de probable ameloblastoma se procedió a la resección-biopsia de la lesión. Mediante incisión interpapilar se expuso la mandíbula que mostraba la superficie abombada y destruída por una tumoración carnosa de consistencia densa que rodeaba la rama del nervio dentario inferior. Tras un cuidadoso curetaje de la cavidad ósea se reconstruyó la mandíbula y se repuso la mucosa. No hubo complicaciones postquirúrgicas.\n\nEl material remitido a Anatomía Patológica consistía en fragmentos tumorales de unos 2x1.5 cm, blanco-grisáceos al corte y de consistencia firme. Se tomaron diversas muestras que tras fijarse en formaldehído se incluyeron en parafi

We also load the CIE-Procedimiento codes table:

In [16]:
df_codes_dev = pd.read_table(corpus_path + "dev/devP.tsv", sep='\t', header=None)

In [17]:
df_codes_dev.columns = ["doc_id", "code"]

In [18]:
df_codes_dev.shape

(817, 2)

In [19]:
df_codes_dev.head()

Unnamed: 0,doc_id,code
0,S0004-06142005000900016-1,bt41zzz
1,S0004-06142005000900016-1,ct13
2,S0004-06142005001000011-1,3e1m39z
3,S0004-06142005001000011-1,0tcb
4,S0004-06142005001000011-1,bt02


In [20]:
len(set(df_codes_dev["doc_id"]))

222

We join the training and development CodiEsp codes dataframes together:

In [21]:
df_codes_train_dev = pd.concat([df_codes_train, df_codes_dev])

In [22]:
df_codes_train_dev.shape

(2367, 2)

In [23]:
df_codes_train_dev.head()

Unnamed: 0,doc_id,code
0,S0004-06142005000700014-1,bw03zzz
1,S0004-06142005000700014-1,3e02329
2,S0004-06142005000700014-1,bw40zzz
3,S0004-06142005000700014-1,bv44zzz
4,S0004-06142005000700014-1,bn20


## Creating corpora of annotated sentences

Leveraging the information available for the named-entity-recognition and normalization (NER-N) CodiEsp-X task, we create both a training and a development corpus of annotated sentences with CIE-Procedimiento codes.

Firstly, we pre-process the NER-N precedure-codes annotations available for both the training and development corpora.

In [24]:
# Training corpus

In [24]:
%%time

codiesp_x_train = pd.read_table(corpus_path + "train/trainX.tsv", sep='\t', header=None)

CPU times: user 10.8 ms, sys: 0 ns, total: 10.8 ms
Wall time: 10.2 ms


In [25]:
codiesp_x_train.columns = ["doc_id", "type", "code", "word", "location"]

In [26]:
codiesp_x_train.shape

(9181, 5)

In [27]:
codiesp_x_train.head()

Unnamed: 0,doc_id,type,code,word,location
0,S0004-06142005000700014-1,PROCEDIMIENTO,bw03zzz,Rx tórax,2163 2171
1,S0004-06142005000700014-1,PROCEDIMIENTO,3e02329,Estreptomicina intramuscular,2787 2801;2810 2823
2,S0004-06142005000700014-1,DIAGNOSTICO,n44.8,teste derecho aumentado de tamaño,1343 1376
3,S0004-06142005000700014-1,DIAGNOSTICO,z20.818,exposición a Brucella,594 615
4,S0004-06142005000700014-1,DIAGNOSTICO,r60.9,edemas,1250 1256


In [28]:
codiesp_x_train = codiesp_x_train[codiesp_x_train["type"] == "PROCEDIMIENTO"]

In [29]:
codiesp_x_train.shape

(1972, 5)

In [30]:
df_codes_train_ner = process_ner_labels(codiesp_x_train).sort_values(["doc_id", "start", "end"])

In [31]:
df_codes_train_ner.head()

Unnamed: 0,doc_id,type,code,word,start,end
0,S0004-06142005000700014-1,PROCEDIMIENTO,bw03zzz,Rx tórax,2163,2171
3,S0004-06142005000700014-1,PROCEDIMIENTO,bw40zzz,Ecografía abdominal,2173,2192
5,S0004-06142005000700014-1,PROCEDIMIENTO,bn20,TAC craneal,2194,2205
4,S0004-06142005000700014-1,PROCEDIMIENTO,bv44zzz,Ecografía testicular,2287,2307
1,S0004-06142005000700014-1,PROCEDIMIENTO,3e02329,Estreptomicina intramuscular,2787,2801


In [32]:
df_codes_train_ner.shape

(2769, 6)

In [34]:
# Development corpus

In [33]:
%%time

codiesp_x_dev = pd.read_table(corpus_path + "dev/devX.tsv", sep='\t', header=None)

CPU times: user 6.37 ms, sys: 189 µs, total: 6.56 ms
Wall time: 6.07 ms


In [34]:
codiesp_x_dev.columns = ["doc_id", "type", "code", "word", "location"]

In [35]:
codiesp_x_dev.shape

(4477, 5)

In [36]:
codiesp_x_dev.head()

Unnamed: 0,doc_id,type,code,word,location
0,S0004-06142005000900016-1,PROCEDIMIENTO,bt41zzz,ecografía renal derecha,307 316;348 361
1,S0004-06142005000900016-1,PROCEDIMIENTO,ct13,gammagrafía renal,739 756
2,S0004-06142005000900016-1,DIAGNOSTICO,q62.11,estenosis en la unión pieloureteral derecha,540 583
3,S0004-06142005000900016-1,DIAGNOSTICO,n28.89,ectasia pielocalicial,326 347
4,S0004-06142005000900016-1,DIAGNOSTICO,n39.0,infecciones del tracto urinario,198 229


In [37]:
codiesp_x_dev = codiesp_x_dev[codiesp_x_dev["type"] == "PROCEDIMIENTO"]

In [38]:
codiesp_x_dev.shape

(1046, 5)

In [39]:
df_codes_dev_ner = process_ner_labels(codiesp_x_dev).sort_values(["doc_id", "start", "end"])

In [40]:
df_codes_dev_ner.head()

Unnamed: 0,doc_id,type,code,word,start,end
0,S0004-06142005000900016-1,PROCEDIMIENTO,bt41zzz,ecografía renal derecha,307,316
1,S0004-06142005000900016-1,PROCEDIMIENTO,bt41zzz,ecografía renal derecha,348,361
2,S0004-06142005000900016-1,PROCEDIMIENTO,ct13,gammagrafía renal,739,756
3,S0004-06142005001000011-1,PROCEDIMIENTO,3e1m39z,diálisis peritoneal,95,114
7,S0004-06142005001000011-1,PROCEDIMIENTO,0270,angioplastia transluminal de la coronaria derecha,424,473


In [41]:
df_codes_dev_ner.shape

(1540, 6)

Now, using the character start-end positions of each sentence from the CodiEsp corpus (see `datasets/CodiEsp-Sentence-Split.ipynb`), we annotate the sentences with CIE-Procedimiento codes. Also, using BETO tokenizer, each sentence is converted into a sequence of subwords, which are further converted into vocabulary indices (input IDs) and segments arrays (BERT input tensors). We also generate a *fragments* dataset indicating the number of produced annotated sentences for each document.

In [42]:
# Sentence-Split information
ss_corpus_path = "../datasets/CodiEsp-SSplit-text/"

### Training corpus

In [43]:
label_list = list(df_codes_train_dev["code"])

In [44]:
len(label_list)

2367

In [45]:
len(set(label_list))

727

In [46]:
from sklearn.preprocessing import MultiLabelBinarizer

mlb_encoder = MultiLabelBinarizer()
mlb_encoder.fit([label_list])

MultiLabelBinarizer(classes=None, sparse_output=False)

In [47]:
# Number of distinct codes
num_labels = len(mlb_encoder.classes_)

In [48]:
num_labels

727

Only training texts that are annotated with CIE-Procedimiento codes are considered:

In [49]:
# Some train documents (texts) are not annotated 
len(set(df_text_train["doc_id"]) - set(df_codes_train_ner["doc_id"]))

65

In [50]:
train_doc_list = sorted(set(df_codes_train_ner["doc_id"]))

In [51]:
len(train_doc_list)

435

In [54]:
# Sentence-Split data

In [52]:
%%time
ss_sub_corpus_path = ss_corpus_path + "train/"
ss_files = [f for f in os.listdir(ss_sub_corpus_path) if os.path.isfile(ss_sub_corpus_path + f)]
ss_dict_train = load_ss_files(ss_files, ss_sub_corpus_path)

CPU times: user 11.9 ms, sys: 3.89 ms, total: 15.8 ms
Wall time: 15.2 ms


In [53]:
%%time
train_ind, train_seg, train_y, train_frag, train_start_end_frag = ss_create_frag_input_data_bert(df_text=df_text_train, 
                                                  text_col=text_col, 
                                                  df_ann=df_codes_train_ner, doc_list=train_doc_list, ss_dict=ss_dict_train,
                                                  tokenizer=tokenizer, lab_encoder=mlb_encoder, seq_len=SEQ_LEN,
                                                  greedy=False)

100%|██████████| 435/435 [00:06<00:00, 68.43it/s]

CPU times: user 6.45 s, sys: 21.2 ms, total: 6.47 s
Wall time: 6.44 s





In [57]:
# Sanity check

In [54]:
train_ind.shape

(7004, 128)

In [55]:
train_seg.shape

(7004, 128)

In [56]:
train_y.shape

(7004, 727)

In [57]:
len(train_frag)

435

In [58]:
len(train_start_end_frag)

7004

In [59]:
# Check n_frag distribution across texts
pd.Series(train_frag).describe()

count    435.000000
mean      16.101149
std        7.767881
min        3.000000
25%       10.000000
50%       15.000000
75%       20.000000
max       54.000000
dtype: float64

In [60]:
# Inspect a randomly selected text and its encoded version
check_id = np.random.randint(low=0, high=len(train_doc_list), size=1)[0]

In [61]:
check_id

162

In [62]:
train_doc_list[check_id]

'S0365-66912004001000008-1'

In [63]:
df_text_train[df_text_train["doc_id"] == train_doc_list[check_id]][text_col].values[0]

'Mujer de 33 años referida a la Unidad de Oncología Ocular, para evaluación de una lesión pigmentada en iris de ojo derecho (OD) descubierta en examen rutinario. A la exploración presentaba una miopía media en ambos ojos con agudeza visual de 0,8. El ojo izquierdo no mostraba hallazgos patológicos, mientras que en OD se observaba una masa pigmentada redondeada de aproximadamente 3-4 mm de diámetro en el cuadrante superoexterno de cámara anterior (CA) entre los meridianos de las 9 y las 11. La masa presentaba una superficie lisa y regular que no infiltraba el iris. La PIO era normal, y la gonioscopia mostraba la ocupación del ángulo por la masa sin lesiones satélites. En la exploración bajo midriasis se observó que se originaba en cuerpo ciliar, produciendo una muesca en el cristalino transparente y un fondo de ojo sin patología valorable. La Tomografia Axial Computarizada (TAC) mostró la existencia de una masa redondeada localizada entre la raíz del iris y el cuerpo ciliar con captació

In [64]:
check_id_frag = sum(train_frag[:check_id])
for i in range(check_id_frag, check_id_frag + train_frag[check_id]):
    print(mlb_encoder.inverse_transform(np.array([train_y[i]])), "\n")

[('b825',)] 

[()] 

[('08j0xzz',)] 

[()] 

[()] 

[('08j0xzz',)] 

[('b825',)] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 



In [65]:
for i in range(check_id_frag, check_id_frag + train_frag[check_id]):
    print(list(zip([tokenizer._token_dict_inv[ind] for ind in train_ind[i]][1:len(train_start_end_frag[i])+1], 
               train_start_end_frag[i])))
    print("\n")

[('Mujer', (0, 5)), ('de', (6, 8)), ('33', (9, 11)), ('años', (12, 16)), ('referida', (17, 25)), ('a', (26, 27)), ('la', (28, 30)), ('Unidad', (31, 37)), ('de', (38, 40)), ('On', (41, 43)), ('##col', (43, 46)), ('##o', (46, 47)), ('##gía', (47, 50)), ('Oc', (51, 53)), ('##ular', (53, 57)), (',', (57, 58)), ('para', (59, 63)), ('evaluación', (64, 74)), ('de', (75, 77)), ('una', (78, 81)), ('lesión', (82, 88)), ('pig', (89, 92)), ('##menta', (92, 97)), ('##da', (97, 99)), ('en', (100, 102)), ('iris', (103, 107)), ('de', (108, 110)), ('ojo', (111, 114)), ('derecho', (115, 122)), ('(', (123, 124)), ('OD', (124, 126)), (')', (126, 127)), ('descubierta', (128, 139)), ('en', (140, 142)), ('examen', (143, 149)), ('rutina', (150, 156)), ('##rio', (156, 159)), ('.', (159, 160))]


[('A', (161, 162)), ('la', (163, 165)), ('exploración', (166, 177)), ('presentaba', (178, 188)), ('una', (189, 192)), ('mio', (193, 196)), ('##pía', (196, 199)), ('media', (200, 205)), ('en', (206, 208)), ('ambos', (20

In [66]:
check_id_frag = sum(train_frag[:check_id])
for frag in train_ind[check_id_frag:check_id_frag + train_frag[check_id]]:
    print(' '.join([tokenizer._token_dict_inv[ind] for ind in frag]), "\n")

[CLS] Mujer de 33 años referida a la Unidad de On ##col ##o ##gía Oc ##ular , para evaluación de una lesión pig ##menta ##da en iris de ojo derecho ( OD ) descubierta en examen rutina ##rio . [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] 

[CLS] A la exploración presentaba una mio ##pía media en ambos ojos con agu ##deza visual de 0 , 8 . [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PA

In [67]:
# Fragment labels distribution
pd.Series(np.sum(train_y, axis=1)).describe()

count    7004.000000
mean        0.305682
std         0.653261
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max         6.000000
dtype: float64

### Development corpus

Only development texts that are annotated with CIE-Procedimiento codes are considered:

In [68]:
# Some dev documents (texts) are not annotated 
len(set(df_text_dev["doc_id"]) - set(df_codes_dev_ner["doc_id"]))

28

In [69]:
dev_doc_list = sorted(set(df_codes_dev_ner["doc_id"]))

In [70]:
len(dev_doc_list)

222

In [71]:
%%time
ss_sub_corpus_path = ss_corpus_path + "dev/"
ss_files = [f for f in os.listdir(ss_sub_corpus_path) if os.path.isfile(ss_sub_corpus_path + f)]
ss_dict_dev = load_ss_files(ss_files, ss_sub_corpus_path)

CPU times: user 42.9 ms, sys: 15 µs, total: 42.9 ms
Wall time: 42.1 ms


In [72]:
%%time
dev_ind, dev_seg, dev_y, dev_frag, dev_start_end_frag = ss_create_frag_input_data_bert(df_text=df_text_dev, 
                                                  text_col=text_col, 
                                                  df_ann=df_codes_dev_ner, doc_list=dev_doc_list, ss_dict=ss_dict_dev,
                                                  tokenizer=tokenizer, lab_encoder=mlb_encoder, seq_len=SEQ_LEN,
                                                  greedy=False)

100%|██████████| 222/222 [00:03<00:00, 62.81it/s]

CPU times: user 3.58 s, sys: 20 ms, total: 3.6 s
Wall time: 3.58 s





In [77]:
# Sanity check

In [73]:
dev_ind.shape

(3797, 128)

In [74]:
dev_seg.shape

(3797, 128)

In [75]:
dev_y.shape

(3797, 727)

In [76]:
len(dev_frag)

222

In [77]:
len(dev_start_end_frag)

3797

In [78]:
# Check n_frag distribution across texts
pd.Series(dev_frag).describe()

count    222.000000
mean      17.103604
std        8.313598
min        4.000000
25%       11.000000
50%       15.000000
75%       21.000000
max       65.000000
dtype: float64

In [79]:
# Inspect a randomly selected text and its encoded version
check_id = np.random.randint(low=0, high=len(dev_doc_list), size=1)[0]

In [80]:
check_id

42

In [81]:
dev_doc_list[check_id]

'S0210-48062007001000004-3'

In [82]:
df_text_dev[df_text_dev["doc_id"] == dev_doc_list[check_id]][text_col].values[0]

'Paciente de 29 años que es remitido a nuestra consulta tras cuadro de dolor en teste derecho 3 meses atrás que remitió con tratamiento antiinflamatorio. Un mes después tuvo un nuevo cuadro de dolor en testículo derecho que fue diagnosticado de orquiepididimitis pero que no se resolvió con tratamiento médico.\nComo antecedentes personales sólo destaca el ser alérgico a las Sulfamidas. No antecedentes urológicos de interés.\nA la exploración física se palpa tumoración en polo posteroinferior de teste derecho indolora.\nAnte la sospecha de tumor testicular se le solicita una analítica completa sanguínea con marcadores tumorales y una ecografía testicular, siendo los resultados de AFP 36.4 ng/ml y una Beta-hCG 4.2 mUI/ml. En la ecografía se observa una tumoración de 23 mm en polo inferior de teste derecho de características no quísticas sospechosa de neoplasia.\nSe le interviene quirúrgicamente mediante una orquiectomía radical inguinal derecha cuya Anatomía Patológica informa de Tumor de

In [83]:
check_id_frag = sum(dev_frag[:check_id])
for i in range(check_id_frag, check_id_frag + dev_frag[check_id]):
    print(mlb_encoder.inverse_transform(np.array([dev_y[i]])), "\n")

[()] 

[()] 

[()] 

[()] 

[()] 

[('bv44zzz',)] 

[('bv44zzz',)] 

[('0vt90zz',)] 

[()] 

[()] 

[()] 

[('bv44zzz',)] 

[()] 

[()] 

[('0vtb0zz',)] 

[()] 

[()] 

[()] 



In [84]:
for i in range(check_id_frag, check_id_frag + dev_frag[check_id]):
    print(list(zip([tokenizer._token_dict_inv[ind] for ind in dev_ind[i]][1:len(dev_start_end_frag[i])+1], 
               dev_start_end_frag[i])))
    print("\n")

[('Paci', (0, 4)), ('##ente', (4, 8)), ('de', (9, 11)), ('29', (12, 14)), ('años', (15, 19)), ('que', (20, 23)), ('es', (24, 26)), ('remitido', (27, 35)), ('a', (36, 37)), ('nuestra', (38, 45)), ('consulta', (46, 54)), ('tras', (55, 59)), ('cuadro', (60, 66)), ('de', (67, 69)), ('dolor', (70, 75)), ('en', (76, 78)), ('test', (79, 83)), ('##e', (83, 84)), ('derecho', (85, 92)), ('3', (93, 94)), ('meses', (95, 100)), ('atrás', (101, 106)), ('que', (107, 110)), ('remi', (111, 115)), ('##tió', (115, 118)), ('con', (119, 122)), ('tratamiento', (123, 134)), ('anti', (135, 139)), ('##inf', (139, 142)), ('##lam', (142, 145)), ('##atori', (145, 150)), ('##o', (150, 151)), ('.', (151, 152))]


[('Un', (153, 155)), ('mes', (156, 159)), ('después', (160, 167)), ('tuvo', (168, 172)), ('un', (173, 175)), ('nuevo', (176, 181)), ('cuadro', (182, 188)), ('de', (189, 191)), ('dolor', (192, 197)), ('en', (198, 200)), ('test', (201, 205)), ('##ículo', (205, 210)), ('derecho', (211, 218)), ('que', (219, 22

In [85]:
check_id_frag = sum(dev_frag[:check_id])
for frag in dev_ind[check_id_frag:check_id_frag + dev_frag[check_id]]:
    print(' '.join([tokenizer._token_dict_inv[ind] for ind in frag]), "\n")

[CLS] Paci ##ente de 29 años que es remitido a nuestra consulta tras cuadro de dolor en test ##e derecho 3 meses atrás que remi ##tió con tratamiento anti ##inf ##lam ##atori ##o . [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] 

[CLS] Un mes después tuvo un nuevo cuadro de dolor en test ##ículo derecho que fue diagnos ##tica ##do de or ##qui ##ep ##idi ##di ##mit ##is pero que no se resolvió con tratamiento médico . [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [P

In [86]:
# Fragment labels distribution
pd.Series(np.sum(dev_y, axis=1)).describe()

count    3797.000000
mean        0.309455
std         0.637639
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max         5.000000
dtype: float64

### Training & Development corpus

We merge the previously generated datasets:

In [87]:
# Indices
train_dev_ind = np.concatenate((train_ind, dev_ind))

In [88]:
train_dev_ind.shape

(10801, 128)

In [89]:
# Segments
train_dev_seg = np.concatenate((train_seg, dev_seg))

In [90]:
train_dev_seg.shape

(10801, 128)

In [91]:
# y
train_dev_y = np.concatenate((train_y, dev_y))

In [92]:
train_dev_y.shape

(10801, 727)

## Fine-tuning

Using the corpus of labeled sentences, we fine-tune the model on a multi-label sentence classification task.

In [93]:
from keras.backend.tensorflow_backend import set_session

# Prevent GPU memory allocation problems
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
set_session(tf.Session(config=config))

In [94]:
from keras_bert import load_trained_model_from_checkpoint

model = load_trained_model_from_checkpoint(
    config_file=config_path, 
    checkpoint_file=checkpoint_path, 
    training=training,                                       
    trainable=trainable, 
    seq_len=SEQ_LEN
)

In [95]:
model.inputs

[<tf.Tensor 'Input-Token:0' shape=(?, 128) dtype=float32>,
 <tf.Tensor 'Input-Segment:0' shape=(?, 128) dtype=float32>]

In [96]:
model.outputs

[<tf.Tensor 'Encoder-12-FeedForward-Norm/add_1:0' shape=(?, 128, 768) dtype=float32>]

In [97]:
from keras.layers import Dense, Activation
from keras.models import Model
from keras.initializers import glorot_uniform
from keras_bert.layers import Extract

dense_cls = Extract(index=0, name='Extract')(model.output) # In order to extract CLS token embedding
dense_out = Dense(units=num_labels, kernel_initializer=glorot_uniform(seed=random_seed))(dense_cls) # Multi-label classification
outputs = Activation('sigmoid')(dense_out)

model = Model(model.inputs, outputs)

In [98]:
model.inputs

[<tf.Tensor 'Input-Token:0' shape=(?, 128) dtype=float32>,
 <tf.Tensor 'Input-Segment:0' shape=(?, 128) dtype=float32>]

In [99]:
model.outputs

[<tf.Tensor 'activation_1/Sigmoid:0' shape=(?, 727) dtype=float32>]

In [None]:
%%time
from keras_radam import RAdam

model.compile(
    optimizer=RAdam(learning_rate=LR),
    loss='binary_crossentropy'
)

history = model.fit(
    x=[train_dev_ind, train_dev_seg],
    y=train_dev_y,
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    shuffle=True
)

Epoch 1/36
Epoch 2/36
Epoch 3/36
  256/10801 [..............................] - ETA: 2:16 - loss: 0.0121

## Test set predictions

Finally, the predictions made by the model on the test set are saved. For this purpose, firstly, each sentence from the test corpus must be converted into a sequence of subwords (input IDs and attention mask arrays). Then, the predictions made by the model at the sentence-level are saved, to be further evaluated at document-level (see `results/CodiEsp-P/Evaluation.ipynb`).

In [100]:
%%time
test_path = corpus_path + "test/text_files/"
test_files = [f for f in os.listdir(test_path) if os.path.isfile(test_path + f)]
test_data = load_text_files(test_files, test_path)
df_text_test = pd.DataFrame({'doc_id': [s.split('.txt')[0] for s in test_files], 'raw_text': test_data})

CPU times: user 10.9 ms, sys: 103 µs, total: 11 ms
Wall time: 9.58 ms


In [101]:
df_text_test.shape

(250, 2)

In [102]:
df_text_test.head()

Unnamed: 0,doc_id,raw_text
0,S1698-44472004000400012-1,"Varón de 54 años de edad, remitido a nuestro s..."
1,S1130-05582012000300005-1,Acude a nuestras consultas a un paciente que p...
2,S0212-16112009000300015-1,Se trató de un varón de 77 años con antecedent...
3,S1139-76322014000500014-1,Niño de cinco años derivado por su pediatra de...
4,S0212-71992004000300009-1,Varón de 22 años de edad que acude a consultas...


In [103]:
df_text_test.raw_text[0]

'Varón de 54 años de edad, remitido a nuestro servicio en mayo del 2003 por presentar odontalgia en relación con un tercer molar inferior derecho erupcionado. A la inspección oral se pudo observar la existencia de una tumefacción que expandía las corticales vestibulo-linguales, en la región del tercer molar mandibular derecho, cariado por distal. La mucosa oral estaba indemne, y no se palpaban adenomegalias cervicales. El paciente refería la existencia de una hipoestesia en el territorio de distribución del nervio mentoniano de quince días de evolución. En la ortopantomografía, se evidenció la presencia de una imagen radiolúcida, de contornos poco definidos, en el cuerpo mandibular derecho. Dos días después, bajo anestesia local, se procedió a la exodoncia del tercer molar y curetaje-biopsia del tejido subyacente. Durante el acto operatorio, se produjo una intensa hemorragia, que pudo ser cohibida con el empleo de Surgicel (Johnson & Johnson, Nuevo Brunswick, NJ) y mediante el empaquet

In [104]:
test_doc_list = sorted(set(df_text_test["doc_id"]))

In [105]:
len(test_doc_list)

250

In [106]:
%%time
ss_sub_corpus_path = ss_corpus_path + "test/"
ss_files = [f for f in os.listdir(ss_sub_corpus_path) if os.path.isfile(ss_sub_corpus_path + f)]
ss_dict_test = load_ss_files(ss_files, ss_sub_corpus_path)

CPU times: user 9.02 ms, sys: 33 µs, total: 9.05 ms
Wall time: 8.27 ms


In [107]:
%%time
test_ind, test_seg, _, test_frag, _ = ss_create_frag_input_data_bert(df_text=df_text_test, 
                                                  text_col=text_col,
                                                  # Since labels are ignored, we pass df_codes_train_ner as df_ann
                                                  df_ann=df_codes_train_ner, doc_list=test_doc_list, ss_dict=ss_dict_test,
                                                  tokenizer=tokenizer, lab_encoder=mlb_encoder, seq_len=SEQ_LEN,
                                                  greedy=False)

100%|██████████| 250/250 [00:01<00:00, 182.41it/s]

CPU times: user 1.43 s, sys: 28.5 ms, total: 1.46 s
Wall time: 1.42 s





In [108]:
%%time
test_preds = model.predict([test_ind, test_seg])

CPU times: user 5.78 s, sys: 581 ms, total: 6.36 s
Wall time: 13 s


In [109]:
test_preds.shape

(3948, 727)

In [110]:
results_dir_path = "../results/CodiEsp-P/"

In [178]:
%%time
np.save(file=results_dir_path + "predictions/beto_seed_" + str(random_seed) + "_test_preds.npy", arr=test_preds)

CPU times: user 2.21 ms, sys: 4.65 ms, total: 6.87 ms
Wall time: 6.02 ms


In [111]:
# To be further used when evaluating model performance at document level
np.save(file=results_dir_path + "beto_test_frags.npy", arr=test_frag)