# Fine-tuning mBERT-Galén on CodiEsp-D

In this notebook, following a multi-label sequence classification approach, the mBERT-Galén model is fine-tuned on both the training and development sets of the CodiEsp-D corpus. Additionally, the predictions made by the model on the test set are saved, in order to futher evaluate the clinical coding performance of the model (see `results/CodiEsp-D/Evaluation.ipynb`).

In [1]:
import tensorflow as tf

# Auxiliary components
import sys
sys.path.append("..")
from nlp_utils import *

# mBERT tokenizer
from keras_bert import load_vocabulary, Tokenizer
model_path = "multi_cased_L-12_H-768_A-12/"
config_path = model_path + "bert_config.json"
checkpoint_path = model_path + "mBERT-Galen/model.ckpt-1000000"
vocab_file = "vocab.txt"
tokenizer = Tokenizer(token_dict=load_vocabulary(model_path + vocab_file), pad_index=0, cased=True)

# Hyper-parameters
text_col = "raw_text"
training = False
trainable = True
SEQ_LEN = 128
BATCH_SIZE = 16
EPOCHS = 31
LR = 3e-5
train_weight = 4
all_abs_weight = 1

random_seed = 0
tf.set_random_seed(random_seed)

Using TensorFlow backend.


## Load text

Firstly, all text files from training and development CodiEsp corpora are loaded in different dataframes.

Also, CIE-Diagnóstico codes are loaded.

In [2]:
corpus_path = "../datasets/codiesp_v4/"
abs_corpus_path = "../datasets/abstractsWithCIE10_v2/"

### Training corpus

In [3]:
%%time
train_path = corpus_path + "train/text_files/"
train_files = [f for f in os.listdir(train_path) if os.path.isfile(train_path + f)]
train_data = load_text_files(train_files, train_path)
df_text_train = pd.DataFrame({'doc_id': [s.split('.txt')[0] for s in train_files], 'raw_text': train_data})

CPU times: user 6.37 ms, sys: 4.26 ms, total: 10.6 ms
Wall time: 10.3 ms


In [4]:
df_text_train.shape

(500, 2)

In [5]:
df_text_train.head()

Unnamed: 0,doc_id,raw_text
0,S0365-66912006000600011-1,Un varón de 13 años es remitido para valoració...
1,S1139-13752009000200010-2,Paciente de 42 años diagnosticado de pancoliti...
2,S1130-05582017000100037-1,"Varón de 72 años, sin antecedentes médicos de ..."
3,S1139-76322016000300015-1,Lactante de ocho días cuyos padres consultan p...
4,S0211-69952011000100019-1,Mujer de 47 años de edad con antecedentes de g...


In [6]:
df_text_train.raw_text[0]

'Un varón de 13 años es remitido para valoración oftalmológica por mala visión. Fenotípicamente era un niño de talla corta con una estatura de 133 cm, braquimorfia y braquidactilia en las cuatro extremidades.\nEl paciente presentaba un error refractivo corregido de -13,00 -6,50 a 1º en el ojo derecho y de -16,00-6,25 a 179º en el izquierdo. Con dicha corrección alcanzaba una agudeza visual de 0,4 y 0,2 respectivamente. No existía diplopía monocular ni hallazgos en la motilidad ocular extrínseca e intrínseca.\nEl diámetro corneal horizontal era de 12,0 mm en ambos ojos y la paquimetría de 613 y 611 micras respectivamente. La cámara anterior era estrecha, apreciándose iridofacodonesis bilateral. Se evidenció microesferofaquia con desplazamiento anterior de ambos cristalinos dentro de la cámara posterior.\nLa presión intraocular era de 20 mmHg bilateralmente. Gonioscópicamente se apreció un ángulo estrecho simétrico en ambos ojos grado II según Schaffer.\nLa exploración mediante topógrafo

We also load the CIE-Diagnóstico codes table:

In [7]:
df_codes_train = pd.read_table(corpus_path + "train/trainD.tsv", sep='\t', header=None)

In [8]:
df_codes_train.columns = ["doc_id", "code"]

In [9]:
df_codes_train.shape

(5639, 2)

In [10]:
df_codes_train.head()

Unnamed: 0,doc_id,code
0,S0004-06142005000700014-1,n44.8
1,S0004-06142005000700014-1,z20.818
2,S0004-06142005000700014-1,r60.9
3,S0004-06142005000700014-1,r52
4,S0004-06142005000700014-1,a23.9


In [11]:
len(set(df_codes_train["doc_id"]))

500

### Development corpus

In [12]:
%%time
dev_path = corpus_path + "dev/text_files/"
dev_files = [f for f in os.listdir(dev_path) if os.path.isfile(dev_path + f)]
dev_data = load_text_files(dev_files, dev_path)
df_text_dev = pd.DataFrame({'doc_id': [s.split('.txt')[0] for s in dev_files], 'raw_text': dev_data})

CPU times: user 5.48 ms, sys: 155 µs, total: 5.64 ms
Wall time: 5.46 ms


In [13]:
df_text_dev.shape

(250, 2)

In [14]:
df_text_dev.head()

Unnamed: 0,doc_id,raw_text
0,S1130-63432016000600013-1,"Varón de 75 años, con antecedentes de hiperuri..."
1,S0365-66912003000600010-1,"Paciente de 33 años de edad, gestante de 34 se..."
2,S0211-69952012000200030-1,Mujer de 67 años con múltiples factores de rie...
3,S0365-66912004000900009-1,Paciente de 55 años que acudió a urgencias por...
4,S1139-76322016000300016-2,"Lactante de 1 mes y 29 días, sin antecedentes ..."


In [15]:
df_text_dev.raw_text[0]

'Varón de 75 años, con antecedentes de hiperuricemia en tratamiento con Alopurinol que ingresa para realización de resección transuretral de próstata.\nPostoperatorio inmediato sin incidencias con tratamiento con Pantoprazol, Ciprofloxacino, Paracetamol, Enantyum y Alopurinol. Al cuarto día de postoperatorio presenta mareos, temblor con componente mioclónico en extremidades y tronco e incapacidad para caminar, sin verse alteraciones analíticas. En esta situación se pauta Rivotril y se suspende el tratamiento con Ciprofloxacino, desapareciendo la clínica mioclónica y mejorando el estado del paciente, por lo que se decide el alta hospitalaria.\n\n'

We also load the CIE-Diagnóstico codes table:

In [16]:
df_codes_dev = pd.read_table(corpus_path + "dev/devD.tsv", sep='\t', header=None)

In [17]:
df_codes_dev.columns = ["doc_id", "code"]

In [18]:
df_codes_dev.shape

(2677, 2)

In [19]:
df_codes_dev.head()

Unnamed: 0,doc_id,code
0,S0004-06142005000900016-1,q62.11
1,S0004-06142005000900016-1,n28.89
2,S0004-06142005000900016-1,n39.0
3,S0004-06142005000900016-1,r31.9
4,S0004-06142005000900016-1,n23


In [20]:
len(set(df_codes_dev["doc_id"]))

250

### Train-Dev abstracts corpus

From the additional abstracts corpus, we load the text from the abstracts containing CIE-Diagnóstico codes which are present either in the training or development CodiEsp-D corpora:

In [21]:
%%time
df_text_all_abs = pd.read_table(abs_corpus_path + "train_dev_abstracts_text.tsv", sep='\t')

CPU times: user 1.34 s, sys: 96.3 ms, total: 1.44 s
Wall time: 1.44 s


In [22]:
df_text_all_abs.shape

(100397, 3)

We only select the abstracts with a subtokens sequence length <= the maximum input sequence size used by the mBERT model (128 subtokens):

In [23]:
all_abs_doc_one_frag = set(pd.read_table(abs_corpus_path + "all_abstracts_seq_len_mbert_128.tsv", 
                                               sep='\t', header=None)[0])

In [24]:
len(all_abs_doc_one_frag)

17723

In [25]:
df_text_all_abs = df_text_all_abs[df_text_all_abs["doc_id"].isin(all_abs_doc_one_frag)]

In [26]:
df_text_all_abs.shape

(11515, 3)

In [27]:
df_text_all_abs.head()

Unnamed: 0,doc_id,raw_text,punc_text
38,biblio-1000756,Este libro es el resultado de un trabajo minuc...,Este libro es el resultado de un trabajo minuc...
48,biblio-1002637,Se efectuó una revisión actualizada sobre el d...,Se efectuó una revisión actualizada sobre el d...
105,biblio-1005037,La microlitiasis testicular (TM) es una patolo...,La microlitiasis testicular TM es una patologí...
115,biblio-1005118,Se proponen algunas consideraciones teóricas s...,Se proponen algunas consideraciones teóricas s...
144,biblio-1005452,Objetivo: entender la relación entre la depres...,Objetivo entender la relación entre la depresi...


We also load the CIE-Diagnóstico codes from the previously loaded abstracts:

In [28]:
df_codes_d_all_abs = pd.read_table(abs_corpus_path + "train_dev_abstracts_codes.tsv", sep='\t', 
                                   header=None)

In [29]:
df_codes_d_all_abs.columns = ["doc_id", "code"]

In [30]:
df_codes_d_all_abs = df_codes_d_all_abs[df_codes_d_all_abs["doc_id"].isin(all_abs_doc_one_frag)]

In [31]:
df_codes_d_all_abs.shape

(15725, 2)

In [32]:
df_codes_d_all_abs.head()

Unnamed: 0,doc_id,code
1,lil-286177,i82.40
2,lil-286177,i82.90
10,lil-506160,q03.1
45,lil-176866,g51.0
46,lil-176866,r29.810


We join the training and development CodiEsp as well as the abstracts codes dataframes together:

In [33]:
df_codes_train_dev_abs = pd.concat([df_codes_train, df_codes_dev, df_codes_d_all_abs])

In [34]:
df_codes_train_dev_abs.shape

(24041, 2)

In [35]:
df_codes_train_dev_abs.head()

Unnamed: 0,doc_id,code
0,S0004-06142005000700014-1,n44.8
1,S0004-06142005000700014-1,z20.818
2,S0004-06142005000700014-1,r60.9
3,S0004-06142005000700014-1,r52
4,S0004-06142005000700014-1,a23.9


## Creating corpora of annotated sentences

Leveraging the information available for the named-entity-recognition and normalization (NER-N) CodiEsp-X task, we create both a training and a development corpus of annotated sentences with CIE-Diagnóstico codes.

Firstly, we pre-process the NER-N precedure-codes annotations available for both the training and development corpora.

In [36]:
# Training corpus

In [37]:
%%time

codiesp_x_train = pd.read_table(corpus_path + "train/trainX.tsv", sep='\t', header=None)

CPU times: user 11.3 ms, sys: 281 µs, total: 11.5 ms
Wall time: 11.3 ms


In [38]:
codiesp_x_train.columns = ["doc_id", "type", "code", "word", "location"]

In [39]:
codiesp_x_train.shape

(9181, 5)

In [40]:
codiesp_x_train.head()

Unnamed: 0,doc_id,type,code,word,location
0,S0004-06142005000700014-1,PROCEDIMIENTO,bw03zzz,Rx tórax,2163 2171
1,S0004-06142005000700014-1,PROCEDIMIENTO,3e02329,Estreptomicina intramuscular,2787 2801;2810 2823
2,S0004-06142005000700014-1,DIAGNOSTICO,n44.8,teste derecho aumentado de tamaño,1343 1376
3,S0004-06142005000700014-1,DIAGNOSTICO,z20.818,exposición a Brucella,594 615
4,S0004-06142005000700014-1,DIAGNOSTICO,r60.9,edemas,1250 1256


In [41]:
codiesp_x_train = codiesp_x_train[codiesp_x_train["type"] == "DIAGNOSTICO"]

In [42]:
codiesp_x_train.shape

(7209, 5)

In [43]:
df_codes_train_ner = process_ner_labels(codiesp_x_train).sort_values(["doc_id", "start", "end"])

In [44]:
df_codes_train_ner.head()

Unnamed: 0,doc_id,type,code,word,start,end
3,S0004-06142005000700014-1,DIAGNOSTICO,r52,dolores,78,85
13,S0004-06142005000700014-1,DIAGNOSTICO,m25.50,dolores osteoarticulares,78,102
9,S0004-06142005000700014-1,DIAGNOSTICO,r50.9,fiebre,147,153
14,S0004-06142005000700014-1,DIAGNOSTICO,a23.9,brucella,360,368
10,S0004-06142005000700014-1,DIAGNOSTICO,r50.9,síndrome febril,534,549


In [45]:
df_codes_train_ner.shape

(8272, 6)

In [46]:
# Development corpus

In [47]:
%%time

codiesp_x_dev = pd.read_table(corpus_path + "dev/devX.tsv", sep='\t', header=None)

CPU times: user 6.85 ms, sys: 209 µs, total: 7.06 ms
Wall time: 6.82 ms


In [48]:
codiesp_x_dev.columns = ["doc_id", "type", "code", "word", "location"]

In [49]:
codiesp_x_dev.shape

(4477, 5)

In [50]:
codiesp_x_dev.head()

Unnamed: 0,doc_id,type,code,word,location
0,S0004-06142005000900016-1,PROCEDIMIENTO,bt41zzz,ecografía renal derecha,307 316;348 361
1,S0004-06142005000900016-1,PROCEDIMIENTO,ct13,gammagrafía renal,739 756
2,S0004-06142005000900016-1,DIAGNOSTICO,q62.11,estenosis en la unión pieloureteral derecha,540 583
3,S0004-06142005000900016-1,DIAGNOSTICO,n28.89,ectasia pielocalicial,326 347
4,S0004-06142005000900016-1,DIAGNOSTICO,n39.0,infecciones del tracto urinario,198 229


In [51]:
codiesp_x_dev = codiesp_x_dev[codiesp_x_dev["type"] == "DIAGNOSTICO"]

In [52]:
codiesp_x_dev.shape

(3431, 5)

In [53]:
df_codes_dev_ner = process_ner_labels(codiesp_x_dev).sort_values(["doc_id", "start", "end"])

In [54]:
df_codes_dev_ner.head()

Unnamed: 0,doc_id,type,code,word,start,end
11,S0004-06142005000900016-1,DIAGNOSTICO,k26.9,ulcus duodenal,37,51
14,S0004-06142005000900016-1,DIAGNOSTICO,k59.00,estreñimiento,54,67
4,S0004-06142005000900016-1,DIAGNOSTICO,n23,dolor en fosa renal,85,104
5,S0004-06142005000900016-1,DIAGNOSTICO,n28.0,crisis renoureteral,128,147
13,S0004-06142005000900016-1,DIAGNOSTICO,n20.0,nefrolitiasis,168,181


In [55]:
df_codes_dev_ner.shape

(3947, 6)

Now, using the character start-end positions of each sentence from the CodiEsp corpus (see `datasets/CodiEsp-Sentence-Split.ipynb`), we annotate the sentences with CIE-Procedimiento codes. Also, using mBERT tokenizer, each sentence is converted into a sequence of subwords, which are further converted into vocabulary indices (input IDs) and segments arrays (BERT input tensors). We also generate a *fragments* dataset indicating the number of produced annotated sentences for each document.

In [56]:
# Sentence-Split information
ss_corpus_path = "../datasets/CodiEsp-SSplit-text/"

### Training corpus

In [57]:
label_list = list(df_codes_train_dev_abs["code"])

In [58]:
len(label_list)

24041

In [59]:
len(set(label_list))

2194

In [60]:
from sklearn.preprocessing import MultiLabelBinarizer

mlb_encoder = MultiLabelBinarizer()
mlb_encoder.fit([label_list])

MultiLabelBinarizer(classes=None, sparse_output=False)

In [61]:
# Number of distinct codes
num_labels = len(mlb_encoder.classes_)

In [62]:
num_labels

2194

Only training texts that are annotated with CIE-Diagnóstico codes are considered:

In [63]:
len(set(df_text_train["doc_id"]) - set(df_codes_train_ner["doc_id"]))

0

In [64]:
train_doc_list = sorted(set(df_codes_train_ner["doc_id"]))

In [65]:
len(train_doc_list)

500

In [66]:
# Sentence-Split data

In [67]:
%%time
ss_sub_corpus_path = ss_corpus_path + "train/"
ss_files = [f for f in os.listdir(ss_sub_corpus_path) if os.path.isfile(ss_sub_corpus_path + f)]
ss_dict_train = load_ss_files(ss_files, ss_sub_corpus_path)

CPU times: user 13 ms, sys: 4.1 ms, total: 17.1 ms
Wall time: 16.9 ms


In [68]:
%%time
train_ind, train_seg, train_y, train_frag, train_start_end_frag = ss_create_frag_input_data_bert(df_text=df_text_train, 
                                                  text_col=text_col, 
                                                  df_ann=df_codes_train_ner, doc_list=train_doc_list, ss_dict=ss_dict_train,
                                                  tokenizer=tokenizer, lab_encoder=mlb_encoder, seq_len=SEQ_LEN,
                                                  greedy=False)

100%|██████████| 500/500 [00:15<00:00, 18.45it/s]

CPU times: user 15.9 s, sys: 54.8 ms, total: 16 s
Wall time: 15.9 s





In [69]:
# Sanity check

In [70]:
train_ind.shape

(7753, 128)

In [71]:
train_seg.shape

(7753, 128)

In [72]:
train_y.shape

(7753, 2194)

In [73]:
len(train_frag)

500

In [74]:
len(train_start_end_frag)

7753

In [75]:
# Check n_frag distribution across texts
pd.Series(train_frag).describe()

count    500.000000
mean      15.506000
std        7.682709
min        2.000000
25%       10.000000
50%       14.000000
75%       19.000000
max       54.000000
dtype: float64

In [76]:
# Inspect a randomly selected text and its encoded version
check_id = np.random.randint(low=0, high=len(train_doc_list), size=1)[0]

In [77]:
check_id

73

In [78]:
train_doc_list[check_id]

'S0210-48062007001000004-1'

In [79]:
df_text_train[df_text_train["doc_id"] == train_doc_list[check_id]][text_col].values[0]

'Paciente de 25 años que, durante ingreso en el servicio de Medicina Interna por fiebre vespertina de 3 días de evolución, se le descubre incidentalmente mediante estudio ecográfico tumoración en testículo derecho.\nComo antecedentes personales, refiere no tener alergias medicamentosas. Herniorrafia inguinal bilateral.\nRefiere tumoración indolora de un año de evolución que ha ido aumentando progresivamente.\nA la exploración física se palpa aumento del testículo derecho con tumoración indolora en polo superior.\nAnte la sospecha de tumor testicular se le realizan diferentes pruebas complementarias. En la Ecografía testicular, se aprecia tumoración en testículo derecho de 78 mm x 57 mm x 61 mm heterogénea, sólida compatible con tumor testicular primario. Teste izquierdo normal. Tomografía axial computerizada (TAC): tumoración testicular derecha con adenopatías retrocavas y paraaórticas > 2 y <5 cm. Lactato deshidrogenasa (LDH): 1890 UI/I; alfafetoproteína (AFP): 51 ng/ml; beta-gonadotr

In [80]:
check_id_frag = sum(train_frag[:check_id])
for i in range(check_id_frag, check_id_frag + train_frag[check_id]):
    print(mlb_encoder.inverse_transform(np.array([train_y[i]])), "\n")

[('n50.9', 'r50.9')] 

[()] 

[()] 

[()] 

[('n50.9',)] 

[('d49.59',)] 

[('d49.59', 'n50.9')] 

[()] 

[('n50.9', 'r59.0')] 

[()] 

[()] 

[('c62.91',)] 

[('n50.9',)] 

[('n50.9',)] 

[()] 

[()] 

[('n50.9',)] 

[()] 

[()] 

[()] 

[('r59.9',)] 

[('c62.90',)] 

[()] 

[()] 

[()] 

[('n13.30', 'r59.0')] 

[()] 

[()] 

[()] 



In [81]:
for i in range(check_id_frag, check_id_frag + train_frag[check_id]):
    print(list(zip([tokenizer._token_dict_inv[ind] for ind in train_ind[i]][1:len(train_start_end_frag[i])+1], 
               train_start_end_frag[i])))
    print("\n")

[('Pac', (0, 3)), ('##iente', (3, 8)), ('de', (9, 11)), ('25', (12, 14)), ('años', (15, 19)), ('que', (20, 23)), (',', (23, 24)), ('durante', (25, 32)), ('ingreso', (33, 40)), ('en', (41, 43)), ('el', (44, 46)), ('servicio', (47, 55)), ('de', (56, 58)), ('Medicina', (59, 67)), ('Inter', (68, 73)), ('##na', (73, 75)), ('por', (76, 79)), ('fie', (80, 83)), ('##bre', (83, 86)), ('ve', (87, 89)), ('##sper', (89, 93)), ('##tina', (93, 97)), ('de', (98, 100)), ('3', (101, 102)), ('días', (103, 107)), ('de', (108, 110)), ('evolución', (111, 120)), (',', (120, 121)), ('se', (122, 124)), ('le', (125, 127)), ('descubre', (128, 136)), ('incident', (137, 145)), ('##almente', (145, 152)), ('mediante', (153, 161)), ('estudio', (162, 169)), ('e', (170, 171)), ('##co', (171, 173)), ('##gráfico', (173, 180)), ('tumor', (181, 186)), ('##ación', (186, 191)), ('en', (192, 194)), ('test', (195, 199)), ('##ículo', (199, 204)), ('derecho', (205, 212)), ('.', (212, 213))]


[('Como', (214, 218)), ('ante', (21

In [82]:
check_id_frag = sum(train_frag[:check_id])
for frag in train_ind[check_id_frag:check_id_frag + train_frag[check_id]]:
    print(' '.join([tokenizer._token_dict_inv[ind] for ind in frag]), "\n")

[CLS] Pac ##iente de 25 años que , durante ingreso en el servicio de Medicina Inter ##na por fie ##bre ve ##sper ##tina de 3 días de evolución , se le descubre incident ##almente mediante estudio e ##co ##gráfico tumor ##ación en test ##ículo derecho . [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] 

[CLS] Como ante ##cedent ##es personales , refiere no tener ale ##rgia ##s med ##ica ##mentos ##as . [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [P

In [83]:
# Fragment labels distribution
pd.Series(np.sum(train_y, axis=1)).describe()

count    7753.000000
mean        0.928415
std         1.339604
min         0.000000
25%         0.000000
50%         0.000000
75%         1.000000
max        14.000000
dtype: float64

### Development corpus

Only development texts that are annotated with CIE-Diagnóstico codes are considered:

In [84]:
len(set(df_text_dev["doc_id"]) - set(df_codes_dev_ner["doc_id"]))

0

In [85]:
dev_doc_list = sorted(set(df_codes_dev_ner["doc_id"]))

In [86]:
len(dev_doc_list)

250

In [87]:
%%time
ss_sub_corpus_path = ss_corpus_path + "dev/"
ss_files = [f for f in os.listdir(ss_sub_corpus_path) if os.path.isfile(ss_sub_corpus_path + f)]
ss_dict_dev = load_ss_files(ss_files, ss_sub_corpus_path)

CPU times: user 9.04 ms, sys: 18 µs, total: 9.06 ms
Wall time: 8.84 ms


In [88]:
%%time
dev_ind, dev_seg, dev_y, dev_frag, dev_start_end_frag = ss_create_frag_input_data_bert(df_text=df_text_dev, 
                                                  text_col=text_col, 
                                                  df_ann=df_codes_dev_ner, doc_list=dev_doc_list, ss_dict=ss_dict_dev,
                                                  tokenizer=tokenizer, lab_encoder=mlb_encoder, seq_len=SEQ_LEN,
                                                  greedy=False)

100%|██████████| 250/250 [00:07<00:00, 32.26it/s]

CPU times: user 7.86 s, sys: 28.1 ms, total: 7.88 s
Wall time: 7.84 s





In [89]:
# Sanity check

In [90]:
dev_ind.shape

(4118, 128)

In [91]:
dev_seg.shape

(4118, 128)

In [92]:
dev_y.shape

(4118, 2194)

In [93]:
len(dev_frag)

250

In [94]:
len(dev_start_end_frag)

4118

In [95]:
# Check n_frag distribution across texts
pd.Series(dev_frag).describe()

count    250.000000
mean      16.472000
std        8.425031
min        4.000000
25%       11.000000
50%       15.000000
75%       21.000000
max       65.000000
dtype: float64

In [96]:
# Inspect a randomly selected text and its encoded version
check_id = np.random.randint(low=0, high=len(dev_doc_list), size=1)[0]

In [97]:
check_id

168

In [98]:
dev_doc_list[check_id]

'S1130-05582015000100004-1'

In [99]:
df_text_dev[df_text_dev["doc_id"] == dev_doc_list[check_id]][text_col].values[0]

'Se presenta el caso de una mujer de 60 años de edad, no fumadora ni bebedora de alcohol, sin antecedentes clínicos de interés, que no refiere procedimientos médicos ni quirúrgicos relevantes y que acude a nuestro servicio para realizarse una exodoncia del primer premolar inferior izquierdo. En el momento del examen clínico intraoral se observa una lesión tumoral asintomática de 15 mm de diámetro en la mucosa bucal izquierda. Su color es rosa pálido similar al de la mucosa y de superficie lisa. En la mucosa bucal derecha se observa una lesión tumoral de 10 mm de diámetro. Esta lesión es hipocrómica comparada con la mucosa bucal y de superficie lisa. El diagnóstico provisional para ambas lesiones es un fibroma traumático asociado a una prótesis mal adaptada. En el momento de realizar la historia clínica la paciente no relató haber sido sometida a un tratamiento de infiltraciones faciales con biopolímeros. El método diagnóstico es una biopsia excisional y el tratamiento es la remoción qu

In [100]:
check_id_frag = sum(dev_frag[:check_id])
for i in range(check_id_frag, check_id_frag + dev_frag[check_id]):
    print(mlb_encoder.inverse_transform(np.array([dev_y[i]])), "\n")

[('f10.10', 'f17.200')] 

[('k13.79',)] 

[()] 

[('k13.79',)] 

[('k13.79',)] 

[('k13.79',)] 

[()] 

[()] 

[()] 

[()] 

[('k13.79',)] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 



In [101]:
for i in range(check_id_frag, check_id_frag + dev_frag[check_id]):
    print(list(zip([tokenizer._token_dict_inv[ind] for ind in dev_ind[i]][1:len(dev_start_end_frag[i])+1], 
               dev_start_end_frag[i])))
    print("\n")

[('Se', (0, 2)), ('presenta', (3, 11)), ('el', (12, 14)), ('caso', (15, 19)), ('de', (20, 22)), ('una', (23, 26)), ('mujer', (27, 32)), ('de', (33, 35)), ('60', (36, 38)), ('años', (39, 43)), ('de', (44, 46)), ('edad', (47, 51)), (',', (51, 52)), ('no', (53, 55)), ('fu', (56, 58)), ('##mad', (58, 61)), ('##ora', (61, 64)), ('ni', (65, 67)), ('be', (68, 70)), ('##bed', (70, 73)), ('##ora', (73, 76)), ('de', (77, 79)), ('alcohol', (80, 87)), (',', (87, 88)), ('sin', (89, 92)), ('ante', (93, 97)), ('##cedent', (97, 103)), ('##es', (103, 105)), ('c', (106, 107)), ('##lín', (107, 110)), ('##icos', (110, 114)), ('de', (115, 117)), ('interés', (118, 125)), (',', (125, 126)), ('que', (127, 130)), ('no', (131, 133)), ('refiere', (134, 141)), ('pro', (142, 145)), ('##ced', (145, 148)), ('##imientos', (148, 156)), ('médicos', (157, 164)), ('ni', (165, 167)), ('qui', (168, 171)), ('##rú', (171, 173)), ('##rgi', (173, 176)), ('##cos', (176, 179)), ('relevante', (180, 189)), ('##s', (189, 190)), ('y

In [102]:
check_id_frag = sum(dev_frag[:check_id])
for frag in dev_ind[check_id_frag:check_id_frag + dev_frag[check_id]]:
    print(' '.join([tokenizer._token_dict_inv[ind] for ind in frag]), "\n")

[CLS] Se presenta el caso de una mujer de 60 años de edad , no fu ##mad ##ora ni be ##bed ##ora de alcohol , sin ante ##cedent ##es c ##lín ##icos de interés , que no refiere pro ##ced ##imientos médicos ni qui ##rú ##rgi ##cos relevante ##s y que acu ##de a nuestro servicio para realizar ##se una ex ##odon ##cia del primer pre ##mol ##ar inferior izquierdo . [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] 

[CLS] En el momento del examen c ##lín ##ico intra ##oral se observa una lesión tumor ##al asi ##nto ##mática de 15 mm de diámetro en la mu ##cosa bu ##cal izquierda . [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] 

In [103]:
# Fragment labels distribution
pd.Series(np.sum(dev_y, axis=1)).describe()

count    4118.000000
mean        0.830986
std         1.249374
min         0.000000
25%         0.000000
50%         0.000000
75%         1.000000
max        14.000000
dtype: float64

### Train-Dev abstracts corpus

In [104]:
len(set(df_text_all_abs["doc_id"]) - set(df_codes_d_all_abs["doc_id"]))

0

In [105]:
all_abs_doc_list = sorted(set(df_codes_d_all_abs["doc_id"]))

In [106]:
len(all_abs_doc_list)

11515

In [107]:
%%time
all_abs_ind, all_abs_seg, all_abs_y, all_abs_frag = create_frag_input_data_bert(df_text=df_text_all_abs, text_col=text_col, 
                                     df_label=df_codes_d_all_abs, doc_list=all_abs_doc_list, tokenizer=tokenizer, 
                                     lab_encoder=mlb_encoder, seq_len=SEQ_LEN)

100%|██████████| 11515/11515 [00:24<00:00, 464.07it/s]


CPU times: user 25.2 s, sys: 106 ms, total: 25.3 s
Wall time: 25 s


In [108]:
all_abs_ind.shape

(11515, 128)

In [109]:
all_abs_seg.shape

(11515, 128)

In [110]:
all_abs_y.shape

(11515, 2194)

In [111]:
len(all_abs_frag)

11515

In [112]:
# Check n_frag distribution across texts
pd.Series(all_abs_frag).describe()

count    11515.0
mean         1.0
std          0.0
min          1.0
25%          1.0
50%          1.0
75%          1.0
max          1.0
dtype: float64

In [113]:
# Inspect a randomly selected text and its encoded version
check_id = np.random.randint(low=0, high=len(all_abs_doc_list), size=1)[0]

In [114]:
check_id

10019

In [115]:
all_abs_doc_list[check_id]

'lil-657011'

In [116]:
df_text_all_abs[df_text_all_abs["doc_id"] == all_abs_doc_list[check_id]][text_col].values[0]

'El tumor de Krukenberg (descrito por un médico alemán que tenía el mismo nombre) es un tumor de ovario que representa la metástasis de una tumoración primaria usualmente ubicada en el estómago. En este trabajo se realiza una breve reseña histórica y se presentan cinco casos manejados en nuestro servicio, con el fin de mostrar la complejidad de su diagnóstico, el abordaje terapéutico y el pésimo pronóstico que esta enfermedad tiene.'

In [117]:
check_id_frag = sum(all_abs_frag[:check_id])
for i in range(check_id_frag, check_id_frag + all_abs_frag[check_id]):
    print(mlb_encoder.inverse_transform(np.array([all_abs_y[i]])), "\n")

[('c16.9',)] 



In [118]:
check_id_frag = sum(all_abs_frag[:check_id])
for frag in all_abs_ind[check_id_frag:check_id_frag + all_abs_frag[check_id]]:
    print(' '.join([tokenizer._token_dict_inv[ind] for ind in frag]), "\n")

[CLS] El tumor de Kr ##uke ##nberg ( descrito por un médico alemán que tenía el mismo nombre ) es un tumor de ova ##rio que representa la met ##ást ##asis de una tumor ##ación primaria usualmente ubicada en el est ##óm ##ago . En este trabajo se realiza una breve res ##eña histórica y se presentan cinco casos man ##eja ##dos en nuestro servicio , con el fin de mostrar la com ##ple ##ji ##dad de su dia ##gnóstico , el abord ##aje ter ##ap ##éu ##tico y el pés ##imo pro ##nó ##stico que esta enfermedad tiene . [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] 



In [119]:
# Fragment labels distribution
pd.Series(np.sum(all_abs_y, axis=1)).describe()

count    11515.000000
mean         1.365610
std          0.731603
min          1.000000
25%          1.000000
50%          1.000000
75%          2.000000
max         10.000000
dtype: float64

### Training & Development & Abstracts corpus

We merge the previously generated datasets:

In [120]:
# Indices
train_dev_abs_ind = np.concatenate((train_ind, dev_ind, all_abs_ind))

In [121]:
train_dev_abs_ind.shape

(23386, 128)

In [122]:
# Segments
train_dev_abs_seg = np.concatenate((train_seg, dev_seg, all_abs_seg))

In [123]:
train_dev_abs_seg.shape

(23386, 128)

In [124]:
# y
train_dev_abs_y = np.concatenate((train_y, dev_y, all_abs_y))

In [125]:
train_dev_abs_y.shape

(23386, 2194)

## Fine-tuning

Using the corpus of labeled sentences, we fine-tune the model on a multi-label sentence classification task.

In [126]:
from keras.backend.tensorflow_backend import set_session

# Prevent GPU memory allocation problems
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
set_session(tf.Session(config=config))

In [127]:
from keras_bert import load_trained_model_from_checkpoint

model = load_trained_model_from_checkpoint(
    config_file=config_path, 
    checkpoint_file=checkpoint_path, 
    training=training,                                       
    trainable=trainable, 
    seq_len=SEQ_LEN
)

In [128]:
model.inputs

[<tf.Tensor 'Input-Token:0' shape=(?, 128) dtype=float32>,
 <tf.Tensor 'Input-Segment:0' shape=(?, 128) dtype=float32>]

In [129]:
model.outputs

[<tf.Tensor 'Encoder-12-FeedForward-Norm/add_1:0' shape=(?, 128, 768) dtype=float32>]

In [130]:
from keras.layers import Dense, Activation
from keras.models import Model
from keras.initializers import glorot_uniform
from keras_bert.layers import Extract

dense_cls = Extract(index=0, name='Extract')(model.output) # In order to extract CLS token embedding
dense_out = Dense(units=num_labels, kernel_initializer=glorot_uniform(seed=random_seed))(dense_cls) # Multi-label classification
outputs = Activation('sigmoid')(dense_out)

model = Model(model.inputs, outputs)

In [131]:
model.inputs

[<tf.Tensor 'Input-Token:0' shape=(?, 128) dtype=float32>,
 <tf.Tensor 'Input-Segment:0' shape=(?, 128) dtype=float32>]

In [132]:
model.outputs

[<tf.Tensor 'activation_1/Sigmoid:0' shape=(?, 2194) dtype=float32>]

In [133]:
# Sample weights

In [134]:
n_train_dev_frags = sum(train_frag) + sum(dev_frag)

In [135]:
n_train_dev_frags

11871

In [136]:
train_dev_abs_weights = np.array([train_weight] * n_train_dev_frags + [all_abs_weight] * (train_dev_abs_y.shape[0] - 
                                                                                      n_train_dev_frags))

In [None]:
%%time
from keras_radam import RAdam

model.compile(
    optimizer=RAdam(learning_rate=LR),
    loss='binary_crossentropy'
)

history = model.fit(
    x=[train_dev_abs_ind, train_dev_abs_seg],
    y=train_dev_abs_y,
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    sample_weight=train_dev_abs_weights,
    shuffle=True
)

Epoch 1/34
Epoch 2/34

## Test set predictions

Finally, the predictions made by the model on the test set are saved. For this purpose, firstly, each sentence from the test corpus must be converted into a sequence of subwords (input IDs and attention mask arrays). Then, the predictions made by the model at the sentence-level are saved, to be further evaluated at document-level (see `results/CodiEsp-D/Evaluation.ipynb`).

In [137]:
%%time
test_path = corpus_path + "test/text_files/"
test_files = [f for f in os.listdir(test_path) if os.path.isfile(test_path + f)]
test_data = load_text_files(test_files, test_path)
df_text_test = pd.DataFrame({'doc_id': [s.split('.txt')[0] for s in test_files], 'raw_text': test_data})

CPU times: user 3.67 ms, sys: 3.93 ms, total: 7.6 ms
Wall time: 6.56 ms


In [138]:
df_text_test.shape

(250, 2)

In [139]:
df_text_test.head()

Unnamed: 0,doc_id,raw_text
0,S0365-66912007000900014-1,Paciente varón de 34 años de edad diagnosticad...
1,S0211-69952014000200012-1,"Un varón de 48 años, de raza caucásica, con IR..."
2,S1139-76322017000200009-1,Presentamos el caso clínico de un niño de cinc...
3,S0210-48062010000100019-1,"Paciente varón de 53 años, diagnosticado de es..."
4,S1130-14732005000500006-1,Se trata de un varón de 20 años diagnosticado ...


In [140]:
df_text_test.raw_text[0]

'Paciente varón de 34 años de edad diagnosticado de varicela tres semanas antes ya resuelta sin complicaciones. Acude a urgencias por presentar disminución de agudeza visual en su ojo izquierdo.\nEn la exploración oftalmológica presenta una agudeza visual corregida de 1 en el ojo derecho (OD) y de 0,6 en el ojo izquierdo (OI). El estudio con lámpara de hendidura demuestra en el OI un tyndall celular de 4+, precipitados queráticos inferiores (3+) y sin presentar la cornea tinción con fluoresceína, siendo normal el OD. La presión intraocular fue de 16mmHg en ambos ojos.\nEn la exploración fundoscópica inicial del OI se aprecia leve vitritis (1+) sin focos de retinitis.\nSe instaura tratamiento tópico con corticoides y midriáticos. A los 2 días se observa leve disminución del tyndall celular (3+) en cámara anterior pero en fondo de ojo aparece un foco periférico de retinitis necrotizante en el área temporal asociado a vasculitis retiniana.\nSe ingresa al paciente y se instaura tratamiento

In [141]:
test_doc_list = sorted(set(df_text_test["doc_id"]))

In [142]:
len(test_doc_list)

250

In [143]:
%%time
ss_sub_corpus_path = ss_corpus_path + "test/"
ss_files = [f for f in os.listdir(ss_sub_corpus_path) if os.path.isfile(ss_sub_corpus_path + f)]
ss_dict_test = load_ss_files(ss_files, ss_sub_corpus_path)

CPU times: user 2.39 ms, sys: 7.95 ms, total: 10.3 ms
Wall time: 9.41 ms


In [144]:
%%time
test_ind, test_seg, _, test_frag, _ = ss_create_frag_input_data_bert(df_text=df_text_test, 
                                                  text_col=text_col,
                                                  # Since labels are ignored, we pass df_codes_train_ner as df_ann
                                                  df_ann=df_codes_train_ner, doc_list=test_doc_list, ss_dict=ss_dict_test,
                                                  tokenizer=tokenizer, lab_encoder=mlb_encoder, seq_len=SEQ_LEN,
                                                  greedy=False)

100%|██████████| 250/250 [00:01<00:00, 168.13it/s]

CPU times: user 1.54 s, sys: 36.5 ms, total: 1.58 s
Wall time: 1.53 s





In [113]:
%%time
test_preds = model.predict([test_ind, test_seg])

CPU times: user 14.9 s, sys: 1.68 s, total: 16.6 s
Wall time: 16.9 s


In [114]:
test_preds.shape

(3950, 727)

In [145]:
results_dir_path = "../results/CodiEsp-D/"

In [178]:
%%time
np.save(file=results_dir_path + "predictions/mbert_galen_seed_" + str(random_seed) + "_test_preds.npy", arr=test_preds)

CPU times: user 2.21 ms, sys: 4.65 ms, total: 6.87 ms
Wall time: 6.02 ms
