# Fine-tuning mBERT-Galén on Cantemist-Coding

In this notebook, following a multi-label sequence classification approach, the mBERT-Galén model is fine-tuned on both the training and development sets of the Cantemist-Coding corpus. Additionally, the predictions made by the model on the test set are saved, in order to futher evaluate the clinical coding performance of the model (see `results/Cantemist-Coding/Evaluation.ipynb`).

In [1]:
import tensorflow as tf

# Auxiliary components
import sys
sys.path.append("..")
from nlp_utils import *

# mBERT tokenizer
from keras_bert import load_vocabulary, Tokenizer
model_path = "multi_cased_L-12_H-768_A-12/"
config_path = model_path + "bert_config.json"
checkpoint_path = model_path + "mBERT-Galen/model.ckpt-1000000"
vocab_file = "vocab.txt"
tokenizer = Tokenizer(token_dict=load_vocabulary(model_path + vocab_file), pad_index=0, cased=True)

# Hyper-parameters
text_col = "raw_text"
training = False
trainable = True
SEQ_LEN = 128
BATCH_SIZE = 16
EPOCHS = 18
LR = 3e-5

random_seed = 0
tf.set_random_seed(random_seed)

Using TensorFlow backend.


## Load text

Firstly, all text files from training and development Cantemist corpora are loaded in different dataframes.

Also, CIE-O codes are loaded.

In [2]:
corpus_path = "../datasets/cantemist_v6/"
sub_task_path = "cantemist-coding/"

### Training corpus

In [3]:
%%time
train_path = corpus_path + "train-set/" + sub_task_path + "txt/"
train_files = [f for f in os.listdir(train_path) if os.path.isfile(train_path + f)]
n_train_files = len(train_files)
train_data = load_text_files(train_files, train_path)
dev1_path = corpus_path + "dev-set1/" + sub_task_path + "txt/"
train_files.extend([f for f in os.listdir(dev1_path) if os.path.isfile(dev1_path + f)])
train_data.extend(load_text_files(train_files[n_train_files:], dev1_path))
df_text_train = pd.DataFrame({'doc_id': [s.split('.txt')[0] for s in train_files], 'raw_text': train_data})

CPU times: user 16.8 ms, sys: 92 µs, total: 16.9 ms
Wall time: 16.7 ms


In [4]:
df_text_train.shape

(751, 2)

In [5]:
df_text_train.head()

Unnamed: 0,doc_id,raw_text
0,cc_onco453,"Anamnesis\nSe trata de un varón de 55 años, ex..."
1,cc_onco962,Anamnesis\nMujer de 40 años que consulta por l...
2,cc_onco989,"Anamnesis\nPaciente de 43 años, perimenopáusic..."
3,cc_onco187,"Anamnesis\nVarón de 72 años, exfumador y bebed..."
4,cc_onco164,"Anamnesis\nMujer de 51 años, sin alergias medi..."


In [6]:
df_text_train.raw_text[0]

'Anamnesis\nSe trata de un varón de 55 años, ex fumador con un índice tabáquico de 40 paquetes-año, HTA, sin antecedentes familiares de interés, en tratamiento con ácido fólico, omeprazol, hierro oral y risperidona.\nEn junio de 2017, es diagnosticado a raíz de una trombosis iliaca derecha de una masa tumoral pobremente diferenciada que infiltraba tercio distal de apéndice cecal, con obliteración del paquete vascular iliaco, así como infiltración del uréter derecho, condicionando ureterohidronefrosis derecha grado cuatro.\nSe realiza biopsia de la lesión, con hallazgos anatomopatológicos de neoplasia fusocelular con inmunohistoquímica (IHC) sugerente de origen urotelial con diferenciación sarcomatoide (positividad para p63, p40, GATA3, CD99, EMA CAM 5.2, CK 34, beta E12, CK7, negatividad para CK20, S100, CD34, cKIT y TTF1), con SYT no reordenado.\nSe lleva a cabo intervención en julio de 2017 con extirpación de la masa tumoral, resección ileocecal, dejando ileostomía terminal y bypass 

We also load the CIE-O codes table:

In [7]:
df_codes_train = pd.concat([pd.read_table(corpus_path + "train-set/" + sub_task_path + "train-coding.tsv", 
                                 sep='\t', header=0), 
                            pd.read_table(corpus_path + "dev-set1/" + sub_task_path + "dev1-coding.tsv", 
                                 sep='\t', header=0)])

In [8]:
df_codes_train.shape

(4142, 2)

In [9]:
df_codes_train["code"] = df_codes_train["code"].str.lower()

In [10]:
df_codes_train.head()

Unnamed: 0,file,code
0,cc_onco860,8140/3
1,cc_onco860,8000/6
2,cc_onco860,8140/6
3,cc_onco98,8000/6
4,cc_onco98,8070/3


In [11]:
df_codes_train[df_codes_train.duplicated(keep=False)]

Unnamed: 0,file,code
1862,cc_onco768,9080/1
1863,cc_onco768,9080/1


In [12]:
df_codes_train = df_codes_train[~df_codes_train.duplicated(keep='first')]

In [13]:
df_codes_train.shape

(4141, 2)

In [14]:
len(set(df_codes_train["file"]))

750

### Development corpus

In [15]:
%%time
dev_path = corpus_path + "dev-set2/" + sub_task_path + "txt/"
dev_files = [f for f in os.listdir(dev_path) if os.path.isfile(dev_path + f)]
dev_data = load_text_files(dev_files, dev_path)
df_text_dev = pd.DataFrame({'doc_id': [s.split('.txt')[0] for s in dev_files], 'raw_text': dev_data})

CPU times: user 6.33 ms, sys: 0 ns, total: 6.33 ms
Wall time: 6.07 ms


In [16]:
df_text_dev.shape

(250, 2)

In [17]:
df_text_dev.head()

Unnamed: 0,doc_id,raw_text
0,cc_onco1183,Anamnesis\nVarón de 48 años que acude en agost...
1,cc_onco751,Anamnesis\nHombre de 71 años de edad con antec...
2,cc_onco1384,"Anamnesis\nVarón de 30 años, sin antecedentes ..."
3,cc_onco1208,Anamnesis\nVarón de 76 años diagnosticado en a...
4,cc_onco734,Anamnesis\nTras canalización del reservorio ce...


In [18]:
df_text_dev.raw_text[0]

'Anamnesis\nVarón de 48 años que acude en agosto de 2012 tras la realización de amputación en hallux derecho a la primera valoración por Oncología Médica.\nAntecedentes: no alérgicos. Patológicos: diabético hace 3 años sin tratamiento. HTA en tratamiento. Enolismo crónico.\nTrastorno afectivo bipolar. Quirúrgicos: apendicectomizado.\nEnfermedad actual: desde hace 2 años presentaba una lesión hiperpigmentada en el hallux del pie derecho, que ocasionalmente sangraba.\n\nExamen físico\nMuñón de hallux derecho en buen estado, presencia de adenopatías inguinales derechas. Resto de exploración sin alteraciones.\n\nPruebas complementarias\n- Biopsia cutánea: melanoma ulcerado.\n- AP de resección de melanoma: melanoma lentiginoso acral en fase de crecimiento vertical, ulcerado. Clark IV. Breslow 6,8 mm, sin infiltración linfovascular.\n- TC de tórax-abdomen-pelvis: sin signos concluyentes de extensión tóraco-abdómino-pélvica de melanoma.\nMicronódulos en el lóbulo superior derecho a valorar en

We also load the CIE-O codes table:

In [19]:
df_codes_dev = pd.read_table(corpus_path + "dev-set2/" + sub_task_path + "dev2-coding.tsv", 
                                 sep='\t', header=0)

In [20]:
df_codes_dev.shape

(1279, 2)

In [21]:
df_codes_dev["code"] = df_codes_dev["code"].str.lower()

In [22]:
df_codes_dev.head()

Unnamed: 0,file,code
0,cc_onco1258,8000/6
1,cc_onco1258,8010/3
2,cc_onco1258,8010/3/h
3,cc_onco1258,8520/3
4,cc_onco1258,8000/3


In [23]:
len(set(df_codes_dev["file"]))

250

We join the training and development Cantemist codes dataframes together:

In [24]:
df_codes_train_dev = pd.concat([df_codes_train, df_codes_dev])

In [25]:
df_codes_train_dev.shape

(5420, 2)

In [26]:
df_codes_train_dev.head()

Unnamed: 0,file,code
0,cc_onco860,8140/3
1,cc_onco860,8000/6
2,cc_onco860,8140/6
3,cc_onco98,8000/6
4,cc_onco98,8070/3


## Creating corpora of annotated sentences

Leveraging the information available for the named-entity-recognition and normalization (NER-N) Cantemist-NORM task, we create both a training and a development corpus of annotated sentences with CIE-O codes.

Firstly, we pre-process the NER-N oncology-codes annotations available for both the training and development corpora.

In [27]:
# Training corpus

In [28]:
train_norm_path = corpus_path + "train-set/cantemist-norm/"
train_ann_files = [train_norm_path + f for f in os.listdir(train_norm_path) if f.split('.')[-1] == "ann"]
dev1_norm_path = corpus_path + "dev-set1/cantemist-norm/"
train_ann_files.extend([dev1_norm_path + f for f in os.listdir(dev1_norm_path) if f.split('.')[-1] == "ann"])

In [29]:
len(train_ann_files)

751

In [30]:
df_codes_train_ner = process_brat_labels(train_ann_files).sort_values(["doc_id", "start", "end"])

In [31]:
df_codes_train_ner.shape

(9737, 5)

In [32]:
df_codes_train_ner["code"] = df_codes_train_ner["code"].str.lower()

In [33]:
df_codes_train_ner.head()

Unnamed: 0,doc_id,code,text_ref,start,end
5230,cc_onco1,8041/3,Carcinoma microcítico,2719,2740
5231,cc_onco1,8041/3,carcinoma microcítico,2950,2971
5232,cc_onco1,8000/6,M0,2988,2990
97,cc_onco10,8000/1,tumor,212,217
95,cc_onco10,8000/1,neoplasia,976,985


In [34]:
len(set(df_codes_train_ner["doc_id"]))

750

In [35]:
# Development corpus

In [36]:
dev_norm_path = corpus_path + "dev-set2/cantemist-norm/"
dev_ann_files = [dev_norm_path + f for f in os.listdir(dev_norm_path) if f.split('.')[-1] == "ann"]

In [37]:
len(dev_ann_files)

250

In [38]:
df_codes_dev_ner = process_brat_labels(dev_ann_files).sort_values(["doc_id", "start", "end"])

In [39]:
df_codes_dev_ner.shape

(2660, 5)

In [40]:
df_codes_dev_ner["code"] = df_codes_dev_ner["code"].str.lower()

In [41]:
df_codes_dev_ner.head()

Unnamed: 0,doc_id,code,text_ref,start,end
1852,cc_onco1001,8070/3,carcinoma epidermoide,576,597
1854,cc_onco1001,8000/1,neoplasia,790,799
1857,cc_onco1001,8140/6,adenocarcinoma T4N3M1b,836,858
1853,cc_onco1001,8000/6,enfermedad hepática,1205,1224
1855,cc_onco1001,8000/1,tumoral,2303,2310


In [42]:
len(set(df_codes_dev_ner["doc_id"]))

250

Now, using the character start-end positions of each sentence from the Cantemist corpus (see `datasets/Cantemist-Sentence-Split.ipynb`), we annotate the sentences with CIE-O codes. Also, using mBERT tokenizer, each sentence is converted into a sequence of subwords, which are further converted into vocabulary indices (input IDs) and segments arrays (BERT input tensors). We also generate a *fragments* dataset indicating the number of produced annotated sentences for each document.

In [43]:
# Sentence-Split information
ss_corpus_path = "../datasets/Cantemist-SSplit-text/"

### Training corpus

In [44]:
label_list = list(df_codes_train_dev["code"])

In [45]:
len(label_list)

5420

In [46]:
len(set(label_list))

743

In [47]:
from sklearn.preprocessing import MultiLabelBinarizer

mlb_encoder = MultiLabelBinarizer()
mlb_encoder.fit([label_list])

MultiLabelBinarizer(classes=None, sparse_output=False)

In [48]:
# Number of distinct codes
num_labels = len(mlb_encoder.classes_)

In [49]:
num_labels

743

Only training texts that are annotated with CIE-O codes are considered:

In [50]:
# Some train documents (texts) are not annotated 
len(set(df_text_train["doc_id"]) - set(df_codes_train_ner["doc_id"]))

1

In [51]:
train_doc_list = sorted(set(df_codes_train_ner["doc_id"]))

In [52]:
len(train_doc_list)

750

In [53]:
# Sentence-Split data

In [54]:
%%time
ss_sub_corpus_path = ss_corpus_path + "training/"
ss_files = [f for f in os.listdir(ss_sub_corpus_path) if os.path.isfile(ss_sub_corpus_path + f)]
ss_dict_train = load_ss_files(ss_files, ss_sub_corpus_path)

CPU times: user 28.9 ms, sys: 8.16 ms, total: 37 ms
Wall time: 36.8 ms


In [55]:
%%time
train_ind, train_seg, train_y, train_frag, train_start_end_frag = ss_create_frag_input_data_bert(df_text=df_text_train, 
                                                  text_col=text_col, 
                                                  df_ann=df_codes_train_ner, doc_list=train_doc_list, ss_dict=ss_dict_train,
                                                  tokenizer=tokenizer, lab_encoder=mlb_encoder, seq_len=SEQ_LEN,
                                                  greedy=False)

100%|██████████| 750/750 [00:54<00:00, 14.62it/s]


CPU times: user 54.4 s, sys: 243 ms, total: 54.7 s
Wall time: 54.5 s


In [56]:
# Sanity check

In [57]:
train_ind.shape

(27708, 128)

In [58]:
train_seg.shape

(27708, 128)

In [59]:
train_y.shape

(27708, 743)

In [60]:
len(train_frag)

750

In [61]:
len(train_start_end_frag)

27708

In [62]:
# Check n_frag distribution across texts
pd.Series(train_frag).describe()

count    750.000000
mean      36.944000
std       14.595647
min       11.000000
25%       27.000000
50%       34.500000
75%       44.000000
max      102.000000
dtype: float64

In [63]:
# Inspect a randomly selected text and its encoded version
check_id = np.random.randint(low=0, high=len(train_doc_list), size=1)[0]

In [64]:
check_id

146

In [65]:
train_doc_list[check_id]

'cc_onco255'

In [66]:
df_text_train[df_text_train["doc_id"] == train_doc_list[check_id]][text_col].values[0]

'Anamnesis\nMujer de 66 años de edad, con antecedentes de hipertensión arterial e hipercolesterolemia, intervenida de tumorectomía en mama derecha en 1975 por patología benigna, incontinencia urinaria, varices, meniscopatía y miopía, en tratamiento habitual con losartán y atorvastatina. Diagnosticada en diciembre de 2015 de Carcinoma ductal infiltrante de mama derecha localmente avanzado E-IIB (T2N1M0) fenotipo HER2 positivo. RE+++. RP +. Ki 67 15%. Se decide quimioterapia (QT) neoadyuvante esquema Doxorrubicina liposomal 50 mg/m2 (día 1) + Paclitaxel 80 mg/m2 (día 1,8,15) + Trastuzumab 4 mg/kg (día 1) y 2 mg (día 8,15), con dosis de carga en ciclo 1. Tras recibir 2 ciclos completos de tratamiento, acudió a urgencias por cuadro febril de 3 días de evolución de hasta 39ºC, con leves molestias para deglutir y artromialgias genealizadas, sin tos ni expectoración ni otra semiología infecciosa por aparatos, con radiografía de tórax normal, siendo dada de alta con el diagnóstico de faringiti

In [67]:
check_id_frag = sum(train_frag[:check_id])
for i in range(check_id_frag, check_id_frag + train_frag[check_id]):
    print(mlb_encoder.inverse_transform(np.array([train_y[i]])), "\n")

[()] 

[('8000/6', '8500/3')] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[('8001/3',)] 

[()] 

[()] 

[()] 

[()] 

[()] 

[('8000/1',)] 



In [68]:
for i in range(check_id_frag, check_id_frag + train_frag[check_id]):
    print(list(zip([tokenizer._token_dict_inv[ind] for ind in train_ind[i]][1:len(train_start_end_frag[i])+1], 
               train_start_end_frag[i])))
    print("\n")

[('Ana', (0, 3)), ('##mne', (3, 6)), ('##sis', (6, 9)), ('Mujer', (10, 15)), ('de', (16, 18)), ('66', (19, 21)), ('años', (22, 26)), ('de', (27, 29)), ('edad', (30, 34)), (',', (34, 35)), ('con', (36, 39)), ('ante', (40, 44)), ('##cedent', (44, 50)), ('##es', (50, 52)), ('de', (53, 55)), ('hip', (56, 59)), ('##erten', (59, 64)), ('##sión', (64, 68)), ('arteria', (69, 76)), ('##l', (76, 77)), ('e', (78, 79)), ('hip', (80, 83)), ('##er', (83, 85)), ('##coles', (85, 90)), ('##tero', (90, 94)), ('##lem', (94, 97)), ('##ia', (97, 99)), (',', (99, 100)), ('inter', (101, 106)), ('##veni', (106, 110)), ('##da', (110, 112)), ('de', (113, 115)), ('tumor', (116, 121)), ('##ect', (121, 124)), ('##om', (124, 126)), ('##ía', (126, 128)), ('en', (129, 131)), ('mama', (132, 136)), ('derecha', (137, 144)), ('en', (145, 147)), ('1975', (148, 152)), ('por', (153, 156)), ('pat', (157, 160)), ('##ología', (160, 166)), ('beni', (167, 171)), ('##gna', (171, 174)), (',', (174, 175)), ('in', (176, 178)), ('##c

In [69]:
check_id_frag = sum(train_frag[:check_id])
for frag in train_ind[check_id_frag:check_id_frag + train_frag[check_id]]:
    print(' '.join([tokenizer._token_dict_inv[ind] for ind in frag]), "\n")

[CLS] Ana ##mne ##sis Mujer de 66 años de edad , con ante ##cedent ##es de hip ##erten ##sión arteria ##l e hip ##er ##coles ##tero ##lem ##ia , inter ##veni ##da de tumor ##ect ##om ##ía en mama derecha en 1975 por pat ##ología beni ##gna , in ##con ##tinen ##cia uri ##nari ##a , vari ##ces , meni ##sco ##pat ##ía y mio ##pía , en tratamiento habitual con los ##art ##án y ator ##vas ##tati ##na . [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] 

[CLS] Dia ##gno ##stica ##da en diciembre de 2015 de Car ##cino ##ma duc ##tal in ##fil ##tra ##nte de mama derecha local ##mente av ##anza ##do E - II ##B ( T2 ##N ##1 ##M ##0 ) fe ##not ##ipo H ##ER ##2 positivo . [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PA

In [70]:
# Fragment labels distribution
pd.Series(np.sum(train_y, axis=1)).describe()

count    27708.000000
mean         0.316335
std          0.620517
min          0.000000
25%          0.000000
50%          0.000000
75%          1.000000
max         10.000000
dtype: float64

### Development corpus

Only development texts that are annotated with CIE-O codes are considered:

In [71]:
# Some dev documents (texts) are not annotated 
len(set(df_text_dev["doc_id"]) - set(df_codes_dev_ner["doc_id"]))

0

In [72]:
dev_doc_list = sorted(set(df_codes_dev_ner["doc_id"]))

In [73]:
len(dev_doc_list)

250

In [74]:
%%time
ss_sub_corpus_path = ss_corpus_path + "development/"
ss_files = [f for f in os.listdir(ss_sub_corpus_path) if os.path.isfile(ss_sub_corpus_path + f)]
ss_dict_dev = load_ss_files(ss_files, ss_sub_corpus_path)

CPU times: user 10 ms, sys: 4.03 ms, total: 14 ms
Wall time: 13.6 ms


In [75]:
%%time
dev_ind, dev_seg, dev_y, dev_frag, dev_start_end_frag = ss_create_frag_input_data_bert(df_text=df_text_dev, 
                                                  text_col=text_col, 
                                                  df_ann=df_codes_dev_ner, doc_list=dev_doc_list, ss_dict=ss_dict_dev,
                                                  tokenizer=tokenizer, lab_encoder=mlb_encoder, seq_len=SEQ_LEN,
                                                  greedy=False)

100%|██████████| 250/250 [00:13<00:00, 18.00it/s]


CPU times: user 14 s, sys: 48 ms, total: 14 s
Wall time: 14 s


In [76]:
# Sanity check

In [77]:
dev_ind.shape

(8319, 128)

In [78]:
dev_seg.shape

(8319, 128)

In [79]:
dev_y.shape

(8319, 743)

In [80]:
len(dev_frag)

250

In [81]:
len(dev_start_end_frag)

8319

In [82]:
# Check n_frag distribution across texts
pd.Series(dev_frag).describe()

count    250.000000
mean      33.276000
std       16.370945
min       10.000000
25%       22.250000
50%       30.500000
75%       39.000000
max      175.000000
dtype: float64

In [83]:
# Inspect a randomly selected text and its encoded version
check_id = np.random.randint(low=0, high=len(dev_doc_list), size=1)[0]

In [84]:
check_id

161

In [85]:
dev_doc_list[check_id]

'cc_onco1325'

In [86]:
df_text_dev[df_text_dev["doc_id"] == dev_doc_list[check_id]][text_col].values[0]

'Anamnesis\nPaciente de 19 años que consulta por crecimiento progresivo de tumoración en nalga derecha de un año de evolución.\n\nExamen físico\nA la exploración, destaca tumoración dependiente de nalga derecha de, aproximadamente, 20 cm de diámetro.\nResto del examen físico sin hallazgos.\n\nPruebas complementarias\nResonancia magnética pélvica: tumoración excéntrica en nalga derecha de 13,5 x 16 x 8,8 cm con aporte de pequeños vasos a nivel craneal, condicionando alta vascularización en su porción media y áreas de fibrosis y necrosis en su porción periférica.\nTomografia computarizada toracoabdominopélvica (TC): ganglios linfáticos aumentados en regiones inguinales y cadenas ilíacas derechas con una adenopatía ilíaca externa derecha de 11 mm de eje corto.\nSe realiza exéresis quirúrgica completa de la tumoración con resultado anatomopatológico (AP) de sarcoma de células redondas grandes de alto grado de 20 cm de diámetro mayor. Márgenes libres. Invasión vascular y linfática.\nInmunoh

In [87]:
check_id_frag = sum(dev_frag[:check_id])
for i in range(check_id_frag, check_id_frag + dev_frag[check_id]):
    print(mlb_encoder.inverse_transform(np.array([dev_y[i]])), "\n")

[('8000/1',)] 

[('8000/1',)] 

[()] 

[('8000/1',)] 

[()] 

[('8000/1', '8800/34/h')] 

[()] 

[()] 

[()] 

[('8800/3/h', '9260/3')] 

[()] 

[()] 

[('8800/34/h', '8805/3')] 

[('8000/6',)] 

[()] 

[()] 

[()] 

[()] 

[('8800/3/h', '8805/6')] 

[()] 



In [88]:
for i in range(check_id_frag, check_id_frag + dev_frag[check_id]):
    print(list(zip([tokenizer._token_dict_inv[ind] for ind in dev_ind[i]][1:len(dev_start_end_frag[i])+1], 
               dev_start_end_frag[i])))
    print("\n")

[('Ana', (0, 3)), ('##mne', (3, 6)), ('##sis', (6, 9)), ('Pac', (10, 13)), ('##iente', (13, 18)), ('de', (19, 21)), ('19', (22, 24)), ('años', (25, 29)), ('que', (30, 33)), ('consulta', (34, 42)), ('por', (43, 46)), ('crecimiento', (47, 58)), ('pro', (59, 62)), ('##gres', (62, 66)), ('##ivo', (66, 69)), ('de', (70, 72)), ('tumor', (73, 78)), ('##ación', (78, 83)), ('en', (84, 86)), ('na', (87, 89)), ('##lga', (89, 92)), ('derecha', (93, 100)), ('de', (101, 103)), ('un', (104, 106)), ('año', (107, 110)), ('de', (111, 113)), ('evolución', (114, 123)), ('.', (123, 124))]


[('Ex', (126, 128)), ('##amen', (128, 132)), ('físico', (133, 139)), ('A', (140, 141)), ('la', (142, 144)), ('ex', (145, 147)), ('##plo', (147, 150)), ('##ración', (150, 156)), (',', (156, 157)), ('destaca', (158, 165)), ('tumor', (166, 171)), ('##ación', (171, 176)), ('de', (177, 179)), ('##pend', (179, 183)), ('##iente', (183, 188)), ('de', (189, 191)), ('na', (192, 194)), ('##lga', (194, 197)), ('derecha', (198, 205)

In [89]:
check_id_frag = sum(dev_frag[:check_id])
for frag in dev_ind[check_id_frag:check_id_frag + dev_frag[check_id]]:
    print(' '.join([tokenizer._token_dict_inv[ind] for ind in frag]), "\n")

[CLS] Ana ##mne ##sis Pac ##iente de 19 años que consulta por crecimiento pro ##gres ##ivo de tumor ##ación en na ##lga derecha de un año de evolución . [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] 

[CLS] Ex ##amen físico A la ex ##plo ##ración , destaca tumor ##ación de ##pend ##iente de na ##lga derecha de , aproximadamente , 20 cm de diámetro . [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD

In [90]:
# Fragment labels distribution
pd.Series(np.sum(dev_y, axis=1)).describe()

count    8319.000000
mean        0.293665
std         0.585919
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max         6.000000
dtype: float64

### Training & Development corpus

We merge the previously generated datasets:

In [91]:
# Indices
train_dev_ind = np.concatenate((train_ind, dev_ind))

In [92]:
train_dev_ind.shape

(36027, 128)

In [93]:
# Segments
train_dev_seg = np.concatenate((train_seg, dev_seg))

In [94]:
train_dev_seg.shape

(36027, 128)

In [96]:
# y
train_dev_y = np.concatenate((train_y, dev_y))

In [97]:
train_dev_y.shape

(36027, 743)

## Fine-tuning

Using the corpus of labeled sentences, we fine-tune the model on a multi-label sentence classification task.

In [98]:
from keras.backend.tensorflow_backend import set_session

# Prevent GPU memory allocation problems
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
set_session(tf.Session(config=config))

In [100]:
from keras_bert import load_trained_model_from_checkpoint

model = load_trained_model_from_checkpoint(
    config_file=config_path, 
    checkpoint_file=checkpoint_path, 
    training=training,                                       
    trainable=trainable, 
    seq_len=SEQ_LEN
)

In [101]:
model.inputs

[<tf.Tensor 'Input-Token:0' shape=(?, 128) dtype=float32>,
 <tf.Tensor 'Input-Segment:0' shape=(?, 128) dtype=float32>]

In [102]:
model.outputs

[<tf.Tensor 'Encoder-12-FeedForward-Norm/add_1:0' shape=(?, 128, 768) dtype=float32>]

In [103]:
from keras.layers import Dense, Activation
from keras.models import Model
from keras.initializers import glorot_uniform
from keras_bert.layers import Extract

dense_cls = Extract(index=0, name='Extract')(model.output) # In order to extract CLS token embedding
dense_out = Dense(units=num_labels, kernel_initializer=glorot_uniform(seed=random_seed))(dense_cls) # Multi-label classification
outputs = Activation('sigmoid')(dense_out)

model = Model(model.inputs, outputs)

In [104]:
model.inputs

[<tf.Tensor 'Input-Token:0' shape=(?, 128) dtype=float32>,
 <tf.Tensor 'Input-Segment:0' shape=(?, 128) dtype=float32>]

In [105]:
model.outputs

[<tf.Tensor 'activation_1/Sigmoid:0' shape=(?, 743) dtype=float32>]

In [None]:
%%time
from keras_radam import RAdam

model.compile(
    optimizer=RAdam(learning_rate=LR),
    loss='binary_crossentropy'
)

history = model.fit(
    x=[train_dev_ind, train_dev_seg],
    y=train_dev_y,
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    shuffle=True
)

Epoch 1/30
Epoch 2/30
Epoch 3/30

## Test set predictions

Finally, the predictions made by the model on the test set are saved. For this purpose, firstly, each sentence from the test corpus must be converted into a sequence of subwords (input IDs and attention mask arrays). Then, the predictions made by the model at the sentence-level are saved, to be further evaluated at document-level (see `results/Cantemist-Coding/Evaluation.ipynb`).

In [98]:
%%time
test_path = corpus_path + "test-set/cantemist-ner/"
test_files = [f for f in os.listdir(test_path) if os.path.isfile(test_path + f) and f.split('.')[-1] == 'txt']
test_data = load_text_files(test_files, test_path)
df_text_test = pd.DataFrame({'doc_id': [s.split('.txt')[0] for s in test_files], 'raw_text': test_data})

CPU times: user 40.2 ms, sys: 15.9 ms, total: 56.1 ms
Wall time: 290 ms


In [109]:
df_text_test.shape

(250, 2)

In [99]:
df_text_test.head()

Unnamed: 0,doc_id,raw_text
0,cc_onco877,"Anamnesis\nMujer de 59 años, alérgica a penici..."
1,cc_onco1075,"Anamnesis\nMujer de 52 años, sin alergias cono..."
2,cc_onco1450,"Anamnesis\nMujer de 51 años de edad, sin antec..."
3,cc_onco1165,Anamnesis\nPaciente varón de 75 años sin hábit...
4,cc_onco1298,"Anamnesis\nMujer de 60 años, exfumadora de 20 ..."


In [100]:
df_text_test.raw_text[0]

'Anamnesis\nMujer de 59 años, alérgica a penicilina y procaína. Fumadora activa (IPA: 43).\nAntecedentes familiares: abuelo materno diagnosticado de carcinoma colon a los 70 años; madre diagnosticada de carcinoma de mama bilateral a los 50 años; padre fallecido de carcinoma gástrico a los 47 años; tres tías maternas diagnosticadas de carcinoma de mama a los 55, 56 y 57 años respectivamente; y tres primas afectas de cáncer de mama.\nAntecedentes personales: bronquitis crónica, poliposis colónica, carcinoma ductal infiltrante clásico mama pT2pN0M0 G2 subtipo tumoral luminal a (RH: +, HER-2: negativo) intervenido en agosto de 2013 mediante tumorectomía mama izquierda (patrón round block) + biopsia selectiva ganglio centinela (negativo) y posterior QT adyuvante con esquema TC (paclitaxel-ciclofosfamida) x 4 ciclos.\nAcude en noviembre de 2013 a visita de seguimiento tras finalizar tratamiento adyuvante. Asintomática.\n\nExploración física\nTemperatura axilar 36,5ºC, tensión arterial 130/83

In [101]:
test_doc_list = sorted(set(df_text_test["doc_id"]))

In [102]:
len(test_doc_list)

300

In [103]:
%%time
ss_sub_corpus_path = ss_corpus_path + "test-background/"
ss_files = [f for f in os.listdir(ss_sub_corpus_path) if os.path.isfile(ss_sub_corpus_path + f)]
ss_dict_test = load_ss_files(ss_files, ss_sub_corpus_path)

CPU times: user 141 ms, sys: 36.1 ms, total: 177 ms
Wall time: 177 ms


In [107]:
%%time
test_ind, test_seg, _, test_frag, _ = ss_create_frag_input_data_bert(df_text=df_text_test, 
                                                  text_col=text_col,
                                                  # Since labels are ignored, we pass df_codes_train_ner as df_ann
                                                  df_ann=df_codes_train_ner, doc_list=test_doc_list, ss_dict=ss_dict_test,
                                                  tokenizer=tokenizer, lab_encoder=mlb_encoder, seq_len=SEQ_LEN,
                                                  greedy=False)

100%|██████████| 300/300 [00:03<00:00, 88.74it/s] 

CPU times: user 3.42 s, sys: 56.3 ms, total: 3.48 s
Wall time: 3.44 s





In [None]:
%%time
test_preds = model.predict([test_ind, test_seg])

In [120]:
test_preds.shape

(3955, 727)

In [105]:
results_dir_path = "../results/Cantemist-Coding/"

In [178]:
%%time
np.save(file=results_dir_path + "predictions/mbert_galen_seed_" + str(random_seed) + "_test_preds.npy", arr=test_preds)

CPU times: user 2.21 ms, sys: 4.65 ms, total: 6.87 ms
Wall time: 6.02 ms
