# Fine-tuning mBERT-Galén on Cantemist-NER

In this notebook, following a multi-class token classification approach, the mBERT-Galén model is fine-tuned on both the training and development sets of the Cantemit-NER corpus. Additionally, the predictions made by the model on the test set are saved, in order to futher evaluate the NER performance of the model (see `results/Evaluation.ipynb`).

In [1]:
import tensorflow as tf

# Auxiliary components
from nlp_utils import *

from transformers import BertTokenizerFast
model_name = "mBERT-Galen/"
tokenizer = BertTokenizerFast.from_pretrained(model_name, do_lower_case=False)

# Hyper-parameters
text_col = "raw_text"
SEQ_LEN = 128
BATCH_SIZE = 16
EPOCHS = 88
LR = 3e-5

GREEDY = True
IGNORE_VALUE = -100
ANN_STRATEGY = "word-all"
EVAL_STRATEGY = "word-max"
LOGITS = True

random_seed = 0
tf.random.set_seed(random_seed)

## Load text

Firstly, all text files from training and development Cantemist corpora are loaded in different dataframes.

Also, NER-annotations are loaded.

In [2]:
corpus_path = "../datasets/cantemist_v6/"
sub_task_path = "cantemist-ner/"

### Training corpus

In [3]:
%%time
train_path = corpus_path + "train-set/" + sub_task_path
train_files = [f for f in os.listdir(train_path) if os.path.isfile(train_path + f) and f.split('.')[-1] == "txt"]
n_train_files = len(train_files)
train_data = load_text_files(train_files, train_path)
dev1_path = corpus_path + "dev-set1/" + sub_task_path
train_files.extend([f for f in os.listdir(dev1_path) if os.path.isfile(dev1_path + f) and f.split('.')[-1] == "txt"])
train_data.extend(load_text_files(train_files[n_train_files:], dev1_path))
df_text_train = pd.DataFrame({'doc_id': [s.split('.txt')[0] for s in train_files], 'raw_text': train_data})

CPU times: user 16 ms, sys: 2.66 ms, total: 18.6 ms
Wall time: 18.3 ms


In [4]:
df_text_train.shape

(751, 2)

In [5]:
df_text_train.head()

Unnamed: 0,doc_id,raw_text
0,cc_onco453,"Anamnesis\nSe trata de un varón de 55 años, ex..."
1,cc_onco962,Anamnesis\nMujer de 40 años que consulta por l...
2,cc_onco989,"Anamnesis\nPaciente de 43 años, perimenopáusic..."
3,cc_onco187,"Anamnesis\nVarón de 72 años, exfumador y bebed..."
4,cc_onco164,"Anamnesis\nMujer de 51 años, sin alergias medi..."


In [6]:
len(set(df_text_train['doc_id']))

751

In [7]:
df_text_train.raw_text[0]

'Anamnesis\nSe trata de un varón de 55 años, ex fumador con un índice tabáquico de 40 paquetes-año, HTA, sin antecedentes familiares de interés, en tratamiento con ácido fólico, omeprazol, hierro oral y risperidona.\nEn junio de 2017, es diagnosticado a raíz de una trombosis iliaca derecha de una masa tumoral pobremente diferenciada que infiltraba tercio distal de apéndice cecal, con obliteración del paquete vascular iliaco, así como infiltración del uréter derecho, condicionando ureterohidronefrosis derecha grado cuatro.\nSe realiza biopsia de la lesión, con hallazgos anatomopatológicos de neoplasia fusocelular con inmunohistoquímica (IHC) sugerente de origen urotelial con diferenciación sarcomatoide (positividad para p63, p40, GATA3, CD99, EMA CAM 5.2, CK 34, beta E12, CK7, negatividad para CK20, S100, CD34, cKIT y TTF1), con SYT no reordenado.\nSe lleva a cabo intervención en julio de 2017 con extirpación de la masa tumoral, resección ileocecal, dejando ileostomía terminal y bypass 

### Development corpus

In [8]:
%%time
dev_path = corpus_path + "dev-set2/" + sub_task_path
dev_files = [f for f in os.listdir(dev_path) if os.path.isfile(dev_path + f) and f.split('.')[-1] == "txt"]
dev_data = load_text_files(dev_files, dev_path)
df_text_dev = pd.DataFrame({'doc_id': [s.split('.txt')[0] for s in dev_files], 'raw_text': dev_data})

CPU times: user 7.09 ms, sys: 0 ns, total: 7.09 ms
Wall time: 6.6 ms


In [9]:
df_text_dev.shape

(250, 2)

In [10]:
df_text_dev.head()

Unnamed: 0,doc_id,raw_text
0,cc_onco1183,Anamnesis\nVarón de 48 años que acude en agost...
1,cc_onco751,Anamnesis\nHombre de 71 años de edad con antec...
2,cc_onco1384,"Anamnesis\nVarón de 30 años, sin antecedentes ..."
3,cc_onco1208,Anamnesis\nVarón de 76 años diagnosticado en a...
4,cc_onco734,Anamnesis\nTras canalización del reservorio ce...


In [11]:
len(set(df_text_dev['doc_id']))

250

In [12]:
df_text_dev.raw_text[0]

'Anamnesis\nVarón de 48 años que acude en agosto de 2012 tras la realización de amputación en hallux derecho a la primera valoración por Oncología Médica.\nAntecedentes: no alérgicos. Patológicos: diabético hace 3 años sin tratamiento. HTA en tratamiento. Enolismo crónico.\nTrastorno afectivo bipolar. Quirúrgicos: apendicectomizado.\nEnfermedad actual: desde hace 2 años presentaba una lesión hiperpigmentada en el hallux del pie derecho, que ocasionalmente sangraba.\n\nExamen físico\nMuñón de hallux derecho en buen estado, presencia de adenopatías inguinales derechas. Resto de exploración sin alteraciones.\n\nPruebas complementarias\n- Biopsia cutánea: melanoma ulcerado.\n- AP de resección de melanoma: melanoma lentiginoso acral en fase de crecimiento vertical, ulcerado. Clark IV. Breslow 6,8 mm, sin infiltración linfovascular.\n- TC de tórax-abdomen-pelvis: sin signos concluyentes de extensión tóraco-abdómino-pélvica de melanoma.\nMicronódulos en el lóbulo superior derecho a valorar en

## Process NER annotations

We load and pre-process the NER annotations in BRAT format available for the Cantemist-NER subtask.

In [13]:
# Training corpus

In [14]:
train_ann_files = [train_path + f for f in os.listdir(train_path) if f.split('.')[-1] == "ann"]
train_ann_files.extend([dev1_path + f for f in os.listdir(dev1_path) if f.split('.')[-1] == "ann"])

In [15]:
len(train_ann_files)

751

In [16]:
df_codes_train_ner = process_brat_ner(train_ann_files).sort_values(["doc_id", "start", "end"])

In [17]:
df_codes_train_ner.shape

(9737, 4)

In [18]:
df_codes_train_ner.head()

Unnamed: 0,doc_id,text_ref,start,end
5230,cc_onco1,Carcinoma microcítico,2719,2740
5231,cc_onco1,carcinoma microcítico,2950,2971
5232,cc_onco1,M0,2988,2990
97,cc_onco10,tumor,212,217
95,cc_onco10,neoplasia,976,985


In [19]:
len(set(df_codes_train_ner["doc_id"]))

750

In [20]:
assert ~df_codes_train_ner[["doc_id", "start", "end"]].duplicated().any()

In [21]:
# Development corpus

In [22]:
dev_ann_files = [dev_path + f for f in os.listdir(dev_path) if f.split('.')[-1] == "ann"]

In [23]:
len(dev_ann_files)

250

In [24]:
df_codes_dev_ner = process_brat_ner(dev_ann_files).sort_values(["doc_id", "start", "end"])

In [25]:
df_codes_dev_ner.shape

(2660, 4)

In [26]:
df_codes_dev_ner.head()

Unnamed: 0,doc_id,text_ref,start,end
1852,cc_onco1001,carcinoma epidermoide,576,597
1854,cc_onco1001,neoplasia,790,799
1857,cc_onco1001,adenocarcinoma T4N3M1b,836,858
1853,cc_onco1001,enfermedad hepática,1205,1224
1855,cc_onco1001,tumoral,2303,2310


In [27]:
df_codes_dev_ner.tail()

Unnamed: 0,doc_id,text_ref,start,end
1630,cc_onco994,tumoración,1604,1614
1629,cc_onco994,metastásica,3064,3075
1628,cc_onco994,macroadenoma,3752,3764
1632,cc_onco994,macroadenoma de la hipófisis,4068,4096
1627,cc_onco994,lesiones hepáticas,5378,5396


In [28]:
assert ~df_codes_dev_ner[["doc_id", "start", "end"]].duplicated().any()

### Remove overlapping annotations

In [29]:
# Training corpus

In [30]:
%%time
df_codes_train_ner_final = eliminate_overlap(df_ann=df_codes_train_ner)

100%|██████████| 750/750 [00:21<00:00, 35.01it/s]

CPU times: user 21.4 s, sys: 38.2 ms, total: 21.5 s
Wall time: 21.4 s





In [31]:
df_codes_train_ner_final.shape

(9605, 4)

In [32]:
# Development corpus

In [33]:
%%time
df_codes_dev_ner_final = eliminate_overlap(df_ann=df_codes_dev_ner)

100%|██████████| 250/250 [00:04<00:00, 55.07it/s]

CPU times: user 4.56 s, sys: 106 µs, total: 4.56 s
Wall time: 4.54 s





In [34]:
df_codes_dev_ner_final.shape

(2623, 4)

## Creation of annotated sequences

We create the corpus used to fine-tune the transformer model on a NER task. In this way, we split the texts into sentences, and convert them into sequences of subtokens. Also, each generated subtoken is assigned a NER label in IOB-2 format.

In [35]:
# Sentence-Split information
ss_corpus_path = "../datasets/Cantemist-SSplit-text/"

In [36]:
from sklearn.preprocessing import LabelEncoder

lab_encoder = LabelEncoder()
# IOB-2 format
lab_encoder.fit(["B", "I", "O"])

LabelEncoder()

### Training corpus

Only training texts with NER annotations are considered:

In [37]:
# Some train documents (texts) are not annotated 
len(set(df_text_train["doc_id"]) - set(df_codes_train_ner_final["doc_id"]))

1

In [38]:
train_doc_list = sorted(set(df_codes_train_ner_final["doc_id"]))

In [39]:
len(train_doc_list)

750

In [40]:
# Sentence-Split data

In [41]:
%%time
ss_sub_corpus_path = ss_corpus_path + "training/"
ss_files = [f for f in os.listdir(ss_sub_corpus_path) if os.path.isfile(ss_sub_corpus_path + f)]
ss_dict_train = load_ss_files(ss_files, ss_sub_corpus_path)

CPU times: user 33.8 ms, sys: 49 µs, total: 33.9 ms
Wall time: 33.6 ms


In [42]:
%%time
train_ind, train_att, train_type, train_y, train_frag, train_start_end_frag, train_word_id = ss_create_input_data_ner(df_text=df_text_train, 
                                                  text_col=text_col, 
                                                  df_ann=df_codes_train_ner_final, doc_list=train_doc_list, ss_dict=ss_dict_train,
                                                  tokenizer=tokenizer, lab_encoder=lab_encoder, seq_len=SEQ_LEN, 
                                                  ign_value=IGNORE_VALUE, strategy=ANN_STRATEGY, greedy=GREEDY)

100%|██████████| 750/750 [01:10<00:00, 10.62it/s]


CPU times: user 1min 10s, sys: 196 ms, total: 1min 11s
Wall time: 1min 10s


In [43]:
# Sanity check

In [43]:
train_ind.shape

(10619, 128)

In [44]:
train_att.shape

(10619, 128)

In [45]:
train_type.shape

(10619, 128)

In [46]:
train_y.shape

(10619, 128)

In [47]:
len(train_frag)

750

In [48]:
len(train_start_end_frag)

10619

In [49]:
len(train_word_id)

10619

In [50]:
# Check n_frag distribution across texts
pd.Series(train_frag).describe()

count    750.000000
mean      14.158667
std        4.858494
min        4.000000
25%       11.000000
50%       14.000000
75%       17.000000
max       41.000000
dtype: float64

In [51]:
# Inspect a randomly selected text and its encoded version
check_id = np.random.randint(low=0, high=len(train_doc_list), size=1)[0]

In [52]:
check_id

329

In [53]:
train_doc_list[check_id]

'cc_onco467'

In [54]:
df_text_train[df_text_train["doc_id"] == train_doc_list[check_id]][text_col].values[0]

'ANTECEDENTES E HISTORIA PREVIA\nPresentamos el caso de un varón de 67 años sin antecedentes personales de interés, independiente para las actividades básicas de la vida diaria y vida activa, que es diagnosticado en marzo de 2012 de melanoma en región dorso lumbar izquierda (índice de Breslow de 4 mm y nivel 4 de Clark, con 12 mm de diámetro) con biopsia de ganglio centinela positiva. Se realiza exéresis de la lesión y linfadenectomía inguinal radical izquierda, con resultado de una adenopatía positiva para metástasis de melanoma en región obturatriz. Inicia tratamiento adyuvante con interferón en septiembre de 2012 concluyéndolo en septiembre de 2013 sin incidencias.\nEn marzo de 2014 en tomografía computarizada (TC) de control se objetivan adenopatías superiores al centímetro a nivel inguinofemoral derecho, por lo que en junio de 2014 se lleva a cabo linfadenectomía con resultado de recidiva ganglionar de melanoma. Se lleva a cabo estudio de mutaciones BRAF siendo no mutado.\nEn sept

In [55]:
df_codes_train_ner_final[df_codes_train_ner_final["doc_id"] == train_doc_list[check_id]]

Unnamed: 0,doc_id,text_ref,start,end
4217,cc_onco467,melanoma,231,239
4219,cc_onco467,metástasis de melanoma,511,533
4218,cc_onco467,melanoma,918,926
4220,cc_onco467,recidiva intestinal de melanoma,1456,1487
4216,cc_onco467,Melanoma maligno,5924,5940


In [56]:
check_id_frag = sum(train_frag[:check_id])

In [57]:
for i in range(check_id_frag, check_id_frag + train_frag[check_id]):
    print(list(zip([tokenizer._convert_id_to_token(int(ind)) for ind in train_ind[i]][1:len(train_start_end_frag[i])+1], 
               train_start_end_frag[i], train_word_id[i], 
               [lab_encoder.inverse_transform([label])[0] if label != IGNORE_VALUE else label \
                for label in train_y[i][1:len(train_start_end_frag[i])+1]])))
    print("\n")

[('AN', (0, 12), 0, 'O'), ('##TE', (0, 12), 0, 'O'), ('##CE', (0, 12), 0, 'O'), ('##DE', (0, 12), 0, 'O'), ('##NT', (0, 12), 0, 'O'), ('##ES', (0, 12), 0, 'O'), ('E', (13, 14), 1, 'O'), ('H', (15, 23), 2, 'O'), ('##IS', (15, 23), 2, 'O'), ('##TO', (15, 23), 2, 'O'), ('##RI', (15, 23), 2, 'O'), ('##A', (15, 23), 2, 'O'), ('PR', (24, 30), 3, 'O'), ('##E', (24, 30), 3, 'O'), ('##VI', (24, 30), 3, 'O'), ('##A', (24, 30), 3, 'O'), ('Presenta', (31, 42), 4, 'O'), ('##mos', (31, 42), 4, 'O'), ('el', (43, 45), 5, 'O'), ('caso', (46, 50), 6, 'O'), ('de', (51, 53), 7, 'O'), ('un', (54, 56), 8, 'O'), ('var', (57, 62), 9, 'O'), ('##ón', (57, 62), 9, 'O'), ('de', (63, 65), 10, 'O'), ('67', (66, 68), 11, 'O'), ('años', (69, 73), 12, 'O'), ('sin', (74, 77), 13, 'O'), ('ante', (78, 90), 14, 'O'), ('##cedent', (78, 90), 14, 'O'), ('##es', (78, 90), 14, 'O'), ('personales', (91, 101), 15, 'O'), ('de', (102, 104), 16, 'O'), ('interés', (105, 112), 17, 'O'), (',', (112, 113), 18, 'O'), ('independiente', (

In [58]:
check_id_frag = sum(train_frag[:check_id])
for frag in train_ind[check_id_frag:check_id_frag + train_frag[check_id]]:
    print(' '.join([tokenizer._convert_id_to_token(int(ind)) for ind in frag]), "\n")

[CLS] AN ##TE ##CE ##DE ##NT ##ES E H ##IS ##TO ##RI ##A PR ##E ##VI ##A Presenta ##mos el caso de un var ##ón de 67 años sin ante ##cedent ##es personales de interés , independiente para las actividades básica ##s de la vida diari ##a y vida activa , que es diagnostic ##ado en marzo de 2012 de me ##lano ##ma en región dor ##so lu ##mbar izquierda ( índice de Br ##es ##low de 4 mm y nivel 4 de Clark , con 12 mm de diámetro ) con bio ##psia de gang ##lio cent ##ine ##la positiva . [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] 

[CLS] Se realiza ex ##ére ##sis de la lesión y li ##n ##fa ##dene ##cto ##mí ##a ing ##uin ##al radical izquierda , con resultado de una ad ##eno ##pat ##ía positiva para met ##ást ##asis de me ##lano ##ma en región ob ##tura ##tri ##z . Ini ##cia tratamiento ad ##yu ##vant ##e con inter ##fer ##ón en septiembre de 2012 con ##clu ##y ##én ##dolo en septie

### Development corpus

Only development texts with NER annotations are considered:

In [59]:
# All development documents (texts) are annotated 
len(set(df_text_dev["doc_id"]) - set(df_codes_dev_ner_final["doc_id"]))

0

In [60]:
dev_doc_list = sorted(set(df_codes_dev_ner_final["doc_id"]))

In [61]:
len(dev_doc_list)

250

In [64]:
# Sentence-Split data

In [62]:
%%time
ss_sub_corpus_path = ss_corpus_path + "development/"
ss_files = [f for f in os.listdir(ss_sub_corpus_path) if os.path.isfile(ss_sub_corpus_path + f)]
ss_dict_dev = load_ss_files(ss_files, ss_sub_corpus_path)

CPU times: user 46.5 ms, sys: 20 ms, total: 66.4 ms
Wall time: 154 ms


In [63]:
%%time
dev_ind, dev_att, dev_type, dev_y, dev_frag, dev_start_end_frag, dev_word_id = ss_create_input_data_ner(df_text=df_text_dev, 
                                                  text_col=text_col, 
                                                  df_ann=df_codes_dev_ner_final, doc_list=dev_doc_list, ss_dict=ss_dict_dev,
                                                  tokenizer=tokenizer, lab_encoder=lab_encoder, seq_len=SEQ_LEN, 
                                                  ign_value=IGNORE_VALUE, strategy=ANN_STRATEGY, greedy=GREEDY)

 84%|████████▍ | 210/250 [00:16<00:02, 17.64it/s]

I
doc_id      cc_onco1427
text_ref        pT3N2Mx
start              1928
end                1935
Name: 803, dtype: object
11
11
[[1849 1858]
 [1859 1867]
 [1868 1872]
 [1873 1887]
 [1888 1898]
 [1899 1900]
 [1900 1905]
 [1906 1916]
 [1917 1918]
 [1919 1924]
 [1924 1925]
 [1926 1935]
 [1935 1936]
 [1937 1939]
 [1940 1948]
 [1949 1952]
 [1953 1961]
 [1962 1963]
 [1964 1971]
 [1972 1980]
 [1981 1985]
 [1986 1996]
 [1996 1997]]


100%|██████████| 250/250 [00:19<00:00, 12.75it/s]

CPU times: user 19.7 s, sys: 27.8 ms, total: 19.7 s
Wall time: 19.7 s





In [67]:
# Sanity check

In [64]:
dev_ind.shape

(2856, 128)

In [65]:
dev_att.shape

(2856, 128)

In [66]:
dev_type.shape

(2856, 128)

In [67]:
dev_y.shape

(2856, 128)

In [68]:
len(dev_frag)

250

In [69]:
len(dev_start_end_frag)

2856

In [70]:
len(dev_word_id)

2856

In [71]:
# Check n_frag distribution across texts
pd.Series(dev_frag).describe()

count    250.000000
mean      11.424000
std        5.025659
min        3.000000
25%        8.000000
50%       10.000000
75%       14.000000
max       38.000000
dtype: float64

In [72]:
# Inspect a randomly selected text and its encoded version
check_id = np.random.randint(low=0, high=len(dev_doc_list), size=1)[0]

In [73]:
check_id

33

In [74]:
dev_doc_list[check_id]

'cc_onco1061'

In [75]:
df_text_dev[df_text_dev["doc_id"] == dev_doc_list[check_id]][text_col].values[0]

'Anamnesis\nVarón de 44 años, con antecedentes de obesidad mórbida, de profesión camionero, que acude a Urgencias en repetidas ocasiones por un episodio de dorsalgia difusa. Indica molestias de 4 meses de evolución, de carácter progresivo, que no ceden a analgesia de primer escalón. Refiere empeoramiento de su estado general, no tolerando el decúbito.\nNo astenia, no pérdida de apetito. No fiebre termometrada. No náuseas, no vómitos, no otra sintomatología de interés.\nEn la última visita a Urgencias, refiere inestabilidad a la marcha asociada a dolor transfixiante interescapular.\nSe le solicita una angio-TC de tórax, en la que se observó: aneurisma de aorta torácica de inicio distal a la salida de la subclavia izquierda de 4 x 5,5 x 4,3 cm que fue valorada por el Servicio Vascular, que no encontró signos de complicación.\nAcude de nuevo a Urgencias 7 días después aquejando pérdida de fuerza y parestesias en ambos miembros inferiores, más acusado en el izquierdo, llegando a realizarse

In [76]:
df_codes_dev_ner_final[df_codes_dev_ner_final["doc_id"] == dev_doc_list[check_id]]

Unnamed: 0,doc_id,text_ref,start,end
1860,cc_onco1061,metástasis,2710,2720
1859,cc_onco1061,malignas,2810,2818
1861,cc_onco1061,metástasis,3042,3052
1858,cc_onco1061,carcinoma infiltrante,3247,3268
1862,cc_onco1061,metástasis de cáncer,3610,3630


In [77]:
check_id_frag = sum(dev_frag[:check_id])

In [78]:
for i in range(check_id_frag, check_id_frag + dev_frag[check_id]):
    print(list(zip([tokenizer._convert_id_to_token(int(ind)) for ind in dev_ind[i]][1:len(dev_start_end_frag[i])+1], 
               dev_start_end_frag[i], dev_word_id[i],
               [lab_encoder.inverse_transform([label])[0] if label != IGNORE_VALUE else label \
                for label in dev_y[i][1:len(dev_start_end_frag[i])+1]])))
    print("\n")

[('Ana', (0, 9), 0, 'O'), ('##mne', (0, 9), 0, 'O'), ('##sis', (0, 9), 0, 'O'), ('Var', (10, 15), 1, 'O'), ('##ón', (10, 15), 1, 'O'), ('de', (16, 18), 2, 'O'), ('44', (19, 21), 3, 'O'), ('años', (22, 26), 4, 'O'), (',', (26, 27), 5, 'O'), ('con', (28, 31), 6, 'O'), ('ante', (32, 44), 7, 'O'), ('##cedent', (32, 44), 7, 'O'), ('##es', (32, 44), 7, 'O'), ('de', (45, 47), 8, 'O'), ('ob', (48, 56), 9, 'O'), ('##esi', (48, 56), 9, 'O'), ('##dad', (48, 56), 9, 'O'), ('mór', (57, 64), 10, 'O'), ('##bida', (57, 64), 10, 'O'), (',', (64, 65), 11, 'O'), ('de', (66, 68), 12, 'O'), ('prof', (69, 78), 13, 'O'), ('##esión', (69, 78), 13, 'O'), ('cam', (79, 88), 14, 'O'), ('##ione', (79, 88), 14, 'O'), ('##ro', (79, 88), 14, 'O'), (',', (88, 89), 15, 'O'), ('que', (90, 93), 16, 'O'), ('acu', (94, 99), 17, 'O'), ('##de', (94, 99), 17, 'O'), ('a', (100, 101), 18, 'O'), ('Ur', (102, 111), 19, 'O'), ('##gen', (102, 111), 19, 'O'), ('##cias', (102, 111), 19, 'O'), ('en', (112, 114), 20, 'O'), ('rep', (115

In [79]:
check_id_frag = sum(dev_frag[:check_id])
for frag in dev_ind[check_id_frag:check_id_frag + dev_frag[check_id]]:
    print(' '.join([tokenizer._convert_id_to_token(int(ind)) for ind in frag]), "\n")

[CLS] Ana ##mne ##sis Var ##ón de 44 años , con ante ##cedent ##es de ob ##esi ##dad mór ##bida , de prof ##esión cam ##ione ##ro , que acu ##de a Ur ##gen ##cias en rep ##eti ##das ocasiones por un episodio de dorsal ##gia di ##fusa . In ##dica mol ##esti ##as de 4 meses de evolución , de carácter pro ##gres ##ivo , que no ce ##den a anal ##gesi ##a de primer es ##cal ##ón . Re ##fier ##e em ##pe ##ora ##miento de su estado general , no tol ##eran ##do el de ##c ##ú ##bito . No as ##tenia , no pérdida de ap ##eti ##to . No fie ##bre termo ##met ##rada . [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] 

[CLS] No ná ##use ##as , no v ##óm ##itos , no otra sin ##toma ##tol ##ogía de interés . En la última visita a Ur ##gen ##cias , refiere in ##esta ##bilidad a la marcha as ##ociada a dolor trans ##fix ##iante interes ##cap ##ular . Se le soli ##cita una ang ##io - TC de tó ##rax , en la que se ob ##ser ##vó : ane ##uris ##ma de ao ##rta to ##rá ##ci ##ca de inicio dis ##tal a la salida 

### Training & Development corpus

We merge the previously generated datasets:

In [80]:
# Indices
train_dev_ind = np.concatenate((train_ind, dev_ind))

In [81]:
train_dev_ind.shape

(13475, 128)

In [82]:
# Attention
train_dev_att = np.concatenate((train_att, dev_att))

In [83]:
train_dev_att.shape

(13475, 128)

In [84]:
# Type
train_dev_type = np.concatenate((train_type, dev_type))

In [85]:
train_dev_type.shape

(13475, 128)

In [86]:
# y
train_dev_y = np.concatenate((train_y, dev_y))

In [87]:
train_dev_y.shape

(13475, 128)

## Fine-tuning

Using the corpus of labeled sentences, we fine-tune the model on a multi-label sentence classification task.

In [88]:
from transformers import TFBertForTokenClassification

model = TFBertForTokenClassification.from_pretrained(model_name, from_pt=True)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertForTokenClassification: ['bert.embeddings.position_ids']
- This IS expected if you are initializing TFBertForTokenClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForTokenClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFBertForTokenClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [89]:
model.summary()

Model: "tf_bert_for_token_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  177262848 
_________________________________________________________________
dropout_37 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  1538      
Total params: 177,264,386
Trainable params: 177,264,386
Non-trainable params: 0
_________________________________________________________________


In [90]:
model.layers

[<transformers.models.bert.modeling_tf_bert.TFBertMainLayer at 0x7f4654782bd0>,
 <tensorflow.python.keras.layers.core.Dropout at 0x7f4603421490>,
 <tensorflow.python.keras.layers.core.Dense at 0x7f4602741790>]

In [91]:
from tensorflow.keras import Input, Model
from tensorflow.keras.layers import Dense
from tensorflow.keras.initializers import GlorotUniform

input_ids = Input(shape=(SEQ_LEN,), name='input_ids', dtype='int64')

num_labels = len(lab_encoder.classes_)

out_seq = model.layers[0](input_ids=input_ids)[0] # take the output sub-token sequence 
out_logits = Dense(units=num_labels, kernel_initializer=GlorotUniform(seed=random_seed))(out_seq) # Multi-class classification

model = Model(inputs=input_ids, outputs=out_logits)

In [92]:
model.summary()

Model: "functional_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_ids (InputLayer)       [(None, 128)]             0         
_________________________________________________________________
bert (TFBertMainLayer)       TFBaseModelOutputWithPool 177262848 
_________________________________________________________________
dense (Dense)                (None, 128, 3)            2307      
Total params: 177,265,155
Trainable params: 177,265,155
Non-trainable params: 0
_________________________________________________________________


In [93]:
model.input

<tf.Tensor 'input_ids:0' shape=(None, 128) dtype=int64>

In [94]:
model.output

<tf.Tensor 'dense/BiasAdd:0' shape=(None, 128, 3) dtype=float32>

In [100]:
%%time
from tensorflow.keras import optimizers, losses
import tensorflow_addons as tfa

optimizer = tfa.optimizers.RectifiedAdam(learning_rate=LR)
loss = TokenClassificationLoss(from_logits=LOGITS, ignore_val=IGNORE_VALUE)
model.compile(optimizer=optimizer, loss=loss)

history = model.fit(x={'input_ids': train_dev_ind}, 
                    y=train_dev_y, batch_size=BATCH_SIZE, epochs=EPOCHS, shuffle=True)

Epoch 1/88
Epoch 2/88
Epoch 3/88
Epoch 4/88
Epoch 5/88
Epoch 6/88
Epoch 7/88
Epoch 8/88
Epoch 9/88
Epoch 10/88
Epoch 11/88
Epoch 12/88
Epoch 13/88
Epoch 14/88
Epoch 15/88
Epoch 16/88
Epoch 17/88
Epoch 18/88
Epoch 19/88
Epoch 20/88
Epoch 21/88
Epoch 22/88
Epoch 23/88
Epoch 24/88
Epoch 25/88
Epoch 26/88
Epoch 27/88
Epoch 28/88
Epoch 29/88
Epoch 30/88
Epoch 31/88
Epoch 32/88
Epoch 33/88
Epoch 34/88
Epoch 35/88
Epoch 36/88
Epoch 37/88
Epoch 38/88
Epoch 39/88
Epoch 40/88
Epoch 41/88
Epoch 42/88
Epoch 43/88
Epoch 44/88
Epoch 45/88
Epoch 46/88
Epoch 47/88
Epoch 48/88
Epoch 49/88
Epoch 50/88
Epoch 51/88
Epoch 52/88
Epoch 53/88
Epoch 54/88
Epoch 55/88
Epoch 56/88
Epoch 57/88
Epoch 58/88
Epoch 59/88
Epoch 60/88
Epoch 61/88
Epoch 62/88
Epoch 63/88
Epoch 64/88
Epoch 65/88
Epoch 66/88
Epoch 67/88
Epoch 68/88
Epoch 69/88
Epoch 70/88
Epoch 71/88
Epoch 72/88
Epoch 73/88
Epoch 74/88
Epoch 75/88
Epoch 76/88
Epoch 77/88
Epoch 78/88
Epoch 79/88
Epoch 80/88
Epoch 81/88
Epoch 82/88
Epoch 83/88
Epoch 84/88
E

As a sanity check procedure, we evaluate model predictions on the development set:

In [96]:
%%time
dev_preds = tf.nn.softmax(logits=model.predict({'input_ids': dev_ind}), 
                           axis=-1).numpy()

CPU times: user 10.3 s, sys: 1.33 s, total: 11.6 s
Wall time: 15.7 s


In [97]:
dev_preds.shape

(2856, 128, 3)

In [98]:
out_dev_path = "dev_preds/"

In [99]:
write_ner_ann(df_pred_ann=ner_preds_brat_format(doc_list=dev_doc_list, fragments=dev_frag, preds=dev_preds, 
                                    start_end=dev_start_end_frag, word_id=dev_word_id, 
                                    lb_encoder=lab_encoder, df_text=df_text_dev, text_col=text_col, strategy=EVAL_STRATEGY), 
              out_path=out_dev_path)

100%|██████████| 250/250 [00:00<00:00, 283.50it/s]
100%|██████████| 250/250 [00:06<00:00, 39.00it/s]


In [105]:
%%time
!python ../resources/cantemist-evaluation-library/src/main.py -g ../datasets/cantemist_v6/dev-set2/cantemist-ner/ -p ./dev_preds/ -s ner 


-----------------------------------------------------
Clinical case name			Precision
-----------------------------------------------------
cc_onco1001.ann		1.0
-----------------------------------------------------
cc_onco1007.ann		1.0
-----------------------------------------------------
cc_onco1008.ann		1.0
-----------------------------------------------------
cc_onco1009.ann		1.0
-----------------------------------------------------
cc_onco1010.ann		1.0
-----------------------------------------------------
cc_onco1011.ann		1.0
-----------------------------------------------------
cc_onco1012.ann		1.0
-----------------------------------------------------
cc_onco1014.ann		1.0
-----------------------------------------------------
cc_onco1016.ann		1.0
-----------------------------------------------------
cc_onco1018.ann		1.0
-----------------------------------------------------
cc_onco1019.ann		1.0
-----------------------------------------------------
cc_onco10

CPU times: user 10.7 ms, sys: 37.4 ms, total: 48.1 ms
Wall time: 1.03 s


## Test set predictions

In [101]:
%%time
test_path = corpus_path + "test-set/" + sub_task_path
test_files = [f for f in os.listdir(test_path) if os.path.isfile(test_path + f) and f.split('.')[-1] == 'txt']
test_data = load_text_files(test_files, test_path)
df_text_test = pd.DataFrame({'doc_id': [s.split('.txt')[0] for s in test_files], 'raw_text': test_data})

CPU times: user 55.7 ms, sys: 20.1 ms, total: 75.7 ms
Wall time: 410 ms


In [102]:
df_text_test.shape

(300, 2)

In [103]:
df_text_test.head()

Unnamed: 0,doc_id,raw_text
0,cc_onco877,"Anamnesis\nMujer de 59 años, alérgica a penici..."
1,cc_onco1075,"Anamnesis\nMujer de 52 años, sin alergias cono..."
2,cc_onco1450,"Anamnesis\nMujer de 51 años de edad, sin antec..."
3,cc_onco1165,Anamnesis\nPaciente varón de 75 años sin hábit...
4,cc_onco1298,"Anamnesis\nMujer de 60 años, exfumadora de 20 ..."


In [104]:
len(set(df_text_test['doc_id']))

300

In [105]:
df_text_test.raw_text[0]

'Anamnesis\nMujer de 59 años, alérgica a penicilina y procaína. Fumadora activa (IPA: 43).\nAntecedentes familiares: abuelo materno diagnosticado de carcinoma colon a los 70 años; madre diagnosticada de carcinoma de mama bilateral a los 50 años; padre fallecido de carcinoma gástrico a los 47 años; tres tías maternas diagnosticadas de carcinoma de mama a los 55, 56 y 57 años respectivamente; y tres primas afectas de cáncer de mama.\nAntecedentes personales: bronquitis crónica, poliposis colónica, carcinoma ductal infiltrante clásico mama pT2pN0M0 G2 subtipo tumoral luminal a (RH: +, HER-2: negativo) intervenido en agosto de 2013 mediante tumorectomía mama izquierda (patrón round block) + biopsia selectiva ganglio centinela (negativo) y posterior QT adyuvante con esquema TC (paclitaxel-ciclofosfamida) x 4 ciclos.\nAcude en noviembre de 2013 a visita de seguimiento tras finalizar tratamiento adyuvante. Asintomática.\n\nExploración física\nTemperatura axilar 36,5ºC, tensión arterial 130/83

In [106]:
test_doc_list = sorted(set(df_text_test["doc_id"]))

In [107]:
len(test_doc_list)

300

In [113]:
# Sentence-Split data

In [108]:
%%time
ss_sub_corpus_path = ss_corpus_path + "test-background/"
ss_files = [f for f in os.listdir(ss_sub_corpus_path) if os.path.isfile(ss_sub_corpus_path + f)]
ss_dict_test = load_ss_files(ss_files, ss_sub_corpus_path)

CPU times: user 583 ms, sys: 274 ms, total: 857 ms
Wall time: 2.52 s


In [109]:
%%time
test_ind, test_att, test_type, _, test_frag, test_start_end_frag, test_word_id = ss_create_input_data_ner(df_text=df_text_test, 
                                                  text_col=text_col, 
                                                  # Since labels are ignored, we pass df_codes_train_ner as df_ann
                                                  df_ann=df_codes_train_ner_final, doc_list=test_doc_list, ss_dict=ss_dict_test,
                                                  tokenizer=tokenizer, lab_encoder=lab_encoder, seq_len=SEQ_LEN, 
                                                  ign_value=IGNORE_VALUE, strategy=ANN_STRATEGY, greedy=GREEDY)

100%|██████████| 300/300 [00:26<00:00, 11.27it/s]

CPU times: user 26.7 s, sys: 115 ms, total: 26.8 s
Wall time: 26.7 s





In [116]:
# Sanity check

In [110]:
test_ind.shape

(3853, 128)

In [111]:
test_att.shape

(3853, 128)

In [112]:
test_type.shape

(3853, 128)

In [113]:
len(test_frag)

300

In [114]:
len(test_start_end_frag)

3853

In [115]:
len(test_word_id)

3853

In [116]:
%%time
test_preds = tf.nn.softmax(logits=model.predict({'input_ids': test_ind}), 
                           axis=-1).numpy()

CPU times: user 11.4 s, sys: 1.94 s, total: 13.3 s
Wall time: 19 s


In [117]:
test_preds.shape

(3853, 128, 3)

In [118]:
out_test_path = "test_preds/"

In [119]:
write_ner_ann(df_pred_ann=ner_preds_brat_format(doc_list=test_doc_list, fragments=test_frag, preds=test_preds, 
                                    start_end=test_start_end_frag, word_id=test_word_id, lb_encoder=lab_encoder, 
                                    df_text=df_text_test, text_col=text_col, strategy=EVAL_STRATEGY), 
              out_path=out_test_path)

100%|██████████| 300/300 [00:01<00:00, 249.91it/s]
100%|██████████| 300/300 [00:08<00:00, 33.87it/s]


In [None]:
%%time
!python ../resources/cantemist-evaluation-library/src/main.py -g ../datasets/cantemist_v6/test-set/cantemist-ner/ -p ./test_preds/ -s ner 

In [None]:
# Save predictions on the test set

In [128]:
model_name = "mbert_galen_" + str(random_seed)

In [129]:
np.save(file="test_preds_" + model_name + ".npy", arr=test_preds)

In [134]:
doc_word_preds, doc_word_start_end = seq_ner_preds_brat_format(doc_list=test_doc_list, fragments=test_frag, 
                           arr_start_end=test_start_end_frag, arr_word_id=test_word_id, arr_preds=test_preds, 
                           strategy=EVAL_STRATEGY)

100%|██████████| 300/300 [00:01<00:00, 251.08it/s]


In [135]:
import pickle

with open("test_doc_word_preds_" + model_name + ".pck", "wb") as f:
    pickle.dump(doc_word_preds, f)

with open("test_doc_word_start_end_" + model_name + ".pck", "wb") as f:
    pickle.dump(doc_word_start_end, f)