# Fine-tuning mBERT on Cantemist-NER

In this notebook, following a multi-class token classification approach, the mBERT model is fine-tuned on both the training and development sets of the Cantemit-NER corpus. Additionally, the predictions made by the model on the test set are saved, in order to futher evaluate the NER performance of the model (see `results/Evaluation.ipynb`).

In [1]:
import tensorflow as tf

# Auxiliary components
from nlp_utils import *

from transformers import BertTokenizerFast
model_name = "multi_cased_L-12_H-768_A-12/"
tokenizer = BertTokenizerFast.from_pretrained(model_name, do_lower_case=False)

# Hyper-parameters
text_col = "raw_text"
SEQ_LEN = 128
BATCH_SIZE = 16
EPOCHS = 69
LR = 3e-5

GREEDY = True
IGNORE_VALUE = -100
ANN_STRATEGY = "word-all"
EVAL_STRATEGY = "word-max"
LOGITS = True

random_seed = 0
tf.random.set_seed(random_seed)

## Load text

Firstly, all text files from training and development Cantemist corpora are loaded in different dataframes.

Also, NER-annotations are loaded.

In [2]:
corpus_path = "../datasets/cantemist_v6/"
sub_task_path = "cantemist-ner/"

### Training corpus

In [3]:
%%time
train_path = corpus_path + "train-set/" + sub_task_path
train_files = [f for f in os.listdir(train_path) if os.path.isfile(train_path + f) and f.split('.')[-1] == "txt"]
n_train_files = len(train_files)
train_data = load_text_files(train_files, train_path)
dev1_path = corpus_path + "dev-set1/" + sub_task_path
train_files.extend([f for f in os.listdir(dev1_path) if os.path.isfile(dev1_path + f) and f.split('.')[-1] == "txt"])
train_data.extend(load_text_files(train_files[n_train_files:], dev1_path))
df_text_train = pd.DataFrame({'doc_id': [s.split('.txt')[0] for s in train_files], 'raw_text': train_data})

CPU times: user 20.5 ms, sys: 0 ns, total: 20.5 ms
Wall time: 20.3 ms


In [4]:
df_text_train.shape

(751, 2)

In [5]:
df_text_train.head()

Unnamed: 0,doc_id,raw_text
0,cc_onco453,"Anamnesis\nSe trata de un varón de 55 años, ex..."
1,cc_onco962,Anamnesis\nMujer de 40 años que consulta por l...
2,cc_onco989,"Anamnesis\nPaciente de 43 años, perimenopáusic..."
3,cc_onco187,"Anamnesis\nVarón de 72 años, exfumador y bebed..."
4,cc_onco164,"Anamnesis\nMujer de 51 años, sin alergias medi..."


In [6]:
len(set(df_text_train['doc_id']))

751

In [7]:
df_text_train.raw_text[0]

'Anamnesis\nSe trata de un varón de 55 años, ex fumador con un índice tabáquico de 40 paquetes-año, HTA, sin antecedentes familiares de interés, en tratamiento con ácido fólico, omeprazol, hierro oral y risperidona.\nEn junio de 2017, es diagnosticado a raíz de una trombosis iliaca derecha de una masa tumoral pobremente diferenciada que infiltraba tercio distal de apéndice cecal, con obliteración del paquete vascular iliaco, así como infiltración del uréter derecho, condicionando ureterohidronefrosis derecha grado cuatro.\nSe realiza biopsia de la lesión, con hallazgos anatomopatológicos de neoplasia fusocelular con inmunohistoquímica (IHC) sugerente de origen urotelial con diferenciación sarcomatoide (positividad para p63, p40, GATA3, CD99, EMA CAM 5.2, CK 34, beta E12, CK7, negatividad para CK20, S100, CD34, cKIT y TTF1), con SYT no reordenado.\nSe lleva a cabo intervención en julio de 2017 con extirpación de la masa tumoral, resección ileocecal, dejando ileostomía terminal y bypass 

### Development corpus

In [8]:
%%time
dev_path = corpus_path + "dev-set2/" + sub_task_path
dev_files = [f for f in os.listdir(dev_path) if os.path.isfile(dev_path + f) and f.split('.')[-1] == "txt"]
dev_data = load_text_files(dev_files, dev_path)
df_text_dev = pd.DataFrame({'doc_id': [s.split('.txt')[0] for s in dev_files], 'raw_text': dev_data})

CPU times: user 6.89 ms, sys: 0 ns, total: 6.89 ms
Wall time: 6.66 ms


In [9]:
df_text_dev.shape

(250, 2)

In [10]:
df_text_dev.head()

Unnamed: 0,doc_id,raw_text
0,cc_onco1183,Anamnesis\nVarón de 48 años que acude en agost...
1,cc_onco751,Anamnesis\nHombre de 71 años de edad con antec...
2,cc_onco1384,"Anamnesis\nVarón de 30 años, sin antecedentes ..."
3,cc_onco1208,Anamnesis\nVarón de 76 años diagnosticado en a...
4,cc_onco734,Anamnesis\nTras canalización del reservorio ce...


In [11]:
len(set(df_text_dev['doc_id']))

250

In [12]:
df_text_dev.raw_text[0]

'Anamnesis\nVarón de 48 años que acude en agosto de 2012 tras la realización de amputación en hallux derecho a la primera valoración por Oncología Médica.\nAntecedentes: no alérgicos. Patológicos: diabético hace 3 años sin tratamiento. HTA en tratamiento. Enolismo crónico.\nTrastorno afectivo bipolar. Quirúrgicos: apendicectomizado.\nEnfermedad actual: desde hace 2 años presentaba una lesión hiperpigmentada en el hallux del pie derecho, que ocasionalmente sangraba.\n\nExamen físico\nMuñón de hallux derecho en buen estado, presencia de adenopatías inguinales derechas. Resto de exploración sin alteraciones.\n\nPruebas complementarias\n- Biopsia cutánea: melanoma ulcerado.\n- AP de resección de melanoma: melanoma lentiginoso acral en fase de crecimiento vertical, ulcerado. Clark IV. Breslow 6,8 mm, sin infiltración linfovascular.\n- TC de tórax-abdomen-pelvis: sin signos concluyentes de extensión tóraco-abdómino-pélvica de melanoma.\nMicronódulos en el lóbulo superior derecho a valorar en

## Process NER annotations

We load and pre-process the NER annotations in BRAT format available for the Cantemist-NER subtask.

In [13]:
# Training corpus

In [14]:
train_ann_files = [train_path + f for f in os.listdir(train_path) if f.split('.')[-1] == "ann"]
train_ann_files.extend([dev1_path + f for f in os.listdir(dev1_path) if f.split('.')[-1] == "ann"])

In [15]:
len(train_ann_files)

751

In [16]:
df_codes_train_ner = process_brat_ner(train_ann_files).sort_values(["doc_id", "start", "end"])

In [17]:
df_codes_train_ner.shape

(9737, 4)

In [18]:
df_codes_train_ner.head()

Unnamed: 0,doc_id,text_ref,start,end
5230,cc_onco1,Carcinoma microcítico,2719,2740
5231,cc_onco1,carcinoma microcítico,2950,2971
5232,cc_onco1,M0,2988,2990
97,cc_onco10,tumor,212,217
95,cc_onco10,neoplasia,976,985


In [19]:
len(set(df_codes_train_ner["doc_id"]))

750

In [20]:
assert ~df_codes_train_ner[["doc_id", "start", "end"]].duplicated().any()

In [21]:
# Development corpus

In [22]:
dev_ann_files = [dev_path + f for f in os.listdir(dev_path) if f.split('.')[-1] == "ann"]

In [23]:
len(dev_ann_files)

250

In [24]:
df_codes_dev_ner = process_brat_ner(dev_ann_files).sort_values(["doc_id", "start", "end"])

In [25]:
df_codes_dev_ner.shape

(2660, 4)

In [26]:
df_codes_dev_ner.head()

Unnamed: 0,doc_id,text_ref,start,end
1852,cc_onco1001,carcinoma epidermoide,576,597
1854,cc_onco1001,neoplasia,790,799
1857,cc_onco1001,adenocarcinoma T4N3M1b,836,858
1853,cc_onco1001,enfermedad hepática,1205,1224
1855,cc_onco1001,tumoral,2303,2310


In [27]:
df_codes_dev_ner.tail()

Unnamed: 0,doc_id,text_ref,start,end
1630,cc_onco994,tumoración,1604,1614
1629,cc_onco994,metastásica,3064,3075
1628,cc_onco994,macroadenoma,3752,3764
1632,cc_onco994,macroadenoma de la hipófisis,4068,4096
1627,cc_onco994,lesiones hepáticas,5378,5396


In [28]:
assert ~df_codes_dev_ner[["doc_id", "start", "end"]].duplicated().any()

### Remove overlapping annotations

In [29]:
# Training corpus

In [30]:
%%time
df_codes_train_ner_final = eliminate_overlap(df_ann=df_codes_train_ner)

100%|██████████| 750/750 [00:21<00:00, 34.29it/s]

CPU times: user 21.9 s, sys: 51.9 ms, total: 21.9 s
Wall time: 21.9 s





In [31]:
df_codes_train_ner_final.shape

(9605, 4)

In [32]:
# Development corpus

In [33]:
%%time
df_codes_dev_ner_final = eliminate_overlap(df_ann=df_codes_dev_ner)

100%|██████████| 250/250 [00:04<00:00, 52.16it/s]

CPU times: user 4.73 s, sys: 84.1 ms, total: 4.82 s
Wall time: 4.79 s





In [34]:
df_codes_dev_ner_final.shape

(2623, 4)

## Creation of annotated sequences

We create the corpus used to fine-tune the transformer model on a NER task. In this way, we split the texts into sentences, and convert them into sequences of subtokens. Also, each generated subtoken is assigned a NER label in IOB-2 format.

In [35]:
# Sentence-Split information
ss_corpus_path = "../datasets/Cantemist-SSplit-text/"

In [36]:
from sklearn.preprocessing import LabelEncoder

lab_encoder = LabelEncoder()
# IOB-2 format
lab_encoder.fit(["B", "I", "O"])

LabelEncoder()

### Training corpus

Only training texts with NER annotations are considered:

In [37]:
# Some train documents (texts) are not annotated 
len(set(df_text_train["doc_id"]) - set(df_codes_train_ner_final["doc_id"]))

1

In [38]:
train_doc_list = sorted(set(df_codes_train_ner_final["doc_id"]))

In [39]:
len(train_doc_list)

750

In [40]:
# Sentence-Split data

In [41]:
%%time
ss_sub_corpus_path = ss_corpus_path + "training/"
ss_files = [f for f in os.listdir(ss_sub_corpus_path) if os.path.isfile(ss_sub_corpus_path + f)]
ss_dict_train = load_ss_files(ss_files, ss_sub_corpus_path)

CPU times: user 31.7 ms, sys: 4 ms, total: 35.7 ms
Wall time: 35.5 ms


In [42]:
%%time
train_ind, train_att, train_type, train_y, train_frag, train_start_end_frag, train_word_id = ss_create_input_data_ner(df_text=df_text_train, 
                                                  text_col=text_col, 
                                                  df_ann=df_codes_train_ner_final, doc_list=train_doc_list, ss_dict=ss_dict_train,
                                                  tokenizer=tokenizer, lab_encoder=lab_encoder, seq_len=SEQ_LEN, 
                                                  ign_value=IGNORE_VALUE, strategy=ANN_STRATEGY, greedy=GREEDY)

100%|██████████| 750/750 [01:16<00:00,  9.86it/s]

CPU times: user 1min 16s, sys: 169 ms, total: 1min 16s
Wall time: 1min 16s





In [43]:
# Sanity check

In [44]:
train_ind.shape

(10619, 128)

In [45]:
train_att.shape

(10619, 128)

In [46]:
train_type.shape

(10619, 128)

In [47]:
train_y.shape

(10619, 128)

In [48]:
len(train_frag)

750

In [49]:
len(train_start_end_frag)

10619

In [50]:
len(train_word_id)

10619

In [51]:
# Check n_frag distribution across texts
pd.Series(train_frag).describe()

count    750.000000
mean      14.158667
std        4.858494
min        4.000000
25%       11.000000
50%       14.000000
75%       17.000000
max       41.000000
dtype: float64

In [52]:
# Inspect a randomly selected text and its encoded version
check_id = np.random.randint(low=0, high=len(train_doc_list), size=1)[0]

In [53]:
check_id

545

In [54]:
train_doc_list[check_id]

'cc_onco767'

In [55]:
df_text_train[df_text_train["doc_id"] == train_doc_list[check_id]][text_col].values[0]

'Anamnesis\nHistoria oncológica:\nOctubre/2011: realización de ecografía abdominal por cólico nefrítico donde se observan metástasis hepáticas. Explica aumento durante los últimos 6 meses del flushing habitual, diarrea y pérdida de peso. Inicia estudio diagnóstico ambulatorio.\n\nExploración física\nECOG 1. Índice de masa corporal (IMC) 30 mg/m2. Hallazgos destacables: hepatomegalia de 3 traveses y flushing en bipedestación.\n\nPruebas complementarias\nEn el análisis sanguíneo no había hallazgos remarcables. Destaca en orina de 24 h una elevación de 5HIIA de 89,4 mg/24 h (valores de referencia < 8,2 mg/24 h) y cromogranina A de 6.110 mg/ml (valores de referencia < 134 ng/ml).\nComo pruebas radiológicas, la TC mostraba una masa parahiliar izquierda de 18 mm e imágenes hepáticas hipodensas sugestivas de metástasis. En cuanto al Octreoscan®, únicamente captaban las lesiones hepáticas con expresión de receptores de la somatostatina.\nLa biopsia hepática reveló la presencia de un tumor neur

In [56]:
df_codes_train_ner_final[df_codes_train_ner_final["doc_id"] == train_doc_list[check_id]]

Unnamed: 0,doc_id,text_ref,start,end
3021,cc_onco767,metástasis,119,129
3022,cc_onco767,metástasis,803,813
3025,cc_onco767,lesiones hepáticas,865,883
3026,cc_onco767,tumor neuroendocrino bien diferenciado,979,1017
3024,cc_onco767,tumor neuroendocrino,1298,1318
3027,cc_onco767,carcinoide atípico,1320,1338
3028,cc_onco767,lesiones hepáticas,1474,1492
3023,cc_onco767,tumor,1498,1503
3029,cc_onco767,tumor carcinoide atípico pulmonar bien diferen...,1572,1638
3030,cc_onco767,carcinoide,1703,1713


In [57]:
check_id_frag = sum(train_frag[:check_id])

In [58]:
for i in range(check_id_frag, check_id_frag + train_frag[check_id]):
    print(list(zip([tokenizer._convert_id_to_token(int(ind)) for ind in train_ind[i]][1:len(train_start_end_frag[i])+1], 
               train_start_end_frag[i], train_word_id[i], 
               [lab_encoder.inverse_transform([label])[0] if label != IGNORE_VALUE else label \
                for label in train_y[i][1:len(train_start_end_frag[i])+1]])))
    print("\n")

[('Ana', (0, 9), 0, 'O'), ('##mne', (0, 9), 0, 'O'), ('##sis', (0, 9), 0, 'O'), ('Historia', (10, 18), 1, 'O'), ('on', (19, 29), 2, 'O'), ('##col', (19, 29), 2, 'O'), ('##ógica', (19, 29), 2, 'O'), (':', (29, 30), 3, 'O'), ('Oct', (31, 38), 4, 'O'), ('##ub', (31, 38), 4, 'O'), ('##re', (31, 38), 4, 'O'), ('/', (38, 39), 5, 'O'), ('2011', (39, 43), 6, 'O'), (':', (43, 44), 7, 'O'), ('realización', (45, 56), 8, 'O'), ('de', (57, 59), 9, 'O'), ('e', (60, 69), 10, 'O'), ('##co', (60, 69), 10, 'O'), ('##grafía', (60, 69), 10, 'O'), ('ab', (70, 79), 11, 'O'), ('##dom', (70, 79), 11, 'O'), ('##inal', (70, 79), 11, 'O'), ('por', (80, 83), 12, 'O'), ('có', (84, 90), 13, 'O'), ('##lico', (84, 90), 13, 'O'), ('nef', (91, 100), 14, 'O'), ('##rí', (91, 100), 14, 'O'), ('##tico', (91, 100), 14, 'O'), ('donde', (101, 106), 15, 'O'), ('se', (107, 109), 16, 'O'), ('observa', (110, 118), 17, 'O'), ('##n', (110, 118), 17, 'O'), ('met', (119, 129), 18, 'B'), ('##ást', (119, 129), 18, 'B'), ('##asis', (119

In [59]:
check_id_frag = sum(train_frag[:check_id])
for frag in train_ind[check_id_frag:check_id_frag + train_frag[check_id]]:
    print(' '.join([tokenizer._convert_id_to_token(int(ind)) for ind in frag]), "\n")

[CLS] Ana ##mne ##sis Historia on ##col ##ógica : Oct ##ub ##re / 2011 : realización de e ##co ##grafía ab ##dom ##inal por có ##lico nef ##rí ##tico donde se observa ##n met ##ást ##asis hep ##áticas . Ex ##pli ##ca aumento durante los últimos 6 meses del fl ##ush ##ing habitual , dia ##rrea y pérdida de peso . Ini ##cia estudio dia ##gnóstico amb ##ulator ##io . Ex ##plo ##ración física EC ##O ##G 1 . Í ##ndi ##ce de masa corporal ( IM ##C ) 30 mg / m2 . Hall ##azgo ##s destaca ##bles : hep ##ato ##me ##gali ##a de 3 tra ##ves ##es y fl ##ush ##ing en bi ##ped ##esta ##ción . [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] 

[CLS] Pr ##ue ##bas complement ##arias En el análisis sang ##u ##íne ##o no había hall ##azgo ##s re ##marca ##bles . Des ##taca en ori ##na de 24 h una elev ##ación de 5 ##HI ##IA de 89 , 4 mg / 24 h ( valores de referencia < 8 , 2 mg / 24 h ) y c ##romo ##gra ##nina A de 6 . 110 mg / ml ( valores de referencia < 134 ng / ml ) . Como pruebas radio ##lógica ##s ,

### Development corpus

Only development texts with NER annotations are considered:

In [61]:
# All development documents (texts) are annotated 
len(set(df_text_dev["doc_id"]) - set(df_codes_dev_ner_final["doc_id"]))

0

In [62]:
dev_doc_list = sorted(set(df_codes_dev_ner_final["doc_id"]))

In [63]:
len(dev_doc_list)

250

In [64]:
# Sentence-Split data

In [65]:
%%time
ss_sub_corpus_path = ss_corpus_path + "development/"
ss_files = [f for f in os.listdir(ss_sub_corpus_path) if os.path.isfile(ss_sub_corpus_path + f)]
ss_dict_dev = load_ss_files(ss_files, ss_sub_corpus_path)

CPU times: user 11.7 ms, sys: 19 µs, total: 11.7 ms
Wall time: 11.5 ms


In [66]:
%%time
dev_ind, dev_att, dev_type, dev_y, dev_frag, dev_start_end_frag, dev_word_id = ss_create_input_data_ner(df_text=df_text_dev, 
                                                  text_col=text_col, 
                                                  df_ann=df_codes_dev_ner_final, doc_list=dev_doc_list, ss_dict=ss_dict_dev,
                                                  tokenizer=tokenizer, lab_encoder=lab_encoder, seq_len=SEQ_LEN, 
                                                  ign_value=IGNORE_VALUE, strategy=ANN_STRATEGY, greedy=GREEDY)

 84%|████████▎ | 209/250 [00:17<00:03, 12.95it/s]

I
doc_id      cc_onco1427
text_ref        pT3N2Mx
start              1928
end                1935
Name: 803, dtype: object
11
11
[[1849 1858]
 [1859 1867]
 [1868 1872]
 [1873 1887]
 [1888 1898]
 [1899 1900]
 [1900 1905]
 [1906 1916]
 [1917 1918]
 [1919 1924]
 [1924 1925]
 [1926 1935]
 [1935 1936]
 [1937 1939]
 [1940 1948]
 [1949 1952]
 [1953 1961]
 [1962 1963]
 [1964 1971]
 [1972 1980]
 [1981 1985]
 [1986 1996]
 [1996 1997]]


100%|██████████| 250/250 [00:20<00:00, 12.03it/s]

CPU times: user 20.9 s, sys: 12 ms, total: 20.9 s
Wall time: 20.9 s





In [67]:
# Sanity check

In [68]:
dev_ind.shape

(2856, 128)

In [69]:
dev_att.shape

(2856, 128)

In [70]:
dev_type.shape

(2856, 128)

In [71]:
dev_y.shape

(2856, 128)

In [72]:
len(dev_frag)

250

In [73]:
len(dev_start_end_frag)

2856

In [74]:
len(dev_word_id)

2856

In [75]:
# Check n_frag distribution across texts
pd.Series(dev_frag).describe()

count    250.000000
mean      11.424000
std        5.025659
min        3.000000
25%        8.000000
50%       10.000000
75%       14.000000
max       38.000000
dtype: float64

In [76]:
# Inspect a randomly selected text and its encoded version
check_id = np.random.randint(low=0, high=len(dev_doc_list), size=1)[0]

In [77]:
check_id

54

In [78]:
dev_doc_list[check_id]

'cc_onco1098'

In [79]:
df_text_dev[df_text_dev["doc_id"] == dev_doc_list[check_id]][text_col].values[0]

'Anamnesis\nPaciente de 44 años sin antecedentes médicos de interés salvo miomas uterinos en seguimiento por Ginecología.\nAcude a Urgencias en noviembre de 2013 refiriendo molestias orales de una semana de evolución con tumoración en la región madibular izquierda sangrante a la manipulación. Acompañaba acorcharmiento hemimandibular y del labio inferior izquierdo.\n\nExamen físico\nA nivel oral se identificaba una tumoración ovalada de aproximadamente 3 cm diámetro, móvil, firme y no dolorosa, pediculada a la mucosa gingival de la zona molar.\n\nPruebas complementarias\n- Ortopantografía (5/11/2013). osteólisis mandibular en el cuarto cuadrante con extensión a la rama mandibular izquierda.\n- Biopsia de la lesión oral (5/11/2013): leiomiosarcoma grado 2 (sistema de grado FNCLCC).\n- TC cervicofacial (7/11/2013): lesión osteolítica en la hemimandíbula izquierda que asienta en su región retromolar y ángulo mandibular, con afectación del canal del nervio dentario inferior, que expande y r

In [80]:
df_codes_dev_ner_final[df_codes_dev_ner_final["doc_id"] == dev_doc_list[check_id]]

Unnamed: 0,doc_id,text_ref,start,end
834,cc_onco1098,miomas,72,78
838,cc_onco1098,tumoración,218,228
839,cc_onco1098,tumoración,412,422
841,cc_onco1098,leiomiosarcoma grado 2 (sistema de grado FNCLCC),731,779
823,cc_onco1098,lesión osteolítica,813,831
836,cc_onco1098,neoplasia maligna,1076,1093
842,cc_onco1098,leiomiosarcoma grado 2,1128,1150
827,cc_onco1098,metástasis,1572,1582
828,cc_onco1098,metástasis,1702,1712
817,cc_onco1098,Leiomiosarcoma,1958,1972


In [81]:
check_id_frag = sum(dev_frag[:check_id])

In [82]:
for i in range(check_id_frag, check_id_frag + dev_frag[check_id]):
    print(list(zip([tokenizer._convert_id_to_token(int(ind)) for ind in dev_ind[i]][1:len(dev_start_end_frag[i])+1], 
               dev_start_end_frag[i], dev_word_id[i],
               [lab_encoder.inverse_transform([label])[0] if label != IGNORE_VALUE else label \
                for label in dev_y[i][1:len(dev_start_end_frag[i])+1]])))
    print("\n")

[('Ana', (0, 9), 0, 'O'), ('##mne', (0, 9), 0, 'O'), ('##sis', (0, 9), 0, 'O'), ('Pac', (10, 18), 1, 'O'), ('##iente', (10, 18), 1, 'O'), ('de', (19, 21), 2, 'O'), ('44', (22, 24), 3, 'O'), ('años', (25, 29), 4, 'O'), ('sin', (30, 33), 5, 'O'), ('ante', (34, 46), 6, 'O'), ('##cedent', (34, 46), 6, 'O'), ('##es', (34, 46), 6, 'O'), ('médicos', (47, 54), 7, 'O'), ('de', (55, 57), 8, 'O'), ('interés', (58, 65), 9, 'O'), ('salvo', (66, 71), 10, 'O'), ('mio', (72, 78), 11, 'B'), ('##mas', (72, 78), 11, 'B'), ('ute', (79, 87), 12, 'O'), ('##rinos', (79, 87), 12, 'O'), ('en', (88, 90), 13, 'O'), ('seg', (91, 102), 14, 'O'), ('##ui', (91, 102), 14, 'O'), ('##miento', (91, 102), 14, 'O'), ('por', (103, 106), 15, 'O'), ('G', (107, 118), 16, 'O'), ('##ine', (107, 118), 16, 'O'), ('##colo', (107, 118), 16, 'O'), ('##gía', (107, 118), 16, 'O'), ('.', (118, 119), 17, 'O'), ('A', (120, 125), 18, 'O'), ('##cud', (120, 125), 18, 'O'), ('##e', (120, 125), 18, 'O'), ('a', (126, 127), 19, 'O'), ('Ur', (12

In [83]:
check_id_frag = sum(dev_frag[:check_id])
for frag in dev_ind[check_id_frag:check_id_frag + dev_frag[check_id]]:
    print(' '.join([tokenizer._convert_id_to_token(int(ind)) for ind in frag]), "\n")

[CLS] Ana ##mne ##sis Pac ##iente de 44 años sin ante ##cedent ##es médicos de interés salvo mio ##mas ute ##rinos en seg ##ui ##miento por G ##ine ##colo ##gía . A ##cud ##e a Ur ##gen ##cias en noviembre de 2013 ref ##iri ##endo mol ##esti ##as oral ##es de una semana de evolución con tumor ##ación en la región ma ##di ##bula ##r izquierda sang ##rante a la mani ##pul ##ación . A ##com ##pa ##ña ##ba ac ##or ##char ##miento hem ##iman ##di ##bula ##r y del lab ##io inferior izquierdo . [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] 

[CLS] Ex ##amen físico A nivel oral se identifica ##ba una tumor ##ación oval ##ada de aproximadamente 3 cm diámetro , mó ##vil , firme y no dolor ##osa , pe ##dic ##ula ##da a la mu ##cosa ging ##ival de la zona mol ##ar . Pr ##ue ##bas complement ##arias - Ort ##opa ##nto ##grafía ( 5 / 11 / 2013 ) . os 

### Training & Development corpus

We merge the previously generated datasets:

In [85]:
# Indices
train_dev_ind = np.concatenate((train_ind, dev_ind))

In [86]:
train_dev_ind.shape

(13475, 128)

In [87]:
# Attention
train_dev_att = np.concatenate((train_att, dev_att))

In [88]:
train_dev_att.shape

(13475, 128)

In [89]:
# Type
train_dev_type = np.concatenate((train_type, dev_type))

In [90]:
train_dev_type.shape

(13475, 128)

In [91]:
# y
train_dev_y = np.concatenate((train_y, dev_y))

In [92]:
train_dev_y.shape

(13475, 128)

## Fine-tuning

Using the corpus of labeled sentences, we fine-tune the model on a multi-label sentence classification task.

In [93]:
from transformers import TFBertForTokenClassification

model = TFBertForTokenClassification.from_pretrained(model_name, from_pt=True)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertForTokenClassification: ['bert.embeddings.position_ids']
- This IS expected if you are initializing TFBertForTokenClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForTokenClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFBertForTokenClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [94]:
model.summary()

Model: "tf_bert_for_token_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  177262848 
_________________________________________________________________
dropout_37 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  1538      
Total params: 177,264,386
Trainable params: 177,264,386
Non-trainable params: 0
_________________________________________________________________


In [95]:
model.layers

[<transformers.models.bert.modeling_tf_bert.TFBertMainLayer at 0x7efee95e0990>,
 <tensorflow.python.keras.layers.core.Dropout at 0x7efeaee7bf50>,
 <tensorflow.python.keras.layers.core.Dense at 0x7efeacff4250>]

In [96]:
from tensorflow.keras import Input, Model
from tensorflow.keras.layers import Dense
from tensorflow.keras.initializers import GlorotUniform

input_ids = Input(shape=(SEQ_LEN,), name='input_ids', dtype='int64')

num_labels = len(lab_encoder.classes_)

out_seq = model.layers[0](input_ids=input_ids)[0] # take the output sub-token sequence 
out_logits = Dense(units=num_labels, kernel_initializer=GlorotUniform(seed=random_seed))(out_seq) # Multi-class classification

model = Model(inputs=input_ids, outputs=out_logits)

In [97]:
model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_ids (InputLayer)       [(None, 128)]             0         
_________________________________________________________________
bert (TFBertMainLayer)       TFBaseModelOutputWithPool 177262848 
_________________________________________________________________
dense (Dense)                (None, 128, 3)            2307      
Total params: 177,265,155
Trainable params: 177,265,155
Non-trainable params: 0
_________________________________________________________________


In [98]:
model.input

<tf.Tensor 'input_ids:0' shape=(None, 128) dtype=int64>

In [99]:
model.output

<tf.Tensor 'dense/Identity:0' shape=(None, 128, 3) dtype=float32>

In [None]:
%%time
from tensorflow.keras import optimizers, losses
import tensorflow_addons as tfa

optimizer = tfa.optimizers.RectifiedAdam(learning_rate=LR)
loss = TokenClassificationLoss(from_logits=LOGITS, ignore_val=IGNORE_VALUE)
model.compile(optimizer=optimizer, loss=loss)

history = model.fit(x={'input_ids': train_dev_ind}, 
                    y=train_dev_y, batch_size=BATCH_SIZE, epochs=EPOCHS, shuffle=True)

Epoch 1/69
 11/843 [..............................] - ETA: 3:47 - loss: 1.4236

As a sanity check procedure, we evaluate model predictions on the development set:

In [101]:
%%time
dev_preds = tf.nn.softmax(logits=model.predict({'input_ids': dev_ind}), 
                           axis=-1).numpy()

CPU times: user 10.2 s, sys: 1.39 s, total: 11.6 s
Wall time: 15.6 s


In [102]:
dev_preds.shape

(2856, 128, 3)

In [103]:
out_dev_path = "dev_preds/"

In [104]:
write_ner_ann(df_pred_ann=ner_preds_brat_format(doc_list=dev_doc_list, fragments=dev_frag, preds=dev_preds, 
                                    start_end=dev_start_end_frag, word_id=dev_word_id, 
                                    lb_encoder=lab_encoder, df_text=df_text_dev, text_col=text_col, strategy=EVAL_STRATEGY), 
              out_path=out_dev_path)

100%|██████████| 250/250 [00:00<00:00, 284.71it/s]
100%|██████████| 250/250 [00:00<00:00, 291.69it/s]


In [105]:
%%time
!python ../resources/cantemist-evaluation-library/src/main.py -g ../datasets/cantemist_v6/dev-set2/cantemist-ner/ -p ./dev_preds/ -s ner 


-----------------------------------------------------
Clinical case name			Precision
-----------------------------------------------------
cc_onco1001.ann		0.857
-----------------------------------------------------
cc_onco1007.ann		1.0
-----------------------------------------------------
cc_onco1008.ann		1.0
-----------------------------------------------------
cc_onco1009.ann		1.0
-----------------------------------------------------
cc_onco1010.ann		1.0
-----------------------------------------------------
cc_onco1011.ann		1.0
-----------------------------------------------------
cc_onco1012.ann		1.0
-----------------------------------------------------
cc_onco1014.ann		1.0
-----------------------------------------------------
cc_onco1016.ann		1.0
-----------------------------------------------------
cc_onco1018.ann		1.0
-----------------------------------------------------
cc_onco1019.ann		1.0
-----------------------------------------------------
cc_onco

CPU times: user 9.34 ms, sys: 37.5 ms, total: 46.8 ms
Wall time: 1.03 s


## Test set predictions

In [106]:
%%time
test_path = corpus_path + "test-set/" + sub_task_path
test_files = [f for f in os.listdir(test_path) if os.path.isfile(test_path + f) and f.split('.')[-1] == 'txt']
test_data = load_text_files(test_files, test_path)
df_text_test = pd.DataFrame({'doc_id': [s.split('.txt')[0] for s in test_files], 'raw_text': test_data})

CPU times: user 10.6 ms, sys: 0 ns, total: 10.6 ms
Wall time: 9.89 ms


In [107]:
df_text_test.shape

(300, 2)

In [108]:
df_text_test.head()

Unnamed: 0,doc_id,raw_text
0,cc_onco877,"Anamnesis\nMujer de 59 años, alérgica a penici..."
1,cc_onco1075,"Anamnesis\nMujer de 52 años, sin alergias cono..."
2,cc_onco1450,"Anamnesis\nMujer de 51 años de edad, sin antec..."
3,cc_onco1165,Anamnesis\nPaciente varón de 75 años sin hábit...
4,cc_onco1298,"Anamnesis\nMujer de 60 años, exfumadora de 20 ..."


In [109]:
len(set(df_text_test['doc_id']))

300

In [110]:
df_text_test.raw_text[0]

'Anamnesis\nMujer de 59 años, alérgica a penicilina y procaína. Fumadora activa (IPA: 43).\nAntecedentes familiares: abuelo materno diagnosticado de carcinoma colon a los 70 años; madre diagnosticada de carcinoma de mama bilateral a los 50 años; padre fallecido de carcinoma gástrico a los 47 años; tres tías maternas diagnosticadas de carcinoma de mama a los 55, 56 y 57 años respectivamente; y tres primas afectas de cáncer de mama.\nAntecedentes personales: bronquitis crónica, poliposis colónica, carcinoma ductal infiltrante clásico mama pT2pN0M0 G2 subtipo tumoral luminal a (RH: +, HER-2: negativo) intervenido en agosto de 2013 mediante tumorectomía mama izquierda (patrón round block) + biopsia selectiva ganglio centinela (negativo) y posterior QT adyuvante con esquema TC (paclitaxel-ciclofosfamida) x 4 ciclos.\nAcude en noviembre de 2013 a visita de seguimiento tras finalizar tratamiento adyuvante. Asintomática.\n\nExploración física\nTemperatura axilar 36,5ºC, tensión arterial 130/83

In [111]:
test_doc_list = sorted(set(df_text_test["doc_id"]))

In [112]:
len(test_doc_list)

300

In [113]:
# Sentence-Split data

In [113]:
%%time
ss_sub_corpus_path = ss_corpus_path + "test-background/"
ss_files = [f for f in os.listdir(ss_sub_corpus_path) if os.path.isfile(ss_sub_corpus_path + f)]
ss_dict_test = load_ss_files(ss_files, ss_sub_corpus_path)

CPU times: user 139 ms, sys: 22.1 ms, total: 161 ms
Wall time: 161 ms


In [114]:
%%time
test_ind, test_att, test_type, _, test_frag, test_start_end_frag, test_word_id = ss_create_input_data_ner(df_text=df_text_test, 
                                                  text_col=text_col, 
                                                  # Since labels are ignored, we pass df_codes_train_ner as df_ann
                                                  df_ann=df_codes_train_ner_final, doc_list=test_doc_list, ss_dict=ss_dict_test,
                                                  tokenizer=tokenizer, lab_encoder=lab_encoder, seq_len=SEQ_LEN, 
                                                  ign_value=IGNORE_VALUE, strategy=ANN_STRATEGY, greedy=GREEDY)

100%|██████████| 300/300 [00:27<00:00, 11.01it/s]

CPU times: user 27.5 s, sys: 0 ns, total: 27.5 s
Wall time: 27.4 s





In [116]:
# Sanity check

In [115]:
test_ind.shape

(3853, 128)

In [116]:
test_att.shape

(3853, 128)

In [117]:
test_type.shape

(3853, 128)

In [118]:
len(test_frag)

300

In [119]:
len(test_start_end_frag)

3853

In [120]:
len(test_word_id)

3853

In [121]:
%%time
test_preds = tf.nn.softmax(logits=model.predict({'input_ids': test_ind}), 
                           axis=-1).numpy()

CPU times: user 11.7 s, sys: 1.81 s, total: 13.5 s
Wall time: 18.9 s


In [122]:
test_preds.shape

(3853, 128, 3)

In [131]:
out_test_path = "test_preds/"

In [132]:
write_ner_ann(df_pred_ann=ner_preds_brat_format(doc_list=test_doc_list, fragments=test_frag, preds=test_preds, 
                                    start_end=test_start_end_frag, word_id=test_word_id, lb_encoder=lab_encoder, 
                                    df_text=df_text_test, text_col=text_col, strategy=EVAL_STRATEGY), 
              out_path=out_test_path)

100%|██████████| 300/300 [00:01<00:00, 244.00it/s]
100%|██████████| 300/300 [00:01<00:00, 257.71it/s]


In [None]:
%%time
!python ../resources/cantemist-evaluation-library/src/main.py -g ../datasets/cantemist_v6/test-set/cantemist-ner/ -p ./test_preds/ -s ner 

In [None]:
# Save predictions on the test set

In [126]:
model_name = "mbert_" + str(random_seed)

In [127]:
np.save(file="test_preds_" + model_name + ".npy", arr=test_preds)

In [134]:
doc_word_preds, doc_word_start_end = seq_ner_preds_brat_format(doc_list=test_doc_list, fragments=test_frag, 
                           arr_start_end=test_start_end_frag, arr_word_id=test_word_id, arr_preds=test_preds, 
                           strategy=EVAL_STRATEGY)

100%|██████████| 300/300 [00:01<00:00, 245.01it/s]


In [135]:
import pickle

with open("test_doc_word_preds_" + model_name + ".pck", "wb") as f:
    pickle.dump(doc_word_preds, f)

with open("test_doc_word_start_end_" + model_name + ".pck", "wb") as f:
    pickle.dump(doc_word_start_end, f)