# Fine-tuning BETO-Galén on Cantemist-NER

In this notebook, following a multi-class token classification approach, the BETO-Galén model is fine-tuned on both the training and development sets of the Cantemit-NER corpus. Additionally, the predictions made by the model on the test set are saved, in order to futher evaluate the NER performance of the model (see `results/Evaluation.ipynb`).

In [1]:
import tensorflow as tf

# Auxiliary components
from nlp_utils import *

from transformers import BertTokenizerFast
model_name = "BETO-Galen/"
tokenizer = BertTokenizerFast.from_pretrained(model_name, do_lower_case=False)

# Hyper-parameters
text_col = "raw_text"
SEQ_LEN = 128
BATCH_SIZE = 16
EPOCHS = 73
LR = 3e-5

GREEDY = True
IGNORE_VALUE = -100
ANN_STRATEGY = "word-all"
EVAL_STRATEGY = "word-max"
LOGITS = True

random_seed = 0
tf.random.set_seed(random_seed)

## Load text

Firstly, all text files from training and development Cantemist corpora are loaded in different dataframes.

Also, NER-annotations are loaded.

In [2]:
corpus_path = "../datasets/cantemist_v6/"
sub_task_path = "cantemist-ner/"

### Training corpus

In [3]:
%%time
train_path = corpus_path + "train-set/" + sub_task_path
train_files = [f for f in os.listdir(train_path) if os.path.isfile(train_path + f) and f.split('.')[-1] == "txt"]
n_train_files = len(train_files)
train_data = load_text_files(train_files, train_path)
dev1_path = corpus_path + "dev-set1/" + sub_task_path
train_files.extend([f for f in os.listdir(dev1_path) if os.path.isfile(dev1_path + f) and f.split('.')[-1] == "txt"])
train_data.extend(load_text_files(train_files[n_train_files:], dev1_path))
df_text_train = pd.DataFrame({'doc_id': [s.split('.txt')[0] for s in train_files], 'raw_text': train_data})

CPU times: user 9.74 ms, sys: 11.4 ms, total: 21.1 ms
Wall time: 19.6 ms


In [4]:
df_text_train.shape

(751, 2)

In [5]:
df_text_train.head()

Unnamed: 0,doc_id,raw_text
0,cc_onco453,"Anamnesis\nSe trata de un varón de 55 años, ex..."
1,cc_onco962,Anamnesis\nMujer de 40 años que consulta por l...
2,cc_onco989,"Anamnesis\nPaciente de 43 años, perimenopáusic..."
3,cc_onco187,"Anamnesis\nVarón de 72 años, exfumador y bebed..."
4,cc_onco164,"Anamnesis\nMujer de 51 años, sin alergias medi..."


In [6]:
len(set(df_text_train['doc_id']))

751

In [7]:
df_text_train.raw_text[0]

'Anamnesis\nSe trata de un varón de 55 años, ex fumador con un índice tabáquico de 40 paquetes-año, HTA, sin antecedentes familiares de interés, en tratamiento con ácido fólico, omeprazol, hierro oral y risperidona.\nEn junio de 2017, es diagnosticado a raíz de una trombosis iliaca derecha de una masa tumoral pobremente diferenciada que infiltraba tercio distal de apéndice cecal, con obliteración del paquete vascular iliaco, así como infiltración del uréter derecho, condicionando ureterohidronefrosis derecha grado cuatro.\nSe realiza biopsia de la lesión, con hallazgos anatomopatológicos de neoplasia fusocelular con inmunohistoquímica (IHC) sugerente de origen urotelial con diferenciación sarcomatoide (positividad para p63, p40, GATA3, CD99, EMA CAM 5.2, CK 34, beta E12, CK7, negatividad para CK20, S100, CD34, cKIT y TTF1), con SYT no reordenado.\nSe lleva a cabo intervención en julio de 2017 con extirpación de la masa tumoral, resección ileocecal, dejando ileostomía terminal y bypass 

### Development corpus

In [8]:
%%time
dev_path = corpus_path + "dev-set2/" + sub_task_path
dev_files = [f for f in os.listdir(dev_path) if os.path.isfile(dev_path + f) and f.split('.')[-1] == "txt"]
dev_data = load_text_files(dev_files, dev_path)
df_text_dev = pd.DataFrame({'doc_id': [s.split('.txt')[0] for s in dev_files], 'raw_text': dev_data})

CPU times: user 7.02 ms, sys: 0 ns, total: 7.02 ms
Wall time: 6.78 ms


In [9]:
df_text_dev.shape

(250, 2)

In [10]:
df_text_dev.head()

Unnamed: 0,doc_id,raw_text
0,cc_onco1183,Anamnesis\nVarón de 48 años que acude en agost...
1,cc_onco751,Anamnesis\nHombre de 71 años de edad con antec...
2,cc_onco1384,"Anamnesis\nVarón de 30 años, sin antecedentes ..."
3,cc_onco1208,Anamnesis\nVarón de 76 años diagnosticado en a...
4,cc_onco734,Anamnesis\nTras canalización del reservorio ce...


In [11]:
len(set(df_text_dev['doc_id']))

250

In [12]:
df_text_dev.raw_text[0]

'Anamnesis\nVarón de 48 años que acude en agosto de 2012 tras la realización de amputación en hallux derecho a la primera valoración por Oncología Médica.\nAntecedentes: no alérgicos. Patológicos: diabético hace 3 años sin tratamiento. HTA en tratamiento. Enolismo crónico.\nTrastorno afectivo bipolar. Quirúrgicos: apendicectomizado.\nEnfermedad actual: desde hace 2 años presentaba una lesión hiperpigmentada en el hallux del pie derecho, que ocasionalmente sangraba.\n\nExamen físico\nMuñón de hallux derecho en buen estado, presencia de adenopatías inguinales derechas. Resto de exploración sin alteraciones.\n\nPruebas complementarias\n- Biopsia cutánea: melanoma ulcerado.\n- AP de resección de melanoma: melanoma lentiginoso acral en fase de crecimiento vertical, ulcerado. Clark IV. Breslow 6,8 mm, sin infiltración linfovascular.\n- TC de tórax-abdomen-pelvis: sin signos concluyentes de extensión tóraco-abdómino-pélvica de melanoma.\nMicronódulos en el lóbulo superior derecho a valorar en

## Process NER annotations

We load and pre-process the NER annotations in BRAT format available for the Cantemist-NER subtask.

In [13]:
# Training corpus

In [14]:
train_ann_files = [train_path + f for f in os.listdir(train_path) if f.split('.')[-1] == "ann"]
train_ann_files.extend([dev1_path + f for f in os.listdir(dev1_path) if f.split('.')[-1] == "ann"])

In [15]:
len(train_ann_files)

751

In [16]:
df_codes_train_ner = process_brat_ner(train_ann_files).sort_values(["doc_id", "start", "end"])

In [17]:
df_codes_train_ner.shape

(9737, 4)

In [18]:
df_codes_train_ner.head()

Unnamed: 0,doc_id,text_ref,start,end
5230,cc_onco1,Carcinoma microcítico,2719,2740
5231,cc_onco1,carcinoma microcítico,2950,2971
5232,cc_onco1,M0,2988,2990
97,cc_onco10,tumor,212,217
95,cc_onco10,neoplasia,976,985


In [19]:
len(set(df_codes_train_ner["doc_id"]))

750

In [20]:
assert ~df_codes_train_ner[["doc_id", "start", "end"]].duplicated().any()

In [21]:
# Development corpus

In [22]:
dev_ann_files = [dev_path + f for f in os.listdir(dev_path) if f.split('.')[-1] == "ann"]

In [23]:
len(dev_ann_files)

250

In [24]:
df_codes_dev_ner = process_brat_ner(dev_ann_files).sort_values(["doc_id", "start", "end"])

In [25]:
df_codes_dev_ner.shape

(2660, 4)

In [26]:
df_codes_dev_ner.head()

Unnamed: 0,doc_id,text_ref,start,end
1852,cc_onco1001,carcinoma epidermoide,576,597
1854,cc_onco1001,neoplasia,790,799
1857,cc_onco1001,adenocarcinoma T4N3M1b,836,858
1853,cc_onco1001,enfermedad hepática,1205,1224
1855,cc_onco1001,tumoral,2303,2310


In [27]:
df_codes_dev_ner.tail()

Unnamed: 0,doc_id,text_ref,start,end
1630,cc_onco994,tumoración,1604,1614
1629,cc_onco994,metastásica,3064,3075
1628,cc_onco994,macroadenoma,3752,3764
1632,cc_onco994,macroadenoma de la hipófisis,4068,4096
1627,cc_onco994,lesiones hepáticas,5378,5396


In [28]:
assert ~df_codes_dev_ner[["doc_id", "start", "end"]].duplicated().any()

### Remove overlapping annotations

In [29]:
# Training corpus

In [30]:
%%time
df_codes_train_ner_final = eliminate_overlap(df_ann=df_codes_train_ner)

100%|██████████| 750/750 [00:21<00:00, 34.63it/s]

CPU times: user 21.7 s, sys: 16.4 ms, total: 21.7 s
Wall time: 21.7 s





In [31]:
df_codes_train_ner_final.shape

(9605, 4)

In [32]:
# Development corpus

In [33]:
%%time
df_codes_dev_ner_final = eliminate_overlap(df_ann=df_codes_dev_ner)

100%|██████████| 250/250 [00:04<00:00, 53.73it/s]

CPU times: user 4.67 s, sys: 8.03 ms, total: 4.67 s
Wall time: 4.65 s





In [34]:
df_codes_dev_ner_final.shape

(2623, 4)

## Creation of annotated sequences

We create the corpus used to fine-tune the transformer model on a NER task. In this way, we split the texts into sentences, and convert them into sequences of subtokens. Also, each generated subtoken is assigned a NER label in IOB-2 format.

In [35]:
# Sentence-Split information
ss_corpus_path = "../datasets/Cantemist-SSplit-text/"

In [36]:
from sklearn.preprocessing import LabelEncoder

lab_encoder = LabelEncoder()
# IOB-2 format
lab_encoder.fit(["B", "I", "O"])

LabelEncoder()

### Training corpus

Only training texts with NER annotations are considered:

In [37]:
# Some train documents (texts) are not annotated 
len(set(df_text_train["doc_id"]) - set(df_codes_train_ner_final["doc_id"]))

1

In [38]:
train_doc_list = sorted(set(df_codes_train_ner_final["doc_id"]))

In [39]:
len(train_doc_list)

750

In [40]:
# Sentence-Split data

In [41]:
%%time
ss_sub_corpus_path = ss_corpus_path + "training/"
ss_files = [f for f in os.listdir(ss_sub_corpus_path) if os.path.isfile(ss_sub_corpus_path + f)]
ss_dict_train = load_ss_files(ss_files, ss_sub_corpus_path)

CPU times: user 29 ms, sys: 4.03 ms, total: 33.1 ms
Wall time: 32.9 ms


In [42]:
%%time
train_ind, train_att, train_type, train_y, train_frag, train_start_end_frag, train_word_id = ss_create_input_data_ner(df_text=df_text_train, 
                                                  text_col=text_col, 
                                                  df_ann=df_codes_train_ner_final, doc_list=train_doc_list, ss_dict=ss_dict_train,
                                                  tokenizer=tokenizer, lab_encoder=lab_encoder, seq_len=SEQ_LEN, 
                                                  ign_value=IGNORE_VALUE, strategy=ANN_STRATEGY, greedy=GREEDY)

100%|██████████| 750/750 [01:09<00:00, 10.82it/s]


CPU times: user 1min 9s, sys: 180 ms, total: 1min 9s
Wall time: 1min 9s


In [43]:
# Sanity check

In [43]:
train_ind.shape

(9326, 128)

In [44]:
train_att.shape

(9326, 128)

In [45]:
train_type.shape

(9326, 128)

In [46]:
train_y.shape

(9326, 128)

In [47]:
len(train_frag)

750

In [48]:
len(train_start_end_frag)

9326

In [49]:
len(train_word_id)

9326

In [50]:
# Check n_frag distribution across texts
pd.Series(train_frag).describe()

count    750.000000
mean      12.434667
std        4.298647
min        4.000000
25%       10.000000
50%       12.000000
75%       15.000000
max       36.000000
dtype: float64

In [51]:
# Inspect a randomly selected text and its encoded version
check_id = np.random.randint(low=0, high=len(train_doc_list), size=1)[0]

In [52]:
check_id

37

In [53]:
train_doc_list[check_id]

'cc_onco14'

In [54]:
df_text_train[df_text_train["doc_id"] == train_doc_list[check_id]][text_col].values[0]

'Anamnesis\nMujer de 26 años, natural de Rumanía, diagnosticada de carcinoma epidermoide de cérvix, sin otros antecedentes de interés, que refiere cefalea e hipertensión arterial desde hace varias semanas.\nA raíz de un cuadro de sangrado postcoital, se diagnostica en agosto de 2013 de carcinoma epidermoide de cérvix EIIB (afectación parametrial) con adenopatías patológicas a nivel locorregional.\nSe inicia tratamiento con QT + RT concurrente, con seis ciclos de cisplatino a dosis de 40 mg/m2 i.v. En la PET-TC de reevaluación de febrero de 2014 se objetiva persistencia de afectación local, con aparición de lesiones a nivel adenopático cervical y mediastínico y metástasis viscerales a nivel pulmonar y óseo. Se remite al Servicio de Oncología Médica y se inicia primera línea de QT paliativa con platino-taxano-bevacizumab a dosis estándar, con un total de seis ciclos completados. En una prueba de imagen de control realizada en mayo de 2014, destaca mejoría de la afectación descrita previa

In [55]:
df_codes_train_ner_final[df_codes_train_ner_final["doc_id"] == train_doc_list[check_id]]

Unnamed: 0,doc_id,text_ref,start,end
135,cc_onco14,carcinoma epidermoide,65,86
136,cc_onco14,carcinoma epidermoide,284,305
137,cc_onco14,metástasis,665,675
139,cc_onco14,tumoral,1268,1275
138,cc_onco14,maligna,3319,3326


In [56]:
check_id_frag = sum(train_frag[:check_id])

In [57]:
for i in range(check_id_frag, check_id_frag + train_frag[check_id]):
    print(list(zip([tokenizer._convert_id_to_token(int(ind)) for ind in train_ind[i]][1:len(train_start_end_frag[i])+1], 
               train_start_end_frag[i], train_word_id[i], 
               [lab_encoder.inverse_transform([label])[0] if label != IGNORE_VALUE else label \
                for label in train_y[i][1:len(train_start_end_frag[i])+1]])))
    print("\n")

[('Ana', (0, 9), 0, 'O'), ('##mne', (0, 9), 0, 'O'), ('##sis', (0, 9), 0, 'O'), ('Mujer', (10, 15), 1, 'O'), ('de', (16, 18), 2, 'O'), ('26', (19, 21), 3, 'O'), ('años', (22, 26), 4, 'O'), (',', (26, 27), 5, 'O'), ('natural', (28, 35), 6, 'O'), ('de', (36, 38), 7, 'O'), ('Rumanía', (39, 46), 8, 'O'), (',', (46, 47), 9, 'O'), ('diagnos', (48, 61), 10, 'O'), ('##tica', (48, 61), 10, 'O'), ('##da', (48, 61), 10, 'O'), ('de', (62, 64), 11, 'O'), ('car', (65, 74), 12, 'B'), ('##cino', (65, 74), 12, 'B'), ('##ma', (65, 74), 12, 'B'), ('epi', (75, 86), 13, 'I'), ('##der', (75, 86), 13, 'I'), ('##mo', (75, 86), 13, 'I'), ('##ide', (75, 86), 13, 'I'), ('de', (87, 89), 14, 'O'), ('cé', (90, 96), 15, 'O'), ('##r', (90, 96), 15, 'O'), ('##vi', (90, 96), 15, 'O'), ('##x', (90, 96), 15, 'O'), (',', (96, 97), 16, 'O'), ('sin', (98, 101), 17, 'O'), ('otros', (102, 107), 18, 'O'), ('antecedentes', (108, 120), 19, 'O'), ('de', (121, 123), 20, 'O'), ('interés', (124, 131), 21, 'O'), (',', (131, 132), 22,

In [58]:
check_id_frag = sum(train_frag[:check_id])
for frag in train_ind[check_id_frag:check_id_frag + train_frag[check_id]]:
    print(' '.join([tokenizer._convert_id_to_token(int(ind)) for ind in frag]), "\n")

[CLS] Ana ##mne ##sis Mujer de 26 años , natural de Rumanía , diagnos ##tica ##da de car ##cino ##ma epi ##der ##mo ##ide de cé ##r ##vi ##x , sin otros antecedentes de interés , que refiere ce ##fal ##ea e hiper ##tensión arterial desde hace varias semanas . A raíz de un cuadro de san ##grado post ##co ##ital , se diagnos ##tica en agosto de 2013 de car ##cino ##ma epi ##der ##mo ##ide de cé ##r ##vi ##x EI ##IB ( afecta ##ción para ##met ##ria ##l ) con aden ##op ##atía ##s pato ##lógicas a nivel loco ##r ##re ##gional . [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] 

[CLS] Se inicia tratamiento con Q ##T [UNK] RT concur ##rente , con seis ciclos de cis ##pla ##tino a dosis de 40 mg [UNK] m ##2 i . v . En la PE ##T - TC de ree ##valuación de febrero de 2014 se objetiva persistencia de afecta ##ción local , con aparición de lesiones a nivel aden ##op ##ático cerv ##ical y medias ##tín ##ico y metá ##

### Development corpus

Only development texts with NER annotations are considered:

In [59]:
# All development documents (texts) are annotated 
len(set(df_text_dev["doc_id"]) - set(df_codes_dev_ner_final["doc_id"]))

0

In [60]:
dev_doc_list = sorted(set(df_codes_dev_ner_final["doc_id"]))

In [61]:
len(dev_doc_list)

250

In [62]:
# Sentence-Split data

In [63]:
%%time
ss_sub_corpus_path = ss_corpus_path + "development/"
ss_files = [f for f in os.listdir(ss_sub_corpus_path) if os.path.isfile(ss_sub_corpus_path + f)]
ss_dict_dev = load_ss_files(ss_files, ss_sub_corpus_path)

CPU times: user 11.8 ms, sys: 0 ns, total: 11.8 ms
Wall time: 11.6 ms


In [64]:
%%time
dev_ind, dev_att, dev_type, dev_y, dev_frag, dev_start_end_frag, dev_word_id = ss_create_input_data_ner(df_text=df_text_dev, 
                                                  text_col=text_col, 
                                                  df_ann=df_codes_dev_ner_final, doc_list=dev_doc_list, ss_dict=ss_dict_dev,
                                                  tokenizer=tokenizer, lab_encoder=lab_encoder, seq_len=SEQ_LEN, 
                                                  ign_value=IGNORE_VALUE, strategy=ANN_STRATEGY, greedy=GREEDY)

 84%|████████▍ | 210/250 [00:16<00:02, 17.75it/s]

I
doc_id      cc_onco1427
text_ref        pT3N2Mx
start              1928
end                1935
Name: 803, dtype: object
11
11
[[1849 1858]
 [1859 1867]
 [1868 1872]
 [1873 1887]
 [1888 1898]
 [1899 1900]
 [1900 1905]
 [1906 1916]
 [1917 1918]
 [1919 1924]
 [1924 1925]
 [1926 1935]
 [1935 1936]
 [1937 1939]
 [1940 1948]
 [1949 1952]
 [1953 1961]
 [1962 1963]
 [1964 1971]
 [1972 1980]
 [1981 1985]
 [1986 1996]
 [1996 1997]]


100%|██████████| 250/250 [00:19<00:00, 12.71it/s]

CPU times: user 19.8 s, sys: 12 ms, total: 19.8 s
Wall time: 19.7 s





In [65]:
# Sanity check

In [66]:
dev_ind.shape

(2541, 128)

In [67]:
dev_att.shape

(2541, 128)

In [68]:
dev_type.shape

(2541, 128)

In [69]:
dev_y.shape

(2541, 128)

In [70]:
len(dev_frag)

250

In [71]:
len(dev_start_end_frag)

2541

In [72]:
len(dev_word_id)

2541

In [73]:
# Check n_frag distribution across texts
pd.Series(dev_frag).describe()

count    250.000000
mean      10.164000
std        4.404395
min        3.000000
25%        7.000000
50%        9.000000
75%       13.000000
max       32.000000
dtype: float64

In [74]:
# Inspect a randomly selected text and its encoded version
check_id = np.random.randint(low=0, high=len(dev_doc_list), size=1)[0]

In [75]:
check_id

188

In [76]:
dev_doc_list[check_id]

'cc_onco1382'

In [77]:
df_text_dev[df_text_dev["doc_id"] == dev_doc_list[check_id]][text_col].values[0]

'Anamnesis\nMujer de 55 años, postmenopáusica sin antecedentes médico-quirúrgicos de interés. Consulta en junio de 2008 por notar retracción del pezón de la mama derecha, sin otra clínica asociada. Mamografía previa 12 meses antes sin alteraciones relevantes.\n\nExamen físico\nDestaca asimetría mamaria con marcada retracción del pezón de la mama derecha. Mama izquierda sin alteraciones.\nNo presenta adenopatías locorregionales palpables. El resto de la exploración es anodina.\n\nPruebas complementarias\nMamografía: área de 4 cm de aumento de densidad, desestructurada con microcalcificaciones dismórficas por detrás del pezón derecho.\nBiopsia con aguja gruesa (BAG): compatible con carcinoma ductal infiltrante de mama grado II.\nRadiografía de tórax: normal.\nMarcadores tumorales: CEA y CA 15.3 normales.\n\nDiagnóstico\nAdenocarcinoma ductal infiltrante de mama por estudio preliminar con biopsia.\n\nTratamiento\nEs intervenida en agosto de 2008, realizándose biopsia selectiva del ganglio

In [78]:
df_codes_dev_ner_final[df_codes_dev_ner_final["doc_id"] == dev_doc_list[check_id]]

Unnamed: 0,doc_id,text_ref,start,end
1240,cc_onco1382,carcinoma ductal infiltrante de mama grado II,679,724
1241,cc_onco1382,Adenocarcinoma ductal infiltrante,815,848
1238,cc_onco1382,metástasis,1027,1037
1242,cc_onco1382,carcinoma endocrino de célula pequeña,1153,1190
1236,cc_onco1382,Micrometástasis,1208,1223
1237,cc_onco1382,células tumorales,1371,1388
1239,cc_onco1382,progresión hepática,2069,2088


In [79]:
check_id_frag = sum(dev_frag[:check_id])

In [80]:
for i in range(check_id_frag, check_id_frag + dev_frag[check_id]):
    print(list(zip([tokenizer._convert_id_to_token(int(ind)) for ind in dev_ind[i]][1:len(dev_start_end_frag[i])+1], 
               dev_start_end_frag[i], dev_word_id[i],
               [lab_encoder.inverse_transform([label])[0] if label != IGNORE_VALUE else label \
                for label in dev_y[i][1:len(dev_start_end_frag[i])+1]])))
    print("\n")

[('Ana', (0, 9), 0, 'O'), ('##mne', (0, 9), 0, 'O'), ('##sis', (0, 9), 0, 'O'), ('Mujer', (10, 15), 1, 'O'), ('de', (16, 18), 2, 'O'), ('55', (19, 21), 3, 'O'), ('años', (22, 26), 4, 'O'), (',', (26, 27), 5, 'O'), ('post', (28, 43), 6, 'O'), ('##men', (28, 43), 6, 'O'), ('##op', (28, 43), 6, 'O'), ('##áus', (28, 43), 6, 'O'), ('##ica', (28, 43), 6, 'O'), ('sin', (44, 47), 7, 'O'), ('antecedentes', (48, 60), 8, 'O'), ('médico', (61, 67), 9, 'O'), ('-', (67, 68), 10, 'O'), ('quirúr', (68, 79), 11, 'O'), ('##gicos', (68, 79), 11, 'O'), ('de', (80, 82), 12, 'O'), ('interés', (83, 90), 13, 'O'), ('.', (90, 91), 14, 'O'), ('Consul', (92, 100), 15, 'O'), ('##ta', (92, 100), 15, 'O'), ('en', (101, 103), 16, 'O'), ('junio', (104, 109), 17, 'O'), ('de', (110, 112), 18, 'O'), ('2008', (113, 117), 19, 'O'), ('por', (118, 121), 20, 'O'), ('notar', (122, 127), 21, 'O'), ('retra', (128, 138), 22, 'O'), ('##cción', (128, 138), 22, 'O'), ('del', (139, 142), 23, 'O'), ('pez', (143, 148), 24, 'O'), ('##ó

In [81]:
check_id_frag = sum(dev_frag[:check_id])
for frag in dev_ind[check_id_frag:check_id_frag + dev_frag[check_id]]:
    print(' '.join([tokenizer._convert_id_to_token(int(ind)) for ind in frag]), "\n")

[CLS] Ana ##mne ##sis Mujer de 55 años , post ##men ##op ##áus ##ica sin antecedentes médico - quirúr ##gicos de interés . Consul ##ta en junio de 2008 por notar retra ##cción del pez ##ón de la mama derecha , sin otra clínica asociada . Mam ##o ##grafía previa 12 meses antes sin alteraciones relevantes . Examen físico Destaca asim ##et ##ría mama ##ria con marcada retra ##cción del pez ##ón de la mama derecha . Mama izquierda sin alteraciones . No presenta aden ##op ##atía ##s loco ##r ##re ##gional ##es pal ##pa ##bles . El resto de la exploración es ano ##dina . [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] 

[CLS] Prueba ##s complementarias Mam ##o ##grafía : área de 4 cm de aumento de densidad , desest ##ruc ##tura ##da con micro ##cal ##ci ##ficaciones dis ##mó ##r ##ficas por detrás del pez ##ón derecho . Bio ##psia con aguja grues ##a ( BA ##G ) : compatible con car ##cino ##ma du ##cta ##l in

### Training & Development corpus

We merge the previously generated datasets:

In [82]:
# Indices
train_dev_ind = np.concatenate((train_ind, dev_ind))

In [83]:
train_dev_ind.shape

(11867, 128)

In [84]:
# Attention
train_dev_att = np.concatenate((train_att, dev_att))

In [85]:
train_dev_att.shape

(11867, 128)

In [86]:
# Type
train_dev_type = np.concatenate((train_type, dev_type))

In [87]:
train_dev_type.shape

(11867, 128)

In [88]:
# y
train_dev_y = np.concatenate((train_y, dev_y))

In [89]:
train_dev_y.shape

(11867, 128)

## Fine-tuning

Using the corpus of labeled sentences, we fine-tune the model on a multi-label sentence classification task.

In [90]:
from transformers import TFBertForTokenClassification

model = TFBertForTokenClassification.from_pretrained(model_name, from_pt=True)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertForTokenClassification: ['bert.embeddings.position_ids']
- This IS expected if you are initializing TFBertForTokenClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForTokenClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFBertForTokenClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [91]:
model.summary()

Model: "tf_bert_for_token_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  109260288 
_________________________________________________________________
dropout_37 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  1538      
Total params: 109,261,826
Trainable params: 109,261,826
Non-trainable params: 0
_________________________________________________________________


In [92]:
model.layers

[<transformers.models.bert.modeling_tf_bert.TFBertMainLayer at 0x7f45d9c02190>,
 <tensorflow.python.keras.layers.core.Dropout at 0x7f45cbdfff10>,
 <tensorflow.python.keras.layers.core.Dense at 0x7f45db981a10>]

In [93]:
from tensorflow.keras import Input, Model
from tensorflow.keras.layers import Dense
from tensorflow.keras.initializers import GlorotUniform

input_ids = Input(shape=(SEQ_LEN,), name='input_ids', dtype='int64')

num_labels = len(lab_encoder.classes_)

out_seq = model.layers[0](input_ids=input_ids)[0] # take the output sub-token sequence 
out_logits = Dense(units=num_labels, kernel_initializer=GlorotUniform(seed=random_seed))(out_seq) # Multi-class classification

model = Model(inputs=input_ids, outputs=out_logits)

In [94]:
model.summary()

Model: "functional_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_ids (InputLayer)       [(None, 128)]             0         
_________________________________________________________________
bert (TFBertMainLayer)       TFBaseModelOutputWithPool 109260288 
_________________________________________________________________
dense (Dense)                (None, 128, 3)            2307      
Total params: 109,262,595
Trainable params: 109,262,595
Non-trainable params: 0
_________________________________________________________________


In [95]:
model.input

<tf.Tensor 'input_ids:0' shape=(None, 128) dtype=int64>

In [96]:
model.output

<tf.Tensor 'dense/BiasAdd:0' shape=(None, 128, 3) dtype=float32>

In [None]:
%%time
from tensorflow.keras import optimizers, losses
import tensorflow_addons as tfa

optimizer = tfa.optimizers.RectifiedAdam(learning_rate=LR)
loss = TokenClassificationLoss(from_logits=LOGITS, ignore_val=IGNORE_VALUE)
model.compile(optimizer=optimizer, loss=loss)

history = model.fit(x={'input_ids': train_dev_ind}, 
                    y=train_dev_y, batch_size=BATCH_SIZE, epochs=EPOCHS, shuffle=True)

Epoch 1/73
105/742 [===>..........................] - ETA: 2:52 - loss: 0.4232

As a sanity check procedure, we evaluate model predictions on the development set:

In [101]:
%%time
dev_preds = tf.nn.softmax(logits=model.predict({'input_ids': dev_ind}), 
                           axis=-1).numpy()

CPU times: user 10.6 s, sys: 1.31 s, total: 11.9 s
Wall time: 13 s


In [102]:
dev_preds.shape

(2856, 128, 3)

In [103]:
out_dev_path = "dev_preds/"

In [104]:
write_ner_ann(df_pred_ann=ner_preds_brat_format(doc_list=dev_doc_list, fragments=dev_frag, preds=dev_preds, 
                                    start_end=dev_start_end_frag, word_id=dev_word_id, 
                                    lb_encoder=lab_encoder, df_text=df_text_dev, text_col=text_col, strategy=EVAL_STRATEGY), 
              out_path=out_dev_path)

100%|██████████| 250/250 [00:00<00:00, 284.79it/s]
100%|██████████| 250/250 [00:00<00:00, 311.27it/s]


In [105]:
%%time
!python ../resources/cantemist-evaluation-library/src/main.py -g ../datasets/cantemist_v6/dev-set2/cantemist-ner/ -p ./dev_preds/ -s ner 


-----------------------------------------------------
Clinical case name			Precision
-----------------------------------------------------
cc_onco1001.ann		1.0
-----------------------------------------------------
cc_onco1007.ann		1.0
-----------------------------------------------------
cc_onco1008.ann		1.0
-----------------------------------------------------
cc_onco1009.ann		1.0
-----------------------------------------------------
cc_onco1010.ann		1.0
-----------------------------------------------------
cc_onco1011.ann		1.0
-----------------------------------------------------
cc_onco1012.ann		1.0
-----------------------------------------------------
cc_onco1014.ann		1.0
-----------------------------------------------------
cc_onco1016.ann		1.0
-----------------------------------------------------
cc_onco1018.ann		1.0
-----------------------------------------------------
cc_onco1019.ann		0.933
-----------------------------------------------------
cc_onco

CPU times: user 28.4 ms, sys: 18.9 ms, total: 47.3 ms
Wall time: 988 ms


## Test set predictions

In [106]:
%%time
test_path = corpus_path + "test-set/" + sub_task_path
test_files = [f for f in os.listdir(test_path) if os.path.isfile(test_path + f) and f.split('.')[-1] == 'txt']
test_data = load_text_files(test_files, test_path)
df_text_test = pd.DataFrame({'doc_id': [s.split('.txt')[0] for s in test_files], 'raw_text': test_data})

CPU times: user 8.71 ms, sys: 0 ns, total: 8.71 ms
Wall time: 7.92 ms


In [107]:
df_text_test.shape

(300, 2)

In [108]:
df_text_test.head()

Unnamed: 0,doc_id,raw_text
0,cc_onco1446,"Anamnesis\nVarón de 70 años de edad, agriculto..."
1,cc_onco629,"Anamnesis\nMujer de 57 años, sin alergias medi..."
2,cc_onco624,Anamnesis\nSe trata de un varón de 67 años sin...
3,cc_onco332,"Anamnesis\nVarón 54 años en la actualidad, sin..."
4,cc_onco441,"Anamnesis\nVarón de 17 años, sin hermanos ni a..."


In [109]:
len(set(df_text_test['doc_id']))

300

In [110]:
df_text_test.raw_text[0]

'Anamnesis\nVarón de 70 años de edad, agricultor, sin alergias y fumador de un paquete al día. Padeció brucelosis a los 40 años y bronquitis crónica en la actualidad. Intervenido en 1992 de colesteatoma izquierdo y sometido en 1998 a una mastoidectomía radical izquierda por mastoiditis con absceso cervical y cerebeloso provocado por persistencia de la enfermedad.\nEn marzo de 2004 consultó por una ulceración en hemiescroto derecho de tres meses de evolución, realizándosele una resección del área afectada. La pieza presentaba un leiomiosarcoma con alto índice mitótico y borde de resección afecto positivo para actina de músculo liso y negativo para queratinas AE1-AE3, S100 y CD34. En mayo de 2004 se ampliaronmárgenes y, al no encontrarse afectación tumoral, el paciente continuó revisiones por el Servicio de Urología.\nEn febrero de 2005, en el curso del estudio de un tumor deltoideo derecho de 9 cm que presentaba desde hacía ocho años y que parecía un neurofibroma o neurinoma por RM, se 

In [111]:
test_doc_list = sorted(set(df_text_test["doc_id"]))

In [112]:
len(test_doc_list)

300

In [113]:
# Sentence-Split data

In [114]:
%%time
ss_sub_corpus_path = ss_corpus_path + "test-background/"
ss_files = [f for f in os.listdir(ss_sub_corpus_path) if os.path.isfile(ss_sub_corpus_path + f)]
ss_dict_test = load_ss_files(ss_files, ss_sub_corpus_path)

CPU times: user 127 ms, sys: 22.6 ms, total: 150 ms
Wall time: 150 ms


In [115]:
%%time
test_ind, test_att, test_type, _, test_frag, test_start_end_frag, test_word_id = ss_create_input_data_ner(df_text=df_text_test, 
                                                  text_col=text_col, 
                                                  # Since labels are ignored, we pass df_codes_train_ner as df_ann
                                                  df_ann=df_codes_train_ner_final, doc_list=test_doc_list, ss_dict=ss_dict_test,
                                                  tokenizer=tokenizer, lab_encoder=lab_encoder, seq_len=SEQ_LEN, 
                                                  ign_value=IGNORE_VALUE, strategy=ANN_STRATEGY, greedy=GREEDY)

100%|██████████| 300/300 [00:26<00:00, 11.50it/s]

CPU times: user 26.3 s, sys: 0 ns, total: 26.3 s
Wall time: 26.2 s





In [116]:
# Sanity check

In [117]:
test_ind.shape

(3853, 128)

In [118]:
test_att.shape

(3853, 128)

In [119]:
test_type.shape

(3853, 128)

In [120]:
len(test_frag)

300

In [121]:
len(test_start_end_frag)

3853

In [122]:
len(test_word_id)

3853

In [123]:
%%time
test_preds = tf.nn.softmax(logits=model.predict({'input_ids': test_ind}), 
                           axis=-1).numpy()

CPU times: user 12.7 s, sys: 1.52 s, total: 14.2 s
Wall time: 15.5 s


In [124]:
test_preds.shape

(3853, 128, 3)

In [125]:
out_test_path = "test_preds/"

In [126]:
write_ner_ann(df_pred_ann=ner_preds_brat_format(doc_list=test_doc_list, fragments=test_frag, preds=test_preds, 
                                    start_end=test_start_end_frag, word_id=test_word_id, lb_encoder=lab_encoder, 
                                    df_text=df_text_test, text_col=text_col, strategy=EVAL_STRATEGY), 
              out_path=out_test_path)

100%|██████████| 300/300 [00:01<00:00, 256.75it/s]
100%|██████████| 300/300 [00:01<00:00, 268.22it/s]


In [None]:
%%time
!python ../resources/cantemist-evaluation-library/src/main.py -g ../datasets/cantemist_v6/test-set/cantemist-ner/ -p ./test_preds/ -s ner 

In [None]:
# Save predictions on the test set

In [128]:
model_name = "beto_galen_" + str(random_seed)

In [129]:
np.save(file="test_preds_" + model_name + ".npy", arr=test_preds)

In [133]:
doc_word_preds, doc_word_start_end = seq_ner_preds_brat_format(doc_list=test_doc_list, fragments=test_frag, 
                           arr_start_end=test_start_end_frag, arr_word_id=test_word_id, arr_preds=test_preds, 
                           strategy=EVAL_STRATEGY)

100%|██████████| 300/300 [00:01<00:00, 244.53it/s]


In [134]:
import pickle

with open("test_doc_word_preds_" + model_name + ".pck", "wb") as f:
    pickle.dump(doc_word_preds, f)

with open("test_doc_word_start_end_" + model_name + ".pck", "wb") as f:
    pickle.dump(doc_word_start_end, f)