# Fine-tuning XLM-R-Galén on Cantemist-NER

In this notebook, following a multi-class token classification approach, the XLM-R-Galén model is fine-tuned on both the training and development sets of the Cantemit-NER corpus. Additionally, the predictions made by the model on the test set are saved, in order to futher evaluate the NER performance of the model (see `results/Evaluation.ipynb`).

In [1]:
import tensorflow as tf

# Auxiliary components
from nlp_utils import *

from transformers import XLMRobertaTokenizerFast
model_name = "XLM-R-Galen/"
tokenizer = XLMRobertaTokenizerFast.from_pretrained(model_name)

# Hyper-parameters
text_col = "raw_text"
SEQ_LEN = 128
BATCH_SIZE = 16
EPOCHS = 71
LR = 3e-5

GREEDY = True
IGNORE_VALUE = -100
ANN_STRATEGY = "word-all"
EVAL_STRATEGY = "word-max"
LOGITS = True

random_seed = 0
tf.random.set_seed(random_seed)

## Load text

Firstly, all text files from training and development Cantemist corpora are loaded in different dataframes.

Also, NER-annotations are loaded.

In [2]:
corpus_path = "../datasets/cantemist_v6/"
sub_task_path = "cantemist-ner/"

### Training corpus

In [3]:
%%time
train_path = corpus_path + "train-set/" + sub_task_path
train_files = [f for f in os.listdir(train_path) if os.path.isfile(train_path + f) and f.split('.')[-1] == "txt"]
n_train_files = len(train_files)
train_data = load_text_files(train_files, train_path)
dev1_path = corpus_path + "dev-set1/" + sub_task_path
train_files.extend([f for f in os.listdir(dev1_path) if os.path.isfile(dev1_path + f) and f.split('.')[-1] == "txt"])
train_data.extend(load_text_files(train_files[n_train_files:], dev1_path))
df_text_train = pd.DataFrame({'doc_id': [s.split('.txt')[0] for s in train_files], 'raw_text': train_data})

CPU times: user 5.12 ms, sys: 15.9 ms, total: 21.1 ms
Wall time: 20.2 ms


In [4]:
df_text_train.shape

(751, 2)

In [5]:
df_text_train.head()

Unnamed: 0,doc_id,raw_text
0,cc_onco453,"Anamnesis\nSe trata de un varón de 55 años, ex..."
1,cc_onco962,Anamnesis\nMujer de 40 años que consulta por l...
2,cc_onco989,"Anamnesis\nPaciente de 43 años, perimenopáusic..."
3,cc_onco187,"Anamnesis\nVarón de 72 años, exfumador y bebed..."
4,cc_onco164,"Anamnesis\nMujer de 51 años, sin alergias medi..."


In [6]:
len(set(df_text_train['doc_id']))

751

In [7]:
df_text_train.raw_text[0]

'Anamnesis\nSe trata de un varón de 55 años, ex fumador con un índice tabáquico de 40 paquetes-año, HTA, sin antecedentes familiares de interés, en tratamiento con ácido fólico, omeprazol, hierro oral y risperidona.\nEn junio de 2017, es diagnosticado a raíz de una trombosis iliaca derecha de una masa tumoral pobremente diferenciada que infiltraba tercio distal de apéndice cecal, con obliteración del paquete vascular iliaco, así como infiltración del uréter derecho, condicionando ureterohidronefrosis derecha grado cuatro.\nSe realiza biopsia de la lesión, con hallazgos anatomopatológicos de neoplasia fusocelular con inmunohistoquímica (IHC) sugerente de origen urotelial con diferenciación sarcomatoide (positividad para p63, p40, GATA3, CD99, EMA CAM 5.2, CK 34, beta E12, CK7, negatividad para CK20, S100, CD34, cKIT y TTF1), con SYT no reordenado.\nSe lleva a cabo intervención en julio de 2017 con extirpación de la masa tumoral, resección ileocecal, dejando ileostomía terminal y bypass 

### Development corpus

In [8]:
%%time
dev_path = corpus_path + "dev-set2/" + sub_task_path
dev_files = [f for f in os.listdir(dev_path) if os.path.isfile(dev_path + f) and f.split('.')[-1] == "txt"]
dev_data = load_text_files(dev_files, dev_path)
df_text_dev = pd.DataFrame({'doc_id': [s.split('.txt')[0] for s in dev_files], 'raw_text': dev_data})

CPU times: user 22.1 ms, sys: 11.4 ms, total: 33.5 ms
Wall time: 31 ms


In [9]:
df_text_dev.shape

(250, 2)

In [10]:
df_text_dev.head()

Unnamed: 0,doc_id,raw_text
0,cc_onco1183,Anamnesis\nVarón de 48 años que acude en agost...
1,cc_onco751,Anamnesis\nHombre de 71 años de edad con antec...
2,cc_onco1384,"Anamnesis\nVarón de 30 años, sin antecedentes ..."
3,cc_onco1208,Anamnesis\nVarón de 76 años diagnosticado en a...
4,cc_onco734,Anamnesis\nTras canalización del reservorio ce...


In [11]:
len(set(df_text_dev['doc_id']))

250

In [12]:
df_text_dev.raw_text[0]

'Anamnesis\nVarón de 48 años que acude en agosto de 2012 tras la realización de amputación en hallux derecho a la primera valoración por Oncología Médica.\nAntecedentes: no alérgicos. Patológicos: diabético hace 3 años sin tratamiento. HTA en tratamiento. Enolismo crónico.\nTrastorno afectivo bipolar. Quirúrgicos: apendicectomizado.\nEnfermedad actual: desde hace 2 años presentaba una lesión hiperpigmentada en el hallux del pie derecho, que ocasionalmente sangraba.\n\nExamen físico\nMuñón de hallux derecho en buen estado, presencia de adenopatías inguinales derechas. Resto de exploración sin alteraciones.\n\nPruebas complementarias\n- Biopsia cutánea: melanoma ulcerado.\n- AP de resección de melanoma: melanoma lentiginoso acral en fase de crecimiento vertical, ulcerado. Clark IV. Breslow 6,8 mm, sin infiltración linfovascular.\n- TC de tórax-abdomen-pelvis: sin signos concluyentes de extensión tóraco-abdómino-pélvica de melanoma.\nMicronódulos en el lóbulo superior derecho a valorar en

## Process NER annotations

We load and pre-process the NER annotations in BRAT format available for the Cantemist-NER subtask.

In [13]:
# Training corpus

In [13]:
train_ann_files = [train_path + f for f in os.listdir(train_path) if f.split('.')[-1] == "ann"]
train_ann_files.extend([dev1_path + f for f in os.listdir(dev1_path) if f.split('.')[-1] == "ann"])

In [14]:
len(train_ann_files)

751

In [15]:
df_codes_train_ner = process_brat_ner(train_ann_files).sort_values(["doc_id", "start", "end"])

In [16]:
df_codes_train_ner.shape

(9737, 4)

In [17]:
df_codes_train_ner.head()

Unnamed: 0,doc_id,text_ref,start,end
5230,cc_onco1,Carcinoma microcítico,2719,2740
5231,cc_onco1,carcinoma microcítico,2950,2971
5232,cc_onco1,M0,2988,2990
97,cc_onco10,tumor,212,217
95,cc_onco10,neoplasia,976,985


In [18]:
len(set(df_codes_train_ner["doc_id"]))

750

In [19]:
assert ~df_codes_train_ner[["doc_id", "start", "end"]].duplicated().any()

In [21]:
# Development corpus

In [20]:
dev_ann_files = [dev_path + f for f in os.listdir(dev_path) if f.split('.')[-1] == "ann"]

In [21]:
len(dev_ann_files)

250

In [22]:
df_codes_dev_ner = process_brat_ner(dev_ann_files).sort_values(["doc_id", "start", "end"])

In [23]:
df_codes_dev_ner.shape

(2660, 4)

In [24]:
df_codes_dev_ner.head()

Unnamed: 0,doc_id,text_ref,start,end
1852,cc_onco1001,carcinoma epidermoide,576,597
1854,cc_onco1001,neoplasia,790,799
1857,cc_onco1001,adenocarcinoma T4N3M1b,836,858
1853,cc_onco1001,enfermedad hepática,1205,1224
1855,cc_onco1001,tumoral,2303,2310


In [25]:
df_codes_dev_ner.tail()

Unnamed: 0,doc_id,text_ref,start,end
1630,cc_onco994,tumoración,1604,1614
1629,cc_onco994,metastásica,3064,3075
1628,cc_onco994,macroadenoma,3752,3764
1632,cc_onco994,macroadenoma de la hipófisis,4068,4096
1627,cc_onco994,lesiones hepáticas,5378,5396


In [26]:
assert ~df_codes_dev_ner[["doc_id", "start", "end"]].duplicated().any()

### Remove overlapping annotations

In [29]:
# Training corpus

In [27]:
%%time
df_codes_train_ner_final = eliminate_overlap(df_ann=df_codes_train_ner)

100%|██████████| 750/750 [00:21<00:00, 34.17it/s]

CPU times: user 22 s, sys: 22.3 ms, total: 22 s
Wall time: 22 s





In [28]:
df_codes_train_ner_final.shape

(9605, 4)

In [32]:
# Development corpus

In [29]:
%%time
df_codes_dev_ner_final = eliminate_overlap(df_ann=df_codes_dev_ner)

100%|██████████| 250/250 [00:04<00:00, 53.06it/s]

CPU times: user 4.72 s, sys: 15.8 ms, total: 4.73 s
Wall time: 4.71 s





In [30]:
df_codes_dev_ner_final.shape

(2623, 4)

## Creation of annotated sequences

We create the corpus used to fine-tune the transformer model on a NER task. In this way, we split the texts into sentences, and convert them into sequences of subtokens. Also, each generated subtoken is assigned a NER label in IOB-2 format.

In [31]:
# Sentence-Split information
ss_corpus_path = "../datasets/Cantemist-SSplit-text/"

In [32]:
from sklearn.preprocessing import LabelEncoder

lab_encoder = LabelEncoder()
# IOB-2 format
lab_encoder.fit(["B", "I", "O"])

LabelEncoder()

### Training corpus

Only training texts with NER annotations are considered:

In [33]:
# Some train documents (texts) are not annotated 
len(set(df_text_train["doc_id"]) - set(df_codes_train_ner_final["doc_id"]))

1

In [34]:
train_doc_list = sorted(set(df_codes_train_ner_final["doc_id"]))

In [35]:
len(train_doc_list)

750

In [40]:
# Sentence-Split data

In [36]:
%%time
ss_sub_corpus_path = ss_corpus_path + "training/"
ss_files = [f for f in os.listdir(ss_sub_corpus_path) if os.path.isfile(ss_sub_corpus_path + f)]
ss_dict_train = load_ss_files(ss_files, ss_sub_corpus_path)

CPU times: user 21.2 ms, sys: 19.7 ms, total: 40.9 ms
Wall time: 38.1 ms


In [37]:
%%time
train_ind, train_att, train_type, train_y, train_frag, train_start_end_frag, train_word_id = ss_create_input_data_ner(df_text=df_text_train, 
                                                  text_col=text_col, 
                                                  df_ann=df_codes_train_ner_final, doc_list=train_doc_list, ss_dict=ss_dict_train,
                                                  tokenizer=tokenizer, lab_encoder=lab_encoder, seq_len=SEQ_LEN, 
                                                  ign_value=IGNORE_VALUE, strategy=ANN_STRATEGY, greedy=GREEDY)

100%|██████████| 750/750 [01:09<00:00, 10.76it/s]

CPU times: user 1min 9s, sys: 156 ms, total: 1min 10s
Wall time: 1min 9s





In [43]:
# Sanity check

In [38]:
train_ind.shape

(10914, 128)

In [39]:
train_att.shape

(10914, 128)

In [40]:
train_type.shape

(10914, 128)

In [41]:
train_y.shape

(10914, 128)

In [42]:
len(train_frag)

750

In [43]:
len(train_start_end_frag)

10914

In [44]:
len(train_word_id)

10914

In [45]:
# Check n_frag distribution across texts
pd.Series(train_frag).describe()

count    750.000000
mean      14.552000
std        5.020315
min        4.000000
25%       11.000000
50%       14.000000
75%       17.000000
max       44.000000
dtype: float64

In [46]:
# Inspect a randomly selected text and its encoded version
check_id = np.random.randint(low=0, high=len(train_doc_list), size=1)[0]

In [47]:
check_id

347

In [48]:
train_doc_list[check_id]

'cc_onco493'

In [49]:
df_text_train[df_text_train["doc_id"] == train_doc_list[check_id]][text_col].values[0]

'Anamnesis\nMujer de 73 años con antecedentes de hipertensión arterial, dislipemia, diabetes mellitus no insulinodependiente y nunca fumadora, aunque sí fumadora pasiva (marido fumador activo fallecido por cáncer de pulmón). Intervenida de prótesis de rodilla bilateral. Tratamiento crónico con atorvastatina 20mg, glimepirida 4 mg y ramipril 2,5 mg. No tiene alergias medicamentosas. Está jubilada tras trabajar como panadera. Antecedente oncológico familiar de hermano con cáncer no microcítico de pulmón.\nPresenta en junio de 2018 episodio de derrame pericárdico, por lo que estuvo ingresada en Cardiología donde se realizó pericardiocentesis de 1.500 l de líquido serohemático, con citología negativa para malignidad. Actualmente vuelve a presentar dolor torácico y disnea.\n\nExploración física\nRegular estado general, hipotensión, taquicardia, taquipnea, regurgitación yugular, pulso paradójico. Ruidos cardiacos apagados.\n\nPruebas complementarias\n» Radiografía de tórax: aumento silueta c

In [50]:
df_codes_train_ner_final[df_codes_train_ner_final["doc_id"] == train_doc_list[check_id]]

Unnamed: 0,doc_id,text_ref,start,end
4489,cc_onco493,cáncer,204,210
4490,cc_onco493,cáncer no microcítico,473,494
4492,cc_onco493,malignidad,708,718
4487,cc_onco493,adenocarcinoma,1113,1127
4488,cc_onco493,adenocarcinoma,1393,1407
4493,cc_onco493,Adenocarcinoma de pulmón E-IVa (c T4N3M1a),1503,1545
4491,cc_onco493,enfermedad pulmonar y hepática,2034,2064


In [51]:
check_id_frag = sum(train_frag[:check_id])

In [52]:
for i in range(check_id_frag, check_id_frag + train_frag[check_id]):
    print(list(zip([tokenizer._convert_id_to_token(int(ind)) for ind in train_ind[i]][1:len(train_start_end_frag[i])+1], 
               train_start_end_frag[i], train_word_id[i], 
               [lab_encoder.inverse_transform([label])[0] if label != IGNORE_VALUE else label \
                for label in train_y[i][1:len(train_start_end_frag[i])+1]])))
    print("\n")

[('▁Ana', (0, 9), 0, 'O'), ('m', (0, 9), 0, 'O'), ('nesi', (0, 9), 0, 'O'), ('s', (0, 9), 0, 'O'), ('▁Mu', (10, 15), 1, 'O'), ('jer', (10, 15), 1, 'O'), ('▁de', (16, 18), 2, 'O'), ('▁73', (19, 21), 3, 'O'), ('▁años', (22, 26), 4, 'O'), ('▁con', (27, 30), 5, 'O'), ('▁antecede', (31, 43), 6, 'O'), ('ntes', (31, 43), 6, 'O'), ('▁de', (44, 46), 7, 'O'), ('▁hipertensi', (47, 59), 8, 'O'), ('ón', (47, 59), 8, 'O'), ('▁arterial', (60, 68), 9, 'O'), ('▁', (68, 69), 10, 'O'), (',', (68, 69), 10, 'O'), ('▁dis', (70, 80), 11, 'O'), ('lip', (70, 80), 11, 'O'), ('emia', (70, 80), 11, 'O'), ('▁', (80, 81), 12, 'O'), (',', (80, 81), 12, 'O'), ('▁diabetes', (82, 90), 13, 'O'), ('▁mell', (91, 99), 14, 'O'), ('itus', (91, 99), 14, 'O'), ('▁no', (100, 102), 15, 'O'), ('▁insulin', (103, 122), 16, 'O'), ('o', (103, 122), 16, 'O'), ('depend', (103, 122), 16, 'O'), ('i', (103, 122), 16, 'O'), ('ente', (103, 122), 16, 'O'), ('▁y', (123, 124), 17, 'O'), ('▁nunca', (125, 130), 18, 'O'), ('▁fum', (131, 139), 19,

In [53]:
check_id_frag = sum(train_frag[:check_id])
for frag in train_ind[check_id_frag:check_id_frag + train_frag[check_id]]:
    print(' '.join([tokenizer._convert_id_to_token(int(ind)) for ind in frag]), "\n")

<s> ▁Ana m nesi s ▁Mu jer ▁de ▁73 ▁años ▁con ▁antecede ntes ▁de ▁hipertensi ón ▁arterial ▁ , ▁dis lip emia ▁ , ▁diabetes ▁mell itus ▁no ▁insulin o depend i ente ▁y ▁nunca ▁fum adora ▁ , ▁aunque ▁sí ▁fum adora ▁pasi va ▁( ▁marido ▁fum ador ▁activo ▁fall ecido ▁por ▁cáncer ▁de ▁pul món ▁) ▁ . ▁Interven ida ▁de ▁pró tes is ▁de ▁rodil la ▁bilateral ▁ . ▁Trata miento ▁cr ónico ▁con ▁ ator vasta tina ▁20 mg ▁ , ▁gli me piri da ▁4 ▁mg ▁y ▁rami pri l ▁2 ▁ , ▁5 ▁mg ▁ . ▁No ▁tiene ▁alergi as ▁medicamentos as ▁ . ▁Está ▁jubila da ▁tras ▁trabajar ▁como ▁pana dera ▁ . </s> <pad> <pad> <pad> <pad> <pad> <pad> <pad> 

<s> ▁Ante ce dente ▁on c ológico ▁familiar ▁de ▁hermano ▁con ▁cáncer ▁no ▁micro cí tico ▁de ▁pul món ▁ . ▁Presenta ▁en ▁junio ▁de ▁2018 ▁episodio ▁de ▁der ram e ▁per ic ár dico ▁ , ▁por ▁lo ▁que ▁estuvo ▁in gres ada ▁en ▁Card i ología ▁donde ▁se ▁realizó ▁per i car dio centes is ▁de ▁1 ▁ . ▁500 ▁l ▁de ▁líquido ▁ser o he mático ▁ , ▁con ▁cit ología ▁negativa ▁para ▁malign idad ▁ . ▁Actua

### Development corpus

Only development texts with NER annotations are considered:

In [54]:
# All development documents (texts) are annotated 
len(set(df_text_dev["doc_id"]) - set(df_codes_dev_ner_final["doc_id"]))

0

In [55]:
dev_doc_list = sorted(set(df_codes_dev_ner_final["doc_id"]))

In [56]:
len(dev_doc_list)

250

In [64]:
# Sentence-Split data

In [57]:
%%time
ss_sub_corpus_path = ss_corpus_path + "development/"
ss_files = [f for f in os.listdir(ss_sub_corpus_path) if os.path.isfile(ss_sub_corpus_path + f)]
ss_dict_dev = load_ss_files(ss_files, ss_sub_corpus_path)

CPU times: user 9.25 ms, sys: 4 ms, total: 13.2 ms
Wall time: 12.5 ms


In [58]:
%%time
dev_ind, dev_att, dev_type, dev_y, dev_frag, dev_start_end_frag, dev_word_id = ss_create_input_data_ner(df_text=df_text_dev, 
                                                  text_col=text_col, 
                                                  df_ann=df_codes_dev_ner_final, doc_list=dev_doc_list, ss_dict=ss_dict_dev,
                                                  tokenizer=tokenizer, lab_encoder=lab_encoder, seq_len=SEQ_LEN, 
                                                  ign_value=IGNORE_VALUE, strategy=ANN_STRATEGY, greedy=GREEDY)

 84%|████████▍ | 211/250 [00:16<00:02, 18.14it/s]

I
doc_id      cc_onco1427
text_ref        pT3N2Mx
start              1928
end                1935
Name: 803, dtype: object
11
11
[[1849 1858]
 [1859 1867]
 [1868 1872]
 [1873 1887]
 [1888 1898]
 [1899 1900]
 [1900 1905]
 [1906 1916]
 [1917 1918]
 [1919 1924]
 [1924 1925]
 [1926 1935]
 [1935 1936]
 [1937 1939]
 [1940 1948]
 [1949 1952]
 [1953 1961]
 [1962 1963]
 [1964 1971]
 [1972 1980]
 [1981 1985]
 [1986 1996]
 [1996 1997]]


100%|██████████| 250/250 [00:19<00:00, 12.89it/s]

CPU times: user 19.5 s, sys: 8.08 ms, total: 19.5 s
Wall time: 19.5 s





In [67]:
# Sanity check

In [59]:
dev_ind.shape

(2965, 128)

In [60]:
dev_att.shape

(2965, 128)

In [61]:
dev_type.shape

(2965, 128)

In [62]:
dev_y.shape

(2965, 128)

In [63]:
len(dev_frag)

250

In [64]:
len(dev_start_end_frag)

2965

In [65]:
len(dev_word_id)

2965

In [66]:
# Check n_frag distribution across texts
pd.Series(dev_frag).describe()

count    250.000000
mean      11.860000
std        5.209699
min        3.000000
25%        8.000000
50%       10.000000
75%       14.000000
max       37.000000
dtype: float64

In [67]:
# Inspect a randomly selected text and its encoded version
check_id = np.random.randint(low=0, high=len(dev_doc_list), size=1)[0]

In [68]:
check_id

49

In [69]:
dev_doc_list[check_id]

'cc_onco1090'

In [70]:
df_text_dev[df_text_dev["doc_id"] == dev_doc_list[check_id]][text_col].values[0]

'Anamnesis\nMujer de 30 años, con antecedente de poliartropatía inflamatoria seronegativa de 7 años de evolución, que debuta en septiembre de 2013 con un cuadro de astenia generalizada, pérdida de peso, disnea de moderados esfuerzos, xeroftalmia y tos, adenopatías laterocervicales y aumento de tamaño de la glándula tiroides. Se realizan una ecografía y una PAAF tiroidea, sugerentes de malignidad. Se completa el estudio de extensión con una PET-TC, apreciándose metástasis óseas y ganglionares.\nEn octubre de 2013 se realiza rastreo corporal con yodo 131, con captación a nivel del lecho tiroideo, sin objetivarse focos de captación extratiroideos. En noviembre de 2013 es intervenida con tiroidectomía completa y vaciamiento ganglionar bilateral (en el lado derecho incompleto), siendo diagnosticada de carcinoma papilar de tiroides variante esclerosante difusa con metástasis óseas y linfáticas. Presenta además derrame pleural derecho (citología positiva para cáncer de tiroides), derrame peri

In [71]:
df_codes_dev_ner_final[df_codes_dev_ner_final["doc_id"] == dev_doc_list[check_id]]

Unnamed: 0,doc_id,text_ref,start,end
236,cc_onco1090,malignidad,386,396
238,cc_onco1090,metástasis,463,473
241,cc_onco1090,carcinoma papilar de tiroides variante esclero...,805,878
235,cc_onco1090,cáncer,964,970
239,cc_onco1090,metástasis,2017,2027
237,cc_onco1090,metastásica,2350,2361
242,cc_onco1090,"Carcinoma papilar de tiroides, variante escler...",3297,3356
240,cc_onco1090,metástasis,3370,3380


In [72]:
check_id_frag = sum(dev_frag[:check_id])

In [73]:
for i in range(check_id_frag, check_id_frag + dev_frag[check_id]):
    print(list(zip([tokenizer._convert_id_to_token(int(ind)) for ind in dev_ind[i]][1:len(dev_start_end_frag[i])+1], 
               dev_start_end_frag[i], dev_word_id[i],
               [lab_encoder.inverse_transform([label])[0] if label != IGNORE_VALUE else label \
                for label in dev_y[i][1:len(dev_start_end_frag[i])+1]])))
    print("\n")

[('▁Ana', (0, 9), 0, 'O'), ('m', (0, 9), 0, 'O'), ('nesi', (0, 9), 0, 'O'), ('s', (0, 9), 0, 'O'), ('▁Mu', (10, 15), 1, 'O'), ('jer', (10, 15), 1, 'O'), ('▁de', (16, 18), 2, 'O'), ('▁30', (19, 21), 3, 'O'), ('▁años', (22, 26), 4, 'O'), ('▁', (26, 27), 5, 'O'), (',', (26, 27), 5, 'O'), ('▁con', (28, 31), 6, 'O'), ('▁antecede', (32, 43), 7, 'O'), ('nte', (32, 43), 7, 'O'), ('▁de', (44, 46), 8, 'O'), ('▁poli', (47, 61), 9, 'O'), ('ar', (47, 61), 9, 'O'), ('trop', (47, 61), 9, 'O'), ('at', (47, 61), 9, 'O'), ('ía', (47, 61), 9, 'O'), ('▁inflama', (62, 74), 10, 'O'), ('toria', (62, 74), 10, 'O'), ('▁ser', (75, 87), 11, 'O'), ('o', (75, 87), 11, 'O'), ('nega', (75, 87), 11, 'O'), ('tiva', (75, 87), 11, 'O'), ('▁de', (88, 90), 12, 'O'), ('▁7', (91, 92), 13, 'O'), ('▁años', (93, 97), 14, 'O'), ('▁de', (98, 100), 15, 'O'), ('▁evolución', (101, 110), 16, 'O'), ('▁', (110, 111), 17, 'O'), (',', (110, 111), 17, 'O'), ('▁que', (112, 115), 18, 'O'), ('▁debut', (116, 122), 19, 'O'), ('a', (116, 122),

In [74]:
check_id_frag = sum(dev_frag[:check_id])
for frag in dev_ind[check_id_frag:check_id_frag + dev_frag[check_id]]:
    print(' '.join([tokenizer._convert_id_to_token(int(ind)) for ind in frag]), "\n")

<s> ▁Ana m nesi s ▁Mu jer ▁de ▁30 ▁años ▁ , ▁con ▁antecede nte ▁de ▁poli ar trop at ía ▁inflama toria ▁ser o nega tiva ▁de ▁7 ▁años ▁de ▁evolución ▁ , ▁que ▁debut a ▁en ▁septiembre ▁de ▁2013 ▁con ▁un ▁cuadro ▁de ▁a st enia ▁general izada ▁ , ▁pérdida ▁de ▁peso ▁ , ▁dis ne a ▁de ▁moder ados ▁esfuerzo s ▁ , ▁xer of tal mia ▁y ▁to s ▁ , ▁ade no pat ías ▁later o cer vica les ▁y ▁aumento ▁de ▁tamaño ▁de ▁la ▁g lán dula ▁tiro ides ▁ . ▁Se ▁realizan ▁una ▁ec ografía ▁y ▁una ▁PA AF ▁tiro idea ▁ , ▁suger entes ▁de ▁malign idad ▁ . </s> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> 

<s> ▁Se ▁completa ▁el ▁estudio ▁de ▁extensión ▁con ▁una ▁PET ▁- ▁TC ▁ , ▁a preci ándose ▁met ást asis ▁ó se as ▁y ▁gang lion ares ▁ . ▁En ▁octubre ▁de ▁2013 ▁se ▁realiza ▁rast reo ▁corporal ▁con ▁yo do ▁131 ▁ , ▁con ▁capta ción ▁a ▁nivel ▁del ▁le cho ▁tiro ide o ▁ , ▁sin ▁objetiva rse ▁foco s ▁de ▁capta ción ▁extra ti roid e os ▁ . </s> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> 

### Training & Development corpus

We merge the previously generated datasets:

In [75]:
# Indices
train_dev_ind = np.concatenate((train_ind, dev_ind))

In [76]:
train_dev_ind.shape

(13879, 128)

In [77]:
# Attention
train_dev_att = np.concatenate((train_att, dev_att))

In [78]:
train_dev_att.shape

(13879, 128)

In [79]:
# Type
train_dev_type = np.concatenate((train_type, dev_type))

In [80]:
train_dev_type.shape

(13879, 128)

In [81]:
# y
train_dev_y = np.concatenate((train_y, dev_y))

In [82]:
train_dev_y.shape

(13879, 128)

## Fine-tuning

Using the corpus of labeled sentences, we fine-tune the model on a multi-label sentence classification task.

In [83]:
from transformers import TFXLMRobertaForTokenClassification

model = TFXLMRobertaForTokenClassification.from_pretrained(model_name, from_pt=True)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFXLMRobertaForTokenClassification: ['roberta.embeddings.position_ids']
- This IS expected if you are initializing TFXLMRobertaForTokenClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFXLMRobertaForTokenClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFXLMRobertaForTokenClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [84]:
model.summary()

Model: "tfxlm_roberta_for_token_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
roberta (TFRobertaMainLayer) multiple                  277453056 
_________________________________________________________________
dropout_37 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  1538      
Total params: 277,454,594
Trainable params: 277,454,594
Non-trainable params: 0
_________________________________________________________________


In [85]:
model.layers

[<transformers.models.roberta.modeling_tf_roberta.TFRobertaMainLayer at 0x7fdeb32c8310>,
 <tensorflow.python.keras.layers.core.Dropout at 0x7fde66733f90>,
 <tensorflow.python.keras.layers.core.Dense at 0x7fde67d85650>]

In [86]:
from tensorflow.keras import Input, Model
from tensorflow.keras.layers import Dense
from tensorflow.keras.initializers import GlorotUniform

input_ids = Input(shape=(SEQ_LEN,), name='input_ids', dtype='int64')
attention_mask = Input(shape=(SEQ_LEN,), name='attention_mask', dtype='int64')
token_type_ids = Input(shape=(SEQ_LEN,), name='token_type_ids', dtype='int64')
inputs = [input_ids, attention_mask, token_type_ids]

num_labels = len(lab_encoder.classes_)

out_seq = model.layers[0](input_ids=inputs[0], attention_mask=inputs[1], token_type_ids=inputs[2])[0] # take the output sub-token sequence 
out_logits = Dense(units=num_labels, kernel_initializer=GlorotUniform(seed=random_seed))(out_seq) # Multi-class classification

model = Model(inputs=inputs, outputs=out_logits)

In [87]:
model.summary()

Model: "functional_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_ids (InputLayer)          [(None, 128)]        0                                            
__________________________________________________________________________________________________
attention_mask (InputLayer)     [(None, 128)]        0                                            
__________________________________________________________________________________________________
token_type_ids (InputLayer)     [(None, 128)]        0                                            
__________________________________________________________________________________________________
roberta (TFRobertaMainLayer)    TFBaseModelOutputWit 277453056   input_ids[0][0]                  
                                                                 attention_mask[0][0]  

In [88]:
model.input

[<tf.Tensor 'input_ids:0' shape=(None, 128) dtype=int64>,
 <tf.Tensor 'attention_mask:0' shape=(None, 128) dtype=int64>,
 <tf.Tensor 'token_type_ids:0' shape=(None, 128) dtype=int64>]

In [89]:
model.output

<tf.Tensor 'dense/BiasAdd:0' shape=(None, 128, 3) dtype=float32>

In [None]:
%%time
from tensorflow.keras import optimizers, losses
import tensorflow_addons as tfa

optimizer = tfa.optimizers.RectifiedAdam(learning_rate=LR)
loss = TokenClassificationLoss(from_logits=LOGITS, ignore_val=IGNORE_VALUE)
model.compile(optimizer=optimizer, loss=loss)

history = model.fit(x={'input_ids': train_dev_ind, 'attention_mask': train_dev_att, 'token_type_ids': train_dev_type}, 
                    y=train_dev_y, batch_size=BATCH_SIZE, epochs=EPOCHS, shuffle=True)

Epoch 1/71

As a sanity check procedure, we evaluate model predictions on the development set:

In [101]:
%%time
dev_preds = tf.nn.softmax(logits=model.predict({'input_ids': dev_ind, 'attention_mask': dev_att, 
                                                'token_type_ids': dev_type}), 
                           axis=-1).numpy()

CPU times: user 11.7 s, sys: 1.5 s, total: 13.2 s
Wall time: 16.3 s


In [102]:
dev_preds.shape

(2965, 128, 3)

In [103]:
out_dev_path = "dev_preds/"

In [104]:
write_ner_ann(df_pred_ann=ner_preds_brat_format(doc_list=dev_doc_list, fragments=dev_frag, preds=dev_preds, 
                                    start_end=dev_start_end_frag, word_id=dev_word_id, 
                                    lb_encoder=lab_encoder, df_text=df_text_dev, text_col=text_col, strategy=EVAL_STRATEGY), 
              out_path=out_dev_path)

100%|██████████| 250/250 [00:00<00:00, 275.22it/s]
100%|██████████| 250/250 [00:00<00:00, 300.63it/s]


In [105]:
%%time
!python ../resources/cantemist-evaluation-library/src/main.py -g ../datasets/cantemist_v6/dev-set2/cantemist-ner/ -p ./dev_preds/ -s ner 


-----------------------------------------------------
Clinical case name			Precision
-----------------------------------------------------
cc_onco1001.ann		1.0
-----------------------------------------------------
cc_onco1007.ann		1.0
-----------------------------------------------------
cc_onco1008.ann		1.0
-----------------------------------------------------
cc_onco1009.ann		1.0
-----------------------------------------------------
cc_onco1010.ann		1.0
-----------------------------------------------------
cc_onco1011.ann		1.0
-----------------------------------------------------
cc_onco1012.ann		1.0
-----------------------------------------------------
cc_onco1014.ann		1.0
-----------------------------------------------------
cc_onco1016.ann		1.0
-----------------------------------------------------
cc_onco1018.ann		1.0
-----------------------------------------------------
cc_onco1019.ann		1.0
-----------------------------------------------------
cc_onco10

CPU times: user 7.54 ms, sys: 39.1 ms, total: 46.6 ms
Wall time: 1.04 s


## Test set predictions

In [106]:
%%time
test_path = corpus_path + "test-set/" + sub_task_path
test_files = [f for f in os.listdir(test_path) if os.path.isfile(test_path + f) and f.split('.')[-1] == 'txt']
test_data = load_text_files(test_files, test_path)
df_text_test = pd.DataFrame({'doc_id': [s.split('.txt')[0] for s in test_files], 'raw_text': test_data})

CPU times: user 10.6 ms, sys: 0 ns, total: 10.6 ms
Wall time: 9.81 ms


In [107]:
df_text_test.shape

(300, 2)

In [108]:
df_text_test.head()

Unnamed: 0,doc_id,raw_text
0,cc_onco877,"Anamnesis\nMujer de 59 años, alérgica a penici..."
1,cc_onco1075,"Anamnesis\nMujer de 52 años, sin alergias cono..."
2,cc_onco1450,"Anamnesis\nMujer de 51 años de edad, sin antec..."
3,cc_onco1165,Anamnesis\nPaciente varón de 75 años sin hábit...
4,cc_onco1298,"Anamnesis\nMujer de 60 años, exfumadora de 20 ..."


In [109]:
len(set(df_text_test['doc_id']))

300

In [110]:
df_text_test.raw_text[0]

'Anamnesis\nMujer de 59 años, alérgica a penicilina y procaína. Fumadora activa (IPA: 43).\nAntecedentes familiares: abuelo materno diagnosticado de carcinoma colon a los 70 años; madre diagnosticada de carcinoma de mama bilateral a los 50 años; padre fallecido de carcinoma gástrico a los 47 años; tres tías maternas diagnosticadas de carcinoma de mama a los 55, 56 y 57 años respectivamente; y tres primas afectas de cáncer de mama.\nAntecedentes personales: bronquitis crónica, poliposis colónica, carcinoma ductal infiltrante clásico mama pT2pN0M0 G2 subtipo tumoral luminal a (RH: +, HER-2: negativo) intervenido en agosto de 2013 mediante tumorectomía mama izquierda (patrón round block) + biopsia selectiva ganglio centinela (negativo) y posterior QT adyuvante con esquema TC (paclitaxel-ciclofosfamida) x 4 ciclos.\nAcude en noviembre de 2013 a visita de seguimiento tras finalizar tratamiento adyuvante. Asintomática.\n\nExploración física\nTemperatura axilar 36,5ºC, tensión arterial 130/83

In [111]:
test_doc_list = sorted(set(df_text_test["doc_id"]))

In [112]:
len(test_doc_list)

300

In [113]:
# Sentence-Split data

In [114]:
%%time
ss_sub_corpus_path = ss_corpus_path + "test-background/"
ss_files = [f for f in os.listdir(ss_sub_corpus_path) if os.path.isfile(ss_sub_corpus_path + f)]
ss_dict_test = load_ss_files(ss_files, ss_sub_corpus_path)

CPU times: user 135 ms, sys: 27.7 ms, total: 162 ms
Wall time: 162 ms


In [115]:
%%time
test_ind, test_att, test_type, _, test_frag, test_start_end_frag, test_word_id = ss_create_input_data_ner(df_text=df_text_test, 
                                                  text_col=text_col, 
                                                  # Since labels are ignored, we pass df_codes_train_ner as df_ann
                                                  df_ann=df_codes_train_ner_final, doc_list=test_doc_list, ss_dict=ss_dict_test,
                                                  tokenizer=tokenizer, lab_encoder=lab_encoder, seq_len=SEQ_LEN, 
                                                  ign_value=IGNORE_VALUE, strategy=ANN_STRATEGY, greedy=GREEDY)

100%|██████████| 300/300 [00:27<00:00, 11.01it/s]

CPU times: user 27.4 s, sys: 0 ns, total: 27.4 s
Wall time: 27.4 s





In [116]:
# Sanity check

In [117]:
test_ind.shape

(3974, 128)

In [118]:
test_att.shape

(3974, 128)

In [119]:
test_type.shape

(3974, 128)

In [120]:
len(test_frag)

300

In [121]:
len(test_start_end_frag)

3974

In [122]:
len(test_word_id)

3974

In [123]:
%%time
test_preds = tf.nn.softmax(logits=model.predict({'input_ids': test_ind, 'attention_mask': test_att, 
                                                 'token_type_ids': test_type}), 
                           axis=-1).numpy()

CPU times: user 13.5 s, sys: 1.99 s, total: 15.5 s
Wall time: 19.8 s


In [124]:
test_preds.shape

(3974, 128, 3)

In [125]:
out_test_path = "test_preds/"

In [126]:
write_ner_ann(df_pred_ann=ner_preds_brat_format(doc_list=test_doc_list, fragments=test_frag, preds=test_preds, 
                                    start_end=test_start_end_frag, word_id=test_word_id, lb_encoder=lab_encoder, 
                                    df_text=df_text_test, text_col=text_col, strategy=EVAL_STRATEGY), 
              out_path=out_test_path)

100%|██████████| 300/300 [00:01<00:00, 251.07it/s]
100%|██████████| 300/300 [00:01<00:00, 257.76it/s]


In [None]:
%%time
!python ../resources/cantemist-evaluation-library/src/main.py -g ../datasets/cantemist_v6/test-set/cantemist-ner/ -p ./test_preds/ -s ner 

In [None]:
# Save predictions on the test set

In [128]:
model_name = "xlmr_galen_" + str(random_seed)

In [129]:
np.save(file="test_preds_" + model_name + ".npy", arr=test_preds)

In [134]:
doc_word_preds, doc_word_start_end = seq_ner_preds_brat_format(doc_list=test_doc_list, fragments=test_frag, 
                           arr_start_end=test_start_end_frag, arr_word_id=test_word_id, arr_preds=test_preds, 
                           strategy=EVAL_STRATEGY)

100%|██████████| 300/300 [00:01<00:00, 251.53it/s]


In [135]:
import pickle

with open("test_doc_word_preds_" + model_name + ".pck", "wb") as f:
    pickle.dump(doc_word_preds, f)

with open("test_doc_word_start_end_" + model_name + ".pck", "wb") as f:
    pickle.dump(doc_word_start_end, f)