# Fine-tuning XLM-R-Galén on Cantemist-Coding

In this notebook, following a multi-label sequence classification approach, the XLM-R-Galén model is fine-tuned on both the training and development sets of the Cantemist-Coding corpus. Additionally, the predictions made by the model on the test set are saved, in order to futher evaluate the clinical coding performance of the model (see `results/Cantemist-Coding/Evaluation.ipynb`).

In [1]:
import tensorflow as tf

# Auxiliary components
import sys
sys.path.append("..")
from nlp_utils import *

# XLM-R tokenizer
from transformers import XLMRobertaTokenizer
import sentencepiece_pb2
model_name = "XLM-R-Galen/"
tokenizer = XLMRobertaTokenizer.from_pretrained(model_name)
spt = sentencepiece_pb2.SentencePieceText()

# Hyper-parameters
text_col = "raw_text"
SEQ_LEN = 128
BATCH_SIZE = 16
EPOCHS = 36
LR = 3e-5

random_seed = 0
tf.random.set_seed(random_seed)

## Load text

Firstly, all text files from training and development Cantemist corpora are loaded in different dataframes.

Also, CIE-O codes are loaded.

In [2]:
corpus_path = "../datasets/cantemist_v6/"
sub_task_path = "cantemist-coding/"

### Training corpus

In [3]:
%%time
train_path = corpus_path + "train-set/" + sub_task_path + "txt/"
train_files = [f for f in os.listdir(train_path) if os.path.isfile(train_path + f)]
n_train_files = len(train_files)
train_data = load_text_files(train_files, train_path)
dev1_path = corpus_path + "dev-set1/" + sub_task_path + "txt/"
train_files.extend([f for f in os.listdir(dev1_path) if os.path.isfile(dev1_path + f)])
train_data.extend(load_text_files(train_files[n_train_files:], dev1_path))
df_text_train = pd.DataFrame({'doc_id': [s.split('.txt')[0] for s in train_files], 'raw_text': train_data})

CPU times: user 960 µs, sys: 15.7 ms, total: 16.7 ms
Wall time: 16.3 ms


In [4]:
df_text_train.shape

(751, 2)

In [5]:
df_text_train.head()

Unnamed: 0,doc_id,raw_text
0,cc_onco453,"Anamnesis\nSe trata de un varón de 55 años, ex..."
1,cc_onco962,Anamnesis\nMujer de 40 años que consulta por l...
2,cc_onco989,"Anamnesis\nPaciente de 43 años, perimenopáusic..."
3,cc_onco187,"Anamnesis\nVarón de 72 años, exfumador y bebed..."
4,cc_onco164,"Anamnesis\nMujer de 51 años, sin alergias medi..."


In [6]:
df_text_train.raw_text[0]

'Anamnesis\nSe trata de un varón de 55 años, ex fumador con un índice tabáquico de 40 paquetes-año, HTA, sin antecedentes familiares de interés, en tratamiento con ácido fólico, omeprazol, hierro oral y risperidona.\nEn junio de 2017, es diagnosticado a raíz de una trombosis iliaca derecha de una masa tumoral pobremente diferenciada que infiltraba tercio distal de apéndice cecal, con obliteración del paquete vascular iliaco, así como infiltración del uréter derecho, condicionando ureterohidronefrosis derecha grado cuatro.\nSe realiza biopsia de la lesión, con hallazgos anatomopatológicos de neoplasia fusocelular con inmunohistoquímica (IHC) sugerente de origen urotelial con diferenciación sarcomatoide (positividad para p63, p40, GATA3, CD99, EMA CAM 5.2, CK 34, beta E12, CK7, negatividad para CK20, S100, CD34, cKIT y TTF1), con SYT no reordenado.\nSe lleva a cabo intervención en julio de 2017 con extirpación de la masa tumoral, resección ileocecal, dejando ileostomía terminal y bypass 

We also load the CIE-O codes table:

In [7]:
df_codes_train = pd.concat([pd.read_table(corpus_path + "train-set/" + sub_task_path + "train-coding.tsv", 
                                 sep='\t', header=0), 
                            pd.read_table(corpus_path + "dev-set1/" + sub_task_path + "dev1-coding.tsv", 
                                 sep='\t', header=0)])

In [8]:
df_codes_train.shape

(4142, 2)

In [9]:
df_codes_train["code"] = df_codes_train["code"].str.lower()

In [10]:
df_codes_train.head()

Unnamed: 0,file,code
0,cc_onco860,8140/3
1,cc_onco860,8000/6
2,cc_onco860,8140/6
3,cc_onco98,8000/6
4,cc_onco98,8070/3


In [11]:
df_codes_train[df_codes_train.duplicated(keep=False)]

Unnamed: 0,file,code
1862,cc_onco768,9080/1
1863,cc_onco768,9080/1


In [12]:
df_codes_train = df_codes_train[~df_codes_train.duplicated(keep='first')]

In [13]:
df_codes_train.shape

(4141, 2)

In [14]:
len(set(df_codes_train["file"]))

750

### Development corpus

In [15]:
%%time
dev_path = corpus_path + "dev-set2/" + sub_task_path + "txt/"
dev_files = [f for f in os.listdir(dev_path) if os.path.isfile(dev_path + f)]
dev_data = load_text_files(dev_files, dev_path)
df_text_dev = pd.DataFrame({'doc_id': [s.split('.txt')[0] for s in dev_files], 'raw_text': dev_data})

CPU times: user 5.63 ms, sys: 93 µs, total: 5.73 ms
Wall time: 5.55 ms


In [16]:
df_text_dev.shape

(250, 2)

In [17]:
df_text_dev.head()

Unnamed: 0,doc_id,raw_text
0,cc_onco1183,Anamnesis\nVarón de 48 años que acude en agost...
1,cc_onco751,Anamnesis\nHombre de 71 años de edad con antec...
2,cc_onco1384,"Anamnesis\nVarón de 30 años, sin antecedentes ..."
3,cc_onco1208,Anamnesis\nVarón de 76 años diagnosticado en a...
4,cc_onco734,Anamnesis\nTras canalización del reservorio ce...


In [18]:
df_text_dev.raw_text[0]

'Anamnesis\nVarón de 48 años que acude en agosto de 2012 tras la realización de amputación en hallux derecho a la primera valoración por Oncología Médica.\nAntecedentes: no alérgicos. Patológicos: diabético hace 3 años sin tratamiento. HTA en tratamiento. Enolismo crónico.\nTrastorno afectivo bipolar. Quirúrgicos: apendicectomizado.\nEnfermedad actual: desde hace 2 años presentaba una lesión hiperpigmentada en el hallux del pie derecho, que ocasionalmente sangraba.\n\nExamen físico\nMuñón de hallux derecho en buen estado, presencia de adenopatías inguinales derechas. Resto de exploración sin alteraciones.\n\nPruebas complementarias\n- Biopsia cutánea: melanoma ulcerado.\n- AP de resección de melanoma: melanoma lentiginoso acral en fase de crecimiento vertical, ulcerado. Clark IV. Breslow 6,8 mm, sin infiltración linfovascular.\n- TC de tórax-abdomen-pelvis: sin signos concluyentes de extensión tóraco-abdómino-pélvica de melanoma.\nMicronódulos en el lóbulo superior derecho a valorar en

We also load the CIE-O codes table:

In [19]:
df_codes_dev = pd.read_table(corpus_path + "dev-set2/" + sub_task_path + "dev2-coding.tsv", 
                                 sep='\t', header=0)

In [20]:
df_codes_dev.shape

(1279, 2)

In [21]:
df_codes_dev["code"] = df_codes_dev["code"].str.lower()

In [22]:
df_codes_dev.head()

Unnamed: 0,file,code
0,cc_onco1258,8000/6
1,cc_onco1258,8010/3
2,cc_onco1258,8010/3/h
3,cc_onco1258,8520/3
4,cc_onco1258,8000/3


In [23]:
len(set(df_codes_dev["file"]))

250

We join the training and development Cantemist codes dataframes together:

In [24]:
df_codes_train_dev = pd.concat([df_codes_train, df_codes_dev])

In [25]:
df_codes_train_dev.shape

(5420, 2)

In [26]:
df_codes_train_dev.head()

Unnamed: 0,file,code
0,cc_onco860,8140/3
1,cc_onco860,8000/6
2,cc_onco860,8140/6
3,cc_onco98,8000/6
4,cc_onco98,8070/3


## Creating corpora of annotated sentences

Leveraging the information available for the named-entity-recognition and normalization (NER-N) Cantemist-NORM task, we create both a training and a development corpus of annotated sentences with CIE-O codes.

Firstly, we pre-process the NER-N oncology-codes annotations available for both the training and development corpora.

In [27]:
# Training corpus

In [28]:
train_norm_path = corpus_path + "train-set/cantemist-norm/"
train_ann_files = [train_norm_path + f for f in os.listdir(train_norm_path) if f.split('.')[-1] == "ann"]
dev1_norm_path = corpus_path + "dev-set1/cantemist-norm/"
train_ann_files.extend([dev1_norm_path + f for f in os.listdir(dev1_norm_path) if f.split('.')[-1] == "ann"])

In [29]:
len(train_ann_files)

751

In [30]:
df_codes_train_ner = process_brat_labels(train_ann_files).sort_values(["doc_id", "start", "end"])

In [31]:
df_codes_train_ner.shape

(9737, 5)

In [32]:
df_codes_train_ner["code"] = df_codes_train_ner["code"].str.lower()

In [33]:
df_codes_train_ner.head()

Unnamed: 0,doc_id,code,text_ref,start,end
5230,cc_onco1,8041/3,Carcinoma microcítico,2719,2740
5231,cc_onco1,8041/3,carcinoma microcítico,2950,2971
5232,cc_onco1,8000/6,M0,2988,2990
97,cc_onco10,8000/1,tumor,212,217
95,cc_onco10,8000/1,neoplasia,976,985


In [34]:
len(set(df_codes_train_ner["doc_id"]))

750

In [35]:
# Development corpus

In [36]:
dev_norm_path = corpus_path + "dev-set2/cantemist-norm/"
dev_ann_files = [dev_norm_path + f for f in os.listdir(dev_norm_path) if f.split('.')[-1] == "ann"]

In [37]:
len(dev_ann_files)

250

In [38]:
df_codes_dev_ner = process_brat_labels(dev_ann_files).sort_values(["doc_id", "start", "end"])

In [39]:
df_codes_dev_ner.shape

(2660, 5)

In [40]:
df_codes_dev_ner["code"] = df_codes_dev_ner["code"].str.lower()

In [41]:
df_codes_dev_ner.head()

Unnamed: 0,doc_id,code,text_ref,start,end
1852,cc_onco1001,8070/3,carcinoma epidermoide,576,597
1854,cc_onco1001,8000/1,neoplasia,790,799
1857,cc_onco1001,8140/6,adenocarcinoma T4N3M1b,836,858
1853,cc_onco1001,8000/6,enfermedad hepática,1205,1224
1855,cc_onco1001,8000/1,tumoral,2303,2310


In [42]:
len(set(df_codes_dev_ner["doc_id"]))

250

Now, using the character start-end positions of each sentence from the Cantemist corpus (see `datasets/Cantemist-Sentence-Split.ipynb`), we annotate the sentences with CIE-O codes. Also, using XLM-R tokenizer, each sentence is converted into a sequence of subwords, which are further converted into vocabulary indices (input IDs) and attention mask arrays (XLM-R input tensors). We also generate a *fragments* dataset indicating the number of produced annotated sentences for each document.

In [43]:
# Sentence-Split information
ss_corpus_path = "../datasets/Cantemist-SSplit-text/"

### Training corpus

In [44]:
label_list = list(df_codes_train_dev["code"])

In [45]:
len(label_list)

5420

In [46]:
len(set(label_list))

743

In [47]:
from sklearn.preprocessing import MultiLabelBinarizer

mlb_encoder = MultiLabelBinarizer()
mlb_encoder.fit([label_list])

MultiLabelBinarizer()

In [48]:
# Number of distinct codes
num_labels = len(mlb_encoder.classes_)

In [49]:
num_labels

743

Only training texts that are annotated with CIE-O codes are considered:

In [50]:
# Some train documents (texts) are not annotated 
len(set(df_text_train["doc_id"]) - set(df_codes_train_ner["doc_id"]))

1

In [51]:
train_doc_list = sorted(set(df_codes_train_ner["doc_id"]))

In [52]:
len(train_doc_list)

750

In [53]:
# Sentence-Split data

In [54]:
%%time
ss_sub_corpus_path = ss_corpus_path + "training/"
ss_files = [f for f in os.listdir(ss_sub_corpus_path) if os.path.isfile(ss_sub_corpus_path + f)]
ss_dict_train = load_ss_files(ss_files, ss_sub_corpus_path)

CPU times: user 32.2 ms, sys: 4.02 ms, total: 36.2 ms
Wall time: 35.8 ms


In [55]:
%%time
train_ind, train_att, train_y, train_frag, train_start_end_frag = ss_create_frag_input_data_xlmr(df_text=df_text_train, 
                                                  text_col=text_col, 
                                                  df_ann=df_codes_train_ner, doc_list=train_doc_list, ss_dict=ss_dict_train,
                                                  tokenizer=tokenizer, sp_pb2=spt, lab_encoder=mlb_encoder, seq_len=SEQ_LEN)

100%|██████████| 750/750 [00:39<00:00, 18.85it/s]


CPU times: user 40.2 s, sys: 115 ms, total: 40.3 s
Wall time: 40.2 s


In [56]:
# Sanity check

In [57]:
train_ind.shape

(27633, 128)

In [58]:
train_att.shape

(27633, 128)

In [59]:
train_y.shape

(27633, 743)

In [60]:
len(train_frag)

750

In [61]:
len(train_start_end_frag)

27633

In [62]:
# Check n_frag distribution across texts
pd.Series(train_frag).describe()

count    750.000000
mean      36.844000
std       14.591672
min       11.000000
25%       27.000000
50%       34.000000
75%       44.000000
max      102.000000
dtype: float64

In [63]:
# Inspect a randomly selected text and its encoded version
check_id = np.random.randint(low=0, high=len(train_doc_list), size=1)[0]

In [64]:
check_id

219

In [65]:
train_doc_list[check_id]

'cc_onco337'

In [66]:
df_text_train[df_text_train["doc_id"] == train_doc_list[check_id]][text_col].values[0]

'Anamnesis\nVarón de 61 años con antecedentes personales de tabaquismo, diabetes mellitus tipo 2, enfermedad renal crónica secundaria a nefropatía diabética y melanoma cutáneo intervenido (estadío Ib) en 1993, sin evidencia de recidiva en seguimiento posterior. Consultó en el servicio de Urgencias por cuadro catarral y astenia de 1 mes de evolución sin otra sintomatología añadida.\n\nExploración física\nEl paciente presentaba un buen estado general y las constantes vitales se encontraban en rango de normalidad. En la exploración física destacaba una cicatriz quirúrgica de melanoma extirpado en región dorsal, así como disminución del murmullo vesicular en base izquierda a la auscultación pulmonar. No se objetivaron lesiones dérmicas sospechosas de malignidad ni otros hallazgos de interés en la exploración.\n\nPruebas complementarias\nEn la radiografía de tórax se produjo hallazgo casual de una masa retrocardíaca de bordes bien definidos. El TAC torácico confirmó la presencia de una lesi

In [67]:
check_id_frag = sum(train_frag[:check_id])
for i in range(check_id_frag, check_id_frag + train_frag[check_id]):
    print(mlb_encoder.inverse_transform(np.array([train_y[i]])), "\n")

[('8720/3',)] 

[()] 

[()] 

[('8720/3',)] 

[('8000/3',)] 

[()] 

[('8000/6',)] 

[()] 

[()] 

[()] 

[('8000/1', '8000/6')] 

[('8000/34',)] 

[()] 

[('8720/3',)] 

[('8000/6',)] 

[()] 

[()] 

[('8720/6',)] 

[()] 



In [68]:
for i in range(check_id_frag, check_id_frag + train_frag[check_id]):
    print(list(zip([tokenizer._convert_id_to_token(int(ind)) for ind in train_ind[i]][1:len(train_start_end_frag[i])+1], 
               train_start_end_frag[i])))
    print("\n")

[('▁Ana', (0, 3)), ('m', (3, 4)), ('nesi', (4, 8)), ('s', (8, 9)), ('▁Var', (9, 13)), ('ón', (13, 16)), ('▁de', (16, 19)), ('▁61', (19, 22)), ('▁años', (22, 28)), ('▁con', (28, 32)), ('▁antecede', (32, 41)), ('ntes', (41, 45)), ('▁personales', (45, 56)), ('▁de', (56, 59)), ('▁taba', (59, 64)), ('quis', (64, 68)), ('mo', (68, 70)), (',', (70, 71)), ('▁diabetes', (71, 80)), ('▁mell', (80, 85)), ('itus', (85, 89)), ('▁tipo', (89, 94)), ('▁2', (94, 96)), (',', (96, 97)), ('▁enfermedad', (97, 108)), ('▁renal', (108, 114)), ('▁crónica', (114, 123)), ('▁secundaria', (123, 134)), ('▁a', (134, 136)), ('▁ne', (136, 139)), ('fro', (139, 142)), ('pat', (142, 145)), ('ía', (145, 148)), ('▁di', (148, 151)), ('ab', (151, 153)), ('ética', (153, 159)), ('▁y', (159, 161)), ('▁melan', (161, 167)), ('oma', (167, 170)), ('▁cu', (170, 173)), ('tá', (173, 176)), ('neo', (176, 179)), ('▁interven', (179, 188)), ('ido', (188, 191)), ('▁(', (191, 193)), ('esta', (193, 197)), ('dí', (197, 200)), ('o', (200, 201))

In [69]:
check_id_frag = sum(train_frag[:check_id])
for frag in train_ind[check_id_frag:check_id_frag + train_frag[check_id]]:
    print(' '.join([tokenizer._convert_id_to_token(int(ind)) for ind in frag]), "\n")

<s> ▁Ana m nesi s ▁Var ón ▁de ▁61 ▁años ▁con ▁antecede ntes ▁personales ▁de ▁taba quis mo , ▁diabetes ▁mell itus ▁tipo ▁2 , ▁enfermedad ▁renal ▁crónica ▁secundaria ▁a ▁ne fro pat ía ▁di ab ética ▁y ▁melan oma ▁cu tá neo ▁interven ido ▁( esta dí o ▁I b ) ▁en ▁1993 , ▁sin ▁evidencia ▁de ▁reci di va ▁en ▁seguimiento ▁posterior . </s> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> 

<s> ▁Consul tó ▁en ▁el ▁servicio ▁de ▁Ur ge ncias ▁por ▁cuadro ▁ca tar ral ▁y ▁a st enia ▁de ▁1 ▁mes ▁de ▁evolución ▁sin ▁otra ▁sin tomat ología ▁a ña dida . </s> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad

In [70]:
# Fragment labels distribution
pd.Series(np.sum(train_y, axis=1)).describe()

count    27633.000000
mean         0.316976
std          0.621133
min          0.000000
25%          0.000000
50%          0.000000
75%          1.000000
max         10.000000
dtype: float64

### Development corpus

Only development texts that are annotated with CIE-O codes are considered:

In [71]:
# Some dev documents (texts) are not annotated 
len(set(df_text_dev["doc_id"]) - set(df_codes_dev_ner["doc_id"]))

0

In [72]:
dev_doc_list = sorted(set(df_codes_dev_ner["doc_id"]))

In [73]:
len(dev_doc_list)

250

In [74]:
%%time
ss_sub_corpus_path = ss_corpus_path + "development/"
ss_files = [f for f in os.listdir(ss_sub_corpus_path) if os.path.isfile(ss_sub_corpus_path + f)]
ss_dict_dev = load_ss_files(ss_files, ss_sub_corpus_path)

CPU times: user 11 ms, sys: 22 µs, total: 11 ms
Wall time: 10.7 ms


In [75]:
%%time
dev_ind, dev_att, dev_y, dev_frag, dev_start_end_frag = ss_create_frag_input_data_xlmr(df_text=df_text_dev, 
                                                  text_col=text_col, 
                                                  df_ann=df_codes_dev_ner, doc_list=dev_doc_list, ss_dict=ss_dict_dev,
                                                  tokenizer=tokenizer, sp_pb2=spt, lab_encoder=mlb_encoder, seq_len=SEQ_LEN)

100%|██████████| 250/250 [00:10<00:00, 24.76it/s]


CPU times: user 10.2 s, sys: 20 ms, total: 10.3 s
Wall time: 10.2 s


In [76]:
# Sanity check

In [77]:
dev_ind.shape

(8303, 128)

In [78]:
dev_att.shape

(8303, 128)

In [79]:
dev_y.shape

(8303, 743)

In [80]:
len(dev_frag)

250

In [81]:
len(dev_start_end_frag)

8303

In [82]:
# Check n_frag distribution across texts
pd.Series(dev_frag).describe()

count    250.000000
mean      33.212000
std       16.315386
min       10.000000
25%       22.250000
50%       30.000000
75%       39.000000
max      175.000000
dtype: float64

In [83]:
# Inspect a randomly selected text and its encoded version
check_id = np.random.randint(low=0, high=len(dev_doc_list), size=1)[0]

In [84]:
check_id

151

In [85]:
dev_doc_list[check_id]

'cc_onco1311'

In [86]:
df_text_dev[df_text_dev["doc_id"] == dev_doc_list[check_id]][text_col].values[0]

'Anamnesis\nVarón de 63 años, fumador, que ingresa para estudio en febrero de 2011 ante la presencia de una masa inguinal izquierda, tos escasamente productiva, astenia y pérdida ponderal de 8 kg de un mes de evolución.\n\nExamen físico\nDestacaba la presencia de adenopatías duras y adheridas en la región laterocervical izquierda, supraclavicular izquierda e inguinal izquierda de hasta 1,5 cm de tamaño. Resto sin hallazgos de interés.\n\nPruebas complementarias\nSe completa estudio con analítica sanguínea, donde destaca Hb 9,9 g/dl y VSG 120 mm, con serologías de VHC, VHB y VIH negativas. En la radiografía de tórax se aprecia engrosamiento hiliar bilateral junto con nódulo pulmonar en el lóbulo superior izquierdo. La tomografía computarizada (TC) confirma la presencia de adenopatías patológicas en mediastino y retroperitoneo, así como de una masa espiculada en el lóbulo superior izquierdo.\nSe realiza biopsia de la adenopatía cervical que confirma diagnóstico de linfoma B difuso de cél

In [87]:
check_id_frag = sum(dev_frag[:check_id])
for i in range(check_id_frag, check_id_frag + dev_frag[check_id]):
    print(mlb_encoder.inverse_transform(np.array([dev_y[i]])), "\n")

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[('8070/32', '9680/3')] 

[('9680/3',)] 

[('8046/3', '8070/3')] 

[()] 

[('8000/1', '8000/6')] 

[()] 

[('8070/3',)] 

[()] 

[('8000/6',)] 

[('8000/1', '8070/6')] 

[()] 



In [88]:
for i in range(check_id_frag, check_id_frag + dev_frag[check_id]):
    print(list(zip([tokenizer._convert_id_to_token(int(ind)) for ind in dev_ind[i]][1:len(dev_start_end_frag[i])+1], 
               dev_start_end_frag[i])))
    print("\n")

[('▁Ana', (0, 3)), ('m', (3, 4)), ('nesi', (4, 8)), ('s', (8, 9)), ('▁Var', (9, 13)), ('ón', (13, 16)), ('▁de', (16, 19)), ('▁63', (19, 22)), ('▁años', (22, 28)), (',', (28, 29)), ('▁fum', (29, 33)), ('ador', (33, 37)), (',', (37, 38)), ('▁que', (38, 42)), ('▁ing', (42, 46)), ('resa', (46, 50)), ('▁para', (50, 55)), ('▁estudio', (55, 63)), ('▁en', (63, 66)), ('▁febrero', (66, 74)), ('▁de', (74, 77)), ('▁2011', (77, 82)), ('▁ante', (82, 87)), ('▁la', (87, 90)), ('▁presencia', (90, 100)), ('▁de', (100, 103)), ('▁una', (103, 107)), ('▁masa', (107, 112)), ('▁in', (112, 115)), ('guin', (115, 119)), ('al', (119, 121)), ('▁izquierda', (121, 131)), (',', (131, 132)), ('▁to', (132, 135)), ('s', (135, 136)), ('▁esca', (136, 141)), ('s', (141, 142)), ('amente', (142, 148)), ('▁product', (148, 156)), ('iva', (156, 159)), (',', (159, 160)), ('▁a', (160, 162)), ('st', (162, 164)), ('enia', (164, 168)), ('▁y', (168, 170)), ('▁pérdida', (170, 179)), ('▁ponder', (179, 186)), ('al', (186, 188)), ('▁de',

In [89]:
check_id_frag = sum(dev_frag[:check_id])
for frag in dev_ind[check_id_frag:check_id_frag + dev_frag[check_id]]:
    print(' '.join([tokenizer._convert_id_to_token(int(ind)) for ind in frag]), "\n")

<s> ▁Ana m nesi s ▁Var ón ▁de ▁63 ▁años , ▁fum ador , ▁que ▁ing resa ▁para ▁estudio ▁en ▁febrero ▁de ▁2011 ▁ante ▁la ▁presencia ▁de ▁una ▁masa ▁in guin al ▁izquierda , ▁to s ▁esca s amente ▁product iva , ▁a st enia ▁y ▁pérdida ▁ponder al ▁de ▁8 ▁kg ▁de ▁un ▁mes ▁de ▁evolución . </s> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> 

<s> ▁Exam en ▁físico ▁Destaca ba ▁la ▁presencia ▁de ▁ade no pat ías ▁dura s ▁y ▁ad her idas ▁en ▁la ▁región ▁later oce rvi cal ▁izquierda , ▁supra cla vi cular ▁izquierda ▁e ▁in guin al ▁izquierda ▁de ▁hasta ▁1,5 ▁cm ▁de ▁tamaño . </s> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad

In [90]:
# Fragment labels distribution
pd.Series(np.sum(dev_y, axis=1)).describe()

count    8303.000000
mean        0.294231
std         0.586341
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max         6.000000
dtype: float64

### Training & Development corpus

We merge the previously generated datasets:

In [91]:
# Indices
train_dev_ind = np.concatenate((train_ind, dev_ind))

In [92]:
train_dev_ind.shape

(35936, 128)

In [93]:
# Attention masks
train_dev_att = np.concatenate((train_att, dev_att))

In [94]:
train_dev_att.shape

(35936, 128)

In [95]:
# y
train_dev_y = np.concatenate((train_y, dev_y))

In [96]:
train_dev_y.shape

(35936, 743)

## Fine-tuning

Using the corpus of labeled sentences, we fine-tune the model on a multi-label sentence classification task.

In [98]:
from transformers import TFXLMRobertaForSequenceClassification

model = TFXLMRobertaForSequenceClassification.from_pretrained(model_name, from_pt=True)

All PyTorch model weights were used when initializing TFXLMRobertaForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFXLMRobertaForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [99]:
model.summary()

Model: "tfxlm_roberta_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
roberta (TFRobertaMainLayer) multiple                  277453056 
_________________________________________________________________
classifier (TFRobertaClassif multiple                  592130    
Total params: 278,045,186
Trainable params: 278,045,186
Non-trainable params: 0
_________________________________________________________________


In [100]:
from tensorflow.keras import Input, Model
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.initializers import GlorotUniform

input_ids = Input(shape=(SEQ_LEN,), name='input_ids', dtype='int64')
attention_mask = Input(shape=(SEQ_LEN,), name='attention_mask', dtype='int64')
inputs = [input_ids, attention_mask]

cls_token = model.layers[0](input_ids=inputs[0], attention_mask=inputs[1])[0][:, 0, :] # take <s> token output representation (equiv. to [CLS]) 
out_logits = Dense(units=num_labels, kernel_initializer=GlorotUniform(seed=random_seed))(cls_token) # Multi-label classification
out_act = Activation('sigmoid')(out_logits)

model = Model(inputs=[input_ids, attention_mask], outputs=out_act)

In [101]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_ids (InputLayer)          [(None, 128)]        0                                            
__________________________________________________________________________________________________
attention_mask (InputLayer)     [(None, 128)]        0                                            
__________________________________________________________________________________________________
roberta (TFRobertaMainLayer)    TFBaseModelOutputWit 277453056   input_ids[0][0]                  
__________________________________________________________________________________________________
tf_op_layer_strided_slice (Tens [(None, 768)]        0           roberta[0][0]                    
______________________________________________________________________________________________

In [102]:
model.inputs

[<tf.Tensor 'input_ids:0' shape=(None, 128) dtype=int64>,
 <tf.Tensor 'attention_mask:0' shape=(None, 128) dtype=int64>]

In [103]:
model.outputs

[<tf.Tensor 'activation_4/Identity:0' shape=(None, 743) dtype=float32>]

In [None]:
%%time
from tensorflow.keras import optimizers, losses
import tensorflow_addons as tfa

optimizer = tfa.optimizers.RectifiedAdam(learning_rate=LR)
loss = losses.BinaryCrossentropy(from_logits=False)
model.compile(optimizer=optimizer, loss=loss)

history = model.fit(x={'input_ids': train_dev_ind, 'attention_mask': train_dev_att}, y=train_dev_y,
          batch_size=BATCH_SIZE, epochs=EPOCHS, shuffle=True)

Epoch 1/30
Epoch 2/30
  15/2246 [..............................] - ETA: 11:27 - loss: 0.0042

## Test set predictions

Finally, the predictions made by the model on the test set are saved. For this purpose, firstly, each sentence from the test corpus must be converted into a sequence of subwords (input IDs and attention mask arrays). Then, the predictions made by the model at the sentence-level are saved, to be further evaluated at document-level (see `results/Cantemist-Coding/Evaluation.ipynb`).

In [97]:
%%time
test_path = corpus_path + "test-set/cantemist-ner/"
test_files = [f for f in os.listdir(test_path) if os.path.isfile(test_path + f) and f.split('.')[-1] == 'txt']
test_data = load_text_files(test_files, test_path)
df_text_test = pd.DataFrame({'doc_id': [s.split('.txt')[0] for s in test_files], 'raw_text': test_data})

CPU times: user 1.07 ms, sys: 8.01 ms, total: 9.08 ms
Wall time: 8.42 ms


In [98]:
df_text_test.shape

(300, 2)

In [99]:
df_text_test.head()

Unnamed: 0,doc_id,raw_text
0,cc_onco877,"Anamnesis\nMujer de 59 años, alérgica a penici..."
1,cc_onco1075,"Anamnesis\nMujer de 52 años, sin alergias cono..."
2,cc_onco1450,"Anamnesis\nMujer de 51 años de edad, sin antec..."
3,cc_onco1165,Anamnesis\nPaciente varón de 75 años sin hábit...
4,cc_onco1298,"Anamnesis\nMujer de 60 años, exfumadora de 20 ..."


In [100]:
df_text_test.raw_text[0]

'Anamnesis\nMujer de 59 años, alérgica a penicilina y procaína. Fumadora activa (IPA: 43).\nAntecedentes familiares: abuelo materno diagnosticado de carcinoma colon a los 70 años; madre diagnosticada de carcinoma de mama bilateral a los 50 años; padre fallecido de carcinoma gástrico a los 47 años; tres tías maternas diagnosticadas de carcinoma de mama a los 55, 56 y 57 años respectivamente; y tres primas afectas de cáncer de mama.\nAntecedentes personales: bronquitis crónica, poliposis colónica, carcinoma ductal infiltrante clásico mama pT2pN0M0 G2 subtipo tumoral luminal a (RH: +, HER-2: negativo) intervenido en agosto de 2013 mediante tumorectomía mama izquierda (patrón round block) + biopsia selectiva ganglio centinela (negativo) y posterior QT adyuvante con esquema TC (paclitaxel-ciclofosfamida) x 4 ciclos.\nAcude en noviembre de 2013 a visita de seguimiento tras finalizar tratamiento adyuvante. Asintomática.\n\nExploración física\nTemperatura axilar 36,5ºC, tensión arterial 130/83

In [101]:
test_doc_list = sorted(set(df_text_test["doc_id"]))

In [102]:
len(test_doc_list)

300

In [103]:
%%time
ss_sub_corpus_path = ss_corpus_path + "test-background/"
ss_files = [f for f in os.listdir(ss_sub_corpus_path) if os.path.isfile(ss_sub_corpus_path + f)]
ss_dict_test = load_ss_files(ss_files, ss_sub_corpus_path)

CPU times: user 126 ms, sys: 36.1 ms, total: 162 ms
Wall time: 161 ms


In [104]:
%%time
test_ind, test_att, _, test_frag, _ = ss_create_frag_input_data_xlmr(df_text=df_text_test, 
                                                  text_col=text_col,
                                                  # Since labels are ignored, we pass df_codes_train_ner as df_ann
                                                  df_ann=df_codes_train_ner, doc_list=test_doc_list, ss_dict=ss_dict_test,
                                                  tokenizer=tokenizer, sp_pb2=spt, lab_encoder=mlb_encoder, seq_len=SEQ_LEN)

100%|██████████| 300/300 [00:02<00:00, 135.27it/s]

CPU times: user 2.35 s, sys: 44.1 ms, total: 2.39 s
Wall time: 2.37 s





In [None]:
%%time
test_preds = model.predict({'input_ids': test_ind, 'attention_mask': test_att})

In [120]:
test_preds.shape

(3955, 727)

In [105]:
results_dir_path = "../results/Cantemist-Coding/"

In [178]:
%%time
np.save(file=results_dir_path + "predictions/xlm_r_galen_seed_" + str(random_seed) + "_test_preds.npy", arr=test_preds)

CPU times: user 2.21 ms, sys: 4.65 ms, total: 6.87 ms
Wall time: 6.02 ms
