# Fine-tuning XLM-R on Cantemist-Coding

In this notebook, following a multi-label sequence classification approach, the XLM-R model is fine-tuned on both the training and development sets of the Cantemist-Coding corpus. Additionally, the predictions made by the model on the test set are saved, in order to futher evaluate the clinical coding performance of the model (see `results/Cantemist-Coding/Evaluation.ipynb`).

In [1]:
import tensorflow as tf

# Auxiliary components
import sys
sys.path.append("..")
from nlp_utils import *

# XLM-R tokenizer
from transformers import XLMRobertaTokenizer
import sentencepiece_pb2
model_name = "xlm-roberta-base"
tokenizer = XLMRobertaTokenizer.from_pretrained(model_name)
spt = sentencepiece_pb2.SentencePieceText()

# Hyper-parameters
text_col = "raw_text"
SEQ_LEN = 128
BATCH_SIZE = 16
EPOCHS = 30
LR = 3e-5

random_seed = 0
tf.random.set_seed(random_seed)

## Load text

Firstly, all text files from training and development Cantemist corpora are loaded in different dataframes.

Also, CIE-O codes are loaded.

In [2]:
corpus_path = "../datasets/cantemist_v6/"
sub_task_path = "cantemist-coding/"

### Training corpus

In [3]:
%%time
train_path = corpus_path + "train-set/" + sub_task_path + "txt/"
train_files = [f for f in os.listdir(train_path) if os.path.isfile(train_path + f)]
n_train_files = len(train_files)
train_data = load_text_files(train_files, train_path)
dev1_path = corpus_path + "dev-set1/" + sub_task_path + "txt/"
train_files.extend([f for f in os.listdir(dev1_path) if os.path.isfile(dev1_path + f)])
train_data.extend(load_text_files(train_files[n_train_files:], dev1_path))
df_text_train = pd.DataFrame({'doc_id': [s.split('.txt')[0] for s in train_files], 'raw_text': train_data})

CPU times: user 12.7 ms, sys: 4.07 ms, total: 16.8 ms
Wall time: 16.5 ms


In [4]:
df_text_train.shape

(751, 2)

In [5]:
df_text_train.head()

Unnamed: 0,doc_id,raw_text
0,cc_onco453,"Anamnesis\nSe trata de un varón de 55 años, ex..."
1,cc_onco962,Anamnesis\nMujer de 40 años que consulta por l...
2,cc_onco989,"Anamnesis\nPaciente de 43 años, perimenopáusic..."
3,cc_onco187,"Anamnesis\nVarón de 72 años, exfumador y bebed..."
4,cc_onco164,"Anamnesis\nMujer de 51 años, sin alergias medi..."


In [6]:
df_text_train.raw_text[0]

'Anamnesis\nSe trata de un varón de 55 años, ex fumador con un índice tabáquico de 40 paquetes-año, HTA, sin antecedentes familiares de interés, en tratamiento con ácido fólico, omeprazol, hierro oral y risperidona.\nEn junio de 2017, es diagnosticado a raíz de una trombosis iliaca derecha de una masa tumoral pobremente diferenciada que infiltraba tercio distal de apéndice cecal, con obliteración del paquete vascular iliaco, así como infiltración del uréter derecho, condicionando ureterohidronefrosis derecha grado cuatro.\nSe realiza biopsia de la lesión, con hallazgos anatomopatológicos de neoplasia fusocelular con inmunohistoquímica (IHC) sugerente de origen urotelial con diferenciación sarcomatoide (positividad para p63, p40, GATA3, CD99, EMA CAM 5.2, CK 34, beta E12, CK7, negatividad para CK20, S100, CD34, cKIT y TTF1), con SYT no reordenado.\nSe lleva a cabo intervención en julio de 2017 con extirpación de la masa tumoral, resección ileocecal, dejando ileostomía terminal y bypass 

We also load the CIE-O codes table:

In [7]:
df_codes_train = pd.concat([pd.read_table(corpus_path + "train-set/" + sub_task_path + "train-coding.tsv", 
                                 sep='\t', header=0), 
                            pd.read_table(corpus_path + "dev-set1/" + sub_task_path + "dev1-coding.tsv", 
                                 sep='\t', header=0)])

In [8]:
df_codes_train.shape

(4142, 2)

In [9]:
df_codes_train["code"] = df_codes_train["code"].str.lower()

In [10]:
df_codes_train.head()

Unnamed: 0,file,code
0,cc_onco860,8140/3
1,cc_onco860,8000/6
2,cc_onco860,8140/6
3,cc_onco98,8000/6
4,cc_onco98,8070/3


In [11]:
df_codes_train[df_codes_train.duplicated(keep=False)]

Unnamed: 0,file,code
1862,cc_onco768,9080/1
1863,cc_onco768,9080/1


In [12]:
df_codes_train = df_codes_train[~df_codes_train.duplicated(keep='first')]

In [13]:
df_codes_train.shape

(4141, 2)

In [14]:
len(set(df_codes_train["file"]))

750

### Development corpus

In [15]:
%%time
dev_path = corpus_path + "dev-set2/" + sub_task_path + "txt/"
dev_files = [f for f in os.listdir(dev_path) if os.path.isfile(dev_path + f)]
dev_data = load_text_files(dev_files, dev_path)
df_text_dev = pd.DataFrame({'doc_id': [s.split('.txt')[0] for s in dev_files], 'raw_text': dev_data})

CPU times: user 7.1 ms, sys: 0 ns, total: 7.1 ms
Wall time: 6.51 ms


In [16]:
df_text_dev.shape

(250, 2)

In [17]:
df_text_dev.head()

Unnamed: 0,doc_id,raw_text
0,cc_onco1183,Anamnesis\nVarón de 48 años que acude en agost...
1,cc_onco751,Anamnesis\nHombre de 71 años de edad con antec...
2,cc_onco1384,"Anamnesis\nVarón de 30 años, sin antecedentes ..."
3,cc_onco1208,Anamnesis\nVarón de 76 años diagnosticado en a...
4,cc_onco734,Anamnesis\nTras canalización del reservorio ce...


In [18]:
df_text_dev.raw_text[0]

'Anamnesis\nVarón de 48 años que acude en agosto de 2012 tras la realización de amputación en hallux derecho a la primera valoración por Oncología Médica.\nAntecedentes: no alérgicos. Patológicos: diabético hace 3 años sin tratamiento. HTA en tratamiento. Enolismo crónico.\nTrastorno afectivo bipolar. Quirúrgicos: apendicectomizado.\nEnfermedad actual: desde hace 2 años presentaba una lesión hiperpigmentada en el hallux del pie derecho, que ocasionalmente sangraba.\n\nExamen físico\nMuñón de hallux derecho en buen estado, presencia de adenopatías inguinales derechas. Resto de exploración sin alteraciones.\n\nPruebas complementarias\n- Biopsia cutánea: melanoma ulcerado.\n- AP de resección de melanoma: melanoma lentiginoso acral en fase de crecimiento vertical, ulcerado. Clark IV. Breslow 6,8 mm, sin infiltración linfovascular.\n- TC de tórax-abdomen-pelvis: sin signos concluyentes de extensión tóraco-abdómino-pélvica de melanoma.\nMicronódulos en el lóbulo superior derecho a valorar en

We also load the CIE-O codes table:

In [19]:
df_codes_dev = pd.read_table(corpus_path + "dev-set2/" + sub_task_path + "dev2-coding.tsv", 
                                 sep='\t', header=0)

In [20]:
df_codes_dev.shape

(1279, 2)

In [21]:
df_codes_dev["code"] = df_codes_dev["code"].str.lower()

In [22]:
df_codes_dev.head()

Unnamed: 0,file,code
0,cc_onco1258,8000/6
1,cc_onco1258,8010/3
2,cc_onco1258,8010/3/h
3,cc_onco1258,8520/3
4,cc_onco1258,8000/3


In [23]:
len(set(df_codes_dev["file"]))

250

We join the training and development Cantemist codes dataframes together:

In [24]:
df_codes_train_dev = pd.concat([df_codes_train, df_codes_dev])

In [25]:
df_codes_train_dev.shape

(5420, 2)

In [26]:
df_codes_train_dev.head()

Unnamed: 0,file,code
0,cc_onco860,8140/3
1,cc_onco860,8000/6
2,cc_onco860,8140/6
3,cc_onco98,8000/6
4,cc_onco98,8070/3


## Creating corpora of annotated sentences

Leveraging the information available for the named-entity-recognition and normalization (NER-N) Cantemist-NORM task, we create both a training and a development corpus of annotated sentences with CIE-O codes.

Firstly, we pre-process the NER-N oncology-codes annotations available for both the training and development corpora.

In [27]:
# Training corpus

In [28]:
train_norm_path = corpus_path + "train-set/cantemist-norm/"
train_ann_files = [train_norm_path + f for f in os.listdir(train_norm_path) if f.split('.')[-1] == "ann"]
dev1_norm_path = corpus_path + "dev-set1/cantemist-norm/"
train_ann_files.extend([dev1_norm_path + f for f in os.listdir(dev1_norm_path) if f.split('.')[-1] == "ann"])

In [29]:
len(train_ann_files)

751

In [30]:
df_codes_train_ner = process_brat_labels(train_ann_files).sort_values(["doc_id", "start", "end"])

In [31]:
df_codes_train_ner.shape

(9737, 5)

In [32]:
df_codes_train_ner["code"] = df_codes_train_ner["code"].str.lower()

In [33]:
df_codes_train_ner.head()

Unnamed: 0,doc_id,code,text_ref,start,end
5230,cc_onco1,8041/3,Carcinoma microcítico,2719,2740
5231,cc_onco1,8041/3,carcinoma microcítico,2950,2971
5232,cc_onco1,8000/6,M0,2988,2990
97,cc_onco10,8000/1,tumor,212,217
95,cc_onco10,8000/1,neoplasia,976,985


In [34]:
len(set(df_codes_train_ner["doc_id"]))

750

In [35]:
# Development corpus

In [36]:
dev_norm_path = corpus_path + "dev-set2/cantemist-norm/"
dev_ann_files = [dev_norm_path + f for f in os.listdir(dev_norm_path) if f.split('.')[-1] == "ann"]

In [37]:
len(dev_ann_files)

250

In [38]:
df_codes_dev_ner = process_brat_labels(dev_ann_files).sort_values(["doc_id", "start", "end"])

In [39]:
df_codes_dev_ner.shape

(2660, 5)

In [40]:
df_codes_dev_ner["code"] = df_codes_dev_ner["code"].str.lower()

In [41]:
df_codes_dev_ner.head()

Unnamed: 0,doc_id,code,text_ref,start,end
1852,cc_onco1001,8070/3,carcinoma epidermoide,576,597
1854,cc_onco1001,8000/1,neoplasia,790,799
1857,cc_onco1001,8140/6,adenocarcinoma T4N3M1b,836,858
1853,cc_onco1001,8000/6,enfermedad hepática,1205,1224
1855,cc_onco1001,8000/1,tumoral,2303,2310


In [42]:
len(set(df_codes_dev_ner["doc_id"]))

250

Now, using the character start-end positions of each sentence from the Cantemist corpus (see `datasets/Cantemist-Sentence-Split.ipynb`), we annotate the sentences with CIE-O codes. Also, using XLM-R tokenizer, each sentence is converted into a sequence of subwords, which are further converted into vocabulary indices (input IDs) and attention mask arrays (XLM-R input tensors). We also generate a *fragments* dataset indicating the number of produced annotated sentences for each document.

In [43]:
# Sentence-Split information
ss_corpus_path = "../datasets/Cantemist-SSplit-text/"

### Training corpus

In [44]:
label_list = list(df_codes_train_dev["code"])

In [45]:
len(label_list)

5420

In [46]:
len(set(label_list))

743

In [47]:
from sklearn.preprocessing import MultiLabelBinarizer

mlb_encoder = MultiLabelBinarizer()
mlb_encoder.fit([label_list])

MultiLabelBinarizer()

In [48]:
# Number of distinct codes
num_labels = len(mlb_encoder.classes_)

In [49]:
num_labels

743

Only training texts that are annotated with CIE-O codes are considered:

In [50]:
# Some train documents (texts) are not annotated 
len(set(df_text_train["doc_id"]) - set(df_codes_train_ner["doc_id"]))

1

In [51]:
train_doc_list = sorted(set(df_codes_train_ner["doc_id"]))

In [52]:
len(train_doc_list)

750

In [53]:
# Sentence-Split data

In [54]:
%%time
ss_sub_corpus_path = ss_corpus_path + "training/"
ss_files = [f for f in os.listdir(ss_sub_corpus_path) if os.path.isfile(ss_sub_corpus_path + f)]
ss_dict_train = load_ss_files(ss_files, ss_sub_corpus_path)

CPU times: user 34 ms, sys: 0 ns, total: 34 ms
Wall time: 33.8 ms


In [55]:
%%time
train_ind, train_att, train_y, train_frag, train_start_end_frag = ss_create_frag_input_data_xlmr(df_text=df_text_train, 
                                                  text_col=text_col, 
                                                  df_ann=df_codes_train_ner, doc_list=train_doc_list, ss_dict=ss_dict_train,
                                                  tokenizer=tokenizer, sp_pb2=spt, lab_encoder=mlb_encoder, seq_len=SEQ_LEN)

100%|██████████| 750/750 [00:40<00:00, 18.39it/s]


CPU times: user 41.1 s, sys: 133 ms, total: 41.3 s
Wall time: 41.2 s


In [56]:
# Sanity check

In [57]:
train_ind.shape

(27633, 128)

In [58]:
train_att.shape

(27633, 128)

In [59]:
train_y.shape

(27633, 743)

In [60]:
len(train_frag)

750

In [61]:
len(train_start_end_frag)

27633

In [62]:
# Check n_frag distribution across texts
pd.Series(train_frag).describe()

count    750.000000
mean      36.844000
std       14.591672
min       11.000000
25%       27.000000
50%       34.000000
75%       44.000000
max      102.000000
dtype: float64

In [63]:
# Inspect a randomly selected text and its encoded version
check_id = np.random.randint(low=0, high=len(train_doc_list), size=1)[0]

In [64]:
check_id

358

In [65]:
train_doc_list[check_id]

'cc_onco509'

In [66]:
df_text_train[df_text_train["doc_id"] == train_doc_list[check_id]][text_col].values[0]

'Anamnesis\nMujer de 49 años remitida en octubre de 2018 desde Atención Primaria a consultas externas de Cirugía General, por aparición de nódulo palpable en mama derecha, de un mes de evolución.\nEn ese momento, la paciente no presentaba antecedentes medicoquirúrgicos de interés, ni refería hábitos tóxicos. Trabajaba como ama de casa y tenía buen apoyo familiar. Como antecedentes familiares, una tía paterna fue diagnosticada de carcinoma de mama a los 65 años.\nComo tratamiento crónico, se había pautado recientemente acetato de medroxiprogesterona por hiperplasia endometrial simple.\n\nExploración física\nLa paciente presentaba buen estado general, con un performance status (PS) 0. En la exploración física, se palpaba un nódulo sólido, de aproximadamente 4 cm en cuadrante superoexterno de mama derecha. La exploración axilar resultó dentro de la normalidad, así como el resto de la exploración física.\n\nPruebas complementarias\nSe solicitó mamografía/ecografía, que mostraba un nódulo o

In [67]:
check_id_frag = sum(train_frag[:check_id])
for i in range(check_id_frag, check_id_frag + train_frag[check_id]):
    print(mlb_encoder.inverse_transform(np.array([train_y[i]])), "\n")

[()] 

[()] 

[()] 

[('8010/3',)] 

[()] 

[()] 

[()] 

[()] 

[()] 

[('8004/3', '8140/0', '8890/3')] 

[('8000/3', '8890/3')] 

[('8000/1',)] 

[('8000/6', '8890/3')] 

[('8890/34',)] 

[()] 

[()] 

[()] 

[()] 

[()] 



In [68]:
for i in range(check_id_frag, check_id_frag + train_frag[check_id]):
    print(list(zip([tokenizer._convert_id_to_token(int(ind)) for ind in train_ind[i]][1:len(train_start_end_frag[i])+1], 
               train_start_end_frag[i])))
    print("\n")

[('▁Ana', (0, 3)), ('m', (3, 4)), ('nesi', (4, 8)), ('s', (8, 9)), ('▁Mu', (9, 12)), ('jer', (12, 15)), ('▁de', (15, 18)), ('▁49', (18, 21)), ('▁años', (21, 27)), ('▁remit', (27, 33)), ('ida', (33, 36)), ('▁en', (36, 39)), ('▁octubre', (39, 47)), ('▁de', (47, 50)), ('▁2018', (50, 55)), ('▁desde', (55, 61)), ('▁Atención', (61, 71)), ('▁Primaria', (71, 80)), ('▁a', (80, 82)), ('▁consulta', (82, 91)), ('s', (91, 92)), ('▁externa', (92, 100)), ('s', (100, 101)), ('▁de', (101, 104)), ('▁Ci', (104, 107)), ('rug', (107, 110)), ('ía', (110, 113)), ('▁General', (113, 121)), (',', (121, 122)), ('▁por', (122, 126)), ('▁aparición', (126, 137)), ('▁de', (137, 140)), ('▁nó', (140, 144)), ('du', (144, 146)), ('lo', (146, 148)), ('▁palp', (148, 153)), ('able', (153, 157)), ('▁en', (157, 160)), ('▁mama', (160, 165)), ('▁derecha', (165, 173)), (',', (173, 174)), ('▁de', (174, 177)), ('▁un', (177, 180)), ('▁mes', (180, 184)), ('▁de', (184, 187)), ('▁evolución', (187, 198)), ('.', (198, 199))]


[('▁En', 

In [69]:
check_id_frag = sum(train_frag[:check_id])
for frag in train_ind[check_id_frag:check_id_frag + train_frag[check_id]]:
    print(' '.join([tokenizer._convert_id_to_token(int(ind)) for ind in frag]), "\n")

<s> ▁Ana m nesi s ▁Mu jer ▁de ▁49 ▁años ▁remit ida ▁en ▁octubre ▁de ▁2018 ▁desde ▁Atención ▁Primaria ▁a ▁consulta s ▁externa s ▁de ▁Ci rug ía ▁General , ▁por ▁aparición ▁de ▁nó du lo ▁palp able ▁en ▁mama ▁derecha , ▁de ▁un ▁mes ▁de ▁evolución . </s> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> 

<s> ▁En ▁ese ▁momento , ▁la ▁paciente ▁no ▁presenta ba ▁antecede ntes ▁medico qui rú rg icos ▁de ▁interés , ▁ni ▁refer ía ▁hábitos ▁tó xico s . </s> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>

In [70]:
# Fragment labels distribution
pd.Series(np.sum(train_y, axis=1)).describe()

count    27633.000000
mean         0.316976
std          0.621133
min          0.000000
25%          0.000000
50%          0.000000
75%          1.000000
max         10.000000
dtype: float64

### Development corpus

Only development texts that are annotated with CIE-O codes are considered:

In [71]:
# Some dev documents (texts) are not annotated 
len(set(df_text_dev["doc_id"]) - set(df_codes_dev_ner["doc_id"]))

0

In [72]:
dev_doc_list = sorted(set(df_codes_dev_ner["doc_id"]))

In [73]:
len(dev_doc_list)

250

In [74]:
%%time
ss_sub_corpus_path = ss_corpus_path + "development/"
ss_files = [f for f in os.listdir(ss_sub_corpus_path) if os.path.isfile(ss_sub_corpus_path + f)]
ss_dict_dev = load_ss_files(ss_files, ss_sub_corpus_path)

CPU times: user 11.4 ms, sys: 0 ns, total: 11.4 ms
Wall time: 11.2 ms


In [75]:
%%time
dev_ind, dev_att, dev_y, dev_frag, dev_start_end_frag = ss_create_frag_input_data_xlmr(df_text=df_text_dev, 
                                                  text_col=text_col, 
                                                  df_ann=df_codes_dev_ner, doc_list=dev_doc_list, ss_dict=ss_dict_dev,
                                                  tokenizer=tokenizer, sp_pb2=spt, lab_encoder=mlb_encoder, seq_len=SEQ_LEN)

100%|██████████| 250/250 [00:10<00:00, 23.81it/s]


CPU times: user 10.6 s, sys: 23.9 ms, total: 10.7 s
Wall time: 10.6 s


In [76]:
# Sanity check

In [77]:
dev_ind.shape

(8303, 128)

In [78]:
dev_att.shape

(8303, 128)

In [79]:
dev_y.shape

(8303, 743)

In [80]:
len(dev_frag)

250

In [81]:
len(dev_start_end_frag)

8303

In [82]:
# Check n_frag distribution across texts
pd.Series(dev_frag).describe()

count    250.000000
mean      33.212000
std       16.315386
min       10.000000
25%       22.250000
50%       30.000000
75%       39.000000
max      175.000000
dtype: float64

In [83]:
# Inspect a randomly selected text and its encoded version
check_id = np.random.randint(low=0, high=len(dev_doc_list), size=1)[0]

In [84]:
check_id

224

In [85]:
dev_doc_list[check_id]

'cc_onco1475'

In [86]:
df_text_dev[df_text_dev["doc_id"] == dev_doc_list[check_id]][text_col].values[0]

'Anamnesis\nMujer de 62 años de edad, sin alergias medicamentosas conocidas, con antecedentes personales de dislipemia mixta, síndrome ansioso-depresivo, hipertransaminasemia atribuída a fármacos. Intervenida de meniscectomía, extirpación de quiste epidérmico mandibular, apendicectomía hace 40 años. Fumadora de 40 cigarrillos/día.\nEn enero de 2007 es diagnosticada de hipotiroidismo subclínico después de consultar por aumento progresivo de peso, iniciándose tratamiento hormonal sustitutivo.\nSe indica PAAF de nódulo de 3 cm en el lóbulo tiroideo izquierdo, que la paciente no se realiza, y no acude a consultas hasta julio de 2009.\n\nExploración física\nPeso 120 kg; talla 162 cm; IMC 42,67 kg/m2.\nPS 0, ECOG 0. Consciente, orientada. Buena coloración de piel y mucosas, bien perfundida, tiroides palpable con nódulo de aproximadamente 3 cm en el lóbulo izquierdo. No presenta adenopatías cervicales palpables. Resto sin alteraciones.\n\nPruebas complementarias\n• Analítica: con hemograma y 

In [87]:
check_id_frag = sum(dev_frag[:check_id])
for i in range(check_id_frag, check_id_frag + dev_frag[check_id]):
    print(mlb_encoder.inverse_transform(np.array([dev_y[i]])), "\n")

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[('8290/0',)] 

[('9671/3', '9699/3')] 

[()] 

[('9699/3',)] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 

[()] 



In [88]:
for i in range(check_id_frag, check_id_frag + dev_frag[check_id]):
    print(list(zip([tokenizer._convert_id_to_token(int(ind)) for ind in dev_ind[i]][1:len(dev_start_end_frag[i])+1], 
               dev_start_end_frag[i])))
    print("\n")

[('▁Ana', (0, 3)), ('m', (3, 4)), ('nesi', (4, 8)), ('s', (8, 9)), ('▁Mu', (9, 12)), ('jer', (12, 15)), ('▁de', (15, 18)), ('▁62', (18, 21)), ('▁años', (21, 27)), ('▁de', (27, 30)), ('▁edad', (30, 35)), (',', (35, 36)), ('▁sin', (36, 40)), ('▁alergi', (40, 47)), ('as', (47, 49)), ('▁medicamentos', (49, 62)), ('as', (62, 64)), ('▁conocida', (64, 73)), ('s', (73, 74)), (',', (74, 75)), ('▁con', (75, 79)), ('▁antecede', (79, 88)), ('ntes', (88, 92)), ('▁personales', (92, 103)), ('▁de', (103, 106)), ('▁dis', (106, 110)), ('lip', (110, 113)), ('emia', (113, 117)), ('▁mix', (117, 121)), ('ta', (121, 123)), (',', (123, 124)), ('▁síndrome', (124, 134)), ('▁an', (134, 137)), ('si', (137, 139)), ('oso', (139, 142)), ('-', (142, 143)), ('de', (143, 145)), ('pres', (145, 149)), ('ivo', (149, 152)), (',', (152, 153)), ('▁hiper', (153, 159)), ('trans', (159, 164)), ('ami', (164, 167)), ('nas', (167, 170)), ('emia', (170, 174)), ('▁atribu', (174, 181)), ('ída', (181, 185)), ('▁a', (185, 187)), ('▁fá'

In [89]:
check_id_frag = sum(dev_frag[:check_id])
for frag in dev_ind[check_id_frag:check_id_frag + dev_frag[check_id]]:
    print(' '.join([tokenizer._convert_id_to_token(int(ind)) for ind in frag]), "\n")

<s> ▁Ana m nesi s ▁Mu jer ▁de ▁62 ▁años ▁de ▁edad , ▁sin ▁alergi as ▁medicamentos as ▁conocida s , ▁con ▁antecede ntes ▁personales ▁de ▁dis lip emia ▁mix ta , ▁síndrome ▁an si oso - de pres ivo , ▁hiper trans ami nas emia ▁atribu ída ▁a ▁fá rma cos . </s> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> 

<s> ▁Interven ida ▁de ▁men isce c tom ía , ▁ex tir pa ción ▁de ▁qui ste ▁e pid ér mico ▁mandi bu lar , ▁a pendi ce c tom ía ▁hace ▁40 ▁años . </s> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <

In [90]:
# Fragment labels distribution
pd.Series(np.sum(dev_y, axis=1)).describe()

count    8303.000000
mean        0.294231
std         0.586341
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max         6.000000
dtype: float64

### Training & Development corpus

We merge the previously generated datasets:

In [91]:
# Indices
train_dev_ind = np.concatenate((train_ind, dev_ind))

In [92]:
train_dev_ind.shape

(35936, 128)

In [93]:
# Attention masks
train_dev_att = np.concatenate((train_att, dev_att))

In [94]:
train_dev_att.shape

(35936, 128)

In [95]:
# y
train_dev_y = np.concatenate((train_y, dev_y))

In [96]:
train_dev_y.shape

(35936, 743)

## Fine-tuning

Using the corpus of labeled sentences, we fine-tune the model on a multi-label sentence classification task.

In [98]:
from transformers import TFXLMRobertaForSequenceClassification

model = TFXLMRobertaForSequenceClassification.from_pretrained(model_name, from_pt=True)

All PyTorch model weights were used when initializing TFXLMRobertaForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFXLMRobertaForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [99]:
model.summary()

Model: "tfxlm_roberta_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
roberta (TFRobertaMainLayer) multiple                  277453056 
_________________________________________________________________
classifier (TFRobertaClassif multiple                  592130    
Total params: 278,045,186
Trainable params: 278,045,186
Non-trainable params: 0
_________________________________________________________________


In [100]:
from tensorflow.keras import Input, Model
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.initializers import GlorotUniform

input_ids = Input(shape=(SEQ_LEN,), name='input_ids', dtype='int64')
attention_mask = Input(shape=(SEQ_LEN,), name='attention_mask', dtype='int64')
inputs = [input_ids, attention_mask]

cls_token = model.layers[0](input_ids=inputs[0], attention_mask=inputs[1])[0][:, 0, :] # take <s> token output representation (equiv. to [CLS]) 
out_logits = Dense(units=num_labels, kernel_initializer=GlorotUniform(seed=random_seed))(cls_token) # Multi-label classification
out_act = Activation('sigmoid')(out_logits)

model = Model(inputs=[input_ids, attention_mask], outputs=out_act)

In [101]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_ids (InputLayer)          [(None, 128)]        0                                            
__________________________________________________________________________________________________
attention_mask (InputLayer)     [(None, 128)]        0                                            
__________________________________________________________________________________________________
roberta (TFRobertaMainLayer)    TFBaseModelOutputWit 277453056   input_ids[0][0]                  
__________________________________________________________________________________________________
tf_op_layer_strided_slice (Tens [(None, 768)]        0           roberta[0][0]                    
______________________________________________________________________________________________

In [102]:
model.inputs

[<tf.Tensor 'input_ids:0' shape=(None, 128) dtype=int64>,
 <tf.Tensor 'attention_mask:0' shape=(None, 128) dtype=int64>]

In [103]:
model.outputs

[<tf.Tensor 'activation_4/Identity:0' shape=(None, 743) dtype=float32>]

In [None]:
%%time
from tensorflow.keras import optimizers, losses
import tensorflow_addons as tfa

optimizer = tfa.optimizers.RectifiedAdam(learning_rate=LR)
loss = losses.BinaryCrossentropy(from_logits=False)
model.compile(optimizer=optimizer, loss=loss)

history = model.fit(x={'input_ids': train_dev_ind, 'attention_mask': train_dev_att}, y=train_dev_y,
          batch_size=BATCH_SIZE, epochs=EPOCHS, shuffle=True)

Epoch 1/30
Epoch 2/30
  15/2246 [..............................] - ETA: 11:27 - loss: 0.0042

## Test set predictions

Finally, the predictions made by the model on the test set are saved. For this purpose, firstly, each sentence from the test corpus must be converted into a sequence of subwords (input IDs and attention mask arrays). Then, the predictions made by the model at the sentence-level are saved, to be further evaluated at document-level (see `results/Cantemist-Coding/Evaluation.ipynb`).

In [99]:
%%time
test_path = corpus_path + "test-set/cantemist-ner/"
test_files = [f for f in os.listdir(test_path) if os.path.isfile(test_path + f) and f.split('.')[-1] == 'txt']
test_data = load_text_files(test_files, test_path)
df_text_test = pd.DataFrame({'doc_id': [s.split('.txt')[0] for s in test_files], 'raw_text': test_data})

CPU times: user 10.1 ms, sys: 8 µs, total: 10.2 ms
Wall time: 9.26 ms


In [100]:
df_text_test.shape

(300, 2)

In [101]:
df_text_test.head()

Unnamed: 0,doc_id,raw_text
0,cc_onco877,"Anamnesis\nMujer de 59 años, alérgica a penici..."
1,cc_onco1075,"Anamnesis\nMujer de 52 años, sin alergias cono..."
2,cc_onco1450,"Anamnesis\nMujer de 51 años de edad, sin antec..."
3,cc_onco1165,Anamnesis\nPaciente varón de 75 años sin hábit...
4,cc_onco1298,"Anamnesis\nMujer de 60 años, exfumadora de 20 ..."


In [102]:
df_text_test.raw_text[0]

'Anamnesis\nMujer de 59 años, alérgica a penicilina y procaína. Fumadora activa (IPA: 43).\nAntecedentes familiares: abuelo materno diagnosticado de carcinoma colon a los 70 años; madre diagnosticada de carcinoma de mama bilateral a los 50 años; padre fallecido de carcinoma gástrico a los 47 años; tres tías maternas diagnosticadas de carcinoma de mama a los 55, 56 y 57 años respectivamente; y tres primas afectas de cáncer de mama.\nAntecedentes personales: bronquitis crónica, poliposis colónica, carcinoma ductal infiltrante clásico mama pT2pN0M0 G2 subtipo tumoral luminal a (RH: +, HER-2: negativo) intervenido en agosto de 2013 mediante tumorectomía mama izquierda (patrón round block) + biopsia selectiva ganglio centinela (negativo) y posterior QT adyuvante con esquema TC (paclitaxel-ciclofosfamida) x 4 ciclos.\nAcude en noviembre de 2013 a visita de seguimiento tras finalizar tratamiento adyuvante. Asintomática.\n\nExploración física\nTemperatura axilar 36,5ºC, tensión arterial 130/83

In [103]:
test_doc_list = sorted(set(df_text_test["doc_id"]))

In [104]:
len(test_doc_list)

300

In [105]:
%%time
ss_sub_corpus_path = ss_corpus_path + "test-background/"
ss_files = [f for f in os.listdir(ss_sub_corpus_path) if os.path.isfile(ss_sub_corpus_path + f)]
ss_dict_test = load_ss_files(ss_files, ss_sub_corpus_path)

CPU times: user 150 ms, sys: 16 ms, total: 166 ms
Wall time: 166 ms


In [106]:
%%time
test_ind, test_att, _, test_frag, _ = ss_create_frag_input_data_xlmr(df_text=df_text_test, 
                                                  text_col=text_col,
                                                  # Since labels are ignored, we pass df_codes_train_ner as df_ann
                                                  df_ann=df_codes_train_ner, doc_list=test_doc_list, ss_dict=ss_dict_test,
                                                  tokenizer=tokenizer, sp_pb2=spt, lab_encoder=mlb_encoder, seq_len=SEQ_LEN)

100%|██████████| 300/300 [00:02<00:00, 133.27it/s]


CPU times: user 2.41 s, sys: 12.1 ms, total: 2.42 s
Wall time: 2.4 s


In [None]:
%%time
test_preds = model.predict({'input_ids': test_ind, 'attention_mask': test_att})

In [120]:
test_preds.shape

(3955, 727)

In [97]:
results_dir_path = "../results/Cantemist-Coding/"

In [178]:
%%time
np.save(file=results_dir_path + "predictions/xlm_r_seed_" + str(random_seed) + "_test_preds.npy", arr=test_preds)

CPU times: user 2.21 ms, sys: 4.65 ms, total: 6.87 ms
Wall time: 6.02 ms


In [107]:
# To be further used when evaluating model performance at document level
np.save(file=results_dir_path + "xlm_r_test_frags.npy", arr=test_frag)
np.save(file=results_dir_path + "classes.npy", arr=mlb_encoder.classes_)
np.save(file=results_dir_path + "test_docs.npy", arr=test_doc_list)