# Abstracts BERT One-fragment

In this notebook, firstly, we analyze the tokenization performed by different BERT Tokenizers when applied to additional abstracts with CIE-D codes corpus. Then, we select and save the identifiers of the texts with only one fragment considering different tokenizers maximum lengths.

In [1]:
import pandas as pd
import numpy as np

# Auxiliary components
from nlp_utils import *

Using TensorFlow backend.


In [2]:
corpus_path = "../datasets/abstractsWithCIE10_v2/"

## Load text

Firstly, all text from all additional abstracts with CIE-D codes associated are loaded (see `BERT-Keras-All-Abstracts` notebook). Apart from the raw text, we also save a pre-processed version of the text where the punctuation is substituted for spaces.

In [3]:
%%time
# Pre-processed version (BETO len >= 30) used
df_text_all_abs = pd.read_table(corpus_path + "all_abstracts_valid_codes_D_text_raw_sw_v2_30.tsv", sep='\t')

CPU times: user 1.97 s, sys: 152 ms, total: 2.12 s
Wall time: 2.12 s


In [4]:
df_text_all_abs.shape

(149424, 3)

In [5]:
import string

In [6]:
%%time
df_text_all_abs["punc_text"] = df_text_all_abs["raw_text"].apply(lambda x: x.translate(str.maketrans(string.punctuation, 
                                                                                                 ' '*len(string.punctuation)))) 

CPU times: user 8.66 s, sys: 35.2 ms, total: 8.7 s
Wall time: 8.7 s


In [7]:
df_text_all_abs.shape

(149424, 4)

In [8]:
df_text_all_abs.head()

Unnamed: 0,doc_id,raw_text,sw_text,punc_text
0,biblio-1000005,Introducción: A pesar del difícil acceso anató...,Introducción : pesar difícil acceso anatómico ...,Introducción A pesar del difícil acceso anató...
1,biblio-1000026,Introducción: La enterocolitis neutropénica se...,Introducción : enterocolitis neutropénica defi...,Introducción La enterocolitis neutropénica se...
2,biblio-1000027,Introducción: La presencia de anticuerpos anti...,Introducción : presencia anticuerpos anti erit...,Introducción La presencia de anticuerpos anti...
3,biblio-1000028,Introducción: El Carcinoma de lengua móvil es ...,Introducción : Carcinoma lengua móvil tumores ...,Introducción El Carcinoma de lengua móvil es ...
4,biblio-1000029,Introducción: El cáncer de ovario epitelial au...,Introducción : cáncer ovario epitelial aunque ...,Introducción El cáncer de ovario epitelial au...


In [9]:
df_text_all_abs.raw_text[0]

'Introducción: A pesar del difícil acceso anatómico para los tumores de mediastino, la resección quirúrgica sigue siendo el mejor enfoque diagnóstico y terapéutico. El objetivo de la presente serie de casos presentamos la experiencia de un centro oncológico en el abordaje de tumores del mediastino y sus resultados.  Métodos: En el departamento de Jefatura de Cirugía Oncológica del Instituto Oncológico nacional de Solca-Guayaquil, durante los meses de Enero del 2013 a Enero 2017 se realizó un estudio descriptivo, retrospectivo. Se analizaron todos los casos de pacientes derivados del área de pre admisión con diagnóstico inicial de tumor de mediastino, a los cuales previo a realizarles marcadores tumorales, Tomografía de Tórax, y a quienes se les realizó como método diagnóstico y en algunos casos terapéutico con abordaje quirúrgico. Se excluyeron pacientes con neoplasias de origen secundario, con historias clínicas incompletas que imposibilitaron el análisis. Se estudiaron las variables 

In [10]:
df_text_all_abs.punc_text[0]

'Introducción  A pesar del difícil acceso anatómico para los tumores de mediastino  la resección quirúrgica sigue siendo el mejor enfoque diagnóstico y terapéutico  El objetivo de la presente serie de casos presentamos la experiencia de un centro oncológico en el abordaje de tumores del mediastino y sus resultados   Métodos  En el departamento de Jefatura de Cirugía Oncológica del Instituto Oncológico nacional de Solca Guayaquil  durante los meses de Enero del 2013 a Enero 2017 se realizó un estudio descriptivo  retrospectivo  Se analizaron todos los casos de pacientes derivados del área de pre admisión con diagnóstico inicial de tumor de mediastino  a los cuales previo a realizarles marcadores tumorales  Tomografía de Tórax  y a quienes se les realizó como método diagnóstico y en algunos casos terapéutico con abordaje quirúrgico  Se excluyeron pacientes con neoplasias de origen secundario  con historias clínicas incompletas que imposibilitaron el análisis  Se estudiaron las variables 

## BERT Tokenizers

Inspecting the [Keras BERT Tokenizer source code](https://github.com/CyberZHG/keras-bert/blob/26bdfe3c36e77fa0524902f31263a920ccd62efb/keras_bert/tokenizer.py#L101), we can see that spaces (' ', '\n', '\t', etc.) are only used to split tokens from the text and they are not considered when performing sub-word (WordPiece) tokenization, while punctuations ('.', ',', ':', etc.) are kept as separated tokens, as they are part of different vocabularies.

In [11]:
from keras_bert import load_vocabulary, Tokenizer

In [12]:
base_path = "../bert_models/"
vocab_file = "vocab.txt"

### Multilingual

This Tokenizer uses a multilingual-cased vocabulary containing tokens from 104 different languages (see [BERT multilingual](https://github.com/google-research/bert/blob/master/multilingual.md) for more details).

In [14]:
multi_path = "multi_cased_L-12_H-768_A-12/"

In [15]:
multi_token_dict = load_vocabulary(base_path + multi_path + vocab_file)

In [16]:
# 119547 expected
len(multi_token_dict)

119547

In [17]:
bert_tokenizer = Tokenizer(token_dict=multi_token_dict, cased=True)

In [18]:
tokenizer_name = "Multi"

In [19]:
unk_token_id = multi_token_dict["[UNK]"]

In [20]:
%%time
text_token_raw = [bert_tokenizer.tokenize(text) for text in df_text_all_abs.raw_text]
text_token_punc = [bert_tokenizer.tokenize(text) for text in df_text_all_abs.punc_text]

df_text_all_abs["raw_" + tokenizer_name] = [len(token_list) for token_list in text_token_raw]
df_text_all_abs["punc_" + tokenizer_name] = [len(token_list) for token_list in text_token_punc]

CPU times: user 6min 52s, sys: 1.45 s, total: 6min 54s
Wall time: 6min 55s


In [21]:
%%time
df_text_all_abs["raw_" + tokenizer_name + "_UNK"] = [bert_tokenizer.encode(text)[0].count(unk_token_id) 
                                                   for text in df_text_all_abs.raw_text]
df_text_all_abs["punc_" + tokenizer_name + "_UNK"] = [bert_tokenizer.encode(text)[0].count(unk_token_id) 
                                                   for text in df_text_all_abs.punc_text]

CPU times: user 6min 58s, sys: 11.8 ms, total: 6min 58s
Wall time: 6min 58s


### BETO

This Tokenizer uses a Spanish-cased vocabulary.

In [22]:
beto_path = "BETO_cased/"

In [23]:
beto_token_dict = load_vocabulary(base_path + beto_path + vocab_file)

In [24]:
# 31002 expected
len(beto_token_dict)

31002

In [25]:
bert_tokenizer = Tokenizer(token_dict=beto_token_dict, pad_index=1, cased=True)

In [26]:
tokenizer_name = "BETO"

In [27]:
unk_token_id = beto_token_dict["[UNK]"]

In [28]:
%%time
text_token_raw = [bert_tokenizer.tokenize(text) for text in df_text_all_abs.raw_text]
text_token_punc = [bert_tokenizer.tokenize(text) for text in df_text_all_abs.punc_text]

df_text_all_abs["raw_" + tokenizer_name] = [len(token_list) for token_list in text_token_raw]
df_text_all_abs["punc_" + tokenizer_name] = [len(token_list) for token_list in text_token_punc]

CPU times: user 6min 38s, sys: 1.13 s, total: 6min 39s
Wall time: 6min 39s


In [29]:
%%time
df_text_all_abs["raw_" + tokenizer_name + "_UNK"] = [bert_tokenizer.encode(text)[0].count(unk_token_id) 
                                                   for text in df_text_all_abs.raw_text]
df_text_all_abs["punc_" + tokenizer_name + "_UNK"] = [bert_tokenizer.encode(text)[0].count(unk_token_id) 
                                                   for text in df_text_all_abs.punc_text]

CPU times: user 6min 42s, sys: 0 ns, total: 6min 42s
Wall time: 6min 42s


### BERT-Scielo

This Tokenizer uses a Spanish-cased custom clinical vocabulary. Obtained from this ["Under review" article](https://www.researchsquare.com/article/rs-13271/v1).

In [30]:
sci_path = "BERT_Scielo_cased/"

In [31]:
sci_token_dict = load_vocabulary(base_path + sci_path + vocab_file)

In [32]:
# 128000 expected
len(sci_token_dict)

128000

In [33]:
bert_tokenizer = Tokenizer(token_dict=sci_token_dict, pad_index=0, cased=True)

In [34]:
tokenizer_name = "Scielo"

In [35]:
unk_token_id = sci_token_dict["[UNK]"]

In [36]:
%%time
text_token_raw = [bert_tokenizer.tokenize(text) for text in df_text_all_abs.raw_text]
text_token_punc = [bert_tokenizer.tokenize(text) for text in df_text_all_abs.punc_text]

df_text_all_abs["raw_" + tokenizer_name] = [len(token_list) for token_list in text_token_raw]
df_text_all_abs["punc_" + tokenizer_name] = [len(token_list) for token_list in text_token_punc]

CPU times: user 6min 19s, sys: 1 s, total: 6min 20s
Wall time: 6min 20s


In [37]:
%%time
df_text_all_abs["raw_" + tokenizer_name + "_UNK"] = [bert_tokenizer.encode(text)[0].count(unk_token_id) 
                                                   for text in df_text_all_abs.raw_text]
df_text_all_abs["punc_" + tokenizer_name + "_UNK"] = [bert_tokenizer.encode(text)[0].count(unk_token_id) 
                                                   for text in df_text_all_abs.punc_text]

CPU times: user 6min 27s, sys: 3.97 ms, total: 6min 27s
Wall time: 6min 27s


In [38]:
df_text_all_abs.head()

Unnamed: 0,doc_id,raw_text,sw_text,punc_text,raw_Multi,punc_Multi,raw_Multi_UNK,punc_Multi_UNK,raw_BETO,punc_BETO,raw_BETO_UNK,punc_BETO_UNK,raw_Scielo,punc_Scielo,raw_Scielo_UNK,punc_Scielo_UNK
0,biblio-1000005,Introducción: A pesar del difícil acceso anató...,Introducción : pesar difícil acceso anatómico ...,Introducción A pesar del difícil acceso anató...,491,442,0,0,444,395,8,4,348,299,49,0
1,biblio-1000026,Introducción: La enterocolitis neutropénica se...,Introducción : enterocolitis neutropénica defi...,Introducción La enterocolitis neutropénica se...,461,425,0,0,393,357,4,3,318,282,36,0
2,biblio-1000027,Introducción: La presencia de anticuerpos anti...,Introducción : presencia anticuerpos anti erit...,Introducción La presencia de anticuerpos anti...,551,503,0,0,487,439,13,4,406,358,48,0
3,biblio-1000028,Introducción: El Carcinoma de lengua móvil es ...,Introducción : Carcinoma lengua móvil tumores ...,Introducción El Carcinoma de lengua móvil es ...,444,389,0,0,391,336,10,3,344,289,55,0
4,biblio-1000029,Introducción: El cáncer de ovario epitelial au...,Introducción : cáncer ovario epitelial aunque ...,Introducción El cáncer de ovario epitelial au...,457,399,0,0,405,347,17,6,323,265,58,0


### Token sequence analysis

To compare the distribution of the number of extracted tokens from the texts for each BERT Tokenizer, we generate the next tables:

In [41]:
col_names = ["Multilingual", "BETO", "Scielo"]

In [42]:
df_res = df_text_all_abs
raw_all_abs_res = pd.DataFrame({col_names[0]: df_res["raw_Multi"].describe(), 
              col_names[1]: df_res["raw_BETO"].describe(), 
              col_names[2]: df_res["raw_Scielo"].describe()})

In [43]:
punc_all_abs_res = pd.DataFrame({col_names[0]: df_res["punc_Multi"].describe(), 
              col_names[1]: df_res["punc_BETO"].describe(), 
              col_names[2]: df_res["punc_Scielo"].describe()})

In [44]:
raw_all_abs_res

Unnamed: 0,Multilingual,BETO,Scielo
count,149424.0,149424.0,149424.0
mean,290.583942,257.412069,216.811576
std,141.119278,127.63405,108.508242
min,23.0,30.0,15.0
25%,182.0,160.0,133.0
50%,272.0,239.0,202.0
75%,391.0,344.0,292.0
max,2096.0,1863.0,1660.0


In [45]:
punc_all_abs_res

Unnamed: 0,Multilingual,BETO,Scielo
count,149424.0,149424.0,149424.0
mean,262.490925,229.319052,188.718559
std,122.123818,108.237142,88.981236
min,21.0,21.0,13.0
25%,169.0,147.0,120.0
50%,249.0,217.0,178.0
75%,352.0,306.0,254.0
max,1957.0,1724.0,1521.0


We also analyze the proportion of unknown tokens produced by each tokenizer:

In [46]:
raw_all_abs_res = pd.DataFrame({col_names[0]: df_res["raw_Multi_UNK"].describe(), 
              col_names[1]: df_res["raw_BETO_UNK"].describe(), 
              col_names[2]: df_res["raw_Scielo_UNK"].describe()})

In [47]:
punc_all_abs_res = pd.DataFrame({col_names[0]: df_res["punc_Multi_UNK"].describe(), 
              col_names[1]: df_res["punc_BETO_UNK"].describe(), 
              col_names[2]: df_res["punc_Scielo_UNK"].describe()})

In [48]:
raw_all_abs_res

Unnamed: 0,Multilingual,BETO,Scielo
count,149424.0,149424.0,149424.0
mean,0.004531,5.025096,28.618435
std,0.135611,8.105564,24.155509
min,0.0,0.0,0.0
25%,0.0,0.0,12.0
50%,0.0,2.0,21.0
75%,0.0,6.0,38.0
max,25.0,187.0,322.0


In [49]:
punc_all_abs_res

Unnamed: 0,Multilingual,BETO,Scielo
count,149424.0,149424.0,149424.0
mean,0.003453,2.053111,0.526903
std,0.125467,3.505599,2.101695
min,0.0,0.0,0.0
25%,0.0,0.0,0.0
50%,0.0,1.0,0.0
75%,0.0,3.0,0.0
max,25.0,110.0,114.0


### Select one-fragment texts

Finally, we select and save the identifiers of the texts with only one fragment considering different tokenizers maximum lengths:

##### BERT-Scielo

In [16]:
SEQ_LEN = 230-2

###### Space-Punctuation

Punctuation is substituted by spaces:

In [20]:
df_text_all_abs_frag = df_text_all_abs[df_text_all_abs.punc_Scielo <= SEQ_LEN]

In [36]:
df_text_all_abs_frag.shape

(100240, 8)

In [37]:
df_text_all_abs_frag[["doc_id"]].head()

Unnamed: 0,doc_id
6,biblio-1000049
8,biblio-1000083
9,biblio-1000087
18,biblio-1000153
21,biblio-1000221


In [38]:
df_text_all_abs_frag[["doc_id"]].to_csv(path_or_buf=corpus_path + "all_abs_doc_one_frag_space_punc_Scielo_" + str(SEQ_LEN+2) + ".tsv", 
                                                   sep="\t", header=False, index=False)

##### BETO

In [50]:
SEQ_LEN = 275-2

###### Raw text

In [19]:
df_text_all_abs_frag = df_text_all_abs[df_text_all_abs.raw_BETO <= SEQ_LEN]

In [20]:
df_text_all_abs_frag.shape

(87871, 16)

In [21]:
df_text_all_abs_frag[["doc_id"]].head()

Unnamed: 0,doc_id
8,biblio-1000083
9,biblio-1000087
18,biblio-1000153
21,biblio-1000221
22,biblio-1000235


In [22]:
df_text_all_abs_frag[["doc_id"]].to_csv(path_or_buf=corpus_path + "all_abs_doc_one_frag_BETO_" + str(SEQ_LEN+2) + ".tsv", 
                                                   sep="\t", header=False, index=False)

###### Space-Punctuation

In [51]:
df_text_all_abs_frag = df_text_all_abs[df_text_all_abs.punc_BETO <= SEQ_LEN]

In [52]:
df_text_all_abs_frag.shape

(98997, 16)

In [53]:
df_text_all_abs_frag[["doc_id"]].head()

Unnamed: 0,doc_id
6,biblio-1000049
8,biblio-1000083
9,biblio-1000087
18,biblio-1000153
21,biblio-1000221


In [54]:
df_text_all_abs_frag[["doc_id"]].to_csv(path_or_buf=corpus_path + "all_abs_doc_one_frag_space_punc_BETO_" + str(SEQ_LEN+2) + ".tsv", 
                                                   sep="\t", header=False, index=False)