# Abstracts BERT Tokenizer

We analyze the tokenization performed by different BERT Tokenizers when applied to additional abstracts with CIE-D codes corpus.

In [1]:
import pandas as pd
import numpy as np

# Auxiliary components
from nlp_utils import *

Using TensorFlow backend.


In [2]:
corpus_path = "../datasets/abstractsWithCIE10_v2/"

## Load text

Firstly, all docs from additional abstracts with CIE-D codes associated are loaded (see `CodiEsp_Exploration` notebook). Apart from the raw text, we also save a pre-processed version of the text where the stop words are removed.

In [3]:
# All abstracts
df_codes_d_abs_all = pd.read_table(corpus_path + "all_abstracts_table_valid_codes_D.tsv", sep='\t', header=None)

In [4]:
df_codes_d_abs_all.columns = ["doc_id", "code"]

In [5]:
df_codes_d_abs_all.shape

(403856, 2)

In [6]:
df_codes_d_abs_all.head()

Unnamed: 0,doc_id,code
0,biblio-994981,r68.82
1,biblio-1008268,f91.9
2,biblio-1008268,f99
3,biblio-1008288,r09.02
4,biblio-1010254,h91.9


In [7]:
# Number of distinct docs
len(set(df_codes_d_abs_all["doc_id"]))

170120

In [8]:
# Number of distinct codes
len(set(df_codes_d_abs_all["code"]))

2984

In [9]:
from nltk.corpus import stopwords

spanish_sw = stopwords.words('spanish')

In [10]:
# All Spanish stop words are lowercase
all([w.islower() for w in spanish_sw])

True

In [11]:
from functools import partial

### All abstracts with valid codes corpus

In [12]:
all_abs_doc_list = sorted(set(df_codes_d_abs_all["doc_id"]))

In [13]:
len(all_abs_doc_list)

170120

In [14]:
all_abs_doc_list[:10]

['biblio-1000005',
 'biblio-1000026',
 'biblio-1000027',
 'biblio-1000028',
 'biblio-1000029',
 'biblio-1000030',
 'biblio-1000049',
 'biblio-1000075',
 'biblio-1000083',
 'biblio-1000087']

In [15]:
%%time
abs_path = corpus_path + "ankush_txt/"
abs_text_data = load_text_files([d + ".txt" for d in all_abs_doc_list], abs_path)
df_text_all_abs = pd.DataFrame({'doc_id': all_abs_doc_list, 'raw_text': abs_text_data})
df_text_all_abs["sw_text"] = df_text_all_abs["raw_text"].apply(partial(remove_sw, stop_words=spanish_sw))

CPU times: user 4min 4s, sys: 1.13 s, total: 4min 6s
Wall time: 4min 6s


In [16]:
df_text_all_abs.shape

(170120, 3)

In [17]:
df_text_all_abs.head()

Unnamed: 0,doc_id,raw_text,sw_text
0,biblio-1000005,Introducción: A pesar del difícil acceso anató...,Introducción : pesar difícil acceso anatómico ...
1,biblio-1000026,Introducción: La enterocolitis neutropénica se...,Introducción : enterocolitis neutropénica defi...
2,biblio-1000027,Introducción: La presencia de anticuerpos anti...,Introducción : presencia anticuerpos anti erit...
3,biblio-1000028,Introducción: El Carcinoma de lengua móvil es ...,Introducción : Carcinoma lengua móvil tumores ...
4,biblio-1000029,Introducción: El cáncer de ovario epitelial au...,Introducción : cáncer ovario epitelial aunque ...


In [18]:
df_text_all_abs.raw_text.values[0]

'Introducción: A pesar del difícil acceso anatómico para los tumores de mediastino, la resección quirúrgica sigue siendo el mejor enfoque diagnóstico y terapéutico. El objetivo de la presente serie de casos presentamos la experiencia de un centro oncológico en el abordaje de tumores del mediastino y sus resultados.  Métodos: En el departamento de Jefatura de Cirugía Oncológica del Instituto Oncológico nacional de Solca-Guayaquil, durante los meses de Enero del 2013 a Enero 2017 se realizó un estudio descriptivo, retrospectivo. Se analizaron todos los casos de pacientes derivados del área de pre admisión con diagnóstico inicial de tumor de mediastino, a los cuales previo a realizarles marcadores tumorales, Tomografía de Tórax, y a quienes se les realizó como método diagnóstico y en algunos casos terapéutico con abordaje quirúrgico. Se excluyeron pacientes con neoplasias de origen secundario, con historias clínicas incompletas que imposibilitaron el análisis. Se estudiaron las variables 

In [19]:
df_text_all_abs.sw_text.values[0]

'Introducción : pesar difícil acceso anatómico tumores mediastino , resección quirúrgica sigue siendo mejor enfoque diagnóstico terapéutico . objetivo presente serie casos presentamos experiencia centro oncológico abordaje tumores mediastino resultados . Métodos : departamento Jefatura Cirugía Oncológica Instituto Oncológico nacional Solca - Guayaquil , meses Enero 2013 Enero 2017 realizó estudio descriptivo , retrospectivo . analizaron casos pacientes derivados área pre admisión diagnóstico inicial tumor mediastino , cuales previo realizarles marcadores tumorales , Tomografía Tórax , realizó método diagnóstico casos terapéutico abordaje quirúrgico . excluyeron pacientes neoplasias origen secundario , historias clínicas incompletas imposibilitaron análisis . estudiaron variables sexo , edad , Tipo Técnica quirúrgica , localización tumor , diagnostico histopatológico mortalidad perioperatoria . análisis estadístico realizado descriptivo . Resultados : evaluaron 22 pacientes diagnóstico 

## BERT Tokenizers

Inspecting the [Keras BERT Tokenizer source code](https://github.com/CyberZHG/keras-bert/blob/26bdfe3c36e77fa0524902f31263a920ccd62efb/keras_bert/tokenizer.py#L101), we can see that spaces (' ', '\n', '\t', etc.) are only used to split tokens from the text and they are not considered when performing sub-word (WordPiece) tokenization, while punctuations ('.', ',', ':', etc.) are kept as separated tokens, as they are part of different vocabularies.

In [20]:
from keras_bert import load_vocabulary, Tokenizer

In [21]:
base_path = "../bert_models/"
vocab_file = "vocab.txt"

### BioBERT

This Tokenizer uses the same English-cased vocabulary used by the original BERT model (see [BioBERT paper](https://academic.oup.com/bioinformatics/article/36/4/1234/5566506) for more details).

In [7]:
bio_path = "biobert_v1.1_pubmed/"

In [8]:
bio_token_dict = load_vocabulary(base_path + bio_path + vocab_file)

In [9]:
# 28996 expected
len(bio_token_dict)

28996

In [10]:
bio_tokenizer = Tokenizer(token_dict=bio_token_dict, cased=True)

In [11]:
%%time
df_text_all_abs["raw_BioBERT"] = [len(bio_tokenizer.tokenize(text)) for text in df_text_all_abs.raw_text]
df_text_all_abs["sw_BioBERT"] = [len(bio_tokenizer.tokenize(text)) for text in df_text_all_abs.sw_text]

CPU times: user 7min 21s, sys: 211 ms, total: 7min 21s
Wall time: 7min 21s


### Multilingual

This Tokenizer uses a multilingual-cased vocabulary containing tokens from 104 different languages (see [BERT multilingual](https://github.com/google-research/bert/blob/master/multilingual.md) for more details).

In [12]:
multi_path = "multi_cased_L-12_H-768_A-12/"

In [13]:
multi_token_dict = load_vocabulary(base_path + multi_path + vocab_file)

In [14]:
# 119547 expected
len(multi_token_dict)

119547

In [15]:
multi_tokenizer = Tokenizer(token_dict=multi_token_dict, cased=True)

In [16]:
%%time
df_text_all_abs["raw_Multi"] = [len(multi_tokenizer.tokenize(text)) for text in df_text_all_abs.raw_text]
df_text_all_abs["sw_Multi"] = [len(multi_tokenizer.tokenize(text)) for text in df_text_all_abs.sw_text]

CPU times: user 6min 33s, sys: 52 ms, total: 6min 33s
Wall time: 6min 33s


### BETO

This Tokenizer uses a Spanish-cased vocabulary (see [BETO](https://github.com/dccuchile/beto) for more details).

In [17]:
beto_path = "BETO_cased/"

In [18]:
beto_token_dict = load_vocabulary(base_path + beto_path + vocab_file)

In [19]:
# 31002 expected
len(beto_token_dict)

31002

In [20]:
beto_tokenizer = Tokenizer(token_dict=beto_token_dict, pad_index=1, cased=True)

In [21]:
%%time
df_text_all_abs["raw_BETO"] = [len(beto_tokenizer.tokenize(text)) for text in df_text_all_abs.raw_text]
df_text_all_abs["sw_BETO"] = [len(beto_tokenizer.tokenize(text)) for text in df_text_all_abs.sw_text]

CPU times: user 6min 13s, sys: 32 ms, total: 6min 13s
Wall time: 6min 13s


In [22]:
df_text_all_abs.head()

Unnamed: 0,doc_id,raw_text,sw_text,raw_BioBERT,sw_BioBERT,raw_Multi,sw_Multi,raw_BETO,sw_BETO
0,biblio-1000005,Introducción: A pesar del difícil acceso anató...,Introducción : pesar difícil acceso anatómico ...,755,613,491,376,444,330
1,biblio-1000026,Introducción: La enterocolitis neutropénica se...,Introducción : enterocolitis neutropénica defi...,707,579,461,361,393,293
2,biblio-1000027,Introducción: La presencia de anticuerpos anti...,Introducción : presencia anticuerpos anti erit...,851,693,551,413,487,349
3,biblio-1000028,Introducción: El Carcinoma de lengua móvil es ...,Introducción : Carcinoma lengua móvil tumores ...,646,492,444,317,391,264
4,biblio-1000029,Introducción: El cáncer de ovario epitelial au...,Introducción : cáncer ovario epitelial aunque ...,649,546,457,367,405,315


### Token sequence analysis

To compare the distribution of the number of extracted tokens from the texts for each BERT Tokenizer, we generate the next tables:

In [4]:
col_names = ["BioBERT", "Multilingual", "BETO"]

In [5]:
df_res = df_text_all_abs
raw_all_abs_res = pd.DataFrame({col_names[0]: df_res["raw_BioBERT"].describe(), 
              col_names[1]: df_res["raw_Multi"].describe(), 
              col_names[2]: df_res["raw_BETO"].describe()})

In [6]:
sw_all_abs_res = pd.DataFrame({col_names[0]: df_res["sw_BioBERT"].describe(), 
              col_names[1]: df_res["sw_Multi"].describe(), 
              col_names[2]: df_res["sw_BETO"].describe()})

In [7]:
raw_all_abs_res

Unnamed: 0,BioBERT,Multilingual,BETO
count,170120.0,170120.0,170120.0
mean,375.978815,255.762626,226.619463
std,233.06396,162.010409,145.448418
min,3.0,3.0,3.0
25%,208.0,139.0,122.0
50%,369.0,247.0,216.0
75%,551.0,374.0,329.0
max,3158.0,2096.0,1863.0


In [8]:
sw_all_abs_res

Unnamed: 0,BioBERT,Multilingual,BETO
count,170120.0,170120.0,170120.0
mean,298.874871,190.132894,161.132871
std,187.681522,124.563585,109.135381
min,3.0,3.0,3.0
25%,163.0,100.0,83.0
50%,291.0,180.0,149.0
75%,437.0,276.0,232.0
max,2358.0,1471.0,1226.0


To sum up, we generate a final table showing, for each Tokenizer, the fraction of texts with a sub-token sequence length <= 512 (maximum input sequence length of BERT model).

In [9]:
max_len = 512
row_names = ["All (raw)", "All (sw)"]
res_512 = pd.DataFrame({col_names[0]: [sum(df_res["raw_BioBERT"] <= max_len)/df_res.shape[0], 
                                       sum(df_res["sw_BioBERT"] <= max_len)/df_res.shape[0]],
                        col_names[1]: [sum(df_res["raw_Multi"] <= max_len)/df_res.shape[0], 
                                       sum(df_res["sw_Multi"] <= max_len)/df_res.shape[0]],
                        col_names[2]: [sum(df_res["raw_BETO"] <= max_len)/df_res.shape[0], 
                                       sum(df_res["sw_BETO"] <= max_len)/df_res.shape[0]]}, 
                       index=row_names)

In [10]:
res_512

Unnamed: 0,BioBERT,Multilingual,BETO
All (raw),0.704044,0.945556,0.973119
All (sw),0.861774,0.991629,0.995521


As expected, in increasing order, the vocabularies containing more complete words (generating shorter sub-token sequences) in Spanish are: BETO, Multilingual and BioBERT.

In [11]:
max_len = 256
row_names = ["All (raw)", "All (sw)"]
res_256 = pd.DataFrame({col_names[0]: [sum(df_res["raw_BioBERT"] <= max_len)/df_res.shape[0], 
                                       sum(df_res["sw_BioBERT"] <= max_len)/df_res.shape[0]],
                        col_names[1]: [sum(df_res["raw_Multi"] <= max_len)/df_res.shape[0], 
                                       sum(df_res["sw_Multi"] <= max_len)/df_res.shape[0]],
                        col_names[2]: [sum(df_res["raw_BETO"] <= max_len)/df_res.shape[0], 
                                       sum(df_res["sw_BETO"] <= max_len)/df_res.shape[0]]}, 
                       index=row_names)

In [12]:
res_256

Unnamed: 0,BioBERT,Multilingual,BETO
All (raw),0.318628,0.523366,0.601722
All (sw),0.428051,0.705808,0.809775


## Text content 

We also want to analyze the sub-token length frequency of the texts from the abstracts, in order to perform a simple sanity check of text content.

Something strange seems to happen. The minimum sequence token length in all Tokenizers is 3, that means there are text/s with ONLY a single word (as '[CLS]' and '[SEP]' are always added).

In [12]:
# As BETO is the tokenizer that produces the shortest sub-token sequences
col_analyze = "raw_BETO"

In [13]:
beto_len = df_text_all_abs[col_analyze].drop_duplicates().sort_values()

In [14]:
beto_len

155011       3
9337         4
21162        5
11544        6
17047        7
          ... 
1308      1405
3144      1453
2641      1504
3599      1727
4649      1863
Name: raw_BETO, Length: 940, dtype: int64

In [15]:
beto_len_min = beto_len.values[0]

In [16]:
beto_len_min

3

In [17]:
beto_len_freq = df_text_all_abs[col_analyze].value_counts()

In [18]:
# More than 20K texts with a length of ONLY 4!!!!
beto_len_freq

4       20348
200       545
212       539
205       529
210       522
        ...  
815         1
1071        1
816         1
944         1
1151        1
Name: raw_BETO, Length: 940, dtype: int64

In [19]:
# Two texts of length 3
beto_len_freq[beto_len_min]

2

In [20]:
df_text_all_abs[df_text_all_abs[col_analyze] == beto_len_min]["raw_text"].drop_duplicates().values

array(['.'], dtype=object)

In [21]:
df_text_all_abs[df_text_all_abs[col_analyze] == beto_len_min]

Unnamed: 0,doc_id,raw_text,sw_text,raw_BioBERT,sw_BioBERT,raw_Multi,sw_Multi,raw_BETO,sw_BETO
155011,lil-695802,.,.,3,3,3,3,3,3
157046,lil-712416,.,.,3,3,3,3,3,3


We check the codes associated to these "texts":

In [22]:
df_codes_d_abs_all.index = pd.Index(df_codes_d_abs_all.doc_id)

In [23]:
df_codes_d_abs_all.loc[df_text_all_abs[df_text_all_abs[col_analyze] == beto_len_min]["doc_id"]]

Unnamed: 0_level_0,doc_id,code
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1
lil-695802,lil-695802,a91
lil-712416,lil-712416,a49
lil-712416,lil-712416,a49.9


In [40]:
df_text_all_abs[df_text_all_abs[col_analyze] < 30]["raw_text"].drop_duplicates().values

array(['Contribuir en la reducción de la incidencia de morbilidad y mortalidad del cáncer de cuello uterino en el Perú.',
       'Se ofrecen datos epidemiológicos sobre los registros del cáncer en Guatemala durante el año 2011.',
       'Se presentan los trabajos que fueron seleccionados para su presentación oral en el Congreso Iberoamericano de Neurología Pediátrica.',
       'La preeclampsia \xad eclampsia es una de las principales causas de morbimortalidad materna y perinatal a nivel mundial',
       'Evaluar los factores del riesgo de caída en ancianos. Estudio descriptivo, transversal con enfoque cuantitativo...',
       'Describir características sociodemográficas y opiniones sobre la institución de pacientes con lepra residentes en un hospital...',
       'El objetivo del estudio es describir dos pacientes con patologías poco frecuentes, de elevada mortalidad que se presentaron con diez días de diferencia.',
       'Contribuir a mejorar a nivel de salud de la población y mantene

In [44]:
df_text_all_abs[df_text_all_abs[col_analyze] < 30].shape

(20696, 9)

As a sanity check procedure, we also save a pre-processed version of the corpus where texts with length < 30 are eliminated:

In [27]:
# Fraction of texts that will be removed
df_text_all_abs[df_text_all_abs[col_analyze] < 30].shape[0]/df_text_all_abs.shape[0]

0.12165530213966612

In [29]:
df_text_all_abs_v2_30 = df_text_all_abs[df_text_all_abs[col_analyze] >= 30]

In [30]:
df_text_all_abs_v2_30.shape

(149424, 9)

In [47]:
%%time
df_text_all_abs_v2_30[["doc_id", "raw_text", "sw_text"]].to_csv(path_or_buf=corpus_path + 
                                                                        "all_abstracts_valid_codes_D_text_raw_sw_v2_30.tsv", 
                                                                        sep="\t", header=True, index=False)

CPU times: user 4.07 s, sys: 127 ms, total: 4.2 s
Wall time: 4.2 s


In [38]:
# Fraction of codes occurrences that will be removed (~ same as texts)
1 - df_codes_d_abs_all_v2_30.shape[0]/df_codes_d_abs_all.shape[0]

0.12959569747632815

In [34]:
df_codes_d_abs_all_v2_30 = df_codes_d_abs_all.loc[set(df_text_all_abs_v2_30["doc_id"])]

In [37]:
df_codes_d_abs_all_v2_30.shape

(351518, 2)

In [45]:
%%time
df_codes_d_abs_all_v2_30.to_csv(path_or_buf=corpus_path + "all_abstracts_table_valid_codes_D_v2_30.tsv", 
                                                   sep="\t", header=False, index=False)

CPU times: user 187 ms, sys: 7.87 ms, total: 195 ms
Wall time: 195 ms


We check the updated sub-token sequence length frequency distribution tables:

In [62]:
df_text_all_abs_v2_30["raw_BETO"].drop_duplicates().sort_values()

7294       30
15518      31
7087       32
8107       33
5098       34
         ... 
1308     1405
3144     1453
2641     1504
3599     1727
4649     1863
Name: raw_BETO, Length: 914, dtype: int64

In [63]:
df_text_all_abs_v2_30["raw_BETO"].value_counts()

200     545
212     539
205     529
210     522
197     515
       ... 
1213      1
912       1
958       1
1023      1
1151      1
Name: raw_BETO, Length: 914, dtype: int64

In [50]:
col_names = ["BioBERT", "Multilingual", "BETO"]

In [51]:
df_res = df_text_all_abs_v2_30
raw_all_abs_res = pd.DataFrame({col_names[0]: df_res["raw_BioBERT"].describe(), 
              col_names[1]: df_res["raw_Multi"].describe(), 
              col_names[2]: df_res["raw_BETO"].describe()})

In [52]:
sw_all_abs_res = pd.DataFrame({col_names[0]: df_res["sw_BioBERT"].describe(), 
              col_names[1]: df_res["sw_Multi"].describe(), 
              col_names[2]: df_res["sw_BETO"].describe()})

In [53]:
raw_all_abs_res

Unnamed: 0,BioBERT,Multilingual,BETO
count,149424.0,149424.0,149424.0
mean,427.013659,290.583942,257.412069
std,201.072809,141.119278,127.63405
min,28.0,23.0,30.0
25%,271.0,182.0,160.0
50%,404.0,272.0,239.0
75%,576.0,391.0,344.0
max,3158.0,2096.0,1863.0


In [58]:
# Corpus contain a few multilingual examples
df_res[df_res["raw_BioBERT"] < 30]["raw_text"].values

array(['Uterine mucinosis and vasculitis associated with lupus erythematous(AU)'],
      dtype=object)

In [59]:
df_res[df_res["raw_Multi"] < 30]["raw_text"].values

array(['Uterine mucinosis and vasculitis associated with lupus erythematous(AU)',
       'A poroceratose de Mibelli é uma genodermatose disceratósica (..) (AU)'],
      dtype=object)

In [54]:
sw_all_abs_res

Unnamed: 0,BioBERT,Multilingual,BETO
count,149424.0,149424.0,149424.0
mean,339.383325,216.014998,183.006083
std,163.134923,110.263527,98.117481
min,27.0,18.0,15.0
25%,213.0,131.0,109.0
50%,320.0,199.0,166.0
75%,457.0,290.0,244.0
max,2358.0,1471.0,1226.0
