In [0]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/12/b5/ac41e3e95205ebf53439e4dd087c58e9fd371fd8e3724f2b9b4cdb8282e5/transformers-2.10.0-py3-none-any.whl (660kB)
[K     |████████████████████████████████| 665kB 2.8MB/s 
[?25hCollecting tokenizers==0.7.0
[?25l  Downloading https://files.pythonhosted.org/packages/14/e5/a26eb4716523808bb0a799fcfdceb6ebf77a18169d9591b2f46a9adb87d9/tokenizers-0.7.0-cp36-cp36m-manylinux1_x86_64.whl (3.8MB)
[K     |████████████████████████████████| 3.8MB 46.8MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 41.0MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K   

In [0]:
from google.colab import drive

drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


## Process Dataset



### Pre processing

The data extracted from WikiExtractor tool is saved in the following format:

```xml
<doc id="1438711" url="https://pt.wikipedia.org/wiki?curid=1438711" title="Lista de rainhas de Aragão">
Lista de rainhas de Aragão

Esta é uma lista das mulheres que usaram o título de Rainha de Aragão, ...

"Consortes de Aragão e de Navarra"

A partir de 1516, a união dos reinos espanhóis passou ...
</doc>

```

We must remove the enclosing doc tags and remove empty lines, so they are not considered during training.

In [0]:
import re


WIKI_DOC_REGEX = re.compile(r'<[\/]{0,1}doc')


def pre_process(all_sentences):
    """ Remove empty sentences and doc declarations from Wikipedia """
    return filter(lambda s: s and not WIKI_DOC_REGEX.match(s), all_sentences)

In [0]:
# Some sanity check

raw_sentences = [
    '<doc id="__id" url="__url" title="__title">',
    'Some title',
    '',
    'some initial text',
    '',
    '</doc>']

list(pre_process(raw_sentences))

['Some title', 'some initial text']

### Load and Cache

After being extract by WikiExtractor, we have 1,680 files containing 6,180,082 sentences (after pre processing).

This is a considerable ammount of files to be read, so we read them all, pre-process them and save a cache of the final list. This helps performing texts with the final list without touching every wiki file.

In [0]:
import os
import pickle


RAW_DATA_ROOT   = '/content/drive/My Drive/PF13/text'
CACHED_PATH     = '/content/drive/My Drive/PF13/text/preprocessed.pkl'


def load_raw_sentences(path):
    """
    Load all sentences in the Wiki extracted structure.
    The files are considered to be generated from WikiExtractor tool:
    - https://github.com/attardi/wikiextractor
    """
    sentences = []

    for root, _, files in os.walk(path):
        for file in files:
            with open(os.path.join(root, file)) as f:
                sentences_in_file = [line.strip() for line in f]
                sentences.extend(sentences_in_file)

    return sentences


def load_sentences(cache_it=True):
    """
    Load sentences from cache.
    If cache is not available, process raw sentencens and cache them.

    Parameters:
    - cache_it (default True): indicates whether raw sentences should be cached.
    """
    if os.path.exists(CACHED_PATH):
        with open(CACHED_PATH, 'rb') as pic:
            return pickle.load(pic)

    all_sentences = list(pre_process(load_raw_sentences(RAW_DATA_ROOT)))

    if cache_it is True:
        with open(CACHED_PATH, 'wb') as pic:
            pickle.dump(all_sentences, pic)

    return all_sentences

In [0]:
all_sentences = load_sentences()

print('Loaded:', len(all_sentences))
print('Example:', all_sentences[:20])

Loaded: 6180082
Example: ['Manuel Scorza', 'Manuel Scorza (Lima, 9 de setembro de 1928 - Madrid, 27 de novembro em 1983) foi um romancista e poeta Peruano da geração dos anos 50, pertencente ao Indigenismo ou Neoindigenismo peruano em conjunto com seus companheiros Ciro Alegría e José María Arguedas.', 'Scorza nasceu em 1928 de pai camponês e mãe índia. Mestiço, como quarenta e cinco por cento da população peruana, passou toda sua infância em Acoria (Huancavelica), um vilarejo dos Andes centrais. Ele completou seus estudos na Colégio Militar Leoncio Prado, que também estudaram os escritores peruanos Mario Vargas Llosa e Herbert Morote Rebolledo, dentre outros. Após os primeiros estudos em escolas públicas, obteve uma bolsa que lhe permitiu retornar para Lima, local de nascimento. Em 1945 entrou para a Universidade Nacional Mayor de San Marcos e iniciou um período febril de atividade política.', 'Scorza escrevia poemas desde os 16 anos, e pertencia à redação oposicionista em 1948, quand

### Save in txt Format

After pre processing, in order to train a tokenizer, we should dump the sentences to one or more txt files, so the tokenizer can be trained from it.

An example of traing the Huggingface BERT WordPiece token can be found in https://github.com/huggingface/tokenizers/blob/master/bindings/python/examples/train_bert_wordpiece.py.

The cached file is about 2GB, so every sentence occupy about 0,3 Kb. I decided to split this corpus in txt files each of the size of 300 Mb. So, each file will have approximately 900,000 sentences.

In [0]:
import math
import shutil


SENTENCES_BY_FILE = 900000
TXT_LOCATION = '/content/wiki_pt'


def save_txt_corpus(sentences):
    """ 
    Saves all pre processeded sentences in txt files.
    Every file contains up to 900k sentences and sentences are split into '\n'.
    """
    total_files = math.ceil(len(sentences) / SENTENCES_BY_FILE)
    sentences_per_file = {}

    if not os.path.exists(TXT_LOCATION):
        os.mkdir(TXT_LOCATION)

    for i in range(total_files):
        start_pos = i * SENTENCES_BY_FILE
        file_sentences = sentences[start_pos:start_pos + SENTENCES_BY_FILE]
        file_name = f'{TXT_LOCATION}/wiki_pt_{i}.txt'

        sentences_per_file[file_name] = len(file_sentences)

        with open(file_name, 'w+') as txt:
            txt.write('\n'.join(file_sentences))

    return sentences_per_file


def move_txt_to_drive(
    dest_location='/content/drive/My Drive/PF13/text_preprocessed/wiki_pt'):
    """ Move local txt corpus to Drive. """
    shutil.copytree(TXT_LOCATION, dest_location)

In [0]:
sentences_saved = save_txt_corpus(all_sentences)

In [0]:
sentences_saved

{'/content/wiki_pt/wiki_pt_0.txt': 900000,
 '/content/wiki_pt/wiki_pt_1.txt': 900000,
 '/content/wiki_pt/wiki_pt_2.txt': 900000,
 '/content/wiki_pt/wiki_pt_3.txt': 900000,
 '/content/wiki_pt/wiki_pt_4.txt': 900000,
 '/content/wiki_pt/wiki_pt_5.txt': 900000,
 '/content/wiki_pt/wiki_pt_6.txt': 780082}

In [0]:
sum([v for i, v in sentences_saved.items()])

6180082

In [0]:
# move_txt_to_drive()

In [0]:
!tail /content/wiki_pt/wiki_pt_2.txt

A ligação entre as prática destas duas artes marciais chinesas internas, "xingyiquan" e "baguazhang", ocorreu a partir das reuniões de Cheng Tinghua com seus amigos Li Tsun I, Chang Chao Tung, Liu Te Kuan, e Liu Wai Hsiang (aluno de "Hsing-I" de Chang Chao Tung).
Os encontros tinham como finalidade comparar seus estilos de luta e compartilhar suas descobertas, num ambiente de aprendizado mútuo.
Cheng Tinghua foi morto durante o Levante dos boxers, em 1900, quando os "oito exércitos estrangeiros" invadiram Pequim.
Um grupo de soldados alemães estava recrutando à força passantes locais para um trabalho a ser realizado perto da porta "Chung Wen", local onde Cheng tinha sua loja.
Ele foi detido pelos soldados, que tentaram alinhá-lo aos demais recrutas.
Cheng resistiu e tentou lutar, derrubando alguns dos seus algozes.
Ao tentar escapar saltando um muro, foi atingido por um disparo dos soldados.
Cheng Yulung (seu filho mais velho, 1875-1928), Cheng Youxin (segundo filho), Cheng Yougong, Fe

# BERT Tokenizer

Now we train our BERT Tokenizer, using BERT WordPiece. We follow the same rationale as Artetxe, Ruder and Yogatama (2020), using the same vocabulary size as the model we'll use in English language, not performing and normalization or lowercasing.


In [0]:
from tokenizers import BertWordPieceTokenizer
from transformers import BertTokenizer, BertModel


We check the parameters of our target tokenizer and model: bert-base-cased.

In [0]:
bert_base_cased = BertTokenizer.from_pretrained('bert-base-cased')
bert_model = BertModel.from_pretrained('bert-base-cased')

print('=' * 50)
print('Tokenizer')
print('=' * 50)
print('Vocab Size=', bert_base_cased.vocab_size)
print('Max Length=', bert_base_cased.max_len)

print('Max Length Sentence Pairs=', bert_base_cased.max_len_sentences_pair)
print('Special Tokens=', bert_base_cased.special_tokens_map)
print('Initial Config=', bert_base_cased.pretrained_init_configuration['bert-base-cased'])
print()
print('[SEP]:', bert_base_cased.sep_token_id)
print('[PAD]:', bert_base_cased.pad_token_id)
print('[UNK]:', bert_base_cased.unk_token_id)
print('[MASK]:', bert_base_cased.mask_token_id)
print('[CLS]:', bert_base_cased.cls_token_id)
print('=' * 50)

print('BERT')
print('=' * 50)
print('Vocab Size=', bert_model.config.vocab_size)
print('Max Position Embeddings=', bert_model.config.max_position_embeddings)
print('=' * 50)

Tokenizer
Vocab Size= 28996
Max Length= 512
Max Length Sentence Pairs= 509
Special Tokens= {'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}
Initial Config= {'do_lower_case': False}

[SEP]: 102
[PAD]: 0
[UNK]: 100
[MASK]: 103
[CLS]: 101
BERT
Vocab Size= 28996
Max Position Embeddings= 512


We create an instance of `BertWordPieceTokenizer`, the same algorithm used in the pre trained tokenizer from `transformers` library. We do not perform lowercasing. However, we keep accents (which are common in Portuguese) and handle Chinese Characters, in case one appears frequently in the corpus.

In [0]:
pt_tokenizer = BertWordPieceTokenizer(lowercase=False, strip_accents=False,
                                      handle_chinese_chars=True)

We'll train the tokenizer in the following files:

In [0]:
all_files = [
             os.path.join(TXT_LOCATION, txt) for txt
             in os.listdir(TXT_LOCATION)
             if txt.endswith('.txt')
            ]
all_files

['/content/wiki_pt/wiki_pt_2.txt',
 '/content/wiki_pt/wiki_pt_3.txt',
 '/content/wiki_pt/wiki_pt_6.txt',
 '/content/wiki_pt/wiki_pt_1.txt',
 '/content/wiki_pt/wiki_pt_4.txt',
 '/content/wiki_pt/wiki_pt_5.txt',
 '/content/wiki_pt/wiki_pt_0.txt']

First, we do some plumbing in order to be compatible with pre trained BERT from transformers. The pre-trained tokenizer has some unused tokens at first, in order to keep the special tokens in certain positions.

As I'll reuse the BERT special tokens based on Artetxe, Ruder and Yogatama (2020), I add the following tokens to preserve the ids. 

In [0]:
initial_ =  ['[PAD]'] + \
            [f'[unused{i}]' for i in range(1, 100)] + \
            ['[UNK]', '[CLS]', '[SEP]', '[MASK]'] + \
            ['[unused100]', '[unused101]']

We save the vocab for being able to use it later. We name our model `bert-base-cased-pt`.

In [0]:
pt_tokenizer.train(
    files=all_files, special_tokens=initial_,
    vocab_size=bert_base_cased.vocab_size) # Using the same vocab size

tokenizer_files = pt_tokenizer.save('/content/', 'bert-base-cased-pt')

In [0]:
# for tokenizer_file in tokenizer_files:
#     print('Copying', tokenizer_file)
#     shutil.copy(tokenizer_file, '/content/drive/My Drive/PF13')

Copying /content/bert-base-cased-pt-vocab.txt


## Test Pre Trained Vocabulary

In [0]:
from itertools import islice

VOCAB_LOCATION='/content/drive/My Drive/PF13/bert-base-cased-pt-vocab.txt'

pt_bert_tokenizer = BertTokenizer(VOCAB_LOCATION, do_lower_case=False,
                                  model_max_length=512)

list(islice(pt_bert_tokenizer.vocab.items(), 10))

[('[PAD]', 0),
 ('[unused1]', 1),
 ('[unused2]', 2),
 ('[unused3]', 3),
 ('[unused4]', 4),
 ('[unused5]', 5),
 ('[unused6]', 6),
 ('[unused7]', 7),
 ('[unused8]', 8),
 ('[unused9]', 9)]

In [0]:
print('=' * 50)
print('Tokenizer')
print('=' * 50)
print('Vocab Size=', pt_bert_tokenizer.vocab_size)
print('Max Length=', pt_bert_tokenizer.max_len)

print('Max Length Sentence Pairs=', pt_bert_tokenizer.max_len_sentences_pair)
print('Special Tokens=', pt_bert_tokenizer.special_tokens_map)
print()
print('[SEP]:', pt_bert_tokenizer.sep_token_id)
print('[PAD]:', pt_bert_tokenizer.pad_token_id)
print('[UNK]:', pt_bert_tokenizer.unk_token_id)
print('[MASK]:', pt_bert_tokenizer.mask_token_id)
print('[CLS]:', pt_bert_tokenizer.cls_token_id)
print('=' * 50)

Tokenizer
Vocab Size= 28996
Max Length= 512
Max Length Sentence Pairs= 509
Special Tokens= {'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}

[SEP]: 102
[PAD]: 0
[UNK]: 100
[MASK]: 103
[CLS]: 101


# References

Artetxe, Mikel, Sebastian Ruder, and Dani Yogatama. "On the cross-lingual transferability of monolingual representations." arXiv preprint arXiv:1910.11856 (2020).
