# Train Tokenizer and Prepare Dataset

This notebook presents the process for creating the PictoBERT tokenizer and preparing dataset.



## Dataset
As our task is word-sense language modeling, we need a word-sense labeled dataset.  Besides, as the task consists of predicting word-senses in sequence, we need a dataset with all the nouns, verbs, adjectives, and adverbs labeled. The well-known and used dataset that comes closest to that is the SemCor 3.0 \cite{miller1993semantic}, which is labeled with senses from WordNet 3.0 and counts with 20 thousand annotated sentences. However, it is too tiny for BERT pre-training, originally trained with a 3,300M words dataset. Also, SemCor has sentences in formal text rather than conversational, which we consider more significant for an also conversational task like pictogram prediction. 

The Child Language Data Exchange System (CHILDES) \cite{macwhinney2014childes} is a ~2 million sentence multilingual corpus composed of transcribed children's speech. As it is from conversational data, we decide to use it as a training dataset. To make it possible, we labeled part of CHILDES with word-senses using SupWSD \cite{papandreaetal:EMNLP2017Demos}. We choose sentences in North American English. The result is a 955 k sentence labeled corpus that we call SemCHILDES (Semantic CHILDES).

This [Notebook](https://github.com/jayralencar/pictoBERT/blob/main/SemCHILDES.ipynb) present the procedure for building SemCHILDES.



### Download Dataset

The dataset used in this nootebook can be downloaded [here](https://drive.google.com/file/d/18xuy-PmffJxTgG76x5nio9f18lCjE_kL/view?usp=sharing). Or running the following cell.

In [1]:
!wget http://jayr.clubedosgeeks.com.br/pictobert/all_mt_2.txt

Downloading...
From: https://drive.google.com/uc?id=18xuy-PmffJxTgG76x5nio9f18lCjE_kL
To: /content/all_mt_2.txt
52.5MB [00:01, 47.0MB/s]


In [3]:
examples = open("./all_mt_2.txt",'r').readlines()
examples = [s.rstrip() for s in examples]
len(examples)

955489

## Training Tokenizer

To allow the usage of a different vocabulary on BERT, we have to train a new tokenizer. Before inputting data into a language model, it is necessary to tokenize it. Tokenization consists of splitting the words in a sentence according to some rules and then transform the split tokens into numbers. Those numbers are what the model will process. Initially, BERT uses a Word Piece tokenizer that split sentences into words or subwords (e.g., \textquote{playing} into \textquote{play##} and \textquote{##ing}). To allow the use of word-senses, we trained a Word Level tokenizer, which split words in a sentence by whitespace. It enables the usage of sense keys.

We use Hugging Face's tokenizers lib.

In [2]:
!pip install tokenizers

Collecting tokenizers
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 4.2MB/s 
[?25hInstalling collected packages: tokenizers
Successfully installed tokenizers-0.10.2


### Create Tokenizer

In [4]:
from tokenizers import Tokenizer
from tokenizers.models import WordLevel
from tokenizers.pre_tokenizers import WhitespaceSplit
from tokenizers.processors import BertProcessing

sense_tokenizer = Tokenizer(WordLevel(unk_token="[UNK]"
  ))
sense_tokenizer.add_special_tokens(["[SEP]", "[CLS]", "[PAD]", "[MASK]","[UNK]"])
sense_tokenizer.pre_tokenizer = WhitespaceSplit()

sep_token = "[SEP]"
cls_token = "[CLS]"
pad_token = "[PAD]"
unk_token = "[UNK]"
sep_token_id = sense_tokenizer.token_to_id(str(sep_token))
cls_token_id = sense_tokenizer.token_to_id(str(cls_token))
pad_token_id = sense_tokenizer.token_to_id(str(pad_token))
unk_token_id = sense_tokenizer.token_to_id(str(unk_token))


sense_tokenizer.post_processor = BertProcessing(
                (str(sep_token), sep_token_id), (str(cls_token), cls_token_id)
            )

### Train tokenizer

In [6]:
from tokenizers.trainers import WordLevelTrainer
g = WordLevelTrainer(special_tokens=["[UNK]"])
sense_tokenizer.train_from_iterator(examples, trainer=g)
print("Vocab size: ", sense_tokenizer.get_vocab_size())

Vocab size:  13584


### Save tokenizer

It is necessary to export the created tokenizer to enable its usage in the future. If you want to use a different tokenizer that we used for training PictoBERT, you have to download the JSON file and upload it in the next steps' notebooks (create model, train).

In [7]:
sense_tokenizer.save("./senses_tokenizer.json")

## Dataset Preparation

We load the trained tokenizer and the dataset and perform data encoding and spliting.

### Split Data

We splited in 98/1/1 train, test and validation. To change this, alter TEST_SIZE below.

In [15]:
TEST_SIZE = 0.02
from sklearn.model_selection import train_test_split
train_idx, val_idx = train_test_split(list(range(len(examples))), test_size=TEST_SIZE, random_state=32)
test_idx, val_idx = train_test_split(val_idx, test_size=0.5, random_state=3)

In [16]:
import numpy as np
train_examples = np.array(examples).take(train_idx)
val_examples = np.array(examples).take(val_idx)
test_examples = np.array(examples).take(test_idx)

### Load tokenizer

It is necessary to load the trained tokenizer using the `PreTrainedTokenizerFast` class from Hugging Face Transformers lib.

To ensure the success of this demonstration, we download the [final tokenizer](https://drive.google.com/file/d/1-2g-GCxjBwESqDn3JByAJABU9Dkuqy0m/view?usp=sharing) used for training PictoBERT in the next cell.




In [9]:
!wget http://jayr.clubedosgeeks.com.br/pictobert/childes_all_new.json

Downloading...
From: https://drive.google.com/uc?id=1-2g-GCxjBwESqDn3JByAJABU9Dkuqy0m
To: /content/childes_all_new.json
  0% 0.00/332k [00:00<?, ?B/s]100% 332k/332k [00:00<00:00, 46.3MB/s]


In [11]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d8/b2/57495b5309f09fa501866e225c84532d1fd89536ea62406b2181933fb418/transformers-4.5.1-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 5.6MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 26.3MB/s 
Installing collected packages: sacremoses, transformers
Successfully installed sacremoses-0.0.45 transformers-4.5.1


In [13]:
TOKENIZER_PATH = "./childes_all_new.json" # you can change this path to use your custom tokenizer

from transformers import PreTrainedTokenizerFast

loaded_tokenizer = PreTrainedTokenizerFast(tokenizer_file=TOKENIZER_PATH)
loaded_tokenizer.pad_token = "[PAD]"
loaded_tokenizer.sep_token = "[SEP]"
loaded_tokenizer.mask_token = "[MASK]"
loaded_tokenizer.cls_token = "[CLS]"
loaded_tokenizer.unk_token = "[UNK]"

### Tokenizer function

This function encodes the examples using the tokenizer. Notice that we used a sequence length of 32, but you can change this value. 

In [14]:
max_len = 32

def tokenize_function(tokenizer,examples):
    # Remove empty lines
    examples = [line for line in examples if len(line) > 0 and not line.isspace()]
    bert = tokenizer(
        examples,
        padding="max_length",
        max_length=max_len,
        return_special_tokens_mask=True,
        truncation=True
    )
    ngram = tokenizer(examples,add_special_tokens=False).input_ids
    return bert,ngram

In [17]:
train_tokenized_examples, train_ngram = tokenize_function(loaded_tokenizer,train_examples)
val_tokenized_examples, val_ngram = tokenize_function(loaded_tokenizer,val_examples)
test_tokenized_examples, test_ngram = tokenize_function(loaded_tokenizer,test_examples)

### Save data

We transform the data in dicts and save using pickle

In [18]:
from torch import tensor
def make_dict(examples,ngrams):
  return {
      "input_ids": examples.input_ids,
      "attention_mask":examples.attention_mask,
      "special_tokens_mask":examples.special_tokens_mask,
      "ngrams":ngrams
  }

In [19]:
import pickle

TRAIN_DATA_PATH = "./train_data.pt"
TEST_DATA_PATH = "./test_data.pt"
VAL_DATA_PATH = "./val_data.pt"

pickle.dump(make_dict(train_tokenized_examples, train_ngram),open(TRAIN_DATA_PATH,'wb'))
pickle.dump(make_dict(val_tokenized_examples,val_ngram),open(TEST_DATA_PATH,'wb'))
pickle.dump(make_dict(test_tokenized_examples, test_ngram),open(VAL_DATA_PATH ,'wb'))