### Tokenization

This notebook contains tokenization with SentencePiecce, an unsupervised text tokenizer developed by [Google](https://github.com/google/sentencepiece).
 
* Its vocabulary size is predetermined prior to the neural model training ( 8k, 16k, or 32k )
* It is used mainly for Neural Network-based text generation systems
* It implements subword units using byte-pair-encoding (BPE) and unigram language model 
* It can be trained directly from raw sentences
* It does not depend on language-specific pre/postprocessing


In [1]:
#!pip install sentencepiece

Firstly, we load the data.

In [2]:
import json

data_dict = {}

for dataset_name in ['train', 'dev', 'test']:
    filename = f'data/preprocessed_data/{dataset_name}.json'
    
    with open(filename, 'r') as f:
        data_dict[dataset_name] = json.load(f)

train_data = data_dict['train']
dev_data = data_dict['dev']
test_data = data_dict['test']

print(f'Train data loaded: {len(train_data)} entries')
print(f'Dev data loaded: {len(dev_data)} entries')
print(f'Test data loaded: {len(test_data)} entries')


Train data loaded: 34000 entries
Dev data loaded: 3368 entries
Test data loaded: 979 entries


In order to train the tokenizer, we need to pass it a txt file with both source and target sentences. So we prepare the data accordingly.

In [3]:
# Prepare training data
import os
os.makedirs('data/tokenizer_input', exist_ok=True)
with open('data/tokenizer_input/train.txt', 'w') as f:
    for entry in train_data:
        f.write(entry['src'] + '\n')
        f.write(entry['tgt'] + '\n')
    print(f'Train data saved to data/tokenizer_input/train.txt')


Train data saved to data/tokenizer_input/train.txt


After the data has been prepared, we can train the tokenizer with  various parameters for vocabulary and model type.

In [5]:
import sentencepiece as spm

os.makedirs('tokenizer_models', exist_ok=True)

# Configurations
vocab_sizes = [8000, 16000]
model_types = ['unigram', 'bpe']
input_file = 'data/tokenizer_input/train.txt'

# Train and save models for each configuration
for vocab_size in vocab_sizes:
    for model_type in model_types:
        model_prefix = f'tokenizer_models/{model_type}_{vocab_size}'
        spm.SentencePieceTrainer.train(input=input_file,
                                       vocab_size=vocab_size,
                                       model_prefix=model_prefix,
                                       model_type=model_type,
                                       pad_id=0, unk_id=1,
                                       bos_id=2, eos_id=3,
                                       pad_piece='[PAD]',
                                       unk_piece='[UNK]',
                                       bos_piece='[BOS]', 
                                       eos_piece='[EOS]')
        print(f'Model trained and saved: {model_prefix}.model')



Model trained and saved: tokenizer_models/unigram_8000.model
Model trained and saved: tokenizer_models/bpe_8000.model
Model trained and saved: tokenizer_models/unigram_16000.model
Model trained and saved: tokenizer_models/bpe_16000.model


Following methods help add tokenized info to our data (that is representing each sentence as tokens and ids with the latter ones being stored as tensors).

In [6]:
def tokenize_sentences(sentences, sp):
    tokenized_data = []
    for sentence in sentences:
        # Get token IDs including BOS and EOS
        ids = [sp.bos_id()] + sp.encode_as_ids(sentence) + [sp.eos_id()]
        # Get token pieces including BOS and EOS
        tokens = ['BOS'] + sp.encode_as_pieces(sentence) + ['EOS']
        tokenized_data.append((ids, tokens))
    return tokenized_data

In [7]:
import torch 
def prepare_tokenized_data(src_sentences, tgt_sentences, sp):
    src_data = tokenize_sentences(src_sentences, sp)
    tgt_data = tokenize_sentences(tgt_sentences, sp)
    tokenized_data = []
    for i in range(len(src_sentences)):
        src_ids, src_tokens = src_data[i]
        tgt_ids, tgt_tokens = tgt_data[i]
        tokenized_data.append({
            'src': src_sentences[i],
            'tgt': tgt_sentences[i],
            'src_tokens': src_tokens,
            'tgt_tokens': tgt_tokens,
            'src_ids': torch.tensor(src_ids, dtype=torch.int64),
            'tgt_ids': torch.tensor(tgt_ids, dtype=torch.int64)
        })
    return tokenized_data

In [8]:
# Extract sentences from the training, dev, and test data
src_train = [entry['src'] for entry in train_data]
tgt_train = [entry['tgt'] for entry in train_data]

src_dev = [entry['src'] for entry in dev_data]
tgt_dev = [entry['tgt'] for entry in dev_data]

src_test = [entry['src'] for entry in test_data]
tgt_test = [entry['tgt'] for entry in test_data]

In [16]:
# Tokenize the train, dev, and test datasets using each model
tokenized_datasets = {}

for model_name in ['bpe_8000', 'bpe_16000', 'unigram_8000', 'unigram_16000']:
    print(f'Tokenizing with model: {model_name}')
    # Load the SentencePiece model
    model_filename = 'tokenizer_models/' + model_name + '.model'
    sp = spm.SentencePieceProcessor(model_file=model_filename)
    train_tokenized = prepare_tokenized_data(src_train, tgt_train, sp)
    dev_tokenized = prepare_tokenized_data(src_dev, tgt_dev, sp)
    test_tokenized = prepare_tokenized_data(src_test, tgt_test, sp)
    
    tokenized_datasets[model_name] = {
        'train': train_tokenized,
        'dev': dev_tokenized,
        'test': test_tokenized
    }

# Save the tokenized datasets to JSON files
output_dir = 'data/tokenized_data'
os.makedirs(output_dir, exist_ok=True)

for model_name, datasets in tokenized_datasets.items():
    for dataset_type in ['train', 'dev', 'test']:
        filename = f'{output_dir}/{model_name}_{dataset_type}.pt'
        torch.save(datasets[dataset_type], filename)
        print(f'Successfully saved {dataset_type} data for {model_name} to {filename}')

Tokenizing with model: bpe_8000
Tokenizing with model: bpe_16000
Tokenizing with model: unigram_8000
Tokenizing with model: unigram_16000
Successfully saved train data for bpe_8000 to data/tokenized_data/bpe_8000_train.pt
Successfully saved dev data for bpe_8000 to data/tokenized_data/bpe_8000_dev.pt
Successfully saved test data for bpe_8000 to data/tokenized_data/bpe_8000_test.pt
Successfully saved train data for bpe_16000 to data/tokenized_data/bpe_16000_train.pt
Successfully saved dev data for bpe_16000 to data/tokenized_data/bpe_16000_dev.pt
Successfully saved test data for bpe_16000 to data/tokenized_data/bpe_16000_test.pt
Successfully saved train data for unigram_8000 to data/tokenized_data/unigram_8000_train.pt
Successfully saved dev data for unigram_8000 to data/tokenized_data/unigram_8000_dev.pt
Successfully saved test data for unigram_8000 to data/tokenized_data/unigram_8000_test.pt
Successfully saved train data for unigram_16000 to data/tokenized_data/unigram_16000_train.pt


#### Experimenting with vocabulary

This section shows some examples from tokenized data.

In [17]:
train_data = torch.load('data/tokenized_data/bpe_8000_train.pt')

In [18]:
# Print tokenized data examples
print("Tokenized Train Data:")
print(train_data[0])

Tokenized Train Data:
{'src': 'My town is a medium size city with eighty thousand inhabitants .', 'tgt': 'My town is a medium - sized city with eighty thousand inhabitants .', 'src_tokens': ['BOS', '▁My', '▁town', '▁is', '▁a', '▁medium', '▁size', '▁city', '▁with', '▁eight', 'y', '▁thousand', '▁inhabitants', '▁.', 'EOS'], 'tgt_tokens': ['BOS', '▁My', '▁town', '▁is', '▁a', '▁medium', '▁-', '▁s', 'ized', '▁city', '▁with', '▁eight', 'y', '▁thousand', '▁inhabitants', '▁.', 'EOS'], 'src_ids': tensor([   2,  453,  703,   51,    5, 6149, 3413,  459,  103, 3198, 7946, 4656,
        5970,   11,    3]), 'tgt_ids': tensor([   2,  453,  703,   51,    5, 6149,  232,   12, 1282,  459,  103, 3198,
        7946, 4656, 5970,   11,    3])}


In [19]:
sp = spm.SentencePieceProcessor(model_file='tokenizer_models/bpe_8000.model')

In [20]:
#returns vocab size
print(sp.get_piece_size())

8000


In [21]:
print('bos=', sp.bos_id())
print('eos=', sp.eos_id())
print('unk=', sp.unk_id())
print('pad=', sp.pad_id())

bos= 2
eos= 3
unk= 1
pad= 0


In [22]:
vocabs = [sp.id_to_piece(id) for id in range(sp.get_piece_size())]
vocabs[:10]

['[PAD]', '[UNK]', '[BOS]', '[EOS]', '▁t', '▁a', 'he', 'in', 're', '▁w']

In [23]:
type(train_data[0]["src_ids"])

torch.Tensor