# **Torchtext +Dataloaders Tutorial**

#### Requirements
* Python >=3.7, <=3.10
* PyTorch >=1.13.0
* Torchtext >= 0.13.0
* Torchdata >= 0.5
* SpaCy's [`en_core_web_sm`](https://spacy.io/models/en) and [`de_core_news_sm`](https://spacy.io/models/de) models 

## **Torchtext**

### Import dataset

In [5]:
from torchtext.datasets import Multi30k

train_iter = iter(Multi30k(split='train'))
de_sent, eng_sent = next(train_iter)
print(de_sent)
print(eng_sent)

Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.
Two young, White males are outside near many bushes.


### Load Tokenizer

The `torchtext.data.utils` module comes with the `get_tokenizer` feature, which allows loading different tokenizers depending on the use case.

In [6]:
from torchtext.data.utils import get_tokenizer

tokenizer = get_tokenizer("basic_english")
tokens = tokenizer(eng_sent)
print(tokens)

['two', 'young', ',', 'white', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.']


The function `get_tokenizer` can take any of the following arguments:
- If `None`, it returns the `split()` function in Python.
- If `basic_english`, it normalizes and splits the sentence (see [source](https://pytorch.org/text/stable/_modules/torchtext/data/utils.html#get_tokenizer))
- If a callable function (e.g., `def mytokenizer`), it returns the function
- If a tokenizer library (e.g., `spacy`), it returns the corresponding library

In [7]:
spacy_en_tokenizer = get_tokenizer("spacy", language="en_core_web_sm")
tokens = spacy_en_tokenizer(eng_sent)
print(tokens)



['Two', 'young', ',', 'White', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.']


SpaCy comes with multilingual support, so let's try tokenizing our target sentence in german:

In [8]:
spacy_de_tokenizer = get_tokenizer("spacy", language="de_core_news_sm")
tokens = spacy_de_tokenizer(de_sent)
print(tokens)

['Zwei', 'junge', 'weiße', 'Männer', 'sind', 'im', 'Freien', 'in', 'der', 'Nähe', 'vieler', 'Büsche', '.']


In [9]:
def mytokenizer(sent):
    return sent.upper().split()

my_tokenizer = get_tokenizer(mytokenizer)
tokens = my_tokenizer(eng_sent)
print(tokens)

['TWO', 'YOUNG,', 'WHITE', 'MALES', 'ARE', 'OUTSIDE', 'NEAR', 'MANY', 'BUSHES.']


Finally, we can also generate n-gram sequences using `ngram_iterator`

In [10]:
from torchtext.data.utils import ngrams_iterator

list(ngrams_iterator(tokens, 2))

['TWO',
 'YOUNG,',
 'WHITE',
 'MALES',
 'ARE',
 'OUTSIDE',
 'NEAR',
 'MANY',
 'BUSHES.',
 'TWO YOUNG,',
 'YOUNG, WHITE',
 'WHITE MALES',
 'MALES ARE',
 'ARE OUTSIDE',
 'OUTSIDE NEAR',
 'NEAR MANY',
 'MANY BUSHES.']

### More advanced Tokenizers

Torchtext offers support to incorporate more sophisticated tokernizers into our workflows (again, depending on our use case!). Let's start with the [SentencePiece](https://github.com/google/sentencepiece) tokenizer. This tokenizer leverages two segmentation algorithms - [BytePair Encoding (BPE)](http://www.aclweb.org/anthology/P16-1162) and [unigram language model](https://arxiv.org/abs/1804.10959) - and is an effective way to address the open vocabulary problem often frequent in Neural Machine Translation (and other NLP tasks).

Furthermore, since `SentencePiece` treats the sentences just as sequences of Unicode characters, there is no language-dependent logic, which makes it practical for multilingual tokenization tasks. [see docs](https://github.com/google/sentencepiece), [see paper](https://aclanthology.org/D18-2012/).

* Note: If you're interested in a deep dive on SentencePiece, here's a [good resource](https://towardsdatascience.com/sentencepiece-tokenizer-demystified-d0a3aac19b15) to consult.

In [11]:
from torchtext.transforms import SentencePieceTokenizer

# using default SPM model by Torch
spm_model_path = r"https://download.pytorch.org/models/text/xlmr.sentencepiece.bpe.model"
sp_tokenizer = SentencePieceTokenizer(spm_model_path)

print(sp_tokenizer([eng_sent, de_sent]))

[['▁Two', '▁young', ',', '▁White', '▁male', 's', '▁are', '▁outside', '▁near', '▁many', '▁bu', 'shes', '.'], ['▁Zwei', '▁junge', '▁weiß', 'e', '▁Männer', '▁sind', '▁im', '▁Frei', 'en', '▁in', '▁der', '▁Nähe', '▁viel', 'er', '▁Bü', 'sche', '.']]


As you can see, we just introduced a new object into our workflow: `transform`. It comes from the `torchtext.transforms` module, which offers a powerful approach to declare text processing pipelines sequentially. More on this later. 

In [12]:
type(sp_tokenizer)

torchtext.transforms.SentencePieceTokenizer

In [13]:
from torchtext.transforms import CLIPTokenizer


In [14]:
CLIPTokenizer

torchtext.transforms.CLIPTokenizer

`SentencePieceTokenizer` is just one of the many tokenizers we can load. Let's take a look at some others.

`CLIPTokenizer` treats spaces like parts of the tokens (a bit like sentencepiece) so a word will be encoded differently whether it is at the beginning of the sentence (without space) or not [see docs](https://pytorch.org/text/stable/transforms.html#cliptokenizer).

In [9]:
from torchtext.transforms import CLIPTokenizer

MERGES_FILE = "http://download.pytorch.org/models/text/clip_merges.bpe"
ENCODER_FILE = "http://download.pytorch.org/models/text/clip_encoder.json"

clip_tokenizer = CLIPTokenizer(
    merges_path=MERGES_FILE, 
    encoder_json_path=ENCODER_FILE,
    return_tokens=True
)

clip_tokenizer(eng_sent)

['two</w>',
 'young</w>',
 ',</w>',
 'white</w>',
 'males</w>',
 'are</w>',
 'outside</w>',
 'near</w>',
 'many</w>',
 'bushes</w>',
 '.</w>']

In [10]:
from torchtext.transforms import BERTTokenizer

VOCAB_FILE = "https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt"

bert_tokenizer = BERTTokenizer(
    vocab_path=VOCAB_FILE, 
    do_lower_case=True, 
    return_tokens=True
)

print(f"Single sentence output: {bert_tokenizer(eng_sent)}")
print()
print(f"Batch sentence output: {bert_tokenizer([eng_sent, de_sent])}")

100%|██████████| 232k/232k [00:00<00:00, 1.08MB/s]

Single sentence output: ['two', 'young', ',', 'white', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.']

Batch sentence output: [['two', 'young', ',', 'white', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.'], ['z', '##wei', 'jung', '##e', 'wei', '##ße', 'manner', 'sin', '##d', 'im', 'fr', '##ei', '##en', 'in', 'der', 'nah', '##e', 'vie', '##ler', 'busch', '##e', '.']]





### Building our vocabulary

Using the `torchtext.voacb` module, we can build our `vocab` object by specifying just a few parameters into the constructor:

* `ordered_dict` – Ordered Dictionary mapping tokens to their corresponding frequencies
* `min_freq` – The minimum frequency needed to include a token in the vocabulary
* `specials` – Special symbols to add. The order of supplied tokens will be preserved
* `special_first` – Indicates whether to insert symbols at the beginning or at the end

In [4]:
from torchtext.vocab import vocab
vocab

<function torchtext.vocab.vocab_factory.vocab(ordered_dict: Dict, min_freq: int = 1, specials: Optional[List[str]] = None, special_first: bool = True) -> torchtext.vocab.vocab.Vocab>

In [15]:
from collections import OrderedDict, Counter
from torchtext.vocab import vocab

# recreating our iterator object
train_iter = iter(Multi30k(split='train'))

en_counter, de_counter = Counter(), Counter()
for paired_sents in train_iter:
    de_sent, en_sent = paired_sents
    en_counter.update(sp_tokenizer(en_sent))
    de_counter.update(sp_tokenizer(de_sent))

In [16]:
en_counter.most_common(5)

[('▁a', 32023), ('.', 27598), ('▁A', 17479), ('▁in', 14947), ('s', 13067)]

Even though we could use our counter dictionary as input to the `vocab` constructor, we should use `OrderedDict` since this data structure preserves the order in which keys and values were inserted when iterating over it. Also, if a new entry overwrites an existing entry, then the order of items is left unchanged.

Since token frequencies can be relevant when doing machine translation, we should sort our counter by frequencies in descending order and then feed this mapping into `OrderedDict`.

In [17]:
ordered_en = OrderedDict(sorted(en_counter.items(), key=lambda x: x[1], reverse=True))
ordered_de = OrderedDict(sorted(de_counter.items(), key=lambda x: x[1], reverse=True))

In [18]:
list(ordered_en.items())[:5]

[('▁a', 32023), ('.', 27598), ('▁A', 17479), ('▁in', 14947), ('s', 13067)]

In [19]:
en_vocab = vocab(
    ordered_en, 
    min_freq=1, 
    specials=('<BOS>', '<EOS>', '<PAD>', '<unk>')
)

de_vocab = vocab(
    ordered_de, 
    min_freq=1, 
    specials=('<BOS>', '<EOS>', '<PAD>', '<unk>')
)

With a `vocab` object ([see docs](https://pytorch.org/text/stable/vocab.html#torchtext.vocab.Vocab)), you can do things like:
*   Get total length of the vocabulary
*   Generate mappings - String2Index (stoi) and Index2String (itos)
*   A purpose-specific vocabulary which contains words appearing more than N times

In [20]:
print("The length of the English vocab is", len(en_vocab))
en_stoi = en_vocab.get_stoi()
print("The index of '<BOS>' is", en_stoi['<BOS>'])
en_itos = en_vocab.get_itos()
print(f"The token at index 200 is '{en_itos[200]}'")
print(f"Special tokens: {en_itos[:4]}")
print(f"5 most common tokens: {en_itos[4:9]}")
print(f"5 least common tokens: {en_itos[-5:]}")

The length of the English vocab is 7367
The index of '<BOS>' is 0
The token at index 200 is 'a'
Special tokens: ['<BOS>', '<EOS>', '<PAD>', '<unk>']
5 most common tokens: ['▁a', '.', '▁A', '▁in', 's']
5 least common tokens: ['fit', 'ig', 'rah', '▁maj', '▁scroll']


### Using `Sequential` to define a text processing pipeline

So far, we have looked at how individual components work. But what if we wanted to define them as part of a text processing pipeline? Here's where `transforms.sequential` can be very useful. But before jumping into defining our pipeline, let's take a look at some of the other features in `transforms` which can help us extend our text processing even beyond what we have done so far ([see docs](https://pytorch.org/text/stable/transforms.html)).

In [17]:
from torchtext.transforms import Sequential, VocabTransform, Truncate, AddToken

max_seq_len = 512
# assuming we don't have these in our vocab (which we do)
bos_idx = 0  
eos_idx = 1

en_text_transform = Sequential(
    SentencePieceTokenizer(spm_model_path),
    VocabTransform(en_vocab),
    Truncate(max_seq_len - 2),
    AddToken(token=bos_idx, begin=True),
    AddToken(token=eos_idx, begin=False)
)

de_text_transform = Sequential(
    SentencePieceTokenizer(spm_model_path),
    VocabTransform(de_vocab),
    Truncate(max_seq_len - 2),
    AddToken(token=bos_idx, begin=True),
    AddToken(token=eos_idx, begin=False)
)

def apply_transform(x):
    return de_text_transform(x[0]), en_text_transform(x[1])

In [18]:
train_datapipe = Multi30k(split='train')
train_datapipe = train_datapipe.map(apply_transform)


Let's see how our two first paired sentences look after running them through the pipeline:

```
Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.
Two young, White males are outside near many bushes.
```


In [19]:
train_iter_seq = iter(train_datapipe)
num_de, num_en = next(train_iter_seq)
print(num_de)
print(num_en)

[0, 28, 105, 43, 17, 41, 143, 29, 139, 8, 7, 19, 186, 1541, 24, 1227, 240, 4, 1]
[0, 22, 29, 18, 1399, 202, 8, 20, 69, 97, 454, 259, 892, 5, 1]


In [20]:
# we had only defined mappings on our English vocab
de_itos = de_vocab.get_itos()
de_stoi = de_vocab.get_stoi()

print(" ".join([de_itos[i] for i in num_de]))
print(" ".join([en_itos[i] for i in num_en]))

<BOS> ▁Zwei ▁junge ▁weiß e ▁Männer ▁sind ▁im ▁Frei en ▁in ▁der ▁Nähe ▁viel er ▁Bü sche . <EOS>
<BOS> ▁Two ▁young , ▁White ▁male s ▁are ▁outside ▁near ▁many ▁bu shes . <EOS>


Now that we have our text processing pipeline ready to roll, let's see how can we turn it into a trainable dataset.

## **Datasets & Dataloaders**

In [21]:
import torch
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader
import io

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


The first thing we need to do is turn the numericalized data obtained through the processing pipeline and transform it into `torch` tensors.

In [22]:
# Adapted from: https://pytorch.org/tutorials/beginner/torchtext_translation_tutorial.html

BATCH_SIZE = 128
PAD_IDX = de_stoi['<PAD>']
BOS_IDX = de_stoi['<BOS>']
EOS_IDX = de_stoi['<EOS>']

def data_process(datapipe):
  """Converts numericalized inputs into Torch tensors."""
  iter_seq = iter(datapipe)
  data = []
  for num_de, num_en in iter_seq:
    de_tensor_ = torch.tensor(num_de, dtype=torch.long)
    en_tensor_ = torch.tensor(num_en, dtype=torch.long)
    data.append((de_tensor_, en_tensor_))
  return data

train_datapipe = Multi30k(split='train').map(apply_transform)

train_data = data_process(train_datapipe)
train_data[0]



(tensor([   0,   28,  105,   43,   17,   41,  143,   29,  139,    8,    7,   19,
          186, 1541,   24, 1227,  240,    4,    1]),
 tensor([   0,   22,   29,   18, 1399,  202,    8,   20,   69,   97,  454,  259,
          892,    5,    1]))

Now, the next step is to generate a data batch using `DataLoader`. This object combines a dataset and a sampler, and provides an iterable over the given dataset ([see docs](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)).

There are a couple of things we need to do first:
* Be sure to convert our special tokens `<BOS>` and `<EOS>` into tensors and fit them into the sequence where they belong using `cat`.
* Generate fixed-length tensors using padding.

The function `generate_batch` takes care of that for us.

In [23]:
def generate_batch(data_batch):
  """Generates batch of Torch tensors."""
  de_batch, en_batch = [], []
  for (de_item, en_item) in data_batch:
    de_batch.append(torch.cat([torch.tensor([BOS_IDX]), de_item, torch.tensor([EOS_IDX])], dim=0))
    en_batch.append(torch.cat([torch.tensor([BOS_IDX]), en_item, torch.tensor([EOS_IDX])], dim=0))
  de_batch = pad_sequence(de_batch, padding_value=PAD_IDX)
  en_batch = pad_sequence(en_batch, padding_value=PAD_IDX)
  return de_batch, en_batch

Now we are ready to instantiate `DataLoader`, which we can do by passing the following parameters into the constructor:

* `dataset`: Dataset from which to load the data
* `batch_size`: How many samples per batch to load (defaults to 1)
* `shuffle`: Reshuffles data at every epoch (defaults to `False`)
* `collate_fn`: Callable passed when using a batched loading from a map-style dataset

Take a look at the [docs](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) to explore the full range of parameters you can pass to `DataLoader`.

In [28]:
train_iter = DataLoader(
  train_data, 
  batch_size=BATCH_SIZE,
  shuffle=True, 
  collate_fn=generate_batch
)

de_batch, en_batch = next(iter(train_iter))
print(f"Size of de_batch: {de_batch.size()}")
print(f"Size of en_batch: {en_batch.size()}")

Size of de_batch: torch.Size([46, 128])
Size of en_batch: torch.Size([48, 128])


This is how one example within the minibatch would look like. Please note that the indeces from this example might look different from the indeces when accessing `train_data[0]` due to having shuffled the data.

Also, note that we are getting two 0s and two 1s, which correspond to `<BOS>` and `<EOS>` respectively. This is redundant and is the consequence of using `AddToken` when running our text preprocessing pipeline using `Sequential`, despite having included those tokens inside `special_tokens` when instantiating `vocab`. This was done for demostration purposes, and  **only one of the two approaches should be used.**

In [29]:
de_batch[:, 0]

tensor([   0,    0,   15, 1657,   96,   91,   12,    6, 2204,    4,    1,    1,
           2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,
           2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,
           2,    2,    2,    2,    2,    2,    2,    2,    2,    2])