# Transformers Training Example

#### Loading Data
We will use [Tensorflow datasets](https://www.tensorflow.org/datasets) to load the Portuguese-English dataset from Ted-Talks, first we need to install the library.
```bash
pip install tensorflow_datasets --use-feature=2020-resolver
```
We also included on the dataset folder a pickle file of this whole data that could be extracted like this:
```python
import tensorflow_datasets as tfds
import pickle
import json
examples, metadata = tfds.load('ted_hrlr_translate/pt_to_en', with_info=True,
                               as_supervised=True)
train_examples, val_examples = examples['train'], examples['validation']

dataset_ted_portuguese_english_train = []
dataset_ted_portuguese_english_val = []

for pt_examples, en_examples in train_examples.take(len(train_examples)):
    pt = pt_examples.numpy().decode('utf-8')
    en = en_examples.numpy().decode('utf-8')
    dataset_ted_portuguese_english_train.append({'english': en, 'portuguese': pt})

for pt_examples, en_examples in val_examples.take(len(val_examples)):
    pt = pt_examples.numpy().decode('utf-8')
    en = en_examples.numpy().decode('utf-8')
    dataset_ted_portuguese_english_val.append({'english': en, 'portuguese': pt})
    
dataset_ted['train'] = dataset_ted_portuguese_english_train
dataset_ted['validation'] = dataset_ted_portuguese_english_val

with open('nlp_ted_pt_en.pkl', 'wb') as f:
    pickle.dump(dataset_ted, f)

# Optional json
with open('../datasets/train.json', 'w') as fout:
    json.dump(dataset_ted['train'] , fout)
with open('../datasets/validation.json', 'w') as fout:
    json.dump(dataset_ted['validation'] , fout)
```

### References
* https://www.tensorflow.org/text/tutorials/transformer#download_the_dataset
* https://www.tensorflow.org/text/guide/subwords_tokenizer
* https://spacy.io/usage/spacy-101
* https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html
* https://towardsdatascience.com/deep-learning-for-nlp-with-pytorch-and-torchtext-4f92d69052f
* https://github.com/pytorch/text/blob/master/examples/legacy_tutorial/migration_tutorial.ipynb
* https://pytorch.org/tutorials/beginner/translation_transformer.html
* https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html
* https://stackoverflow.com/questions/66838185/build-vocab-using-spacy

In [8]:
import sys
sys.path.append('../')
from model.sublayers.transformer import TransformerEncoderDecoder
import pickle, json, torch, spacy
import numpy as np
spacy_pt = spacy.load('pt_core_news_sm')
spacy_en = spacy.load('en_core_web_sm')

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print('Device:', device)

# Training Hyperperameters
num_layers = 4
d_model = 128
dff = 512
num_heads = 8
dropout_rate = 0.1
EPOCHS = 20
BATCH_SIZE = 64

Device: cpu


### Load and Prepare Data

In [9]:
with open('../datasets/train.json', 'rb') as f:
    train = json.load(f)
    
with open('../datasets/validation.json', 'rb') as f:
    validation = json.load(f)

In [10]:
train[1]

{'english': 'but what if it were active ?',
 'portuguese': 'mas e se estes fatores fossem ativos ?'}

#### Tokenize
Now we need to convert the phrases to a list of tokens, which will be a list of string of words
```bash
pip install spacy --use-feature=2020-resolver
python -m spacy download pt_core_news_sm
python -m spacy download en_core_web_sm
```

In [11]:
# Define Begining/End tokens and the blank
BOS_WORD = '<s>'
EOS_WORD = '</s>'
BLANK_WORD = "<pad>"

def tokenize_pt(text):
        return [tok.text for tok in spacy_pt.tokenizer(text)]

def tokenize_en(text):
    return [tok.text for tok in spacy_en.tokenizer(text)]

print(tokenize_pt(train[1]['portuguese']))
print(tokenize_en(train[1]['english']))

['mas', 'e', 'se', 'estes', 'fatores', 'fossem', 'ativos', '?']
['but', 'what', 'if', 'it', 'were', 'active', '?']
