<a href="https://colab.research.google.com/github/kairavkkp/ML-Tutorials/blob/basic-transformers/Basic-Transformer/basic_transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Basic Transformer

[Link](https://www.tensorflow.org/tutorials/text/transformer) to original tutorial.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [9]:
!nvidia-smi

Thu Apr  1 06:19:26 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.67       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   51C    P0    71W / 149W |    124MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
!pip install -q tensorflow_datasets
!pip install -q tensorflow_text

[K     |████████████████████████████████| 3.4MB 4.1MB/s 
[?25h

In [5]:
import tensorflow as tf
import tensorflow_text as text
import tensorflow_datasets as tfds
import os
import sys
import numpy as np

In [22]:
os.chdir('/content/drive/MyDrive/Basic-Transformer')

In [6]:
## Suppress Tensorflow warnings

import logging
logging.getLogger('tensorflow').setLevel(logging.ERROR)

In [23]:
# Dataset Download
# Portugese to English word translation
examples, metadata = tfds.load('ted_hrlr_translate/pt_to_en', with_info=True, as_supervised=True)

train_examples, val_examples = examples['train'], examples['validation']

In [24]:
# Snippet of what train_example looks like
for pt_examples, en_examples in train_examples.batch(1).take(1):

  print("Portuguese Example:")
  for pt in pt_examples.numpy():
    print(pt.decode('utf-8'))

  print("\nEnglish Example:")
  for en in en_examples.numpy():
    print(en.decode('utf-8'))

Portuguese Example:
e quando melhoramos a procura , tiramos a única vantagem da impressão , que é a serendipidade .

English Example:
and when you improve searchability , you actually take away the one advantage of print , which is serendipity .


In [25]:
# Tokennization and detokenization

model_name = "ted_hrlr_translate_pt_en_converter"
tf.keras.utils.get_file(f"{model_name}.zip",
                        f"https://storage.googleapis.com/download.tensorflow.org/models/{model_name}.zip",
                        cache_dir='.', cache_subdir='.', extract=True)

Downloading data from https://storage.googleapis.com/download.tensorflow.org/models/ted_hrlr_translate_pt_en_converter.zip


'././ted_hrlr_translate_pt_en_converter.zip'

In [26]:
# Loading the tokenizers from the downloaded model
tokenizers = tf.saved_model.load(model_name)

In [28]:
# Let's look at the methods inherited by the Tokenizers

# This is for the English tokenizer
[item for item in dir(tokenizers.en) if not item.startswith('_')]

['detokenize',
 'get_reserved_tokens',
 'get_vocab_path',
 'get_vocab_size',
 'lookup',
 'tokenize',
 'tokenizer',
 'vocab']

In [29]:
# This is for the Portuguese tokenizer
[item for item in dir(tokenizers.pt) if not item.startswith('_')]

['detokenize',
 'get_reserved_tokens',
 'get_vocab_path',
 'get_vocab_size',
 'lookup',
 'tokenize',
 'tokenizer',
 'vocab']

As we can see both the tokenizers inherit same methods. It'll be easy for us to perform similar operations on each texts in parallel.

In [35]:
print("Let's tokenize this string.\n")

print("Non Tokenized String: ")
for en in en_examples.numpy():
  print(en.decode('utf-8'))

print()

# Encoding the string
encoded = tokenizers.en.tokenize(en_examples)

print("Tokenized string: ")
for row in encoded.to_list():
  print(row)
print()

print('Now, lets try to detokenize the string and see if we get back the original string back.\n')

# Decoding the string
decoded = tokenizers.en.detokenize(encoded)

print('Detokenized String:')
for row in decoded.numpy():
  print(row.decode('utf-8'))

Let's tokenize this string.

Non Tokenized String: 
and when you improve searchability , you actually take away the one advantage of print , which is serendipity .

Tokenized string: 
[2, 72, 117, 79, 1259, 1491, 2362, 13, 79, 150, 184, 311, 71, 103, 2308, 74, 2679, 13, 148, 80, 55, 4840, 1434, 2423, 540, 15, 3]

Now, lets try to detokenize the string and see if we get back the original string back.

Detokenized String:
and when you improve searchability , you actually take away the one advantage of print , which is serendipity .


In [36]:
## We can use the lookup method to get the token-text from token-IDs

tokens = tokenizers.en.lookup(encoded)
tokens

<tf.RaggedTensor [[b'[START]', b'and', b'when', b'you', b'improve', b'search', b'##ability', b',', b'you', b'actually', b'take', b'away', b'the', b'one', b'advantage', b'of', b'print', b',', b'which', b'is', b's', b'##ere', b'##nd', b'##ip', b'##ity', b'.', b'[END]']]>

We can see that the word `searchability` is tokenized in a sub-word manner. It is comprised of two tokens, namely `search` and `##ability`. Similarly it is the same for `serendipity`.


## Setup Input pipeline


In [37]:
def tokenize_pairs(pt, en):
  pt = tokenizers.pt.tokenize(pt)
  pt = pt.to_tensor()

  en = tokenizers.en.tokenize(en)
  en = en.to_tensor()

  return pt, en

In [38]:
BUFFER_SIZE = 20000
BATCH_SIZE = 64

In [39]:
def make_batches(ds):
  return (
      ds.cache()
      .shuffle(BUFFER_SIZE)
      .batch(BATCH_SIZE)
      .map(tokenize_pairs, num_parallel_calls=tf.data.AUTOTUNE)
      .prefetch(tf.data.AUTOTUNE)
  )

In [41]:
# Preparing batches for Training and Validation
train_batches = make_batches(train_examples)
val_batches = make_batches(val_examples)

As we know that Transformers have a disadvantage that if the training set like text, doesn't have any positional encoding the transformers will just treat them as Bag of Words.

### Positional Encoding

$$PE_{(pos,2i)} = sin(pos/10000^{2i/d_{model}})$$

$$PE_{(pos, 2i+1)} = cos(pos/10000^{2i/d_{model}})$$

In [45]:
# Creating Positional Encoding for the train examples.
def get_angles(pos, i, d_model):
  angle_rates = 1 / np.power(10000, (2 * (i // 2)) / np.float32(d_model))
  return pos * angle_rates

def positional_encoding(position, d_model):
  angle_rads = get_angles(np.arange(position)[:, np.newaxis],
                          np.arange(d_model)[np.newaxis, :],
                          d_model)
  ## handling PE for Even indices
  angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])

  # handling PE for Odd indices
  angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])

  pos_encoding = angle_rads[np.newaxis, ...]

  return tf.cast(pos_encoding, dtype=tf.float32)


In [48]:
n, d = 2048, 512
pos_encoding = positional_encoding(n, d)
pos_encoding = pos_encoding[0]
pos_encoding.shape

TensorShape([2048, 512])