# Transformers

> Implementation of the Transformers Architecture and some basic documentation

In [None]:
#| default_exp core

In [None]:
#| hide
from nbdev.showdoc import *

In [None]:
#| hide
import numpy as np
import random
import importlib

Dataset and Preprocessing

Run the following cell to read the dataset of shakespeare text, create a list of unique characters (such as a-z), and compute the dataset and vocabulary size. 

* data_size is the total number of characters in the file
* vocab_size is the unique characters used in the file 

In [None]:
data = open('../shakespeare.txt','r')

In [None]:
#| export
data= data.lower()
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print('There are %d total characters and %d unique characters in your data.' % (data_size, vocab_size))

There are 94275 total characters and 38 unique characters in your data.


* The characters are a-z (26 characters) plus the "\n" (or newline character) and several other characters.
* In this, the newline character "\n" plays a role similar to the `<EOS>` (or "End of sentence") token.

In [None]:
#| export
chars = sorted(chars)
chars

['\n', ' ', '!', "'", '(', ')', ',', '-', '.', ':', ';', '?', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


* `char_to_ix`: In the cell below, we'll create a Python dictionary (i.e., a hash table) to map each character to an index from 0-26.
* `ix_to_char`: Then, we'll create a second Python dictionary that maps each index back to the corresponding character. 
    -  This will help us figure out which index corresponds to which character in the probability distribution output of the softmax layer. 

In [None]:
#| export
char_to_ix = { ch:i for i,ch in enumerate(chars) }
ix_to_char = { i:ch for i,ch in enumerate(chars) }
ix_to_char

{0: '\n', 1: ' ', 2: '!', 3: "'", 4: '(', 5: ')', 6: ',', 7: '-', 8: '.', 9: ':', 10: ';', 11: '?', 12: 'a', 13: 'b', 14: 'c', 15: 'd', 16: 'e', 17: 'f', 18: 'g', 19: 'h', 20: 'i', 21: 'j', 22: 'k', 23: 'l', 24: 'm', 25: 'n', 26: 'o', 27: 'p', 28: 'q', 29: 'r', 30: 's', 31: 't', 32: 'u', 33: 'v', 34: 'w', 35: 'x', 36: 'y', 37: 'z'}


### 1.2 - Overview of the Model

Your model will have the following structure: 

- Initialize parameters 
- Run the optimization loop
    - Forward propagation to compute the loss function
    - Backward propagation to compute the gradients with respect to the loss function
    - Clip the gradients to avoid exploding gradients
    - Using the gradients, update your parameters with the gradient descent update rule.
- Return the learned parameters 
    
<img src="./images/rnn.png" style="width:450;height:300px;">
<caption><center><font color='purple'><b>Figure 1</b>: Recurrent Neural Network, similar to what you built in the previous notebook "Building a Recurrent Neural Network - Step by Step."  </center></caption>

* At each time-step, the RNN tries to predict what the next character is, given the previous characters. 
* $\mathbf{X} = (x^{\langle 1 \rangle}, x^{\langle 2 \rangle}, ..., x^{\langle T_x \rangle})$ is a list of characters from the training set.
* $\mathbf{Y} = (y^{\langle 1 \rangle}, y^{\langle 2 \rangle}, ..., y^{\langle T_x \rangle})$ is the same list of characters but shifted one character forward. 
* At every time-step $t$, $y^{\langle t \rangle} = x^{\langle t+1 \rangle}$.  The prediction at time $t$ is the same as the input at time $t + 1$.

In [None]:
#| hide
import nbdev; nbdev.nbdev_export()