# We build a transformer for English-Chinese translation

Source: [The Annotated Transformer: English-to-Chinese Translator](https://cuicaihao.com/the-annotated-transformer-english-to-chinese-translator/)

Other resources:

- [Build your own Transformer from scratch using Pytorch](https://towardsdatascience.com/build-your-own-transformer-from-scratch-using-pytorch-84c850470dcb)
- [Vanilla-transformer Repo by Aryan Shekarlaban](https://github.com/arxyzan/vanilla-transformer?tab=readme-ov-file)
- [The Illustrated Transformer by Jay Alammar](https://jalammar.github.io/illustrated-transformer/)
- [The Original Transformer (PyTorch) by Aleksa Gordic](https://github.com/gordicaleksa/pytorch-original-transformer)
- [Attention is all you need from scratch by Aladdin Persson](https://github.com/aladdinpersson/Machine-Learning-Collection/blob/master/ML/Pytorch/more_advanced/Seq2Seq_attention/seq2seq_attention.py)
- [PyTorch Seq2Seq by Ben Trevett](https://github.com/bentrevett/pytorch-seq2seq)
- [Transformers: Attention in Disguise by Mihail Eric](https://www.mihaileric.com/posts/transformers-attention-in-disguise/)
- [The Annotated Transformer by Harvard NLP](http://nlp.seas.harvard.edu/2018/04/03/attention.html)


## High-level structure

- __Encoder__: transforms the input sentence (list of tokens) into numeric matrix format. 
- __Decoder__: map embedding back to another language. 

### Encoder structure:
1. Data preparation <br>
    * tokenize the original sentences in input language. <br>
        - apply tokenizer to each original sentence and append `BOS` and `EOS` to the new token sentence.
        
    * build vocab (language encoder and decoder) including "word-to-id" and "id-to-word". <br>
        - for all word in vocab (max word is tunable)
    * sort the vocab to reduce padding. <br>
        - padding: pad (to the right) with zeros of seq length < max len
    * split the dataset into patches. <br>
        - define an `Batch` object to hold __src__ and __target__ sentences. 
        - `Batch` has `mask` method: 
2. Positional encoding
3. Self-attention + masking
4. Layer norm + residual
5. Feedforward layer
6. Layer normal + residual

### Masking
- The input `X` is `[batch−size, sequence−length]`, we use ‘padding’ to fill the matrix with 0 with respect to the longest sequence.
- But this will case issues for the softmax computation. This means the padding sections join the computation, but they shouldn’t. 
- So we create this mask to ignore these area by assign a large negative bias. Thus, the masked area will lead to 0 so we avoid them in computation. We use mini-batch data as input, means we feed multiply lines of sentences into the model for training and computation.

# Load Data

## Questions:

1. 