# Getting that last BLEU point
**By changing the way dataset is batched**

- Neural MT models are trained on a batch of sentence pairs. 
- The way batches are made (and submitted to optimizer's loss function) is often overlooked (but not anymore after this!).
- Usually we are adviced to make random batches of sentences
- But, we have to make deal with practical problems:
  - GPUs have finite RAM (currently, about 11GB)
  - Sequences are of unequal length
  - Sequences can be extremely long or short word translations found in dictionaries
- Before closely studying transformer model in tensor2tensor:
 - `batch_size` used to be number of sentences
 - We either filter away long sequences or truncate them to a specified `max_len`
 - We made truly random batches! Always! Shuffle the entire dataset --> start slicing it as batches as we read 
- After closely studying the transformer model in tensor2tensor
 - `batch_size` is number of tokens on the target side. 
 - We dont need to filter the long sequences or truncate them as long as `max_len <= batch_size`
 - Truly random batches ? Nope. We disrupt the randomness by grouping similar length sentences into batches. Randomness is preserved at the level of batches but inside each batch the sentences have similar length.
 

## Why the tensor2tensor style batching is good ?

Loss of each batch is computed per target token on the target (and then normalized). Thus each token on the target side provides a supervision signal to the optimizer.
We want more of those to get better supervision ==> More tokens in the batch the better.

But in practice, we deal with unequal length sequences so we add padding at then end and mask them during the the loss calculation. Every padded token is therefore a waste of computation, so we have to minimize them. And also, it is a loss of opportunity: if it was a useful token, it could have provided a supervision signal. 

If we make batches by counting the number of sentences, some of them will be full, but many of them will be shorter. We could reduce the padding by sorting by length. But shorter sentences thus yield very little signal since they have few tokens. 

- Instead of batching by sentence count, we shall batch by token count on the target side.
This guarantees that each batch's loss is over approximately same number of tokens (maybe helpful to the optimizer).

- We can minimize the wasted computation by minimizing the padded tokens. How? By making batches such that sentences have same number of target tokens.

However, the randomness and approximate length grouping are somewhat contradictory. i.e. 
- if we go for truly random order of dataset (by shuffling all sentences), we dont get batches of approximately same length.
- if we go for same length grouping, we disrupt the randomness. 

So, what we should do:
- we make batches of similar lengths ahead of time
- we shuffle the batches, not the idividual examples. 