This repository serves as my reference implement of the Transformer architecture, and supplementary things to make up a fully working neural machine translation model.
The structure of this repository is inspired by fairseq (heck, even the name), which I had an opportunity working with while playing around with experimental NMT architectures for my undergraduate thesis. Reimplementing the NMT model from scratch is an excellent way to delve into the intricacies of implementation details that cannot be fully grasped by reading the paper alone. I personally also had a few "naruhodo" momments during the implementation process.
The Transformer architecture itself is quite intuitive. However, implementing an NMT system from scratch is challenging due to the extensive "extra" work required for writing additional modules for data loading, batching, converting, and decoding. Organizing these components effectively also demands a certain level of codebase management experience. At the end of the day, the true complexity lies in the supporting infrastructure rather than the core model itself. Not to mention if I were to deploy this little sh*t into production. Oh my.
Quick project navigation:
- Transformer implementation
- Translation-related logics
- Data collating
- Decoding (greedy search, beam search)
The project currently supports logging training details to Weight and Biases. One can implement a log sink for Tensorboard by extending utils.logger.LoggingHandler.
To be implemented:
- Beam search
- Autoregressive hidden state caching
The specifications of the transformer model is defined in a config file. A reference config.yaml is included. The parameter names are quite self-explainatory.
Gotchas:
- Batch size is calculated in tokens/batch, not samples/batch. Before training, batches are constructed by sorting the data by increasing source length, and samples are greedily grouped until the total number of tokens in the batch exceeds specified size (see BaseCollator). Data is not sorted during inference.
A nano-seq parallel dataset is basically a directory with two txt
files with the filenames being language codes. Source and target language is specified in the training config.
For example:
data
├── test
│ ├── en.txt
│ └── vi.txt
├── train
│ ├── en.txt
│ └── vi.txt
└── valid
├── en.txt
└── vi.txt
A dictionary is the vocab file exported from spm_train
, in which each line is the token entry and its frequency, seperated by a tab character. The frequency information is not used by nano-seq, so you are free to use any subword tokenizer, as long as the dictionary is converted to the described format (e.g. by setting a dummy frequency -1 for all entries).
If you wish to use two separated vocabularies for each language (e.g. to disable embedding sharing between the encoder and decoder), set the config shared_dict
to false
, and include two files in the dictionary folder.
data
├── dictionary
├── en.vocab
└── vi.vocab
In case a shared dictionary is used, it should be named dictionary.vocab
.
git clone https://github.com/hungngocphat01/nano-seq
pip install ./nano-seq
python train.py \
-c path/to/config.yaml \
--chkpt-path path/to/checkpoint/dir \
--chkpt-load /path/to/checkpoint_to_load.pt
The last parameter can be omitted if training from the beginning.
python predict.py \
-c path/to/config.yaml \
--data-path path/to/inference/dataset \
--dict-path path/to/dictionary/dir \
--batch-size 256 \
-o output_file.txt
When calling the inference script, the valid_path
, dict_path
and batch_size
in the config file are ignored, and has to be explicitly redefined.
A reference IWSLT'14 English-Vietnamese dataset (tokenized and truncated) is included in the Release
section of GitHub, using the subword-nmt
tokenizer.
Training CLI
Metrics logged to WandB