nano-seq: a pytorch nmt library implemented from scratch

Introduction

This repository serves as my reference implement of the Transformer architecture, and supplementary things to make up a fully working neural machine translation model.

The structure of this repository is inspired by fairseq (heck, even the name), which I had an opportunity working with while playing around with experimental NMT architectures for my undergraduate thesis. Reimplementing the NMT model from scratch is an excellent way to delve into the intricacies of implementation details that cannot be fully grasped by reading the paper alone. I personally also had a few "naruhodo" momments during the implementation process.

The Transformer architecture itself is quite intuitive. However, implementing an NMT system from scratch is challenging due to the extensive "extra" work required for writing additional modules for data loading, batching, converting, and decoding. Organizing these components effectively also demands a certain level of codebase management experience. At the end of the day, the true complexity lies in the supporting infrastructure rather than the core model itself. Not to mention if I were to deploy this little sh*t into production. Oh my.

Quick project navigation:

The project currently supports logging training details to Weight and Biases. One can implement a log sink for Tensorboard by extending utils.logger.LoggingHandler.

To be implemented:

Beam search
Autoregressive hidden state caching

Quick guide

Config file

The specifications of the transformer model is defined in a config file. A reference config.yaml is included. The parameter names are quite self-explainatory.

Gotchas:

Batch size is calculated in tokens/batch, not samples/batch. Before training, batches are constructed by sorting the data by increasing source length, and samples are greedily grouped until the total number of tokens in the batch exceeds specified size (see BaseCollator). Data is not sorted during inference.

Dataset and dictionary

A nano-seq parallel dataset is basically a directory with two txt files with the filenames being language codes. Source and target language is specified in the training config.

For example:

data
├── test
│   ├── en.txt
│   └── vi.txt
├── train
│   ├── en.txt
│   └── vi.txt
└── valid
    ├── en.txt
    └── vi.txt

A dictionary is the vocab file exported from spm_train, in which each line is the token entry and its frequency, seperated by a tab character. The frequency information is not used by nano-seq, so you are free to use any subword tokenizer, as long as the dictionary is converted to the described format (e.g. by setting a dummy frequency -1 for all entries).

If you wish to use two separated vocabularies for each language (e.g. to disable embedding sharing between the encoder and decoder), set the config shared_dict to false, and include two files in the dictionary folder.

data
├── dictionary
    ├── en.vocab
    └── vi.vocab

In case a shared dictionary is used, it should be named dictionary.vocab.

Installation

git clone https://github.com/hungngocphat01/nano-seq
pip install ./nano-seq

Training

python train.py \
  -c path/to/config.yaml \
  --chkpt-path path/to/checkpoint/dir \
  --chkpt-load /path/to/checkpoint_to_load.pt

The last parameter can be omitted if training from the beginning.

Inference

python predict.py \
  -c path/to/config.yaml \
  --data-path path/to/inference/dataset \
  --dict-path path/to/dictionary/dir \
  --batch-size 256 \
  -o output_file.txt

When calling the inference script, the valid_path, dict_path and batch_size in the config file are ignored, and has to be explicitly redefined.

Reference

A reference IWSLT'14 English-Vietnamese dataset (tokenized and truncated) is included in the Release section of GitHub, using the subword-nmt tokenizer.

Screenshots

Training CLI

Metrics logged to WandB

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
assets		assets
examples/new_headlines_classification		examples/new_headlines_classification
nano_seq		nano_seq
tests		tests
.gitignore		.gitignore
README.md		README.md
config.yaml		config.yaml
poetry.lock		poetry.lock
predict.py		predict.py
pyproject.toml		pyproject.toml
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nano-seq: a pytorch nmt library implemented from scratch

Introduction

Quick guide

Config file

Dataset and dictionary

Installation

Training

Inference

Reference

Screenshots

About

Releases 1

Packages

Languages

hungngocphat01/nano-seq

Folders and files

Latest commit

History

Repository files navigation

nano-seq: a pytorch nmt library implemented from scratch

Introduction

Quick guide

Config file

Dataset and dictionary

Installation

Training

Inference

Reference

Screenshots

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages