Keras library implementing the (Universal) Transformer model
Switch branches/tags
Nothing to show
Clone or download


Keras-transformer it's a library implementing nuts and bolts for building (Universal) Transformer models using Keras. It allows you to assemble a multi-step Transformer model in a flexible way, for example:

transformer_block = TransformerBlock(
add_coordinate_embedding = TransformerCoordinateEmbedding(

for step in range(transformer_depth):
    output = transformer_block(
        add_coordinate_embedding(input, step=step))

The library supports positional encoding and embeddings, attention masking, memory-compressed attention, ACT (adaptive computation time). All pieces of the model (like self-attention, activation function, layer normalization) are available as Keras layers, so, if necessary, you can build your version of Transformer, by re-arranging them differently or replacing some of them.

The (Universal) Transformer is a deep learning architecture described in arguably one of the most impressive DL papers of 2017 and 2018: the "Attention is all you need" and the "Universal Transformers" by Google Research and Google Brain teams.

The authors brought the idea of recurrent multi-head self-attention, which has inspired a big wave of new research models that keep coming ever since. These models demonstrate new state-of-the-art results in various NLP tasks, including translation, parsing, question answering, and even some algorithmic tasks.


To install the library you need to clone the repository

git clone

then switch to the cloned directory and run pip

cd keras-transformer
pip install .

Language modelling example

This repository contains a simple example showing how Keras-transformer works. It's not a rigorous evaluation of the model's capabilities, but rather a demonstration on how to use the code.

The code trains a simple language-modeling network on the WikiText-2 dataset and evaluates its perplexity. The model itself is an Adaptive Universal Transformer with five layers.

To launch the code, you will first need to install the requirements listed in example/requirements.txt. Assuming you work from a Python virtual environment, you can do this by running

pip install -r example/requirements.txt

You will also need to make sure you have a backend for Keras. For instance, you can install Tensorflow (the sample was tested using Tensorflow and PlaidML as backends):

pip install tensorflow

Now you can launch the example itself as

pip -m --save lm_model.h5

to see all command line options and their default values, try

pip -m --help

If all goes well, after launching the example you should see the perplexity falling with each epoch.

Building vocabulary: 100%|█████████████████████████████████| 36718/36718 [00:04<00:00, 7642.33it/s]
Learning BPE...Done
Building BPE vocabulary: 100%|███████████████████████████████| 36718/36718 [00:06<00:00, 5743.74it/s]
Train on 9414 samples, validate on 957 samples
Epoch 1/50
9414/9414 [==============================] - 76s 8ms/step - loss: 7.0847 - perplexity: 1044.2455
    - val_loss: 6.3167 - val_perplexity: 406.5031

After 200 epochs (~5 hours) of training on GeForce 1080 Ti, I've got validation perplexity about 51.61 and test perplexity 50.82. The score can be further improved, but that is not the point of this demo.