The repository for the paper "Exponentially Faster Language Modelling"
https://arxiv.org/abs/2311.10770
- The
trainingfolder contains a clone of the crammedBERT repository from the beginning of October 2023. A few new configurations and small modifications have been made to enable the use of FFFs. A masking implementation (i.e. an implementation of FFFs that offers no speed advantage over FFs but simulates its selective engagement of neurons by masking) is provided for training and downstream finetuning. - The
benchmark_cpufolder contains C++ code using Intel MKL 2023.2.0 to implement accelerated CPU versions of FFF inference as well as baseline DMM implementations of the traditional FF layers. benchmark_pytorchfolder contains the C++ code for the "Native fused" and "PyTorch BMM" implementations of both FF and FFF inference.benchmark_cudafolder contains the C++/CUDA kernel code for the "Naive CUDA" implementations of FF and FFF.
The configuration and weights for UltraFastBERT-1x11-long can be found on HuggingFace:
https://huggingface.co/pbelcak/UltraFastBERT-1x11-long
These files have been produced and uploaded using training/load_local_model.py with impl.push_to_huggingface_hub=True.
UltraFastBERT-1x11-long, as a model, is an instance of our small extension of the crammedBERT setup.
You can simply enter the training directory and follow the steps given in the crammingBERT README to use HuggingFace AutoTokenizer and AutoModelForMaskedLM, with the difference that you want UltraFastBERT-1x11-long, and not crammedBERT.
- Create a new Python/conda environment, or simply use one that does not have any previous version of the original
crammingproject installed. If, by accident, you use the original cramming repository code instead of the one provided in the/trainingfolder of this project, you will be warned bytransformersthat there are some extra weights (FFF weight) and that some weights are missing (the FF weights expected by the originalcrammedBERT). cd ./trainingpip install .- Create
minimal_example.py - Paste the code below
import cramming
from transformers import AutoModelForMaskedLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("pbelcak/UltraFastBERT-1x11-long")
model = AutoModelForMaskedLM.from_pretrained("pbelcak/UltraFastBERT-1x11-long")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)- Run
python minimal_example.py.
- To reproduce our training and finetuning results, simply head straight down to the
trainingfolder and follow the instructions of the README there. - To reproduce our CPU speed benchmarking results, head to
benchmark_cpu. If you're on Windows, the easiest way to compile&run the code might be to use Visual Studio 2022 Community with the Intel oneAPI extension. The other option is to use the Intel compilers directly (more information on the Intel oneAPI "Getting started" websites). benchmark_pytorchresults can be reproduced by runningpython main.pyin the folder. The outcomes of these runs are automatically put into a SQLiteresults.dbfile for the ease of inspection.benchmark_cudarequires the CUDA Toolkit. Once installed, usingpython setup.py installin the extension folder will do the CUDA code compilation for you and prepare a module that can be imported.