A simplified implementation of OpenAI's BPE encoder for GPT-2.
The original implementation can be found in original.py, which was copied from here.
My re-implementation can be found in bpe.py. I simplified a lot of things, added type hints, and refactored everything to be functional (I use recursion for merging the pairs). This implementation is probably slower than the original.
You can test that this implementation gives identical outputs to the original when encoding some_text_file.txt via:
$ python test.py some_text_file.txt
✅ test passed (encode -> decode recovers input text)
✅ test passed (gives same output as original implementation)Note, you'll need to install regex:
$ pip install regexTested with Python 3.9.6.