GPT-2 Based Detokenizer

Notice: this repository was almost entirely written with the assistance of GPT-4.

This package provides a detokenizer utilizing the GPT-2 language model, which can be used to reconstruct coherent text from a list of tokens. It decides whether or not to add a space between each token pair, utilizing beam search to maximize the total string probability under the language model.

Installation

To use this package, you need to have Python installed along with the transformers library. You can install transformers using pip:

pip install transformers

Usage

The main class in this package is GPT2Detokenizer. Here is a basic usage example:

from detokenizer import GPT2Detokenizer

detokenizer = GPT2Detokenizer()
tokens = ["I", "don", "'", "t", "know", "."]
print(detokenizer.detokenize(tokens))  # Expected output: "I don't know."

Testing

To run the unit tests, navigate to the directory containing test_detokenizer.py in the command line and execute:

python -m unittest test_detokenizer.py

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
__pycache__		__pycache__
README.md		README.md
__init__.py		__init__.py
detokenizer.py		detokenizer.py
main.py		main.py
test_detokenizer.py		test_detokenizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pycache

pycache

README.md

README.md

init.py

init.py

detokenizer.py

detokenizer.py

main.py

main.py

test_detokenizer.py

test_detokenizer.py

Repository files navigation

GPT-2 Based Detokenizer

Installation

Usage

Testing

About

Releases

Packages

Languages

nickatomlin/detokenizer

Folders and files

Latest commit

History

Repository files navigation

GPT-2 Based Detokenizer

Installation

Usage

Testing

About

Resources

Stars

Watchers

Forks

Languages