Skip to content

Commit

Permalink
add tokenizer readme w/ instructions for convert script
Browse files Browse the repository at this point in the history
  • Loading branch information
apage43 authored and manyoso committed May 30, 2023
1 parent 840e011 commit 9c15d1f
Showing 1 changed file with 14 additions and 0 deletions.
14 changes: 14 additions & 0 deletions gpt4all-backend/tokenizer/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# BPE tokenizer

This is a C++ implementation of the encoding/decoding functions of a pretrained GPT-2 style BPE tokenizer. It is meant to be compatible with the GPT-J and MPT-7B tokenizers that were trained with HuggingFace [`tokenizers`](https://github.com/huggingface/tokenizers), and only implements the necessary functionality for those models (it is assumed that strings should always be [normalized](https://en.wikipedia.org/wiki/Unicode_equivalence) to Unicode NFC form and split with the GPT-2 "pretokenizing" [regular expression](https://github.com/karpathy/minGPT/blob/37baab71b9abea1b76ab957409a1cc2fbfba8a26/mingpt/bpe.py#L92))

## Converting a tokenizer file

`scripts/gen_tokenizer_include.py` can be used to convert a huggingface `tokenizers` `tokenizer.json` file into a C++ header file:

```bash
# get tokenizer.json
cd gpt4all-backend
wget -O /tmp/gptj-tokenizer.json https://huggingface.co/nomic-ai/gpt4all-j/raw/main/tokenizer.json
python ./scripts/gen_tokenizer_include.py /tmp/gptj-tokenizer.json gptj > ./tokenizer/gptj_tokenizer_config.h
```

0 comments on commit 9c15d1f

Please sign in to comment.