-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support for gpt-3.5 and gpt 4? #2
Comments
gpt-3.5 uses this same tokenizer (edit: actually 3.5 turbo, at least, uses gpt-4 tokenizer I think??), but gpt-4 uses a differerent one. This might have what you want: https://github.com/dqbd/tiktoken/tree/main You may also want to give this issue a thumbs up: openai/tiktoken#94 (wasm/pyodide would allow running in browsers) |
Hi, after a lot of rummaging through some of the helpful issues you raised on Pyodide and Hugging Face tokenizers I've somewhat managed to figure the steps to do this: git clone https://github.com/pyodide/pyodide && cd pyodide
# pre-built flag is not present as stated in issue 1010 of huggingface tokenizer
./run_docker
make
# installing rust with nightly toolchain
sudo apt update
sudo apt install curl
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
source $HOME/.cargo/env
rustup toolchain install nightly
rustup target add --toolchain nightly wasm32-unknown-emscripten
rustup component add rust-src --toolchain nightly-x86_64-unknown-linux-gnu
# once this is done we need to install maturin and emsdk
# emsdk seems to install on the make step of the code too so is this required?
cargo install --git https://github.com/PyO3/maturin.git maturin
pip install ./pyodide-build
# https://github.com/huggingface/tokenizers/issues/1010
git clone https://github.com/openai/tiktoken.git
pip install --upgrade pip
pip install setuptools_rust
sudo apt-get install pkg-config libssl-dev
cd tiktoken
python setup.py install --user
sudo apt-get install vim
vim tiktoken_test.py Once the file opens paste the following into it import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
if (enc.decode(enc.encode("hello world")) == "hello world"):
print("encoding and decoding test passed!")
else:
print("encoding and decoding test failed...")
# To get the tokeniser corresponding to a specific model in the OpenAI API:
enc = tiktoken.encoding_for_model("gpt-4") This should print the following: encoding and decoding test passed! For more documentation go here: https://github.com/psymbio/tiktoken_rust_wasm/blob/main/emscripten.md and https://github.com/psymbio/tiktoken_rust_wasm I needed help in figuring what exactly builds the wheel file using emscripten? |
Potentially useful:
It has gpt-3, gpt-3.5-turbo, gpt-4, llama 2, t5, and more I think you can basically just do this: import { AutoTokenizer } from 'https://cdn.jsdelivr.net/npm/@xenova/transformers@2.6.2'
let tokenizer = await AutoTokenizer.from_pretrained("Xenova/gpt-4");
let tokens = tokenizer.encode("hello world"); |
Any ideas to support for gpt-3.5 and gpt 4?
The text was updated successfully, but these errors were encountered: