Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for gpt-3.5 and gpt 4? #2

Closed
Augustukas opened this issue Jun 20, 2023 · 3 comments
Closed

support for gpt-3.5 and gpt 4? #2

Augustukas opened this issue Jun 20, 2023 · 3 comments

Comments

@Augustukas
Copy link

Any ideas to support for gpt-3.5 and gpt 4?

@josephrocca
Copy link
Owner

josephrocca commented Jun 21, 2023

gpt-3.5 uses this same tokenizer (edit: actually 3.5 turbo, at least, uses gpt-4 tokenizer I think??), but gpt-4 uses a differerent one.

This might have what you want: https://github.com/dqbd/tiktoken/tree/main

You may also want to give this issue a thumbs up: openai/tiktoken#94 (wasm/pyodide would allow running in browsers)

@psymbio
Copy link

psymbio commented Oct 20, 2023

Hi, after a lot of rummaging through some of the helpful issues you raised on Pyodide and Hugging Face tokenizers I've somewhat managed to figure the steps to do this:

git clone https://github.com/pyodide/pyodide && cd pyodide
# pre-built flag is not present as stated in issue 1010 of huggingface tokenizer
./run_docker
make

# installing rust with nightly toolchain
sudo apt update
sudo apt install curl
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
source $HOME/.cargo/env
rustup toolchain install nightly
rustup target add --toolchain nightly wasm32-unknown-emscripten
rustup component add rust-src --toolchain nightly-x86_64-unknown-linux-gnu

# once this is done we need to install maturin and  emsdk
# emsdk seems to install on the make step of the code too so is this required?
cargo install --git https://github.com/PyO3/maturin.git maturin
pip install ./pyodide-build

# https://github.com/huggingface/tokenizers/issues/1010
git clone https://github.com/openai/tiktoken.git
pip install --upgrade pip
pip install setuptools_rust
sudo apt-get install pkg-config libssl-dev
cd tiktoken
python setup.py install --user
sudo apt-get install vim
vim tiktoken_test.py

Once the file opens paste the following into it

import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
if (enc.decode(enc.encode("hello world")) == "hello world"):
    print("encoding and decoding test passed!")
else:
    print("encoding and decoding test failed...")

# To get the tokeniser corresponding to a specific model in the OpenAI API:
enc = tiktoken.encoding_for_model("gpt-4")

This should print the following:

encoding and decoding test passed!

For more documentation go here: https://github.com/psymbio/tiktoken_rust_wasm/blob/main/emscripten.md and https://github.com/psymbio/tiktoken_rust_wasm

I needed help in figuring what exactly builds the wheel file using emscripten?

@josephrocca
Copy link
Owner

josephrocca commented Oct 20, 2023

Potentially useful:

It has gpt-3, gpt-3.5-turbo, gpt-4, llama 2, t5, and more

I think you can basically just do this:

import { AutoTokenizer } from 'https://cdn.jsdelivr.net/npm/@xenova/transformers@2.6.2'
let tokenizer = await AutoTokenizer.from_pretrained("Xenova/gpt-4");
let tokens = tokenizer.encode("hello world");

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants