support for gpt-3.5 and gpt 4? #2

Augustukas · 2023-06-20T10:44:59Z

Any ideas to support for gpt-3.5 and gpt 4?

josephrocca · 2023-06-21T13:57:45Z

gpt-3.5 uses this same tokenizer (edit: actually 3.5 turbo, at least, uses gpt-4 tokenizer I think??), but gpt-4 uses a differerent one.

This might have what you want: https://github.com/dqbd/tiktoken/tree/main

You may also want to give this issue a thumbs up: openai/tiktoken#94 (wasm/pyodide would allow running in browsers)

psymbio · 2023-10-20T13:18:50Z

Hi, after a lot of rummaging through some of the helpful issues you raised on Pyodide and Hugging Face tokenizers I've somewhat managed to figure the steps to do this:

git clone https://github.com/pyodide/pyodide && cd pyodide
# pre-built flag is not present as stated in issue 1010 of huggingface tokenizer
./run_docker
make

# installing rust with nightly toolchain
sudo apt update
sudo apt install curl
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
source $HOME/.cargo/env
rustup toolchain install nightly
rustup target add --toolchain nightly wasm32-unknown-emscripten
rustup component add rust-src --toolchain nightly-x86_64-unknown-linux-gnu

# once this is done we need to install maturin and  emsdk
# emsdk seems to install on the make step of the code too so is this required?
cargo install --git https://github.com/PyO3/maturin.git maturin
pip install ./pyodide-build

# https://github.com/huggingface/tokenizers/issues/1010
git clone https://github.com/openai/tiktoken.git
pip install --upgrade pip
pip install setuptools_rust
sudo apt-get install pkg-config libssl-dev
cd tiktoken
python setup.py install --user
sudo apt-get install vim
vim tiktoken_test.py

Once the file opens paste the following into it

import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
if (enc.decode(enc.encode("hello world")) == "hello world"):
    print("encoding and decoding test passed!")
else:
    print("encoding and decoding test failed...")

# To get the tokeniser corresponding to a specific model in the OpenAI API:
enc = tiktoken.encoding_for_model("gpt-4")

This should print the following:

encoding and decoding test passed!

For more documentation go here: https://github.com/psymbio/tiktoken_rust_wasm/blob/main/emscripten.md and https://github.com/psymbio/tiktoken_rust_wasm

I needed help in figuring what exactly builds the wheel file using emscripten?

josephrocca · 2023-10-20T13:25:03Z

Potentially useful:

It has gpt-3, gpt-3.5-turbo, gpt-4, llama 2, t5, and more

I think you can basically just do this:

import { AutoTokenizer } from 'https://cdn.jsdelivr.net/npm/@xenova/transformers@2.6.2'
let tokenizer = await AutoTokenizer.from_pretrained("Xenova/gpt-4");
let tokens = tokenizer.encode("hello world");

josephrocca closed this as completed Jun 21, 2023

psymbio mentioned this issue Oct 21, 2023

Add support for Pyodide openai/tiktoken#94

Closed

Emasoft mentioned this issue Dec 10, 2023

F-REQ: If the pip installer doesn't find Rust, it should install the pure python version of the tokenizer openai/tiktoken#227

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support for gpt-3.5 and gpt 4? #2

support for gpt-3.5 and gpt 4? #2

Augustukas commented Jun 20, 2023

josephrocca commented Jun 21, 2023 •

edited

Loading

psymbio commented Oct 20, 2023

josephrocca commented Oct 20, 2023 •

edited

Loading

support for gpt-3.5 and gpt 4? #2

support for gpt-3.5 and gpt 4? #2

Comments

Augustukas commented Jun 20, 2023

josephrocca commented Jun 21, 2023 • edited Loading

psymbio commented Oct 20, 2023

josephrocca commented Oct 20, 2023 • edited Loading

josephrocca commented Jun 21, 2023 •

edited

Loading

josephrocca commented Oct 20, 2023 •

edited

Loading