Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_encoding error for gpt2, but other encodings fine #63

Closed
mobilestack opened this issue Mar 15, 2023 · 5 comments
Closed

get_encoding error for gpt2, but other encodings fine #63

mobilestack opened this issue Mar 15, 2023 · 5 comments

Comments

@mobilestack
Copy link

The code is like this.

import tiktoken

# runs ok
encoding2 = tiktoken.get_encoding("cl100k_base")

# runs ok
encoding4 = tiktoken.encoding_for_model("gpt-3.5-turbo")

# runs ok
encoding3 = tiktoken.get_encoding("p50k_base")

# runs error !!
encoding3 = tiktoken.get_encoding("gpt2")

The error message is:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[11], line 2
      1 # runs error
----> 2 encoding3 = tiktoken.get_encoding("gpt2")

File ~/work/venv310/lib/python3.10/site-packages/tiktoken/registry.py:63, in get_encoding(encoding_name)
     60     raise ValueError(f"Unknown encoding {encoding_name}")
     62 constructor = ENCODING_CONSTRUCTORS[encoding_name]
---> 63 enc = Encoding(**constructor())
     64 ENCODINGS[encoding_name] = enc
     65 return enc

File ~/work/venv310/lib/python3.10/site-packages/tiktoken_ext/openai_public.py:11, in gpt2()
     10 def gpt2():
---> 11     mergeable_ranks = data_gym_to_mergeable_bpe_ranks(
     12         vocab_bpe_file="https://openaipublic.blob.core.windows.net/gpt-2/encodings/main/vocab.bpe",
     13         encoder_json_file="https://openaipublic.blob.core.windows.net/gpt-2/encodings/main/encoder.json",
     14     )
     15     return {
     16         "name": "gpt2",
     17         "explicit_n_vocab": 50257,
   (...)
     20         "special_tokens": {"<|endoftext|>": 50256},
     21     }

File ~/work/venv310/lib/python3.10/site-packages/tiktoken/load.py:95, in data_gym_to_mergeable_bpe_ranks(vocab_bpe_file, encoder_json_file)
     93 encoder_json_loaded.pop(b"<|endoftext|>", None)
     94 encoder_json_loaded.pop(b"<|startoftext|>", None)
---> 95 assert bpe_ranks == encoder_json_loaded
     97 return bpe_ranks

AssertionError: 

According to another issue that you suggest to run.

python --version
python -c 'import platform; print(platform.platform())'
python -m venv env
source env/bin/activate
env/bin/python -m pip install wheel
env/bin/python -m pip install tiktoken
env/bin/python -c 'import tiktoken; print(tiktoken.get_encoding("gpt2"))'
env/bin/python -c 'import site; import os; print(os.listdir(site.getsitepackages()[0]))'

Since I don't have a python, but I have python3, so I run everything in venv.

Results are something like these.

Python 3.10.3
macOS-13.2-arm64-arm-64bit

(venv310) ➜  ~ pip install wheel
Requirement already satisfied: wheel in ./work/venv310/lib/python3.10/site-packages (0.40.0)
(venv310) ➜  ~ pip install tiktoken
Requirement already satisfied: tiktoken in ./work/venv310/lib/python3.10/site-packages (0.3.1)
Requirement already satisfied: regex>=2022.1.18 in ./work/venv310/lib/python3.10/site-packages (from tiktoken) (2022.10.31)
Requirement already satisfied: requests>=2.26.0 in ./work/venv310/lib/python3.10/site-packages (from tiktoken) (2.28.2)
Requirement already satisfied: charset-normalizer<4,>=2 in ./work/venv310/lib/python3.10/site-packages (from requests>=2.26.0->tiktoken) (2.0.12)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in ./work/venv310/lib/python3.10/site-packages (from requests>=2.26.0->tiktoken) (1.26.9)
Requirement already satisfied: idna<4,>=2.5 in ./work/venv310/lib/python3.10/site-packages (from requests>=2.26.0->tiktoken) (3.3)
Requirement already satisfied: certifi>=2017.4.17 in ./work/venv310/lib/python3.10/site-packages (from requests>=2.26.0->tiktoken) (2022.12.7)
(venv310) ➜  ~ python -c 'import site; import os; print(os.listdir(site.getsitepackages()[0]))'
['shellingham-1.5.0.post1.dist-info', 'fastjsonschema', 'dataclasses_json-0.5.7.dist-info', 'typing_extensions-4.5.0.dist-info', 'commonmark-0.9.1.dist-info', 'talib', 'weibo_spider', 'multidict-6.0.4.dist-info', 'async_timeout', 'marshmallow', 'importlib_metadata-4.10.1.dist-info', 'appnope', 'packaging', 'fonttools-4.31.1.dist-info', 'aiohttp', 'rfc3339_validator-0.1.4.dist-info', 'appnope-0.1.3.dist-info', 'certifi-2022.12.7.dist-info', 'pyrsistent-0.19.3.dist-info', 'altgraph-0.17.3.dist-info', 'wcwidth-0.2.6.dist-info', 'qdarkstyle', 'fqdn-1.5.1.dist-info', 'decorator-5.1.1.dist-info', 'tokenizers', 'ffmpeg', 'jupyter_client-8.0.3.dist-info', 'wcwidth', 'idna-3.3.dist-info', 'Jinja2-3.1.2.dist-info', 'websocket', 'markupsafe', 'integv', 'deap', 'jupyter_core', 'lxml-4.8.0.dist-info', 'vnpy_algotrading-1.0.2.dist-info', 'pandocfilters-1.5.0.dist-info', 'ptyprocess-0.7.0.dist-info', 'widgetsnbextension', 'aiosignal-1.3.1.dist-info', 'pytz-2022.1.dist-info', 'bs4-0.0.1.dist-info', 'isoduration-20.11.0.dist-info', 'webcolors-1.12.dist-info', 'webencodings', 'huggingface_hub', 'tinycss2-1.2.1.dist-info', 'yt_dlp', 
'backcall', 'websocket_client-1.5.1.dist-info', 'bleach-6.0.0.dist-info', 'defusedxml', '.DS_Store', 'nbformat', 'mistune', 'webencodings-0.5.1.dist-info', 'shiboken6', 'attrs', 'colorama-0.4.6.dist-info', 'pyrsistent', 'python_dateutil-2.8.2.dist-info', 'pycryptodomex-3.17.dist-info', 'debugpy-1.6.6.dist-info', 'bleach', 'pygments', 'TA_Lib-0.4.24.dist-info', 'pure_eval', 'aiofiles', 'pyparsing-3.0.7.dist-info', 'gpt_index-0.4.28.dist-info', 'asttokens-2.2.1.dist-info', 'pycparser', 'async_timeout-4.0.2.dist-info', 'more_itertools-9.1.0.dist-info', 'soupsieve-2.3.2.post1.dist-info', 
'nbclient-0.7.2.dist-info', 'python_json_logger-2.0.7.dist-info', 'jupyter_server_terminals-0.4.4.dist-info', 'jupyterlab-3.6.1.dist-info', 'vnpy_ctastrategy', 'pylab.py', 'defusedxml-0.7.1.dist-info', 'ipywidgets', 'typer', 'xvideos_dl', 'marshmallow-3.19.0.dist-info', 'youtube_dl', 'shiboken6-6.2.3.dist-info', 'argon2', 'jupyter_core-5.2.0.dist-info', 'pyparsing', 'debugpy', 'cursor', 'requests-2.28.2.dist-info', 'pickleshare-0.7.5.dist-info', 'vnpy_paperaccount', 'stack_data-0.6.2.dist-info', 'stack_data', 'past', 'langchain', 'QDarkStyle-3.0.3.dist-info', 'jinja2', 'nest_asyncio-1.5.6.dist-info', 'jupyter_events-0.6.3.dist-info', 'arrow', 'IPython', 'soupsieve', 'frozenlist', 'Send2Trash-1.8.0.dist-info', 'jupyter_client', 'parso-0.8.3.dist-info', 'seaborn', 'isoduration', 'executing-1.2.0.dist-info', 'six-1.16.0.dist-info', 'mypy_extensions-1.0.0.dist-info', 'EbookLib-0.18.dist-info', 'peewee-3.14.10.dist-info', 'decorator.py', 'filelock-3.9.0.dist-info', 'jupyterlab_widgets-3.0.5.dist-info', 'jupyterlab_plotly', 'llvmlite', 'ipywidgets-8.0.4.dist-info', '_cffi_backend.cpython-310-darwin.so', 'mutagen', 'jsonpointer.py', 'notebook_shim', 'numba-0.56.4.dist-info', 'future-0.18.3.dist-info', 'xvideos_dl-1.3.0.dist-info', 'colorama', 'cffi', 'vnpy_spreadtrading-1.1.4.dist-info', 'aiofiles-22.1.0.dist-info', 
'executing', 'jsonpointer-2.3.dist-info', 'ipykernel_launcher.py', 'llama_index', 'matplotlib_inline', 
'jupyterlab_server', 'jedi', 'send2trash', 'PySide6-6.2.3.dist-info', 'pip-23.0.1.dist-info', 'tests', 'absl', 'ipython_genutils', 'jupyter_server-2.4.0.dist-info', 'Babel-2.12.1.dist-info', 'fqdn', 'youtube_dl-2021.12.17.dist-info', 'vnpy_sqlite', 'fontTools', 'argon2_cffi-21.3.0.dist-info', 'idna', 'json5-0.9.11.dist-info', 'prometheus_client-0.16.0.dist-info', 'importlib_metadata', 'tqdm-4.64.0.dist-info', 
'_argon2_cffi_bindings', 'wheel', 'bs4', 'click', 'pickleshare.py', 'plotly-5.5.0.dist-info', 'tenacity', 'torch', 'comm', 'websockets', 'ipykernel-6.21.3.dist-info', 'aiosqlite', 'mpl_toolkits', 'pytz', 'jupyter_server_fileid-0.8.0.dist-info', 'filelock', 
'langchain-0.0.109.dist-info', 'pydantic-1.10.6.dist-info', 'tiktoken-0.3.1.dist-info', '__pycache__', 'jupyter_ydoc-0.2.3.dist-info', 'transformers-4.26.1.dist-info', 'nbclassic', 'arrow-1.2.3.dist-info', 'altgraph', 'sqlalchemy', 'pyqtgraph', 'shellingham', 'vnpy_ctastrategy-1.0.8.dist-info', 'regex', 'platformdirs-3.1.1.dist-info', 'Pillow-9.0.1.dist-info', 'jupyter_events', 'nbclient', 'plotly', 'numpy', 'jupyterlab_pygments', 'more_itertools', 'SQLAlchemy-1.4.46.dist-info', 'notebook-6.5.3.dist-info', 'pycparser-2.21.dist-info', 'charset_normalizer', 'PIL', 'requests', 'click-7.1.2.dist-info', 'cursor-1.3.5.dist-info', 'absl_py-1.0.0.dist-info', 'pure_eval-0.2.2.dist-info', 'pwiz.py', 'backcall-0.2.0.dist-info', 
'zipp.py', '_plotly_utils', 'ypy_websocket', 'matplotlib-3.5.1-py3.10-nspkg.pth', 'multidict', 'anyio', 'pip', 'cycler-0.11.0.dist-info', 'babel', 'marshmallow_enum', 'tornado', 'pvectorc.cpython-310-darwin.so', 'tomli', 'dataclasses_json', 'seaborn-0.11.2.dist-info', 'jupyter_server_fileid', 'PySide6', 'matplotlib_inline-0.1.6.dist-info', 'nbformat-5.7.3.dist-info', 'jupyterlab_server-2.20.0.dist-info', 'certifi', 'prompt_toolkit', 'pandocfilters.py', 'terminado-0.17.1.dist-info', 'pyinstaller_hooks_contrib-2023.0.dist-info', 'distutils-precedence.pth', 'pyqtgraph-0.12.3.dist-info', 'ipython_genutils-0.2.0.dist-info', 'vnpy_spreadtrading', 'weibo_spider-0.3.0.dist-info', 'sniffio', 'attr', 'pexpect', 'tiktoken', '_pyinstaller_hooks_contrib', 'transformers', 'jsonschema', 'jupyter_ydoc', 'tqdm', 'tzlocal-2.0.0.dist-info', 'PyYAML-6.0.dist-info', 
'yt_dlp-2023.3.4.dist-info', 'Brotli-1.0.9.dist-info', 'jupyterlab_pygments-0.2.2.dist-info', 'mypy_extensions.py', 'ffmpeg_python-0.2.0.dist-info', 'kiwisolver.cpython-310-darwin.so', 'torch-1.13.1.dist-info', 'tokenizers-0.13.2.dist-info', 'MarkupSafe-2.1.2.dist-info', '_yaml', 'huggingface_hub-0.13.2.dist-info', 'aiosqlite-0.18.0.dist-info', 'ptyprocess', 'six.py', 'jupyter_server_terminals', 'playhouse', 'vnpy_algotrading', 'pandas-1.3.5.dist-info', 'json5', 'tinycss2', 'jupyter_server_ydoc-0.6.1.dist-info', 'pexpect-4.8.0.dist-info', 'rfc3339_validator.py', 'macholib-1.16.2.dist-info', 'brotli.py', 'rich', 'cycler.py', 'cffi-1.15.1.dist-info', 'urllib3-1.26.9.dist-info', 'nbclassic-0.5.3.dist-info', 'regex-2022.10.31.dist-info', 'matplotlib', 'yaml', 'prometheus_client', 'vnpy', 'uri_template-1.2.0.dist-info', 'frozenlist-1.3.3.dist-info', 'attrs-22.2.0.dist-info', 'ebooklib', 'rfc3986_validator.py', 'jupyter_server', 'pythonjsonlogger', 
'tiktoken_ext', 'scipy-1.8.0.dist-info', 'numba', 'torchgen', 'urllib3', 'nbconvert', 'wheel-0.40.0.dist-info', 'comm-0.1.2.dist-info', 'rfc3986_validator-0.1.1.dist-info', 'tomli-2.0.1.dist-info', 'ipython-8.11.0.dist-info', 'integv-1.3.0.dist-info', 'rich-10.16.2.dist-info', 'widgetsnbextension-4.0.5.dist-info', 'uri_template', 'prompt_toolkit-3.0.38.dist-info', 'macholib', 'asttokens', 'jupyterlab', 'Cryptodome', 'argon2_cffi_bindings-21.2.0.dist-info', 
'setuptools', 'marshmallow_enum-1.5.1.dist-info', 'Pygments-2.14.0.dist-info', 'numpy-1.21.5.dist-info', 'pkg_resources', 'notebook', 'tenacity-8.2.2.dist-info', 'setuptools-57.0.0.dist-info', 'charset_normalizer-2.0.12.dist-info', '_distutils_hack', 'sniffio-1.3.0.dist-info', '_pyrsistent_version.py', 'pyzmq-25.0.1.dist-info', 'fastjsonschema-2.16.3.dist-info',
 'vnpy-3.0.0.dist-info', 'llvmlite-0.39.1.dist-info', 'notebook_shim-0.2.2.dist-info', 'terminado', 'tornado-6.2.dist-info', 'openai_whisper-20230308.dist-info', 'websockets-10.4.dist-info', 'parso', 'pydantic', 'ypy_websocket-0.8.2.dist-info', 'zipp-3.7.0.dist-info', 'QtPy-2.0.1.dist-info', 'mutagen-1.46.0.dist-info', 'webcolors.py', 'y_py-0.5.9.dist-info', 'beautifulsoup4-4.11.2.dist-info', 'anyio-3.6.2.dist-info', 'openai-0.27.2.dist-info', 'typer-0.3.2.dist-info', 
'peewee.py', 'psutil', 'traitlets', 'libfuturize', 'nbconvert-7.2.9.dist-info', 'matplotlib-3.5.1.dist-info', 'mistune-2.0.5.dist-info', 'future', 'typing_inspect.py', 'lxml', 'aiohttp-3.8.4.dist-info', 'typing_inspect-0.8.0.dist-info', 'scipy', 'vnpy_sqlite-1.0.0.dist-info', 'yarl', 'vnpy_ctabacktester', 'functorch', 'vnpy_paperaccount-1.0.1.dist-info', 'zmq', 'packaging-21.3.dist-info', 'yarl-1.8.2.dist-info', 'qtpy', 'vnpy_ctabacktester-1.0.5.dist-info', 
'kiwisolver-1.4.0.dist-info', 'libpasteurize', '_brotli.cpython-310-darwin.so', 
'plotlywidget', 'ipykernel', 'tzlocal', 'aiosignal', '_plotly_future_', 'jedi-0.18.2.dist-info', 
'y_py', 'pandas', 'dateutil', 'commonmark', 'nest_asyncio.py', 'openai', 'typing_extensions.py', 'whisper', 'gpt_index', 'platformdirs', 'llama_index-0.4.28.dist-info', 'jupyterlab_widgets', 'jupyter.py', 'deap-1.3.1.dist-info', 
'psutil-5.9.4.dist-info', 'traitlets-5.9.0.dist-info', 'jsonschema-4.17.3.dist-info', 'jupyter_server_ydoc']

Hopefully there is a solution. Many thanks!

@hauntsaninja
Copy link
Collaborator

hauntsaninja commented Mar 15, 2023

Hm, thanks for the detailed environment information, but I'm not able to reproduce.

Can you set export TIKTOKEN_CACHE_DIR="" and retry? This environment variable will prevent tiktoken from using a cache for the vocab files it downloads.

Note that even in the simple publicly available tests this code path is tested:

enc = tiktoken.get_encoding("gpt2")

@mobilestack
Copy link
Author

I tried to set the key, but not solved. Is there a specific path for the cache? I might need to delete the cache manually.

@hauntsaninja
Copy link
Collaborator

The logic is here:

cache_dir = os.path.join(tempfile.gettempdir(), "data-gym-cache")

So typically python -c 'import tempfile; import os; print(os.path.join(tempfile.gettempdir(), "data-gym-cache"))'

@hauntsaninja
Copy link
Collaborator

If that doesn't help, maybe you could set a breakpoint and see what the difference between those two dictionaries is.

@mobilestack
Copy link
Author

Woo, that works, after deleted the cached files, it turns right now. Thanks a lot!

There might be an error of the file during or after downloading. Not sure if it is needed to check the cached file before use it, or in that assert bpe_ranks == encoder_json_loaded line, might print more info if it failed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants