Skip to content

Commit

Permalink
Bump version, sync codebase
Browse files Browse the repository at this point in the history
  • Loading branch information
hauntsaninja committed Mar 28, 2023
1 parent 82facf9 commit e1c661e
Show file tree
Hide file tree
Showing 4 changed files with 23 additions and 4 deletions.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@

This is the changelog for the open source version of tiktoken.

## [v0.3.3]
- `tiktoken` will now make a best effort attempt to replace surrogate pairs with the corresponding
Unicode character and will replace lone surrogates with the Unicode replacement character.

## [v0.3.2]
- Add encoding for GPT-4

Expand Down
2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "tiktoken"
version = "0.3.2"
version = "0.3.3"
edition = "2021"
rust-version = "1.57.0"

Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[project]
name = "tiktoken"
version = "0.3.2"
version = "0.3.3"
description = "tiktoken is a fast BPE tokeniser for use with OpenAI's models"
readme = "README.md"
license = {file = "LICENSE"}
Expand Down
19 changes: 17 additions & 2 deletions tiktoken/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,12 @@ def encode_ordinary(self, text: str) -> list[int]:
>>> enc.encode_ordinary("hello world")
[31373, 995]
"""
return self._core_bpe.encode_ordinary(text)
try:
return self._core_bpe.encode_ordinary(text)
except UnicodeEncodeError:
# See comment in encode
text = text.encode("utf-16", "surrogatepass").decode("utf-16", "replace")
return self._core_bpe.encode_ordinary(text)

def encode(
self,
Expand Down Expand Up @@ -111,7 +116,17 @@ def encode(
if match := _special_token_regex(disallowed_special).search(text):
raise_disallowed_special_token(match.group())

return self._core_bpe.encode(text, allowed_special)
try:
return self._core_bpe.encode(text, allowed_special)
except UnicodeEncodeError:
# BPE operates on bytes, but the regex operates on unicode. If we pass a str that is
# invalid UTF-8 to Rust, it will rightfully complain. Here we do a quick and dirty
# fixup for any surrogate pairs that may have sneaked their way into the text.
# Technically, this introduces a place where encode + decode doesn't roundtrip a Python
# string, but given that this is input we want to support, maybe that's okay.
# Also we use errors="replace" to handle weird things like lone surrogates.
text = text.encode("utf-16", "surrogatepass").decode("utf-16", "replace")
return self._core_bpe.encode(text, allowed_special)

def encode_ordinary_batch(self, text: list[str], *, num_threads: int = 8) -> list[list[int]]:
"""Encodes a list of strings into tokens, in parallel, ignoring special tokens.
Expand Down

0 comments on commit e1c661e

Please sign in to comment.