docs: get_num_tokens - default tokenizer #2439

nikitajz · 2023-04-05T14:26:11Z

Hi!

Upd: Fixed in docs: update tokenizer notice in llms/getting_started #2641
~~I've noticed in the Models -> LLM documentation the following note about get_num_tokens function:~~

Notice that by default the tokens are estimated using a HuggingFace tokenizer.

~~It looks not exactly correct since Huggingface is used only in legacy versions (< 3.8), so it probably outdated.~~

There is also a mapping in tiktoken package that can be reused in the function get_num_tokens:
https://github.com/openai/tiktoken/blob/46287bfa493f8ccca4d927386d7ea9cc20487525/tiktoken/model.py#L13-L53

The text was updated successfully, but these errors were encountered:

AlexTs10 · 2023-04-06T17:17:09Z

I took a look at the repo and here the wrapper for openai uses the tiktoken library. If this doesn't help could you provide the exact files you saw this since you doc link doesn't show the exact code ?

nikitajz · 2023-04-09T23:02:25Z

I took a look at the repo and here the wrapper for openai uses the tiktoken library. If this doesn't help could you provide the exact files you saw this since you doc link doesn't show the exact code ?

I've added a PR with a tiny fix #2641
Hopefully, everything is correct.

A tiny update in docs which is spotted here: #2439

A tiny update in docs which is spotted here: langchain-ai/langchain#2439

dosubot · 2023-09-04T16:04:59Z

Hi, @nikitajz! I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, the issue was about a documentation error regarding the default tokenizer used in the get_num_tokens function. You mentioned that the note about HuggingFace tokenizer being the default is outdated and provided a link to the correct mapping in the tiktoken package. AlexTs10 confirmed that the wrapper for openai uses the tiktoken library, which is the correct mapping for the default tokenizer in the get_num_tokens function. You also added a pull request with a small fix to address the documentation error.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.

Thank you for your contribution and please don't hesitate to reach out if you have any further questions or concerns!

nikitajz mentioned this issue Apr 9, 2023

docs: update tokenizer notice in llms/getting_started #2641

Merged

nikitajz changed the title ~~get_num_tokens - default tokenizer~~ docs: get_num_tokens - default tokenizer Apr 10, 2023

hwchase17 pushed a commit that referenced this issue Apr 11, 2023

docs: update tokenizer notice in llms/getting_started (#2641)

1c979e3

A tiny update in docs which is spotted here: #2439

wertycn pushed a commit to wertycn/langchain-zh that referenced this issue Apr 26, 2023

docs: update tokenizer notice in llms/getting_started (#2641)

27be567

A tiny update in docs which is spotted here: langchain-ai/langchain#2439

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Sep 4, 2023

dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 18, 2023

dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Sep 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: get_num_tokens - default tokenizer #2439

docs: get_num_tokens - default tokenizer #2439

nikitajz commented Apr 5, 2023 •

edited

AlexTs10 commented Apr 6, 2023

nikitajz commented Apr 9, 2023

dosubot bot commented Sep 4, 2023

docs: get_num_tokens - default tokenizer #2439

docs: get_num_tokens - default tokenizer #2439

Comments

nikitajz commented Apr 5, 2023 • edited

AlexTs10 commented Apr 6, 2023

nikitajz commented Apr 9, 2023

dosubot bot commented Sep 4, 2023

nikitajz commented Apr 5, 2023 •

edited