Add huggingface tokenizer support for splitting text #45

hwchase17 · 2022-11-01T03:54:31Z

Would be good to have some methods that split on tokens as in the OpenAI example

https://github.com/openai/openai-cookbook/blob/459afa7d9bf026c4434f54458dc7d9e7d9f9f5fe/examples/Obtain_dataset.ipynb

hadsed · 2022-11-04T15:46:49Z

So NLTK's sentence and word tokenizers are pretty robust and fast. I'm curious to know how to observe its effectiveness

(Leaving the module link here for easy reference)

hwchase17 changed the title ~~Add more complex ways of splitting text, rather than just character based.~~ Add huggingface tokenizer support for splitting text Nov 8, 2022

hwchase17 self-assigned this Nov 8, 2022

hwchase17 linked a pull request Nov 10, 2022 that will close this issue

huggingface tokenizer #75

Merged

hwchase17 closed this as completed in #75 Nov 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add huggingface tokenizer support for splitting text #45

Add huggingface tokenizer support for splitting text #45

hwchase17 commented Nov 1, 2022

hadsed commented Nov 4, 2022 •

edited

Loading

Add huggingface tokenizer support for splitting text #45

Add huggingface tokenizer support for splitting text #45

Comments

hwchase17 commented Nov 1, 2022

hadsed commented Nov 4, 2022 • edited Loading

hadsed commented Nov 4, 2022 •

edited

Loading