-
Notifications
You must be signed in to change notification settings - Fork 7.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gpt_tokenize: unknown token '?' #13
Comments
Have you fixed it? I meet this bug too. |
I'm getting the same error: gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'ť'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ł'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö' |
I get the same error but the query will still get a reply as it should. |
same error |
up |
I hard forked the repository and switched to Qdrant vector storage. Runs locally as well and hits faster requests. This solved the issue for me |
Sounds great! Would you open a PR @su77ungr ? |
same error: gpt_tokenize: unknown token 'Ô' |
Tested it myself. It doesn't solve the "unknown token" warning, and the result is not faster nor more accurate than using Chroma. |
The error has to do with symbols being present in the original doc. There definitely are some of those in the test document used by this repo. But it is just a warning, it doesn't prevent the tool from working |
It's not possible to work with those characters with the default model. This has nothing to do with the vector storage. You have to use a different model. But qdrant does not fail on them - like chinese text. Qdrant should be faster on the benchmark here. This let me open my own implemenation. |
Me too gpt_tokenize: unknown token 'Ô' |
I got a similar result but they were unprintable. It'll also fail with some unicode characters. |
I'll keep an eye on the improvements you pointed @su77ungr and also on your fork. Thanks for sharing!! |
What do you mean "Qdrant vector storage", can you explain please? I'm newbie. |
I think that |
the last line "killed" meaning just like mine you have a potato pc hehe, not enough memory |
This fixed it for me |
I get the exact same issue even though I have 32 GB of RAM, isn't that enough? Ingest.py takes like 1 sec but when I ask a question (on the suggested document), it just freezes my entire pc and the process gets killed (on Fedora 37) |
the way I understand, because the RAM goes brrrrrr, then when limit reach it is killed, this RAM usage must be prioritize rather than the unknown token message |
That's what I suspected too, I just tried running it with |
Use: 'python privateGPT.py 2>/dev/null' to start privateGPT. |
The default SotU doc does have some non-ASCII chars. You can check pretty easily: $ python -c "import chardet; print(chardet.detect(open('source_documents/state_of_the_union.txt', 'rb').read()))"
{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''} I suspect something in the processing chain (probably whatever is tokenizing the embeddings for the prompt) doesn't like non-ASCII UTF-8 tokens, which is very non-optimal. It might well be making the construction of prompt lossy, and useless if you're working with non-English content. |
Running this fixed my issue.
|
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
[1] 32658 killed python3 privateGPT.py
The text was updated successfully, but these errors were encountered: