Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpt_tokenize: unknown token '?' #13

Closed
anonimo28 opened this issue May 9, 2023 · 24 comments
Closed

gpt_tokenize: unknown token '?' #13

anonimo28 opened this issue May 9, 2023 · 24 comments

Comments

@anonimo28
Copy link

anonimo28 commented May 9, 2023

gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
[1] 32658 killed python3 privateGPT.py

@anonimo28 anonimo28 changed the title I have this error :( ImportError: May 9, 2023
@anonimo28 anonimo28 changed the title ImportError: [1] 6221 abort ERROR: May 9, 2023
@anonimo28 anonimo28 changed the title [1] 6221 abort ERROR: gpt_tokenize: unknown token '?' May 9, 2023
@moneymouse
Copy link

Have you fixed it? I meet this bug too.

@bbscout
Copy link

bbscout commented May 9, 2023

I'm getting the same error:

gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'ť'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ł'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'

@x4g4p3x
Copy link

x4g4p3x commented May 9, 2023

I get the same error but the query will still get a reply as it should.

@nssiwi
Copy link

nssiwi commented May 9, 2023

same error

@kamuridesu
Copy link

up

@su77ungr
Copy link

https://github.com/su77ungr/CASALIOY

I hard forked the repository and switched to Qdrant vector storage. Runs locally as well and hits faster requests. This solved the issue for me

@imartinez
Copy link
Collaborator

imartinez commented May 10, 2023

Sounds great! Would you open a PR @su77ungr ?

@lsotillos
Copy link

lsotillos commented May 10, 2023

same error:

gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token '£'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ø'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'

@imartinez
Copy link
Collaborator

imartinez commented May 10, 2023

https://github.com/su77ungr/CASALIOY

I hard forked the repository and switched to Qdrant vector storage. Runs locally as well and hits faster requests. This solved the issue for me

Tested it myself. It doesn't solve the "unknown token" warning, and the result is not faster nor more accurate than using Chroma.

@imartinez
Copy link
Collaborator

imartinez commented May 10, 2023

The error has to do with symbols being present in the original doc. There definitely are some of those in the test document used by this repo. But it is just a warning, it doesn't prevent the tool from working

@su77ungr
Copy link

su77ungr commented May 10, 2023

It's not possible to work with those characters with the default model. This has nothing to do with the vector storage. You have to use a different model. But qdrant does not fail on them - like chinese text.

Qdrant should be faster on the benchmark here.
I vowed for the ease of implementation. I'm going to use a different retrieving algo too. That's the bottleneck. Also Qdrant will be way faster with a better implementation like this.

This let me open my own implemenation.

@assuredclean
Copy link

Me too

gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token '£'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ø'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'

@dennydream
Copy link

I got a similar result but they were unprintable. It'll also fail with some unicode characters.

@imartinez
Copy link
Collaborator

I'll keep an eye on the improvements you pointed @su77ungr and also on your fork. Thanks for sharing!!

@Amarbo
Copy link

Amarbo commented May 12, 2023

https://github.com/su77ungr/CASALIOY

I hard forked the repository and switched to Qdrant vector storage. Runs locally as well and hits faster requests. This solved the issue for me

What do you mean "Qdrant vector storage", can you explain please? I'm newbie.

@tk42
Copy link

tk42 commented May 13, 2023

I think that MODEL_TYPE in .env does not match the actual model.
I got this error when I was running a LlamaCpp model with MODEL_TYPE=GPT4All, it disappeared when I set MODEL_TYPE=LlamaCpp.

@GitEin11
Copy link

GitEin11 commented May 14, 2023

gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
[1] 32658 killed python3 privateGPT.py

the last line "killed" meaning just like mine you have a potato pc hehe, not enough memory

@mabry1985
Copy link

I think that MODEL_TYPE in .env does not match the actual model. I got this error when I was running a LlamaCpp model with MODEL_TYPE=GPT4All, it disappeared when I set MODEL_TYPE=LlamaCpp.

This fixed it for me

@JMans15
Copy link

JMans15 commented May 17, 2023

gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
[1] 32658 killed python3 privateGPT.py

the last line "killed" meaning just like mine you have a potato pc hehe, not enough memory

I get the exact same issue even though I have 32 GB of RAM, isn't that enough? Ingest.py takes like 1 sec but when I ask a question (on the suggested document), it just freezes my entire pc and the process gets killed (on Fedora 37)

@GitEin11
Copy link

gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
[1] 32658 killed python3 privateGPT.py

the last line "killed" meaning just like mine you have a potato pc hehe, not enough memory

I get the exact same issue even though I have 32 GB of RAM, isn't that enough? Ingest.py takes like 1 sec but when I ask a question (on the suggested document), it just freezes my entire pc and the process gets killed (on Fedora 37)

the way I understand, because the RAM goes brrrrrr, then when limit reach it is killed, this RAM usage must be prioritize rather than the unknown token message

@JMans15
Copy link

JMans15 commented May 17, 2023

That's what I suspected too, I just tried running it with __NV_PRIME_RENDER_OFFLOAD=1 and __GLX_VENDOR_LIBRARY_NAME=nvidia (I don't even know if it's supposed to run on the GPU) and now it just freezes my pc until I kill it manually

@late7
Copy link

late7 commented May 18, 2023

Use: 'python privateGPT.py 2>/dev/null' to start privateGPT.
By adding >2/dev/null in the end of command you'll suppress Error messages (stderror 2). This is far away from a fix, but adds usability. Seems to work with Windows Git Bash as well :-)

@uogbuji
Copy link

uogbuji commented May 23, 2023

The default SotU doc does have some non-ASCII chars. You can check pretty easily:

$ python -c "import chardet; print(chardet.detect(open('source_documents/state_of_the_union.txt', 'rb').read()))"
{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

I suspect something in the processing chain (probably whatever is tokenizing the embeddings for the prompt) doesn't like non-ASCII UTF-8 tokens, which is very non-optimal. It might well be making the construction of prompt lossy, and useless if you're working with non-English content.

@veyselyenilmez
Copy link

Running this fixed my issue.

find /path/to/folder -type f -name "*.txt" -exec sh -c 'iconv -f utf-8 -t utf-8 -c "{}" | sed -e "s/[^[:print:]]/?/g" -e "s/[Çç]/C/g" -e "s/[Ğğ]/G/g" -e "s/[İı]/I/g" -e "s/[Öö]/O/g" -e "s/[Şş]/S/g" -e "s/[Üü]/U/g" > "{}.tmp" && mv "{}.tmp" "{}"' \;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests