gpt_tokenize: unknown token '?' #13

anonimo28 · 2023-05-09T00:26:55Z

gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
[1] 32658 killed python3 privateGPT.py

moneymouse · 2023-05-09T07:22:55Z

Have you fixed it? I meet this bug too.

bbscout · 2023-05-09T08:37:43Z

I'm getting the same error:

gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'ť'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ł'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'

x4g4p3x · 2023-05-09T11:12:02Z

I get the same error but the query will still get a reply as it should.

nssiwi · 2023-05-09T14:41:11Z

same error

kamuridesu · 2023-05-09T14:59:08Z

up

su77ungr · 2023-05-10T06:24:15Z

https://github.com/su77ungr/CASALIOY

I hard forked the repository and switched to Qdrant vector storage. Runs locally as well and hits faster requests. This solved the issue for me

imartinez · 2023-05-10T06:37:43Z

Sounds great! Would you open a PR @su77ungr ?

lsotillos · 2023-05-10T12:29:25Z

same error:

gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token '£'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ø'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'

imartinez · 2023-05-10T17:59:52Z

https://github.com/su77ungr/CASALIOY

I hard forked the repository and switched to Qdrant vector storage. Runs locally as well and hits faster requests. This solved the issue for me

Tested it myself. It doesn't solve the "unknown token" warning, and the result is not faster nor more accurate than using Chroma.

imartinez · 2023-05-10T18:20:30Z

The error has to do with symbols being present in the original doc. There definitely are some of those in the test document used by this repo. But it is just a warning, it doesn't prevent the tool from working

su77ungr · 2023-05-10T18:22:36Z

It's not possible to work with those characters with the default model. This has nothing to do with the vector storage. You have to use a different model. But qdrant does not fail on them - like chinese text.

Qdrant should be faster on the benchmark here.
I vowed for the ease of implementation. I'm going to use a different retrieving algo too. That's the bottleneck. Also Qdrant will be way faster with a better implementation like this.

This let me open my own implemenation.

assuredclean · 2023-05-10T22:16:11Z

Me too

gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token '£'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ø'
gpt_tokenize: unknown token 'Ô'
gpt_tokenize: unknown token 'Ç'
gpt_tokenize: unknown token 'Ö'

dennydream · 2023-05-11T14:16:17Z

I got a similar result but they were unprintable. It'll also fail with some unicode characters.

imartinez · 2023-05-11T17:55:45Z

I'll keep an eye on the improvements you pointed @su77ungr and also on your fork. Thanks for sharing!!

Amarbo · 2023-05-12T17:00:02Z

https://github.com/su77ungr/CASALIOY

I hard forked the repository and switched to Qdrant vector storage. Runs locally as well and hits faster requests. This solved the issue for me

What do you mean "Qdrant vector storage", can you explain please? I'm newbie.

tk42 · 2023-05-13T23:44:49Z

I think that MODEL_TYPE in .env does not match the actual model.
I got this error when I was running a LlamaCpp model with MODEL_TYPE=GPT4All, it disappeared when I set MODEL_TYPE=LlamaCpp.

GitEin11 · 2023-05-14T02:01:48Z

gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
[1] 32658 killed python3 privateGPT.py

the last line "killed" meaning just like mine you have a potato pc hehe, not enough memory

mabry1985 · 2023-05-14T06:54:08Z

I think that MODEL_TYPE in .env does not match the actual model. I got this error when I was running a LlamaCpp model with MODEL_TYPE=GPT4All, it disappeared when I set MODEL_TYPE=LlamaCpp.

This fixed it for me

JMans15 · 2023-05-17T16:31:53Z

gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
[1] 32658 killed python3 privateGPT.py

the last line "killed" meaning just like mine you have a potato pc hehe, not enough memory

I get the exact same issue even though I have 32 GB of RAM, isn't that enough? Ingest.py takes like 1 sec but when I ask a question (on the suggested document), it just freezes my entire pc and the process gets killed (on Fedora 37)

GitEin11 · 2023-05-17T16:45:12Z

gpt_tokenize: unknown token '?'
gpt_tokenize: unknown token '?'
[1] 32658 killed python3 privateGPT.py

the last line "killed" meaning just like mine you have a potato pc hehe, not enough memory

I get the exact same issue even though I have 32 GB of RAM, isn't that enough? Ingest.py takes like 1 sec but when I ask a question (on the suggested document), it just freezes my entire pc and the process gets killed (on Fedora 37)

the way I understand, because the RAM goes brrrrrr, then when limit reach it is killed, this RAM usage must be prioritize rather than the unknown token message

JMans15 · 2023-05-17T16:52:00Z

That's what I suspected too, I just tried running it with __NV_PRIME_RENDER_OFFLOAD=1 and __GLX_VENDOR_LIBRARY_NAME=nvidia (I don't even know if it's supposed to run on the GPU) and now it just freezes my pc until I kill it manually

late7 · 2023-05-18T10:30:09Z

Use: 'python privateGPT.py 2>/dev/null' to start privateGPT.
By adding >2/dev/null in the end of command you'll suppress Error messages (stderror 2). This is far away from a fix, but adds usability. Seems to work with Windows Git Bash as well :-)

uogbuji · 2023-05-23T18:08:47Z

The default SotU doc does have some non-ASCII chars. You can check pretty easily:

$ python -c "import chardet; print(chardet.detect(open('source_documents/state_of_the_union.txt', 'rb').read()))"
{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

I suspect something in the processing chain (probably whatever is tokenizing the embeddings for the prompt) doesn't like non-ASCII UTF-8 tokens, which is very non-optimal. It might well be making the construction of prompt lossy, and useless if you're working with non-English content.

veyselyenilmez · 2023-06-01T13:56:52Z

Running this fixed my issue.

find /path/to/folder -type f -name "*.txt" -exec sh -c 'iconv -f utf-8 -t utf-8 -c "{}" | sed -e "s/[^[:print:]]/?/g" -e "s/[Çç]/C/g" -e "s/[Ğğ]/G/g" -e "s/[İı]/I/g" -e "s/[Öö]/O/g" -e "s/[Şş]/S/g" -e "s/[Üü]/U/g" > "{}.tmp" && mv "{}.tmp" "{}"' \;

anonimo28 changed the title ~~I have this error :(~~ ImportError: May 9, 2023

anonimo28 changed the title ~~ImportError:~~ [1] 6221 abort ERROR: May 9, 2023

anonimo28 changed the title ~~[1] 6221 abort ERROR:~~ gpt_tokenize: unknown token '?' May 9, 2023

hippalectryon-0 mentioned this issue May 12, 2023

gpt_tokenize: unknown token su77ungr/CASALIOY#13

Closed

mabry1985 mentioned this issue May 14, 2023

gpt_tokenize: uknown token #107

Closed

PulpCattel mentioned this issue May 19, 2023

gpt_tokenize: unknown token '' #180

Closed

anonimo28 closed this as completed Jun 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpt_tokenize: unknown token '?' #13

gpt_tokenize: unknown token '?' #13

anonimo28 commented May 9, 2023 •

edited

moneymouse commented May 9, 2023

bbscout commented May 9, 2023

x4g4p3x commented May 9, 2023

nssiwi commented May 9, 2023

kamuridesu commented May 9, 2023

su77ungr commented May 10, 2023

imartinez commented May 10, 2023 •

edited

lsotillos commented May 10, 2023 •

edited

imartinez commented May 10, 2023 •

edited

imartinez commented May 10, 2023 •

edited

su77ungr commented May 10, 2023 •

edited

assuredclean commented May 10, 2023

dennydream commented May 11, 2023

imartinez commented May 11, 2023

Amarbo commented May 12, 2023

tk42 commented May 13, 2023

GitEin11 commented May 14, 2023 •

edited

mabry1985 commented May 14, 2023

JMans15 commented May 17, 2023

GitEin11 commented May 17, 2023

JMans15 commented May 17, 2023

late7 commented May 18, 2023

uogbuji commented May 23, 2023

veyselyenilmez commented Jun 1, 2023

gpt_tokenize: unknown token '?' #13

gpt_tokenize: unknown token '?' #13

Comments

anonimo28 commented May 9, 2023 • edited

moneymouse commented May 9, 2023

bbscout commented May 9, 2023

x4g4p3x commented May 9, 2023

nssiwi commented May 9, 2023

kamuridesu commented May 9, 2023

su77ungr commented May 10, 2023

imartinez commented May 10, 2023 • edited

lsotillos commented May 10, 2023 • edited

imartinez commented May 10, 2023 • edited

imartinez commented May 10, 2023 • edited

su77ungr commented May 10, 2023 • edited

assuredclean commented May 10, 2023

dennydream commented May 11, 2023

imartinez commented May 11, 2023

Amarbo commented May 12, 2023

tk42 commented May 13, 2023

GitEin11 commented May 14, 2023 • edited

mabry1985 commented May 14, 2023

JMans15 commented May 17, 2023

GitEin11 commented May 17, 2023

JMans15 commented May 17, 2023

late7 commented May 18, 2023

uogbuji commented May 23, 2023

veyselyenilmez commented Jun 1, 2023

anonimo28 commented May 9, 2023 •

edited

imartinez commented May 10, 2023 •

edited

lsotillos commented May 10, 2023 •

edited

imartinez commented May 10, 2023 •

edited

imartinez commented May 10, 2023 •

edited

su77ungr commented May 10, 2023 •

edited

GitEin11 commented May 14, 2023 •

edited