Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

不支持中文吗 #207

Closed
liuyang77886 opened this issue May 16, 2023 · 18 comments
Closed

不支持中文吗 #207

liuyang77886 opened this issue May 16, 2023 · 18 comments
Labels
primordial Related to the primordial version of PrivateGPT, which is now frozen in favour of the new PrivateGPT

Comments

@liuyang77886
Copy link

我上传了中文的pdf,输入中文的Question,出现结果的是乱码

@K-tang-mkv
Copy link

嵌入向量的模型换成这个试试GanymedeNil/text2vec-large-chinese

@liuyang77886
Copy link
Author

请问在哪里修改

@XJXxiaohao
Copy link

你中文成功了吗

@hnuzhoulin
Copy link

嵌入向量的模型换成这个试试GanymedeNil/text2vec-large-chinese

error is below:

Loading documents from source_documents
Loaded 1 documents from source_documents
Split into 366 chunks of text (max. 500 characters each)
No sentence-transformers model found with name /root/.cache/torch/sentence_transformers/GanymedeNil_text2vec-large-chinese. Creating a new one with MEAN pooling.
Using embedded DuckDB with persistence: data will be stored in: db
Traceback (most recent call last):
  File "ingest.py", line 97, in <module>
    main()
  File "ingest.py", line 91, in main
    db = Chroma.from_documents(texts, embeddings, persist_directory=persist_directory, client_settings=CHROMA_SETTINGS)
  File "/home/zhoulin/privateGPT/venv/lib/python3.8/site-packages/langchain/vectorstores/chroma.py", line 413, in from_documents
    return cls.from_texts(
  File "/home/zhoulin/privateGPT/venv/lib/python3.8/site-packages/langchain/vectorstores/chroma.py", line 381, in from_texts
    chroma_collection.add_texts(texts=texts, metadatas=metadatas, ids=ids)
  File "/home/zhoulin/privateGPT/venv/lib/python3.8/site-packages/langchain/vectorstores/chroma.py", line 159, in add_texts
    self._collection.add(
  File "/home/zhoulin/privateGPT/venv/lib/python3.8/site-packages/chromadb/api/models/Collection.py", line 101, in add
    self._client._add(
  File "/home/zhoulin/privateGPT/venv/lib/python3.8/site-packages/chromadb/api/local.py", line 223, in _add
    self._db.add_incremental(collection_uuid, added_uuids, embeddings)
  File "/home/zhoulin/privateGPT/venv/lib/python3.8/site-packages/chromadb/db/clickhouse.py", line 605, in add_incremental
    index.add(uuids, embeddings)
  File "/home/zhoulin/privateGPT/venv/lib/python3.8/site-packages/chromadb/db/index/hnswlib.py", line 132, in add
    self._check_dimensionality(embeddings)
  File "/home/zhoulin/privateGPT/venv/lib/python3.8/site-packages/chromadb/db/index/hnswlib.py", line 119, in _check_dimensionality
    raise InvalidDimensionException(
chromadb.errors.InvalidDimensionException: Dimensionality of (1024) does not match index dimensionality (384)

@schiffma
Copy link

I had the same error, deleting all the files/subdirs in the db directory before executing ingest.py helped.

@propheteeeee
Copy link

嵌入向量的模型换成这个试试GanymedeNil/text2vec-large-chinese

大佬,请教下用的那个embending加载本地的,HuggingFaceEmbeddings加载不了,没有指定路径的参数。。

@hnuzhoulin
Copy link

I had the same error, deleting all the files/subdirs in the db directory before executing ingest.py helped.

@schiffma thanks,solve this error,but when ask question,before get response,get many gpt_tokenize: unknown token '',then response is not I want.

@NyyCui
Copy link

NyyCui commented May 19, 2023

请问是直接把bin文件下载之后,替换model路径吗?我替换了之后运行不起来了

@lidedongsn
Copy link

lidedongsn commented May 19, 2023

.
llama_init_from_file: kv self size = 1000.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
Using embedded DuckDB with persistence: data will be stored in: db
gptj_model_load: loading model from '/root/git/AIGC/privateGPT/models/text2vec-large-chinese.bin' - please wait ...
gptj_model_load: invalid model file '/root/git/AIGC/privateGPT/models/text2vec-large-chinese.bin' (bad magic)

@NyyCui
Copy link

NyyCui commented May 19, 2023

. llama_init_from_file: kv self size = 1000.00 MB AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | Using embedded DuckDB with persistence: data will be stored in: db gptj_model_load: loading model from '/root/git/AIGC/privateGPT/models/text2vec-large-chinese.bin' - please wait ... gptj_model_load: invalid model file '/root/git/AIGC/privateGPT/models/text2vec-large-chinese.bin' (bad magic)

是的,我也是这个错误。不清楚是哪里出了问题

@hanwsf
Copy link

hanwsf commented May 19, 2023

.env中:
PERSIST_DIRECTORY=db
MODEL_TYPE=GPT4All
MODEL_PATH=models/ggml-gpt4all-j-v1.3-groovy.bin
EMBEDDINGS_MODEL_NAME=DMetaSoul/sbert-chinese-general-v2 <-这里直接填https://huggingface.co/models?library=sentence-transformers&sort=likes&search=chinese中复制的模型名称。但是显示还是乱码。
MODEL_N_CTX=1000
用英文模型正常的。

@royleecn
Copy link

royleecn commented May 20, 2023

env embedding改成这个可以了
EMBEDDINGS_MODEL_NAME=distiluse-base-multilingual-cased-v1
来源是这个:https://www.sbert.net/docs/pretrained_models.html的多语言

@hanwsf
Copy link

hanwsf commented May 21, 2023

嵌入向量的模型换成这个试试GanymedeNil/text2vec-large-chinese

error is below:

Loading documents from source_documents
Loaded 1 documents from source_documents
Split into 366 chunks of text (max. 500 characters each)
No sentence-transformers model found with name /root/.cache/torch/sentence_transformers/GanymedeNil_text2vec-large-chinese. Creating a new one with MEAN pooling.
Using embedded DuckDB with persistence: data will be stored in: db
Traceback (most recent call last):
  File "ingest.py", line 97, in <module>
    main()
  File "ingest.py", line 91, in main
    db = Chroma.from_documents(texts, embeddings, persist_directory=persist_directory, client_settings=CHROMA_SETTINGS)
  File "/home/zhoulin/privateGPT/venv/lib/python3.8/site-packages/langchain/vectorstores/chroma.py", line 413, in from_documents
    return cls.from_texts(
  File "/home/zhoulin/privateGPT/venv/lib/python3.8/site-packages/langchain/vectorstores/chroma.py", line 381, in from_texts
    chroma_collection.add_texts(texts=texts, metadatas=metadatas, ids=ids)
  File "/home/zhoulin/privateGPT/venv/lib/python3.8/site-packages/langchain/vectorstores/chroma.py", line 159, in add_texts
    self._collection.add(
  File "/home/zhoulin/privateGPT/venv/lib/python3.8/site-packages/chromadb/api/models/Collection.py", line 101, in add
    self._client._add(
  File "/home/zhoulin/privateGPT/venv/lib/python3.8/site-packages/chromadb/api/local.py", line 223, in _add
    self._db.add_incremental(collection_uuid, added_uuids, embeddings)
  File "/home/zhoulin/privateGPT/venv/lib/python3.8/site-packages/chromadb/db/clickhouse.py", line 605, in add_incremental
    index.add(uuids, embeddings)
  File "/home/zhoulin/privateGPT/venv/lib/python3.8/site-packages/chromadb/db/index/hnswlib.py", line 132, in add
    self._check_dimensionality(embeddings)
  File "/home/zhoulin/privateGPT/venv/lib/python3.8/site-packages/chromadb/db/index/hnswlib.py", line 119, in _check_dimensionality
    raise InvalidDimensionException(
chromadb.errors.InvalidDimensionException: Dimensionality of (1024) does not match index dimensionality (384)

需要将之前创建的Db文件夹清空,不能英文向量384上叠加中文embedding 1024的

@hanwsf
Copy link

hanwsf commented May 21, 2023

env embedding改成这个可以了 EMBEDDINGS_MODEL_NAME=distiluse-base-multilingual-cased-v1 来源是这个:https://www.sbert.net/docs/pretrained_models.html的多语言

非常感谢!使用其中的:paraphrase-multilingual-mpnet-base-v2可以出来中文。就是前面有很多的:gpt_tokenize: unknown token '�'

To be improved @imartinez, please help to check:

  1. how to remove the 'gpt_tokenize: unknown token '�'''
  2. The answer is in the pdf, it should come back as Chinese, but reply me in English, and the answer source is inaccurate.
    Thanks!

====Output for reference====

python privateGPT.py
Using embedded DuckDB with persistence: data will be stored in: db
gptj_model_load: loading model from 'models/ggml-gpt4all-j-v1.3-groovy.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx = 2048
gptj_model_load: n_embd = 4096
gptj_model_load: n_head = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot = 64
gptj_model_load: f16 = 2
gptj_model_load: ggml ctx size = 4505.45 MB
gptj_model_load: memory_size = 896.00 MB, n_mem = 57344
gptj_model_load: ................................... done
gptj_model_load: model size = 3609.38 MB / num tensors = 285

Enter a query: RabbitMQ有什么特点?
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
gpt_tokenize: unknown token '�'
RabbitMQ is a messaging system that is used to send and receive messages between different systems. It is a popular choice for building real-time, scalable, and fault-tolerant systems. RabbitMQ is a message queue that is used to store messages until they are processed by the appropriate system. It is a reliable messaging system that is used to ensure that messages are delivered to the correct system at the correct time. RabbitMQ is a popular choice for building real-time, scalable, and fault-tolerant systems.

Question:
RabbitMQ有什么特点?

Answer:
RabbitMQ is a messaging system that is used to send and receive messages between different systems. It is a popular choice for building real-time, scalable, and fault-tolerant systems. RabbitMQ is a message queue that is used to store messages until they are processed by the appropriate system. It is a reliable messaging system that is used to ensure that messages are delivered to the correct system at the correct time. RabbitMQ is a popular choice for building real-time, scalable, and fault-tolerant systems.

source_documents/JAVA核心知识点整理.pdf:
8.1.2.1. 异步通讯 NIO .................................................................................................................................... 148
8.1.2.2. 零拷贝(DIRECT BUFFERS 使用堆外直接内存) .......................................................................... 149
8.1.2.3. 内存池(基于内存池的缓冲区重用机制) ......................................................................................... 149

source_documents/JAVA核心知识点整理.pdf:
4.5.

2.8.

2.9.

2.8.2.1.
2.8.2.2.

source_documents/JAVA核心知识点整理.pdf:
5.1.7.4. 序列化(深 clone 一中实现) ........................................................................................................ 115

source_documents/JAVA核心知识点整理.pdf:
2.2.2. 虚拟机栈(线程私有) .................................................................................................................... 22
2.2.3. 本地方法区(线程私有) ................................................................................................................ 23
2.2.4. 堆(Heap-线程共享)-运行时数据区 ...................................................................................... 23

Enter a query:

@NyyCui
Copy link

NyyCui commented May 22, 2023

env embedding改成这个可以了 EMBEDDINGS_MODEL_NAME=distiluse-base-multilingual-cased-v1 来源是这个:https://www.sbert.net/docs/pretrained_models.html的多语言

非常感谢!使用其中的:paraphrase-multilingual-mpnet-base-v2可以出来中文。就是前面有很多的:gpt_tokenize: unknown token '�'

To be improved @imartinez, please help to check:

  1. how to remove the 'gpt_tokenize: unknown token '�'''
  2. The answer is in the pdf, it should come back as Chinese, but reply me in English, and the answer source is inaccurate.
    Thanks!

====Output for reference====

python privateGPT.py Using embedded DuckDB with persistence: data will be stored in: db gptj_model_load: loading model from 'models/ggml-gpt4all-j-v1.3-groovy.bin' - please wait ... gptj_model_load: n_vocab = 50400 gptj_model_load: n_ctx = 2048 gptj_model_load: n_embd = 4096 gptj_model_load: n_head = 16 gptj_model_load: n_layer = 28 gptj_model_load: n_rot = 64 gptj_model_load: f16 = 2 gptj_model_load: ggml ctx size = 4505.45 MB gptj_model_load: memory_size = 896.00 MB, n_mem = 57344 gptj_model_load: ................................... done gptj_model_load: model size = 3609.38 MB / num tensors = 285

Enter a query: RabbitMQ有什么特点? gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' gpt_tokenize: unknown token '�' RabbitMQ is a messaging system that is used to send and receive messages between different systems. It is a popular choice for building real-time, scalable, and fault-tolerant systems. RabbitMQ is a message queue that is used to store messages until they are processed by the appropriate system. It is a reliable messaging system that is used to ensure that messages are delivered to the correct system at the correct time. RabbitMQ is a popular choice for building real-time, scalable, and fault-tolerant systems.

Question:
RabbitMQ有什么特点?

Answer:
RabbitMQ is a messaging system that is used to send and receive messages between different systems. It is a popular choice for building real-time, scalable, and fault-tolerant systems. RabbitMQ is a message queue that is used to store messages until they are processed by the appropriate system. It is a reliable messaging system that is used to ensure that messages are delivered to the correct system at the correct time. RabbitMQ is a popular choice for building real-time, scalable, and fault-tolerant systems.

source_documents/JAVA核心知识点整理.pdf:
8.1.2.1. 异步通讯 NIO .................................................................................................................................... 148
8.1.2.2. 零拷贝(DIRECT BUFFERS 使用堆外直接内存) .......................................................................... 149
8.1.2.3. 内存池(基于内存池的缓冲区重用机制) ......................................................................................... 149

source_documents/JAVA核心知识点整理.pdf:
4.5.

2.8.

2.9.

2.8.2.1. 2.8.2.2.

source_documents/JAVA核心知识点整理.pdf:
5.1.7.4. 序列化(深 clone 一中实现) ........................................................................................................ 115

source_documents/JAVA核心知识点整理.pdf:
2.2.2. 虚拟机栈(线程私有) .................................................................................................................... 22
2.2.3. 本地方法区(线程私有) ................................................................................................................ 23
2.2.4. 堆(Heap-线程共享)-运行时数据区 ...................................................................................... 23

Enter a query:

你这个回答的不还是英文吗。只有引用文档是中文的

@NyyCui
Copy link

NyyCui commented May 22, 2023

env embedding改成这个可以了 EMBEDDINGS_MODEL_NAME=distiluse-base-multilingual-cased-v1 来源是这个:https://www.sbert.net/docs/pretrained_models.html的多语言

EMBEDDINGS_MODEL_NAME=distiluse-base-multilingual-cased-v1这个换了之后,运行python ingest.py,会报个错误
`Loading documents from source_documents
Loaded 2 documents from source_documents
Split into 91 chunks of text (max. 500 characters each)
Traceback (most recent call last):
File "C:\Program Files\Python310\lib\site-packages\transformers\modeling_utils.py", line 446, in load_state_dict
return torch.load(checkpoint_file, map_location="cpu")
File "C:\Program Files\Python310\lib\site-packages\torch\serialization.py", line 797, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File "C:\Program Files\Python310\lib\site-packages\torch\serialization.py", line 283, in init
super().init(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Program Files\Python310\lib\site-packages\transformers\modeling_utils.py", line 450, in load_state_dict
if f.read(7) == "version":
UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 64: illegal multibyte sequence

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "D:\AAAAAA\privateGPT\ingest.py", line 96, in
main()
File "D:\AAAAAA\privateGPT\ingest.py", line 87, in main
embeddings = HuggingFaceEmbeddings(model_name=embeddings_model_name)
File "C:\Program Files\Python310\lib\site-packages\langchain\embeddings\huggingface.py", line 54, in init
self.client = sentence_transformers.SentenceTransformer(
File "C:\Program Files\Python310\lib\site-packages\sentence_transformers\SentenceTransformer.py", line 95, in init
modules = self._load_sbert_model(model_path)
File "C:\Program Files\Python310\lib\site-packages\sentence_transformers\SentenceTransformer.py", line 840, in _load_sbert_model
module = module_class.load(os.path.join(model_path, module_config['path']))
File "C:\Program Files\Python310\lib\site-packages\sentence_transformers\models\Transformer.py", line 137, in load
return Transformer(model_name_or_path=input_path, **config)
File "C:\Program Files\Python310\lib\site-packages\sentence_transformers\models\Transformer.py", line 29, in init
self._load_model(model_name_or_path, config, cache_dir)
File "C:\Program Files\Python310\lib\site-packages\sentence_transformers\models\Transformer.py", line 49, in _load_model
self.auto_model = AutoModel.from_pretrained(model_name_or_path, config=config, cache_dir=cache_dir)
File "C:\Program Files\Python310\lib\site-packages\transformers\models\auto\auto_factory.py", line 467, in from_pretrained
return model_class.from_pretrained(
File "C:\Program Files\Python310\lib\site-packages\transformers\modeling_utils.py", line 2542, in from_pretrained
state_dict = load_state_dict(resolved_archive_file)
File "C:\Program Files\Python310\lib\site-packages\transformers\modeling_utils.py", line 462, in load_state_dict
raise OSError(
OSError: Unable to load weights from pytorch checkpoint file for 'C:\Users\Administrator/.cache\torch\sentence_transformers\sentence-transformers_distiluse-base-multilingual-cased-v1\pytorch_model.bin' at 'C:\Users\Administrator/.cache\torch\sentence_transformers\sentence-transformers_distiluse-base-multilingual-cased-v1\pytorch_model.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.`

@hanwsf
Copy link

hanwsf commented May 23, 2023 via email

@zxjason
Copy link

zxjason commented May 25, 2023

can we set up a QQ group to discuss for Chinese issue?
my QQ is 84095749

@imartinez imartinez added the primordial Related to the primordial version of PrivateGPT, which is now frozen in favour of the new PrivateGPT label Oct 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
primordial Related to the primordial version of PrivateGPT, which is now frozen in favour of the new PrivateGPT
Projects
None yet
Development

No branches or pull requests