New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
不支持中文吗 #207
Comments
嵌入向量的模型换成这个试试GanymedeNil/text2vec-large-chinese |
请问在哪里修改 |
你中文成功了吗 |
error is below:
|
I had the same error, deleting all the files/subdirs in the db directory before executing ingest.py helped. |
大佬,请教下用的那个embending加载本地的,HuggingFaceEmbeddings加载不了,没有指定路径的参数。。 |
@schiffma thanks,solve this error,but when ask question,before get response,get many gpt_tokenize: unknown token '',then response is not I want. |
请问是直接把bin文件下载之后,替换model路径吗?我替换了之后运行不起来了 |
. |
是的,我也是这个错误。不清楚是哪里出了问题 |
.env中: |
env embedding改成这个可以了 |
需要将之前创建的Db文件夹清空,不能英文向量384上叠加中文embedding 1024的 |
非常感谢!使用其中的:paraphrase-multilingual-mpnet-base-v2可以出来中文。就是前面有很多的:gpt_tokenize: unknown token '�' To be improved @imartinez, please help to check:
====Output for reference==== python privateGPT.py Enter a query: RabbitMQ有什么特点?
2.8. 2.9. 2.8.2.1.
Enter a query: |
你这个回答的不还是英文吗。只有引用文档是中文的 |
EMBEDDINGS_MODEL_NAME=distiluse-base-multilingual-cased-v1这个换了之后,运行python ingest.py,会报个错误 During handling of the above exception, another exception occurred: Traceback (most recent call last): During handling of the above exception, another exception occurred: Traceback (most recent call last): |
这里看是两个问题,放进文件夹的是zip格式的压缩文件吗?第2个是要么换一个模型看一看。
…---- 回复的原邮件 ----
| 发件人 | ***@***.***> |
| 日期 | 2023年05月22日 15:43 |
| 收件人 | ***@***.***> |
| 抄送至 | ***@***.***>***@***.***> |
| 主题 | Re: [imartinez/privateGPT] 不支持中文吗 (Issue #207) |
env embedding改成这个可以了 EMBEDDINGS_MODEL_NAME=distiluse-base-multilingual-cased-v1 来源是这个:https://www.sbert.net/docs/pretrained_models.html的多语言
EMBEDDINGS_MODEL_NAME=distiluse-base-multilingual-cased-v1这个换了之后,运行python ingest.py,会报个错误
`Loading documents from source_documents
Loaded 2 documents from source_documents
Split into 91 chunks of text (max. 500 characters each)
Traceback (most recent call last):
File "C:\Program Files\Python310\lib\site-packages\transformers\modeling_utils.py", line 446, in load_state_dict
return torch.load(checkpoint_file, map_location="cpu")
File "C:\Program Files\Python310\lib\site-packages\torch\serialization.py", line 797, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File "C:\Program Files\Python310\lib\site-packages\torch\serialization.py", line 283, in init
super().init(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Program Files\Python310\lib\site-packages\transformers\modeling_utils.py", line 450, in load_state_dict
if f.read(7) == "version":
UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 64: illegal multibyte sequence
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:\AAAAAA\privateGPT\ingest.py", line 96, in
main()
File "D:\AAAAAA\privateGPT\ingest.py", line 87, in main
embeddings = HuggingFaceEmbeddings(model_name=embeddings_model_name)
File "C:\Program Files\Python310\lib\site-packages\langchain\embeddings\huggingface.py", line 54, in init
self.client = sentence_transformers.SentenceTransformer(
File "C:\Program Files\Python310\lib\site-packages\sentence_transformers\SentenceTransformer.py", line 95, in init
modules = self._load_sbert_model(model_path)
File "C:\Program Files\Python310\lib\site-packages\sentence_transformers\SentenceTransformer.py", line 840, in _load_sbert_model
module = module_class.load(os.path.join(model_path, module_config['path']))
File "C:\Program Files\Python310\lib\site-packages\sentence_transformers\models\Transformer.py", line 137, in load
return Transformer(model_name_or_path=input_path, **config)
File "C:\Program Files\Python310\lib\site-packages\sentence_transformers\models\Transformer.py", line 29, in init
self._load_model(model_name_or_path, config, cache_dir)
File "C:\Program Files\Python310\lib\site-packages\sentence_transformers\models\Transformer.py", line 49, in _load_model
self.auto_model = AutoModel.from_pretrained(model_name_or_path, config=config, cache_dir=cache_dir)
File "C:\Program Files\Python310\lib\site-packages\transformers\models\auto\auto_factory.py", line 467, in from_pretrained
return model_class.from_pretrained(
File "C:\Program Files\Python310\lib\site-packages\transformers\modeling_utils.py", line 2542, in from_pretrained
state_dict = load_state_dict(resolved_archive_file)
File "C:\Program Files\Python310\lib\site-packages\transformers\modeling_utils.py", line 462, in load_state_dict
raise OSError(
OSError: Unable to load weights from pytorch checkpoint file for 'C:\Users\Administrator/.cache\torch\sentence_transformers\sentence-transformers_distiluse-base-multilingual-cased-v1\pytorch_model.bin' at 'C:\Users\Administrator/.cache\torch\sentence_transformers\sentence-transformers_distiluse-base-multilingual-cased-v1\pytorch_model.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.`
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.Message ID: ***@***.***>
|
can we set up a QQ group to discuss for Chinese issue? |
我上传了中文的pdf,输入中文的Question,出现结果的是乱码
The text was updated successfully, but these errors were encountered: