Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: 使用document_simhash_deduplicator算子报错:NameError: name 'fingerprint_warnings' is not defined #92

Closed
3 tasks done
jamestch opened this issue Nov 21, 2023 · 3 comments · Fixed by #94
Closed
3 tasks done
Assignees
Labels
bug Something isn't working

Comments

@jamestch
Copy link

jamestch commented Nov 21, 2023

Before Reporting 报告之前

  • I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。

  • I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引,并且在安装过程中没有错误发生。(否则,我们建议您使用Question模板向我们进行提问)

Search before reporting 先搜索,再报告

  • I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表 中搜索但是没有发现类似的bug报告。

OS 系统

Ubuntu

Installation Method 安装方式

docker image build from Dockerfile by myself

Data-Juicer Version Data-Juicer版本

v0.1.2

Python Version Python版本

3.8.18

Describe the bug 描述这个bug

(1)关键错误信息
An error occurred during Op [document_simhash_deduplicator]

_raise PicklingError(_pickle.PicklingError: Can't pickle <cyfunction _Pyx_CFunc_size__t____hash__t____hash__t___to_py..wrap at 0x7ff22bad4e80>: it's not found as cfunc.to_py.wrap

NameError: name 'fingerprint_warnings' is not defined
(2)报错信息截图
image
image

To Reproduce 如何复现

1、pulled the latest code,and build docker image from Dockerfile by myself(docker image name: data-juicer:v0.1.2s)
2、start docker container as follow
docker run -dit
--name dj
-v ~/.cache/:/root/.cache/
data-juicer:v0.1.2s /bin/bash

docker cp -a dj:/data-juicer/ /data/llm-data/data-juicer

docker stop dj && docker rm dj

docker run -dit
--name dj
-p 8501:8501
-v /data/llm-data/data-juicer:/data-juicer
-v /data/llm-data/cache/:/root/.cache/
data-juicer:v0.1.2s /bin/bash
3、docker exec -it dj /bin/bash
4、cd /data-juicer/demos/process_cft_zh_data
streamlit run app.py
5、access ip:8501 by browser
6、click "start to process data" button

Configs 配置信息

No response

Logs 报错日志

No response

Screenshots 截图

No response

Additional 额外信息

issue83有类似问题,下载最新代码 and 重装datasets==2.11.0
dill==0.3.4包,都无法解决

@jamestch jamestch added the bug Something isn't working label Nov 21, 2023
@jamestch
Copy link
Author

似乎只要使用document_simhash_deduplicator这个算子,都会报上述错误

@jamestch jamestch changed the title [Bug]: NameError: name 'fingerprint_warnings' is not defined [Bug]: 使用document_simhash_deduplicator算子报错:NameError: name 'fingerprint_warnings' is not defined Nov 23, 2023
@HYLcool HYLcool self-assigned this Nov 23, 2023
@HYLcool
Copy link
Collaborator

HYLcool commented Nov 23, 2023

你好,感谢你的关注与使用!

我们这边已复现出你遇到的这个问题,初步估计应该是由最近的commit里面对该算子的修改导致的问题,我们会尽快修复这个问题,然后给你回复哈~

@HYLcool HYLcool linked a pull request Nov 23, 2023 that will close this issue
@HYLcool
Copy link
Collaborator

HYLcool commented Nov 23, 2023

嗨,你好 @jamestch

我们已经在PR #94 中修复这个问题,请你拉取最新代码然后再次尝试,谢谢~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants