[Bug]: 使用document_simhash_deduplicator算子报错:NameError: name 'fingerprint_warnings' is not defined #92
Closed
3 tasks done
Labels
bug
Something isn't working
Before Reporting 报告之前
I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。
I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引,并且在安装过程中没有错误发生。(否则,我们建议您使用Question模板向我们进行提问)
Search before reporting 先搜索,再报告
OS 系统
Ubuntu
Installation Method 安装方式
docker image build from Dockerfile by myself
Data-Juicer Version Data-Juicer版本
v0.1.2
Python Version Python版本
3.8.18
Describe the bug 描述这个bug
(1)关键错误信息
An error occurred during Op [document_simhash_deduplicator]
_raise PicklingError(_pickle.PicklingError: Can't pickle <cyfunction _Pyx_CFunc_size__t____hash__t____hash__t___to_py..wrap at 0x7ff22bad4e80>: it's not found as cfunc.to_py.wrap
NameError: name 'fingerprint_warnings' is not defined
![image](https://private-user-images.githubusercontent.com/6100838/284630634-637845f4-1a5f-4694-80e7-d58a749d6538.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTg3NDA2MDAsIm5iZiI6MTcxODc0MDMwMCwicGF0aCI6Ii82MTAwODM4LzI4NDYzMDYzNC02Mzc4NDVmNC0xYTVmLTQ2OTQtODBlNy1kNThhNzQ5ZDY1MzgucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYxOCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MThUMTk1MTQwWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NTg5MGE2ZjJjNmY2OWMzNzNlN2YwZWEyNjEzMzg4N2RhNjI0MmE1OTI2ZGQ2NWVkMjRhNGJiZGZhY2YwZTEyOSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.SU5_F5wBvplj1ktlOqIHre0nG5vzExFlBr6ghKXkhak)
![image](https://private-user-images.githubusercontent.com/6100838/284630816-27ece342-7b5d-4512-a6a4-d8f58a6b84f0.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTg3NDA2MDAsIm5iZiI6MTcxODc0MDMwMCwicGF0aCI6Ii82MTAwODM4LzI4NDYzMDgxNi0yN2VjZTM0Mi03YjVkLTQ1MTItYTZhNC1kOGY1OGE2Yjg0ZjAucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDYxOCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA2MThUMTk1MTQwWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9ZjAzM2ZiYmQ5YTQwNjU5NGU0N2U3ZWM1NDNkZjE0YWEzMTZjMGVjOGVhNmNiMmE2NTk2NDU4ZDBmYWE4NDZjMCZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.vRCS-ISi30xB5IWEyC_P7Mw9haMdYj0ymfQXnI32KY8)
(2)报错信息截图
To Reproduce 如何复现
1、pulled the latest code,and build docker image from Dockerfile by myself(docker image name: data-juicer:v0.1.2s)
2、start docker container as follow
docker run -dit
--name dj
-v ~/.cache/:/root/.cache/
data-juicer:v0.1.2s /bin/bash
docker cp -a dj:/data-juicer/ /data/llm-data/data-juicer
docker stop dj && docker rm dj
docker run -dit
--name dj
-p 8501:8501
-v /data/llm-data/data-juicer:/data-juicer
-v /data/llm-data/cache/:/root/.cache/
data-juicer:v0.1.2s /bin/bash
3、docker exec -it dj /bin/bash
4、cd /data-juicer/demos/process_cft_zh_data
streamlit run app.py
5、access ip:8501 by browser
6、click "start to process data" button
Configs 配置信息
No response
Logs 报错日志
No response
Screenshots 截图
No response
Additional 额外信息
issue83有类似问题,下载最新代码 and 重装datasets==2.11.0
dill==0.3.4包,都无法解决
The text was updated successfully, but these errors were encountered: