[Bug]: date format changed from input to output #71

logoshot · 2023-11-14T07:38:41Z

Before Reporting 报告之前

I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。
I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引，并且在安装过程中没有错误发生。（否则，我们建议您使用Question模板向我们进行提问）

Search before reporting 先搜索，再报告

I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表中搜索但是没有发现类似的bug报告。

OS 系统

Ubuntu

Installation Method 安装方式

from source

Data-Juicer Version Data-Juicer版本

latest

Python Version Python版本

3.8

Describe the bug 描述这个bug

I have a jsonl data to be processed, there is a time key in data records, it looks like '2023-10-13 16:06:31' originally, then I follow the python tools/process_data.py --config configs/demo/process.yaml command to process data, and in output jsonl, I found time is changed to 1678, a integer. I've found that it may be caused by datasets.to_json , there is a parameter called date_format, I set it to 'iso', the output will change to '1970-01-01T00:00:01.698', so it's not only bug in format, but the value also changed.

To Reproduce 如何复现

prepare a jsonl dataset with time.
run python tools/process_data.py --config configs/demo/process.yaml
output time format and value changed

Configs 配置信息

project_name: 'all'                                         # project name for distinguish your configs
dataset_path: '/path/to/dataset/0.jsonl'                     
export_path: '/path/to/result/result.jsonl'              
export_shard_size: 0                                      
export_in_parallel: false                                  
np: 4                                                       # number of subprocess to process your dataset
text_keys: '内容'                                      
suffixes: ['.jsonl']                                                # the suffix of files that will be read. For example: '.txt', 'txt' or ['txt', '.pdf', 'docx']
use_cache: true                                             # whether to use the cache management of Hugging Face datasets. It might take up lots of disk space when using cache
ds_cache_dir: null                                        
use_checkpoint: false                              
open_tracer: false                                          # whether to open the tracer to trace the changes during process. It might take more time when opening tracer
op_list_to_trace: []                                        # only ops in this list will be traced by tracer. If it's empty, all ops will be traced. Only available when tracer is opened.
trace_num: 10                                               # number of samples to show the differences between datasets before and after each op. Only available when tracer is opened.
op_fusion: false                                            # whether to fuse operators that share the same intermediate variables automatically. Op fusion might reduce the memory requirements slightly but speed up the whole process.
cache_compress: null                                        # The compression method of the cache file, which can be specified in ['gzip', 'zstd', 'lz4']. If this parameter is None, the cache file will not be compressed. We recommend you turn on this argument when your input dataset is larger than tens of GB and your disk space is not enough.

# for distributed processing
executor_type: default                                      # Type of executor, support "default" or "ray" for now.
ray_address: auto                                           # The address of the Ray cluster.

# only for data analysis
save_stats_in_one_file: false                               # whether to store all stats result into one file

# process schedule: a list of several process operators with their arguments
process:
  - clean_email_mapper:                                     # remove emails from text.
  - clean_html_mapper:                                      # remove html formats form text.
  - clean_ip_mapper:                                        # remove ip addresses from text.
  - clean_links_mapper:                                     # remove web links from text.
  - clean_copyright_mapper:                                 # remove copyright comments.
  - punctuation_normalization_mapper:                       # normalize unicode punctuations to English punctuations.
  - whitespace_normalization_mapper:                        # normalize different kinds of whitespaces to English whitespace.

Logs 报错日志

No response

Screenshots 截图

No response

Additional 额外信息

No response

The text was updated successfully, but these errors were encountered:

logoshot · 2023-11-14T08:16:22Z

I tried to print data from to_json, found that the time value is right before to_json, so the datetime value is changed during to_json.

    @staticmethod
    def to_jsonl(dataset, export_path, num_proc=1, **kwargs):
        """
        Export method for json/jsonl target files.

        :param dataset: the dataset to export.
        :param export_path: the path to store the exported dataset.
        :param num_proc: the number of processes used to export the dataset.
        :param kwargs: extra arguments.
        :return:
        """
        print(dataset['time'])
        dataset.to_json(export_path, force_ascii=False, num_proc=num_proc)

zhijianma · 2023-11-14T08:35:25Z

I tried to print data from to_json, found that the time value is right before to_json, so the datetime value is changed during to_json.

    @staticmethod
    def to_jsonl(dataset, export_path, num_proc=1, **kwargs):
        """
        Export method for json/jsonl target files.

        :param dataset: the dataset to export.
        :param export_path: the path to store the exported dataset.
        :param num_proc: the number of processes used to export the dataset.
        :param kwargs: extra arguments.
        :return:
        """
        print(dataset['time'])
        dataset.to_json(export_path, force_ascii=False, num_proc=num_proc)

The image bellow is the content after exporting dataset with iso in my local machine, and we can find the values of text remain unchaned , but with a ios format. Please check your local time and the Python dependencies, such as datasets, pandas, and pyarrow.

logoshot · 2023-11-15T07:47:27Z

I install these package with command pip install -v -e .[all], so I think it is the default version, could you help me check your version, here is mine:

pandas 2.0.0
datasets 2.11.0
pyarrow 14.0.1

zhijianma · 2023-11-15T09:40:16Z

'2023-10-13 16:06:31'

This maybe a bug of pyarrow from v13.0.
I have tested from v11.0 to v14.0 with date_format iso, here is my results:

v11.0.0 and v12.0.0 write the right text 2023-10-13T16:06:31.000 to the jsonl file.
v13.0.0 and v14.0.0 write the wrong text 1970-01-01T00:00:01.697 to the jsonl file.
So , we suggest you to downgrade pyarrw to v12.0.0 .

github-actions · 2023-12-14T09:32:04Z

This issue is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this issue will be closed in 3 day.

github-actions · 2023-12-18T09:32:20Z

Close this stale issue.

logoshot added the bug Something isn't working label Nov 14, 2023

HYLcool assigned zhijianma Nov 23, 2023

github-actions bot added the stale-issue label Dec 14, 2023

github-actions bot closed this as completed Dec 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: date format changed from input to output #71

[Bug]: date format changed from input to output #71

logoshot commented Nov 14, 2023 •

edited

logoshot commented Nov 14, 2023

zhijianma commented Nov 14, 2023 •

edited

logoshot commented Nov 15, 2023

zhijianma commented Nov 15, 2023

github-actions bot commented Dec 14, 2023

github-actions bot commented Dec 18, 2023

[Bug]: date format changed from input to output #71

[Bug]: date format changed from input to output #71

Comments

logoshot commented Nov 14, 2023 • edited

Before Reporting 报告之前

Search before reporting 先搜索，再报告

OS 系统

Installation Method 安装方式

Data-Juicer Version Data-Juicer版本

Python Version Python版本

Describe the bug 描述这个bug

To Reproduce 如何复现

Configs 配置信息

Logs 报错日志

Screenshots 截图

Additional 额外信息

logoshot commented Nov 14, 2023

zhijianma commented Nov 14, 2023 • edited

logoshot commented Nov 15, 2023

zhijianma commented Nov 15, 2023

github-actions bot commented Dec 14, 2023

github-actions bot commented Dec 18, 2023

logoshot commented Nov 14, 2023 •

edited

zhijianma commented Nov 14, 2023 •

edited