Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: date format changed from input to output #71

Closed
3 tasks done
logoshot opened this issue Nov 14, 2023 · 6 comments
Closed
3 tasks done

[Bug]: date format changed from input to output #71

logoshot opened this issue Nov 14, 2023 · 6 comments
Assignees
Labels
bug Something isn't working stale-issue

Comments

@logoshot
Copy link

logoshot commented Nov 14, 2023

Before Reporting 报告之前

  • I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。

  • I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引,并且在安装过程中没有错误发生。(否则,我们建议您使用Question模板向我们进行提问)

Search before reporting 先搜索,再报告

  • I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表 中搜索但是没有发现类似的bug报告。

OS 系统

Ubuntu

Installation Method 安装方式

from source

Data-Juicer Version Data-Juicer版本

latest

Python Version Python版本

3.8

Describe the bug 描述这个bug

I have a jsonl data to be processed, there is a time key in data records, it looks like '2023-10-13 16:06:31' originally, then I follow the python tools/process_data.py --config configs/demo/process.yaml command to process data, and in output jsonl, I found time is changed to 1678, a integer. I've found that it may be caused by datasets.to_json , there is a parameter called date_format, I set it to 'iso', the output will change to '1970-01-01T00:00:01.698', so it's not only bug in format, but the value also changed.

To Reproduce 如何复现

  1. prepare a jsonl dataset with time.
  2. run python tools/process_data.py --config configs/demo/process.yaml
  3. output time format and value changed

Configs 配置信息

project_name: 'all'                                         # project name for distinguish your configs
dataset_path: '/path/to/dataset/0.jsonl'                     
export_path: '/path/to/result/result.jsonl'              
export_shard_size: 0                                      
export_in_parallel: false                                  
np: 4                                                       # number of subprocess to process your dataset
text_keys: '内容'                                      
suffixes: ['.jsonl']                                                # the suffix of files that will be read. For example: '.txt', 'txt' or ['txt', '.pdf', 'docx']
use_cache: true                                             # whether to use the cache management of Hugging Face datasets. It might take up lots of disk space when using cache
ds_cache_dir: null                                        
use_checkpoint: false                              
open_tracer: false                                          # whether to open the tracer to trace the changes during process. It might take more time when opening tracer
op_list_to_trace: []                                        # only ops in this list will be traced by tracer. If it's empty, all ops will be traced. Only available when tracer is opened.
trace_num: 10                                               # number of samples to show the differences between datasets before and after each op. Only available when tracer is opened.
op_fusion: false                                            # whether to fuse operators that share the same intermediate variables automatically. Op fusion might reduce the memory requirements slightly but speed up the whole process.
cache_compress: null                                        # The compression method of the cache file, which can be specified in ['gzip', 'zstd', 'lz4']. If this parameter is None, the cache file will not be compressed. We recommend you turn on this argument when your input dataset is larger than tens of GB and your disk space is not enough.

# for distributed processing
executor_type: default                                      # Type of executor, support "default" or "ray" for now.
ray_address: auto                                           # The address of the Ray cluster.

# only for data analysis
save_stats_in_one_file: false                               # whether to store all stats result into one file

# process schedule: a list of several process operators with their arguments
process:
  - clean_email_mapper:                                     # remove emails from text.
  - clean_html_mapper:                                      # remove html formats form text.
  - clean_ip_mapper:                                        # remove ip addresses from text.
  - clean_links_mapper:                                     # remove web links from text.
  - clean_copyright_mapper:                                 # remove copyright comments.
  - punctuation_normalization_mapper:                       # normalize unicode punctuations to English punctuations.
  - whitespace_normalization_mapper:                        # normalize different kinds of whitespaces to English whitespace.

Logs 报错日志

No response

Screenshots 截图

No response

Additional 额外信息

No response

@logoshot logoshot added the bug Something isn't working label Nov 14, 2023
@logoshot
Copy link
Author

I tried to print data from to_json, found that the time value is right before to_json, so the datetime value is changed during to_json.

    @staticmethod
    def to_jsonl(dataset, export_path, num_proc=1, **kwargs):
        """
        Export method for json/jsonl target files.

        :param dataset: the dataset to export.
        :param export_path: the path to store the exported dataset.
        :param num_proc: the number of processes used to export the dataset.
        :param kwargs: extra arguments.
        :return:
        """
        print(dataset['time'])
        dataset.to_json(export_path, force_ascii=False, num_proc=num_proc)

@zhijianma
Copy link
Collaborator

zhijianma commented Nov 14, 2023

I tried to print data from to_json, found that the time value is right before to_json, so the datetime value is changed during to_json.

    @staticmethod
    def to_jsonl(dataset, export_path, num_proc=1, **kwargs):
        """
        Export method for json/jsonl target files.

        :param dataset: the dataset to export.
        :param export_path: the path to store the exported dataset.
        :param num_proc: the number of processes used to export the dataset.
        :param kwargs: extra arguments.
        :return:
        """
        print(dataset['time'])
        dataset.to_json(export_path, force_ascii=False, num_proc=num_proc)

The image bellow is the content after exporting dataset with iso in my local machine, and we can find the values of text remain unchaned , but with a ios format. Please check your local time and the Python dependencies, such as datasets, pandas, and pyarrow.
image

@logoshot
Copy link
Author

I install these package with command pip install -v -e .[all], so I think it is the default version, could you help me check your version, here is mine:

pandas 2.0.0
datasets 2.11.0
pyarrow 14.0.1

@zhijianma
Copy link
Collaborator

'2023-10-13 16:06:31'

This maybe a bug of pyarrow from v13.0.
I have tested from v11.0 to v14.0 with date_format iso, here is my results:

  • v11.0.0 and v12.0.0 write the right text 2023-10-13T16:06:31.000 to the jsonl file.
  • v13.0.0 and v14.0.0 write the wrong text 1970-01-01T00:00:01.697 to the jsonl file.
    So , we suggest you to downgrade pyarrw to v12.0.0 .

Copy link

This issue is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this issue will be closed in 3 day.

Copy link

Close this stale issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale-issue
Projects
None yet
Development

No branches or pull requests

2 participants