Skip to content

Enhancement: Improve the checksum for the Dataset (or the equality test) #16

@cwognum

Description

@cwognum

When a Dataset contains pointer columns, we compute the checksum based on the file path instead of the file content (see here). Since we currently use the checksum to determine equality between Dataset objects (see here), this can lead to a weird situation like the following:

src_dataset.to_json("/home/cas/Desktop")
dst_dataset = Dataset.from_json("/home/cas/Desktop/dataset.yaml")
src_dataset == dst_dataset  # This is False

How could we improve this system? @hadim mentioned using the number of bytes in a file as a proxy. That would work a lot better, but I don't think that will be performant enough as datasets get large and are stored remote.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingenhancementNew feature or requestquestionFurther information is requested

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions