DataDeps doesn't check if files themself exist (?) #10

Evizero · 2017-12-26T11:39:47Z

It seems that by design the package does not check if the specified folder actually contains the specified files. This seems like a missed opportunity to me. What are your thoughts on this?

oxinabox · 2017-12-26T11:49:33Z

Which files in particular?

Do you mean in-between the fetch step, and the post-fetch step?
Those files we could indeed check.
If a checksum is provided we kinda do check them, don't we?
I guess the default fallback function (which prints the xor'd hash for everything) might not though.

We can't check the files after the post-fetch step,
since we don't know what the post-fetch step will do.
(E.g. extract, or delete, or synthesize)
Related to that is #6

Evizero · 2017-12-26T12:01:15Z

We can't check the files after the post-fetch step, since we don't know what the post-fetch step will do.

That is a good point. I'll think on this a little.

Evizero · 2017-12-27T14:22:30Z

The way I do this currently in MLDatasets is that after DataDeps does its thing (i.e. check that the folder exists) I check if the requested file exists in that folder. If it doesn't, the code assumes that the file should be present but must have been deleted. Consequently it simply retriggers DataDeps.download and then checks again. (see https://github.com/JuliaML/MLDatasets.jl/blob/0fb774033d5c5ac9be4be41ee111209339dfa188/src/io/download.jl#L31-L41)

In other words I also don't assume that the requested file is in the specified list of to-download files (since as you say we don't know what the post-fetch step does). But I think the above is a fair enough assumption.

We could allow this mechanism as part of DataDeps using some syntax using /. For example, CIFAR10 only downloads a single archive "cifar-10-binary.tar.gz", but after post_fetch i end up with a subfolder and a couple files in them.

For this, datadep"CIFAR10/cifar-10-batches-bin/test_batch.bin" could mean DataDep "CIFAR10" and then subfolder "cifar-10-batches-bin" and then file "test_batch.bin". This could tell DataDeps that if the folder can't be found, or if the specified subfolder/file doesn't exists, trigger fetch and post-fetch for "CIFAR10" and then try again. The macro should in the end return the path to the actual file (e.g. "/home/user/.julia/datadeps/CIFAR10/cifar-10-batches-bin/test_batch.bin")

oxinabox · 2017-12-27T15:11:45Z

that seems reasonable.

Evizero · 2017-12-28T12:52:09Z

A nice side-effect of this is that the existence of the downloaded archive file is never checked. As a consequence a user could just have the dataset predownloaded and extracted without keeping the archive file around

Evizero mentioned this issue Dec 28, 2017

Additional flag for RegisterDataDep for post_fetch_cleanup #14

Closed

oxinabox mentioned this issue Jan 10, 2018

Add file existence support #18

Merged

oxinabox closed this as completed in #18 Jan 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataDeps doesn't check if files themself exist (?) #10

DataDeps doesn't check if files themself exist (?) #10

Evizero commented Dec 26, 2017

oxinabox commented Dec 26, 2017 •

edited

Loading

Evizero commented Dec 26, 2017

Evizero commented Dec 27, 2017 •

edited

Loading

oxinabox commented Dec 27, 2017

Evizero commented Dec 28, 2017

DataDeps doesn't check if files themself exist (?) #10

DataDeps doesn't check if files themself exist (?) #10

Comments

Evizero commented Dec 26, 2017

oxinabox commented Dec 26, 2017 • edited Loading

Evizero commented Dec 26, 2017

Evizero commented Dec 27, 2017 • edited Loading

oxinabox commented Dec 27, 2017

Evizero commented Dec 28, 2017

oxinabox commented Dec 26, 2017 •

edited

Loading

Evizero commented Dec 27, 2017 •

edited

Loading