Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataDeps doesn't check if files themself exist (?) #10

Closed
Evizero opened this issue Dec 26, 2017 · 5 comments · Fixed by #18
Closed

DataDeps doesn't check if files themself exist (?) #10

Evizero opened this issue Dec 26, 2017 · 5 comments · Fixed by #18

Comments

@Evizero
Copy link
Collaborator

Evizero commented Dec 26, 2017

It seems that by design the package does not check if the specified folder actually contains the specified files. This seems like a missed opportunity to me. What are your thoughts on this?

@oxinabox
Copy link
Owner

oxinabox commented Dec 26, 2017

Which files in particular?

Do you mean in-between the fetch step, and the post-fetch step?
Those files we could indeed check.
If a checksum is provided we kinda do check them, don't we?
I guess the default fallback function (which prints the xor'd hash for everything) might not though.

We can't check the files after the post-fetch step,
since we don't know what the post-fetch step will do.
(E.g. extract, or delete, or synthesize)
Related to that is #6

@Evizero
Copy link
Collaborator Author

Evizero commented Dec 26, 2017

We can't check the files after the post-fetch step, since we don't know what the post-fetch step will do.

That is a good point. I'll think on this a little.

@Evizero
Copy link
Collaborator Author

Evizero commented Dec 27, 2017

The way I do this currently in MLDatasets is that after DataDeps does its thing (i.e. check that the folder exists) I check if the requested file exists in that folder. If it doesn't, the code assumes that the file should be present but must have been deleted. Consequently it simply retriggers DataDeps.download and then checks again. (see https://github.com/JuliaML/MLDatasets.jl/blob/0fb774033d5c5ac9be4be41ee111209339dfa188/src/io/download.jl#L31-L41)

In other words I also don't assume that the requested file is in the specified list of to-download files (since as you say we don't know what the post-fetch step does). But I think the above is a fair enough assumption.

We could allow this mechanism as part of DataDeps using some syntax using /. For example, CIFAR10 only downloads a single archive "cifar-10-binary.tar.gz", but after post_fetch i end up with a subfolder and a couple files in them.

For this, datadep"CIFAR10/cifar-10-batches-bin/test_batch.bin" could mean DataDep "CIFAR10" and then subfolder "cifar-10-batches-bin" and then file "test_batch.bin". This could tell DataDeps that if the folder can't be found, or if the specified subfolder/file doesn't exists, trigger fetch and post-fetch for "CIFAR10" and then try again. The macro should in the end return the path to the actual file (e.g. "/home/user/.julia/datadeps/CIFAR10/cifar-10-batches-bin/test_batch.bin")

@oxinabox
Copy link
Owner

that seems reasonable.

@Evizero
Copy link
Collaborator Author

Evizero commented Dec 28, 2017

A nice side-effect of this is that the existence of the downloaded archive file is never checked. As a consequence a user could just have the dataset predownloaded and extracted without keeping the archive file around

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants