Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Old bug (dvc pull from S3 and md5 starts with 00) reappears with latest release #6089

Closed
jnd77 opened this issue Jun 1, 2021 · 17 comments · Fixed by #6151
Closed

Old bug (dvc pull from S3 and md5 starts with 00) reappears with latest release #6089

jnd77 opened this issue Jun 1, 2021 · 17 comments · Fixed by #6151
Assignees
Labels
bug Did we break something? p0-critical Critical issue. Needs to be fixed ASAP. research

Comments

@jnd77
Copy link

jnd77 commented Jun 1, 2021

Bug Report

dvc pull does not seem to work for files whose md5 starts with 00, when remote is S3

Description

This bug #4141 has reappeared with release 2.2.0 (it's working fine with 2.1.0).
Might be due to this commit:
#5683

Let me know if you need more information ... :)

@efiop efiop added the p0-critical Critical issue. Needs to be fixed ASAP. label Jun 1, 2021
@isidentical
Copy link
Contributor

Would you mind sending the full traceback?

@efiop efiop added research bug Did we break something? labels Jun 1, 2021
@isidentical
Copy link
Contributor

I just tried this as a test case and it seemed to work (363's hash is something that startswith 00, which is confirmed in the last assert);

@pytest.mark.parametrize("remote", [pytest.lazy_fixture("s3")], indirect=True)
def test_pull_00_prefix(tmp_dir, dvc, remote):
    tmp_dir.dvc_gen({"foo": "363"})
    dvc.push()
    clean(["foo"], dvc)
    
    stats = dvc.pull("foo")
    assert stats["fetched"] == 1
    assert stats["added"] == ["foo"]
    
    [cache_dir] = dvc.cloud.get_remote('upstream').fs.ls(remote)
    assert cache_dir.endswith('00')

@efiop efiop added this to To do in DVC May 18 - Jun 1 2021 via automation Jun 1, 2021
@efiop efiop added this to To do in DVC Jun 1 - 15 2021 via automation Jun 1, 2021
@pmrowla
Copy link
Contributor

pmrowla commented Jun 1, 2021

@isidentical I think we'd need to test with pulling more than one file at a time, if we're only trying to push/pull a single file, we do the direct exists() query instead of using walk/ls when we are checking whether objects exist in the remote.

It looks like we might not be handling the dvc specific prefix kwarg for fs.walk_files()/walk() any more, if we're just passing it directly into fsspec's fs.ls now, fsspec won't do anything with it.

see #4141 (comment) for an explanation of the original bug, and #4149 (comment) for discussion on why we added the prefix kwarg

essentially , there's times that we do walk("00/0", prefix=True) where we really want to query for S3 keys that start with 00/0, not keys inside 00/0/ (I'm not sure what API call s3fs will actually run in this case?)

@isidentical
Copy link
Contributor

essentially , there's times that we do walk("00/0", prefix=True) where we really want to query for S3 keys that start with 00/0, not keys inside 00/0/ (I'm not sure what API call s3fs will actually run in this case?)

Probably find() is the right API but I recall that discussing this with @efiop and noticing that it doesn't work on arbitrary prefixes. I'm checking out the s3fs but in the meanwhile I assume this optimization is something that we could control (setting the prefix length to 2 should work, right?) so maybe we could do that as an option.

@pmrowla
Copy link
Contributor

pmrowla commented Jun 1, 2021

(setting the prefix length to 2 should work, right?)

Yeah, this should work in the meantime, but ideally it'd be best if we can use arbitrary prefix lengths longer than 2 for clouds that support it

@efiop
Copy link
Contributor

efiop commented Jun 1, 2021

Also looks like we've added a test in #4149 which is still present right now, but it was only testing the exists() path 🙁

@isidentical Yes, we actually use just 00/ for all non-s3-like clouds (TRAVERSE_PREFIX_LEN), so we could try to fallback to that for now, if there is no better quick way to do that. But find should also work, considering that ls right now doesn't return iterator, but the full list. Both ways mean that we will lose some performance in estimation speed (since we'll have to go through more files), but that is expected at this point in our migration.

@isidentical
Copy link
Contributor

Also looks like we've added a test in #4149 which is still present right now, but it was only testing the exists() path slightly_frowning_face

It would be great if we could add a more generic test to the pull itself. I've modified the case to include additional 1024 files though I think it wasn't enough (still succeeds);

@pytest.mark.parametrize("remote", [pytest.lazy_fixture("s3")], indirect=True)
def test_pull_00_prefix(tmp_dir, dvc, remote):
    random_files = {f'random_{n}': str(n) for n in range(1024)}
    tmp_dir.dvc_gen({"foo": "363", "random": random_files})

    dvc.push()
    clean(["foo", "random"], dvc)
    
    stats = dvc.pull()
    assert 'foo' in stats['added']

@efiop
Copy link
Contributor

efiop commented Jun 1, 2021

@isidentical The behavior depends on TRAVERSE_PREFIX_LEN (@pmrowla could correct me if I'm wrong), so we'll need to exceed that or, probably better, just patch it to be lower for the tests.

EDIT: i meant TRAVERSE_THRESHOLD_SIZE 🤦

@pmrowla
Copy link
Contributor

pmrowla commented Jun 1, 2021

If you set

BaseFileSystem.TRAVERSE_THRESHOLD_SIZE = 0
BaseFileSystem.TRAVERSE_WEIGHT_MULTIPLIER = 1

in your test, it should force the prefix based query for all push/pull/status queries w/more than one on file all remote types

@isidentical
Copy link
Contributor

Still can't reproduce :/ (Also tried setting LIST_OBJECT_PAGE_SIZE though that didn't work).

@pytest.mark.parametrize("remote", [pytest.lazy_fixture("s3")], indirect=True)
def test_pull_00_prefix(tmp_dir, dvc, remote, monkeypatch):
    from dvc.fs.s3 import BaseFileSystem

    BaseFileSystem.TRAVERSE_THRESHOLD_SIZE = 0
    BaseFileSystem.TRAVERSE_WEIGHT_MULTIPLIER = 1


    random_files = {f'random_{n}': str(n) for n in range(32)}
    tmp_dir.dvc_gen({"foo": "363", "random": random_files})

    dvc.push()
    clean(["foo", "random"], dvc)
    
    stats = dvc.pull()
    assert 'foo' in stats['added']

@pmrowla

This comment has been minimized.

@isidentical isidentical moved this from To do to Done in DVC May 18 - Jun 1 2021 Jun 1, 2021
@pmrowla pmrowla added awaiting response we are waiting for your reply, please respond! :) and removed awaiting response we are waiting for your reply, please respond! :) labels Jun 1, 2021
@pmrowla
Copy link
Contributor

pmrowla commented Jun 1, 2021

Ok it is actually reproducible if I also hard code LIST_OBJECT_PAGE_SIZE to be 1 (so for large enough real-world S3 remotes we'll hit the issue)

$ dvc pull -v
2021-06-01 22:21:54,237 DEBUG: Check for update is disabled.
2021-06-01 22:21:55,042 DEBUG: Preparing to download data from 's3://dvc-temp/peter-test'
2021-06-01 22:21:55,043 DEBUG: Preparing to collect status from s3://dvc-temp/peter-test
2021-06-01 22:21:55,043 DEBUG: Collecting information from local cache...
2021-06-01 22:21:55,043 DEBUG: Collecting information from remote cache...
2021-06-01 22:21:55,044 DEBUG: Matched '0' indexed hashes
2021-06-01 22:21:56,461 DEBUG: Estimated remote size: 4096 files
2021-06-01 22:21:56,461 DEBUG: Querying '3' hashes via traverse
2021-06-01 22:22:07,352 WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:
name: baz, md5: 0000000018e6137ac2caab16074784a6
name: foo, md5: 00411460f7c92d2124a67ea0f4cb5f85
2021-06-01 22:22:07,357 DEBUG: Downloading 's3://dvc-temp/peter-test/37/b51d194a7513e45b56f6524f2d51f2' to '.dvc/cache/37/b51d194a7513e45b56f6524f2d51f2'
2021-06-01 22:22:08,572 DEBUG: state save (103513906, 1622553728568792576, 3) 37b51d194a7513e45b56f6524f2d51f2
2021-06-01 22:22:08,579 DEBUG: 'bar' doesn't exist.
2021-06-01 22:22:08,580 DEBUG: Checking out 'bar' with cache 'object md5: 37b51d194a7513e45b56f6524f2d51f2'.
2021-06-01 22:22:08,581 DEBUG: Created 'reflink': .dvc/cache/37/b51d194a7513e45b56f6524f2d51f2 -> bar
2021-06-01 22:22:08,582 DEBUG: state save (103513913, 1622553728568792576, 3) 37b51d194a7513e45b56f6524f2d51f2
2021-06-01 22:22:08,586 DEBUG: cache for 'foo'('md5: 00411460f7c92d2124a67ea0f4cb5f85') has changed.
2021-06-01 22:22:08,587 WARNING: Cache 'md5: 00411460f7c92d2124a67ea0f4cb5f85' not found. File 'foo' won't be created.
2021-06-01 22:22:08,589 DEBUG: cache for 'baz'('md5: 0000000018e6137ac2caab16074784a6') has changed.
2021-06-01 22:22:08,590 WARNING: Cache 'md5: 0000000018e6137ac2caab16074784a6' not found. File 'baz' won't be created.
A       bar
1 file added and 2 files failed

Setting TRAVERSE_PREFIX_LEN to 2 fixes it.

For testing this I think a dataset with >4096 files and these values should hit the issue as well:

BaseFileSystem.TRAVERSE_THRESHOLD_SIZE = 0
BaseFileSystem.TRAVERSE_WEIGHT_MULTIPLIER = 1
BaseFileSystem.LIST_OBJECT_PAGE_SIZE = 1
BaseFileSystem.TRAVERSE_PREFIX_LEN = 3

We should also probably just introduce an internal flag for forcing the different traverse behaviors for simpler testing purposes

@efiop
Copy link
Contributor

efiop commented Jun 2, 2021

@isidentical Thanks for a quick fix! 🙏

@jnd77 We are releasing 2.3.0 right now with a temporary fix. Thanks for the feedback!

@jnd77
Copy link
Author

jnd77 commented Jun 3, 2021

Thanks a lot for the quick fix. Works fine now. :)

@jnd77 jnd77 closed this as completed Jun 3, 2021
@pmrowla
Copy link
Contributor

pmrowla commented Jun 3, 2021

Keeping this issue open for now since the current hotfix is just a workaround until we have a proper solution in the fsspec backends

@pmrowla pmrowla reopened this Jun 3, 2021
DVC Jun 1 - 15 2021 automation moved this from Done to In progress Jun 3, 2021
@pmrowla pmrowla moved this from In progress to Done in DVC Jun 1 - 15 2021 Jun 3, 2021
@isidentical
Copy link
Contributor

All cloud providers now support find(..., prefix=...). I'm awaiting some changes on the adlfs side for an edge case, but after that we should be good to go.

@efiop
Copy link
Contributor

efiop commented Jun 8, 2021

@karajan1001 FYI ^^^ Please check that ossfs supports it as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Did we break something? p0-critical Critical issue. Needs to be fixed ASAP. research
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

4 participants