Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_list_oids_traverse is not traversing 00 prefix for s3 remote #192

Closed
petrchmelar opened this issue Feb 3, 2023 · 6 comments · Fixed by #193
Closed

_list_oids_traverse is not traversing 00 prefix for s3 remote #192

petrchmelar opened this issue Feb 3, 2023 · 6 comments · Fixed by #193
Assignees
Labels
bug Something isn't working p0-critical

Comments

@petrchmelar
Copy link

petrchmelar commented Feb 3, 2023

Hi,

we are using dvc with s3 remote and dvc status -c started to behave a little bit strange.
With dvc 2.43.2 (dvc data 0.37.3) the dvc status -c returns a new file to push, however I'm 100% sure that the file already exists in storage (I've called dvc push several times and also called aws s3 ls just to be 100 % sure).

I've also tried the same with dvc==2.30.0 (dvc data 0.17.1) and it seems that this version acts as expected - it returns no file to push.

I did a little bit of debugging and found out that the md5 of the problematic file starts with 00.
After more digging I've maybe found a problematic piece of code that was probably refactored a few weeks ago.
The _list_oids_traverse function is traversing over prefixes but it seems that the traverse_prefixes contains prefixes from 01 to ff and prefixes from 001 to 00f.
When I've tried to hack the function by adding simple traverse_prefixes.append("00") the dvc status -c returned no file to push as expected.
It smells a little bit fishy to me and I guess that could indicate a bug.

https://github.com/iterative/dvc-objects/blob/main/src/dvc_objects/db.py#L302

    def _list_oids_traverse(self, remote_size, remote_oids, jobs=None):
        """Iterate over all oids found in this fs.
        Hashes are fetched in parallel according to prefix, except in
        cases where the remote size is very small.

        All oids from the remote (including any from the size
        estimation step passed via the `remote_oids` argument) will be
        returned.

        NOTE: For large remotes the list of oids will be very
        big(e.g. 100M entries, md5 for each is 32 bytes, so ~3200Mb list)
        and we don't really need all of it at the same time, so it makes
        sense to use a generator to gradually iterate over it, without
        keeping all of it in memory.
        """
        num_pages = remote_size / self.fs.LIST_OBJECT_PAGE_SIZE
        if num_pages < 256 / self.fs.jobs:
            # Fetching prefixes in parallel requires at least 255 more
            # requests, for small enough remotes it will be faster to fetch
            # entire cache without splitting it into prefixes.
            #
            # NOTE: this ends up re-fetching oids that were already
            # fetched during remote size estimation
            traverse_prefixes = None
        else:
            yield from remote_oids
            traverse_prefixes = [f"{i:02x}" for i in range(1, 256)]
            if self.fs.TRAVERSE_PREFIX_LEN > 2:
                traverse_prefixes += [
                    "{0:0{1}x}".format(i, self.fs.TRAVERSE_PREFIX_LEN)
                    for i in range(1, pow(16, self.fs.TRAVERSE_PREFIX_LEN - 2))
                ]

        yield from self._list_oids(prefixes=traverse_prefixes, jobs=jobs)
╰─❯ dvc doctor
DVC version: 2.43.2 (pip)
-------------------------
Platform: Python 3.8.10 on Linux-4.18.0-408.el8.x86_64-x86_64-with-glibc2.29
Subprojects:
        dvc_data = 0.37.3
        dvc_objects = 0.19.1
        dvc_render = 0.1.0
        dvc_task = 0.1.11
        dvclive = 1.4.0
        scmrepo = 0.1.7
Supports:
        http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2022.11.0, boto3 = 1.24.59)
Cache types: symlink
Cache directory: xfs on /dev/mapper/data-srv
Caches: local
Remotes: s3
Workspace directory: xfs on /dev/mapper/data-srv
Repo: dvc, git
╰─❯ dvc doctor             
DVC version: 2.30.0 (pip)
---------------------------------
Platform: Python 3.8.10 on Linux-4.18.0-408.el8.x86_64-x86_64-with-glibc2.29
Subprojects:
        dvc_data = 0.17.1
        dvc_objects = 0.7.0
        dvc_render = 0.0.12
        dvc_task = 0.1.3
        dvclive = 1.4.0
        scmrepo = 0.1.1
Supports:
        http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2022.11.0, boto3 = 1.24.59)
Cache types: symlink
Cache directory: xfs on /dev/mapper/data-srv
Caches: local
Remotes: s3
Workspace directory: xfs on /dev/mapper/data-srv
Repo: dvc, git

You are using dvc version 2.30.0; however, version 2.43.2 is available.
To upgrade, run 'pip install --upgrade dvc'.
@pmrowla pmrowla self-assigned this Feb 4, 2023
@Maimonator
Copy link

Ahh! too bad I missed this message before looking into this.
I've been having the same problem and I've found the root cause to be this line:

TRAVERSE_PREFIX_LEN = 3

I've been using dvc 2.10 which persumably had a TRAVERSE_PREFIX_SIZE=2 as all of the folders in the relevant S3 bucket have a two character prefix.
It seems to me that with recent changes (well probably a lot of changes) the TRAVERSE_PREFIX_SIZE was changed to 3.
Specifically in my case existing oids would be filtered out by the _estimate_remote_size in the following flow:
the oids_exist function would be called with some oids
then these oids would be passed to the _estimate_remote_size
then the oids would be overriden by the result of where prefix is now "000" and not "00"

            oids = self._oids_with_limit(
                max_oids / total_prefixes, prefixes=[prefix]
            )

and finally we'll get an empty set of oids remote_oids = set(iter_with_pbar(oids))

This caused that I couldn't pull an existing oid because I couldn't find it in the bucket.
Hopefully this relates to the problem and helps shed some light.

Let me know if this sounds right

@pmrowla
Copy link
Contributor

pmrowla commented Feb 6, 2023

This problem has come up before in the past (iterative/dvc#4141, iterative/dvc#6089) and was fixed, but may have been reintroduced after the performance related refactoring in #180

We do have tests that are supposed to catch this issue, but the performance refactor also touched those tests, which may be why it was not caught.

@pmrowla
Copy link
Contributor

pmrowla commented Feb 6, 2023

Also, just to clarify:

The _list_oids_traverse function is traversing over prefixes but it seems that the traverse_prefixes contains prefixes from 01 to ff and prefixes from 001 to 00f.

This is intentional, the missing 000 prefix is supposed to be searched during remote size estimation that is done before _list_oids_traverse (so to finish listing an entire DVC remote, _list_oids_traverse only should only need to search the reamining possible prefixes (001-00f, 01-ff).

@pmrowla
Copy link
Contributor

pmrowla commented Feb 6, 2023

The issue is that fs.find(..., prefix=True) usage was mistakenly dropped in the performance refactor (#180). This was not caught in dvc-objects tests since the prefix flag is only applicable to specific fsspec filesystems for object-storage based clouds. This should have been caught in downstream dvc tests, but the 00... edge case tests were mistakenly lost when we migrated to dvc[testing] and separate plugins for testing the remote clouds (iterative/dvc@b6f2a8e).

@pmrowla
Copy link
Contributor

pmrowla commented Feb 6, 2023

This issue is fixed by installing dvc-objects==0.19.2 dvc-objects==0.19.3, the fix also be available in the next dvc release

@pmrowla
Copy link
Contributor

pmrowla commented Feb 6, 2023

There was another issue that came up in the downstream DVC CI, the updated fix is in dvc-objects==0.19.3 now (the downstream DVC release should be available shortly)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working p0-critical
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants