Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postmortem: Files could not be fetched from remotes in specific circumstances #8967

Closed
pmrowla opened this issue Feb 6, 2023 · 5 comments
Closed
Assignees

Comments

@pmrowla
Copy link
Contributor

pmrowla commented Feb 6, 2023

High level summary

Performance-related refactoring in dvc-objects re-introduced an old bug where files with an MD5 hash starting with 00 would be reported as missing from a remote in specific situations even though the files had been pushed properly. This would result in incorrect dvc status -c output as well as preventing DVC from fetching the files.

Timeline

All times in UTC+9

Perf indicators

  • Time to identify: 15 days (time between affected DVC release and user report)
  • Time to fix: 1 day
  • Time to resolve issue: 3 days (time from user report, includes a weekend, user report was made ~3AM UTC+9 on a Saturday)

Impact

Users with large remotes on specific object-storage based clouds could encounter this bug when trying to fetch files with MD5 hashes beginning with 00. Which clouds were affected depends on individual fsspec fs.find() implementation (confirmed on S3 and Azure, issue not reproducible with GCS). Encountering the bug is also dependent on triggering the traverse-based existence check behavior in DVC (which is based on the relative difference between total number of files in the remote and total number of files the user wishes to fetch).

Estimating a number or percentage of impacted DVC users is not feasible given the specific edge-case nature of the bug.

Root cause analysis

This issue had been encountered and fixed in the past (#4141, #6089), and tests for this edge case were added in DVC when the issue was fixed at that time. However, when the remote plugins were separated in to their own sub-projects and remote tests were refactored/migrated into dvc[testing], the specific tests for this bug were mistakenly removed entirely and not migrated into the new remote test framework (b6f2a8e).

Given that this was an issue we have encountered and fixed in the past, it should have been caught. Unfortunately, since the old tests were lost, the bug was not caught when it was re-introduced.

Prevention

  • When doing large refactors (especially ones which affect testing), team should take care to ensure that there is no loss of edge/corner case test coverage.
@pmrowla pmrowla self-assigned this Feb 6, 2023
@efiop
Copy link
Contributor

efiop commented Feb 6, 2023

Another thing that we always discuss with @pmrowla regarding that logic is that it is quite complicated and we hope to replace or improve it by incorporating indexes into odbs and maybe basing our workflow on index building and not on tricky prefix estimations.

@shcheklein
Copy link
Member

Thanks folks! This is very helpful. A few comments on this:

First, how about we introduce the "time to triage" (where we should be identifying this is p0 or not). After that it would be easier to reduce this number Time to resolve issue: 3 days for critical issues. Doesn't mean a particular person should be spending weekend on this. What I mean that we all should know that we have a p0 asap and try to discuss if someone could take it and solve asap depending on their capacity.

When doing large refactors (especially ones which affect testing), team should take care to ensure that there is no loss of edge/corner case test coverage.

I can't come up with some other ideas, but I wonder what actual steps we can do to improve this? More reviews (can slow us down though), creating checklists / sharing the plan (not sure it can help in this case), more comments for tests?

Do you remember what actually caused the test to be removed? Did it look too complicated? Did we remove it because we want to migrate to the new logic?

Another thing that we always discuss with @pmrowla regarding that logic is that it is quite complicated and we hope to replace or improve it by incorporating indexes into odbs and maybe basing our workflow on index building and not on tricky prefix estimations.

Make sense. Would be really great to have also an epic (story, whatever) for visibility and knowledge sharing, including a checklist of items to check that we could start collecting (from the top of my head- this edge case).

And the last question: should we pull the broken version from PyPi, or deprecate it? (I'm not sure we've done this before, but in case we detect a severe issue with data reliability - should we consider deprecating versions?)

@pmrowla
Copy link
Contributor Author

pmrowla commented Feb 7, 2023

Do you remember what actually caused the test to be removed?

We migrated all of the remote tests to the dvc.testing framework when the remote plugins were split into their own repositories. I don't think removing this test was intentional. The migration wasn't just a straight cut/paste of the old tests to the new location - some of the tests were duplicates, some of the tests belonged elsewhere (as smaller unit tests in dvc-objects) and some of them were cleaned up/refactored (since a lot of the old tests still used unittest instead of pytest). I'm guessing that someone just thought these tests were duplicated elsewhere and removed them entirely by mistake.

@shcheklein
Copy link
Member

Thanks @pmrowla ! Just a suggestion / an idea for another possible steps- write test descriptions (e.g. link to a ticket, mention that it's an important edge case, etc). Not sure at it would have helped here though.

@skshetry
Copy link
Member

skshetry commented Feb 7, 2023

I think it’s better to ask for review in these kinds of refactoring. It’ll force you to make small and isolated PRs. At worse, reviews won’t make any difference. At best, we might detect these issues.

@pmrowla pmrowla closed this as completed Feb 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Archived in project
Development

No branches or pull requests

4 participants