Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix some celery queue related ci failure. #8404

Merged
merged 2 commits into from Nov 11, 2022
Merged

Conversation

karajan1001
Copy link
Contributor

@karajan1001 karajan1001 commented Oct 6, 2022

wait for #8349
fix: #8403

  1. Make the follow exit after the tasks been finished.
  2. remove some of the flaky mark

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

@karajan1001 karajan1001 added ci I keep failing, you keep fixing A: experiments Related to dvc exp bugfix fixes bug A: task-queue Related to task queue. labels Oct 6, 2022
@karajan1001 karajan1001 self-assigned this Oct 6, 2022
Copy link
Contributor

@pmrowla pmrowla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karajan1001 looks like there's CI failures with these test changes

Comment on lines 330 to 332
except FileNotFoundError:
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the chances of this file never being created? Worried about infinite loop here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that the time cost here is to wait for the data transfer to finish but the problem here is that we do not know how long it would take.

One solution is to give a warning and exit if it hadn't finished after 5 or 10 seconds.

@skshetry
Copy link
Member

skshetry commented Oct 7, 2022

@karajan1001, try changing the following line:

pytest-filter:
- "import or plot or live or experiment"
- "not (import or plot or live or experiment)"

to:

 pytest-filter:  
 - "test_queue or experiment or exp"
 - "test_queue or experiment or exp"

That will run 20 (2x10) jobs. If you need more jobs, add more lines to pytest-filter. There may be other tests that are flaky than the one you have parametrized.

@dtrifiro dtrifiro changed the title Fix some celery queue realted ci failure Fix some celery queue related ci failure Oct 10, 2022
@karajan1001 karajan1001 changed the title Fix some celery queue related ci failure [WIP] Fix some celery queue related ci failure Oct 11, 2022
@karajan1001 karajan1001 changed the title [WIP] Fix some celery queue related ci failure Fix some celery queue related ci failure Nov 8, 2022
@karajan1001 karajan1001 changed the title Fix some celery queue related ci failure [WIP]Fix some celery queue related ci failure. Nov 8, 2022
Comment on lines 329 to 340
MAX_RETRY = 5
for _ in range(MAX_RETRY):
for _, queue_entry in self._iter_done_tasks():
if queue_entry == entry:
logger.debug("entry %s finished", entry.stash_rev)
return
time.sleep(1)
logger.warning(
"Post process experiment %s time out with max retries %d.",
entry.stash_rev,
MAX_RETRY,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't belong in follow(), this will break the use case where we the user is using queue logs -f and ctrl-c's to stop viewing the logs (it should exit without waiting for the underlying task to finish).

If there are places that use follow() but we need to actually wait for the entire task to finish, we should really be doing something like

celery_queue.follow(entry)
celery_queue.get_result(entry)

get_result() has better logic for waiting until the given entry is completed, we should avoid this kind of busy-wait sleep() whenever possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the problem is that the get_result is leaky, we might get the result before the tasks are complete.

def _load_collected(rev: str) -> Optional[ExecutorResult]:
executor_info = _load_info(rev)
if executor_info.status > TaskStatus.SUCCESS:
return executor_info.result
raise FileNotFoundError
try:
return _load_collected(entry.stash_rev)
except FileNotFoundError:
# Infofile will not be created until execution begins
pass

Here we directly look into the result directly without checking the AysncResult.ready() or waiting until``AysncResult.get()` is returned. And this is where the problem is.

@codecov
Copy link

codecov bot commented Nov 9, 2022

Codecov Report

Base: 94.31% // Head: 93.98% // Decreases project coverage by -0.32% ⚠️

Coverage data is based on head (3dbf052) compared to base (929de7c).
Patch coverage: 94.87% of modified lines in pull request are covered.

❗ Current head 3dbf052 differs from pull request most recent head af35413. Consider uploading reports for the commit af35413 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8404      +/-   ##
==========================================
- Coverage   94.31%   93.98%   -0.33%     
==========================================
  Files         430      430              
  Lines       32840    32839       -1     
  Branches     4592     4587       -5     
==========================================
- Hits        30972    30864     -108     
- Misses       1448     1538      +90     
- Partials      420      437      +17     
Impacted Files Coverage Ξ”
dvc/repo/experiments/show.py 91.89% <ΓΈ> (+1.08%) ⬆️
dvc/repo/experiments/executor/base.py 83.10% <50.00%> (+1.08%) ⬆️
dvc/repo/experiments/queue/celery.py 87.26% <86.95%> (-1.74%) ⬇️
dvc/repo/experiments/executor/local.py 89.07% <100.00%> (ΓΈ)
dvc/repo/experiments/run.py 97.43% <100.00%> (-0.07%) ⬇️
tests/func/experiments/test_experiments.py 99.71% <100.00%> (ΓΈ)
tests/func/experiments/test_queue.py 100.00% <100.00%> (ΓΈ)
tests/func/experiments/test_show.py 98.82% <100.00%> (+0.20%) ⬆️
...ests/unit/repo/experiments/test_executor_status.py 98.48% <100.00%> (+0.12%) ⬆️
tests/func/test_unprotect.py 78.57% <0.00%> (-21.43%) ⬇️
... and 23 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

β˜” View full report at Codecov.
πŸ“’ Do you have feedback about the report comment? Let us know in this issue.

@karajan1001
Copy link
Contributor Author

karajan1001 commented Nov 10, 2022

@skshetry celery tests in 3.11 are flaky, and when I looked into their home page https://github.com/celery/celery, I find that they only declared to support 3.7 ~ 3.10, maybe we need to skip some of the tests in 3.11 ?

@skshetry
Copy link
Member

@skshetry celery tests in 3.11 are flaky, and when I looked into their home page https://github.com/celery/celery, I find that they only declared to support 3.7 ~ 3.10, maybe we need to skip some of the tests in 3.11 ?

The error does look legit. Why is it trying to remove .dvc directory?

@pmrowla
Copy link
Contributor

pmrowla commented Nov 10, 2022

@skshetry celery tests in 3.11 are flaky, and when I looked into their home page https://github.com/celery/celery, I find that they only declared to support 3.7 ~ 3.10, maybe we need to skip some of the tests in 3.11 ?

@karajan1001 are they flaky (and sometimes pass) in 3.11 or do they always fail? Either way it is probably ok to just mark the celery tests with:

@pytest.mark.skipif(sys.version_info >= (3, 11), reason="celery unsupported in 3.11")

We may also need to consider disabling the queue related commands (or at least outputting a warning) if we are in 3.11, but that can be addressed in a separate issue (similar to how we don't support hydra functionality in 3.11)

@skshetry
Copy link
Member

skshetry commented Nov 10, 2022

They are already failing in main, so we can ignore here. The issue seems unrelated to celery at a quick glance.

So no need to xfail/skip it, celery seems to be working fine on 3.11 (given it’s pure python). The failure looks to be our fault, we need to investigate it separately.

@skshetry
Copy link
Member

skshetry commented Nov 10, 2022

Looking into it, it always passes for me in Windows (and, cls._repro_dvc is correctly closing all the state db related handles, so ResourceMonitor does not show any associated handles for cache.db at the end).

@karajan1001
Copy link
Contributor Author

karajan1001 commented Nov 10, 2022

@karajan1001 are they flaky (and sometimes pass) in 3.11 or do they always fail?

Always fail on Windows.
Always pass on ubuntu and MacOS.

They are already failing in main, so we can ignore here. The issue seems unrelated to celery at a quick glance.

Let's track them in a separate issue.

@skshetry
Copy link
Member

@karajan1001, can you remove the changes to pytest-filter in github workflow? After @pmrowla approves, we can merge this.

Copy link
Contributor

@pmrowla pmrowla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to also remove the @pytest.mark.parametrize("repeat", range(10)) usage in addition to the pytest-filter changes before merging

fix: iterative#8403
1. remove some of the flaky mark
2. In `get_result` make sure the celery task is completed.
1. Modify run all to include currently running exps.
2. bump dvc-task to 0.1.5
@skshetry
Copy link
Member

The failure looks similar to python/cpython#97641, however I am not able to reproduce locally.

skshetry added a commit to skshetry/dvc that referenced this pull request Nov 10, 2022
There seems to be a regression in Python 3.11, where the sqlite
connections are not deallocated, due to some internal changes in
Python 3.11, where they are now using LRU cache. They are not deallocated
until `gc.collect()` is not called.

See python/cpython#97641.
This affects only Windows, because when we try to remove the tempdir for
the exp run, the sqlite connection is open which prevents us from
deleting that folder.

Although this may happen in real scenario in `exp run`, I am only fixing
the tests by mocking `dvc.close()` and extending it by calling
`gc.collect()` after it. We could also mock `State.close()` but didnot
want to mock something that is not in dvc itself.

The `diskcache` uses threadlocal for connections, so they are expected
to be garbage collected, and therefore does not provide a good way to
close the connections. The only API it offers is `self.close()` and that
only closes main thread's connection. If we had access to connection,
an easier way would have been to explicitly call `conn.close()`.
But we don''t have such option at the moment.

Related: iterative#8404 (comment)
GHA Failure: https://github.com/iterative/dvc/actions/runs/3437324559/jobs/5731929385#step:5:57
@karajan1001
Copy link
Contributor Author

Looks like I can not force merge it. Required statuses must pass before merging

@karajan1001 karajan1001 changed the title [WIP]Fix some celery queue related ci failure. Fix some celery queue related ci failure. Nov 11, 2022
@skshetry
Copy link
Member

I have a fix here #8547 that fixes the test.

@skshetry skshetry merged commit fa54c1a into iterative:main Nov 11, 2022
skshetry added a commit to skshetry/dvc that referenced this pull request Nov 11, 2022
There seems to be a regression in Python 3.11, where the sqlite
connections are not deallocated, due to some internal changes in
Python 3.11, where they are now using LRU cache. They are not deallocated
until `gc.collect()` is not called.

See python/cpython#97641.
This affects only Windows, because when we try to remove the tempdir for
the exp run, the sqlite connection is open which prevents us from
deleting that folder.

Although this may happen in real scenario in `exp run`, I am only fixing
the tests by mocking `dvc.close()` and extending it by calling
`gc.collect()` after it. We could also mock `State.close()` but didnot
want to mock something that is not in dvc itself.

The `diskcache` uses threadlocal for connections, so they are expected
to be garbage collected, and therefore does not provide a good way to
close the connections. The only API it offers is `self.close()` and that
only closes main thread's connection. If we had access to connection,
an easier way would have been to explicitly call `conn.close()`.
But we don''t have such option at the moment.

Related: iterative#8404 (comment)
GHA Failure: https://github.com/iterative/dvc/actions/runs/3437324559/jobs/5731929385#step:5:57
@karajan1001 karajan1001 deleted the fix_ci branch November 11, 2022 01:35
skshetry added a commit that referenced this pull request Nov 16, 2022
There seems to be a regression in Python 3.11, where the sqlite
connections are not deallocated, due to some internal changes in
Python 3.11, where they are now using LRU cache. They are not deallocated
until `gc.collect()` is not called.

See python/cpython#97641.
This affects only Windows, because when we try to remove the tempdir for
the exp run, the sqlite connection is open which prevents us from
deleting that folder.

Although this may happen in real scenario in `exp run`, I am only fixing
the tests by mocking `dvc.close()` and extending it by calling
`gc.collect()` after it. We could also mock `State.close()` but didnot
want to mock something that is not in dvc itself.

The `diskcache` uses threadlocal for connections, so they are expected
to be garbage collected, and therefore does not provide a good way to
close the connections. The only API it offers is `self.close()` and that
only closes main thread's connection. If we had access to connection,
an easier way would have been to explicitly call `conn.close()`.
But we don''t have such option at the moment.

Related: #8404 (comment)
GHA Failure: https://github.com/iterative/dvc/actions/runs/3437324559/jobs/5731929385#step:5:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: experiments Related to dvc exp A: task-queue Related to task queue. bugfix fixes bug ci I keep failing, you keep fixing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Celery queue related CI failures
3 participants