Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queued experiments fail, can't see logs to understand what's up #4332

Open
shcheklein opened this issue Jul 23, 2023 · 8 comments
Open

Queued experiments fail, can't see logs to understand what's up #4332

shcheklein opened this issue Jul 23, 2023 · 8 comments
Labels
A: experiments Area: experiments table webview and everything related blocked Issue or pull request blocked due to other dependencies or issues bug Something isn't working DVC-first Needs to be done first for DVC priority-p1 Regular product backlog

Comments

@shcheklein
Copy link
Member

Screen.Recording.2023-07-23.at.10.29.18.AM.mov

Single experiment runs fine even within a queue. Might be related to some resource allocation when it tries to do 2 of them, but I can't say what's going on.

It's related to me researching user complaining:

But if i run the same queue again..No metrics are getting displayed in Studio
Also these logged experiments are not getting deleted

@shcheklein shcheklein added bug Something isn't working A: experiments Area: experiments table webview and everything related triage labels Jul 23, 2023
@mattseddon
Copy link
Member

Depends on iterative/dvc#9425

@shcheklein
Copy link
Member Author

cc @dberenbaum

@dberenbaum
Copy link
Contributor

dberenbaum commented Jul 24, 2023

@shcheklein Does dvc queue logs motor-abac work?

Edit: asking to see if it's only about iterative/dvc#9425 or if it is also related to iterative/dvc#9616.

@shcheklein shcheklein added the blocked Issue or pull request blocked due to other dependencies or issues label Jul 25, 2023
@mattseddon mattseddon removed the triage label Jul 26, 2023
@mjunker
Copy link

mjunker commented Aug 4, 2023

I run into the same issue. My impression is that this happens mainly when running lots of experiments (100 or more). I create experiments using the cli like this:

dvc exp run -S <several params with ranges> --queue --temp -f -n <name> dvc queue start -j 8

Also I do not get any output from dvc queue logs . When I apply the failed experiment to workspace it runs without any problem.

@shcheklein shcheklein added the priority-p1 Regular product backlog label Aug 17, 2023
@shcheklein
Copy link
Member Author

Making it p1 (even though it's blocked) since it comes my very often (cc @dberenbaum - let me know what you think). For the record we had also a new ticket #4524 , also I think I saw it today on Discord.

@dberenbaum
Copy link
Contributor

The priority makes sense, but there are two underlying issues, and I'm not sure if we are trying to cover both here:

  1. exp show: provide executor information for finished experiments dvc#9425: VS Code needs to detect which experiments have logs.
  2. queue: log dvc errors dvc#9616: Queued experiments fail to generate logs if error happened before stage runs.

@shcheklein shcheklein added the DVC-first Needs to be done first for DVC label Aug 22, 2023
@shcheklein
Copy link
Member Author

@dberenbaum I think eventually we try to cover both. If I understand correctly the Queued experiments fail to generate logs if error happened before stage runs.. Those errors could happen if there is no data, or someone forgot to commit params.yaml, etc, etc. If there is no easy way to see those errors and then restart those experiments it becomes a problem during onboarding.

The first item will be fixed automatically if the first one is fixed, right?

What is your take? What is the complexity and scope of both on the DVC side?

@dberenbaum
Copy link
Contributor

The first item will be fixed automatically if the first one is fixed, right?

What did you mean here? I don't know that either one will fix the other. We are discussing in the tickets above what the options are to solve each and what level of effort it takes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: experiments Area: experiments table webview and everything related blocked Issue or pull request blocked due to other dependencies or issues bug Something isn't working DVC-first Needs to be done first for DVC priority-p1 Regular product backlog
Projects
None yet
Development

No branches or pull requests

4 participants