Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parsl provider error messages are lost #679

Open
benclifford opened this issue Jan 27, 2022 · 3 comments
Open

parsl provider error messages are lost #679

benclifford opened this issue Jan 27, 2022 · 3 comments
Labels
bug Something isn't working

Comments

@benclifford
Copy link
Contributor

benclifford commented Jan 27, 2022

Describe the bug
This is based on a report in the #help slack channel

When the slurm provider fails to scale out, the code that is supposed to report that to the user fails in potentially several ways:

  1. This seems to be static type error in the exception handling code for scale_out failing, when constructing a more specific exception - interchange indeed has no config.
Submission of command to scale_out failed
2022-01-27 14:33:58.605 funcx_endpoint.strategies.simple:43 [ERROR] Caught error in strategize : 'Interchange' object has no attribute 'config'
Traceback (most recent call last):
  File "/work2/04372/ejonas/anaconda/envs/s2s/lib/python3.9/site-packages/funcx_endpoint/strategies/simple.py", line 41, in strategize
    self._strategize(*args, **kwargs)
  File "/work2/04372/ejonas/anaconda/envs/s2s/lib/python3.9/site-packages/funcx_endpoint/strategies/simple.py", line 143, in _strategize
    self.interchange.scale_out(excess_blocks)
  File "/work2/04372/ejonas/anaconda/envs/s2s/lib/python3.9/site-packages/funcx_endpoint/executors/high_throughput/interchange.py", line 1151, in scale_out
    self.config.provider.label,
AttributeError: 'Interchange' object has no attribute 'config'
  1. The parsl layer logs an error to eg parsl.providers.slurm but the endpoint admin was unable to find the relevant log message - maybe it should appear around the same place as the above report? The relevant parsl log line is:
            logger.error("Retcode:%s STDOUT:%s STDERR:%s", retcode, stdout.strip(), stderr.strip())

To Reproduce
Get endpoint to try to scale out with a broken provider/provider configuration

Expected behavior
The errors coming from parsl.providers should lead the user towards fixing the problem (in the example user's case, a quota exhaustion reported by sbatch) rather than being hidden

Environment
slurm
other component versions unknown

@benclifford benclifford added the bug Something isn't working label Jan 27, 2022
@benclifford
Copy link
Contributor Author

I've recreated this in my dev environment by replacing the submit call for my relevant local provider (kube.py) with return None, which emulates the slurm failure for the purposes of this buggy exception report - point 1 in the issue.

For point 2, I have discussed internally with @sirosen about logging parsl (and more) error messages to the endpoint logs.

benclifford added a commit that referenced this issue Feb 2, 2022
This tries to find the provider label inside self.config.provider,
which does not exist. In this interchange, the provider is
directly available as an attribute.

Tested by: modify my local kube provider to return None on
all submits, see that the issue #679 stack trace appears.
Make this change in this commit, and see that a ScalingFailed
correctly appears.

This addresses the first bullet point in issue #679.
benclifford added a commit that referenced this issue Feb 15, 2022
This tries to find the provider label inside self.config.provider,
which does not exist. In this interchange, the provider is
directly available as an attribute.

Tested by: modify my local kube provider to return None on
all submits, see that the issue #679 stack trace appears.
Make this change in this commit, and see that a ScalingFailed
correctly appears.

This addresses the first bullet point in issue #679.
benclifford added a commit that referenced this issue Mar 8, 2022
This tries to find the provider label inside self.config.provider,
which does not exist. In this interchange, the provider is
directly available as an attribute.

Tested by: modify my local kube provider to return None on
all submits, see that the issue #679 stack trace appears.
Make this change in this commit, and see that a ScalingFailed
correctly appears.

Fixes issue #679
@benclifford
Copy link
Contributor Author

The 2nd part of this might have been fixed by @rjmello 7b22192

@rjmello
Copy link
Contributor

rjmello commented Apr 9, 2024

The 2nd part of this might have been fixed by @rjmello 7b22192

Correct; I'd expect the logs to show now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants