parsl provider error messages are lost #679

benclifford · 2022-01-27T22:21:28Z

Describe the bug
This is based on a report in the #help slack channel

When the slurm provider fails to scale out, the code that is supposed to report that to the user fails in potentially several ways:

This seems to be static type error in the exception handling code for scale_out failing, when constructing a more specific exception - interchange indeed has no config.

Submission of command to scale_out failed
2022-01-27 14:33:58.605 funcx_endpoint.strategies.simple:43 [ERROR] Caught error in strategize : 'Interchange' object has no attribute 'config'
Traceback (most recent call last):
  File "/work2/04372/ejonas/anaconda/envs/s2s/lib/python3.9/site-packages/funcx_endpoint/strategies/simple.py", line 41, in strategize
    self._strategize(*args, **kwargs)
  File "/work2/04372/ejonas/anaconda/envs/s2s/lib/python3.9/site-packages/funcx_endpoint/strategies/simple.py", line 143, in _strategize
    self.interchange.scale_out(excess_blocks)
  File "/work2/04372/ejonas/anaconda/envs/s2s/lib/python3.9/site-packages/funcx_endpoint/executors/high_throughput/interchange.py", line 1151, in scale_out
    self.config.provider.label,
AttributeError: 'Interchange' object has no attribute 'config'

The parsl layer logs an error to eg parsl.providers.slurm but the endpoint admin was unable to find the relevant log message - maybe it should appear around the same place as the above report? The relevant parsl log line is:

            logger.error("Retcode:%s STDOUT:%s STDERR:%s", retcode, stdout.strip(), stderr.strip())

To Reproduce
Get endpoint to try to scale out with a broken provider/provider configuration

Expected behavior
The errors coming from parsl.providers should lead the user towards fixing the problem (in the example user's case, a quota exhaustion reported by sbatch) rather than being hidden

Environment
slurm
other component versions unknown

The text was updated successfully, but these errors were encountered:

benclifford · 2022-02-02T17:39:54Z

I've recreated this in my dev environment by replacing the submit call for my relevant local provider (kube.py) with return None, which emulates the slurm failure for the purposes of this buggy exception report - point 1 in the issue.

For point 2, I have discussed internally with @sirosen about logging parsl (and more) error messages to the endpoint logs.

This tries to find the provider label inside self.config.provider, which does not exist. In this interchange, the provider is directly available as an attribute. Tested by: modify my local kube provider to return None on all submits, see that the issue #679 stack trace appears. Make this change in this commit, and see that a ScalingFailed correctly appears. This addresses the first bullet point in issue #679.

This tries to find the provider label inside self.config.provider, which does not exist. In this interchange, the provider is directly available as an attribute. Tested by: modify my local kube provider to return None on all submits, see that the issue #679 stack trace appears. Make this change in this commit, and see that a ScalingFailed correctly appears. Fixes issue #679

benclifford · 2024-04-06T12:38:16Z

The 2nd part of this might have been fixed by @rjmello 7b22192

rjmello · 2024-04-09T14:16:48Z

The 2nd part of this might have been fixed by @rjmello 7b22192

Correct; I'd expect the logs to show now.

benclifford added the bug Something isn't working label Jan 27, 2022

benclifford mentioned this issue Feb 2, 2022

Fix broken ScalingFailed exception construction #690

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parsl provider error messages are lost #679

parsl provider error messages are lost #679

benclifford commented Jan 27, 2022 •

edited

benclifford commented Feb 2, 2022

benclifford commented Apr 6, 2024

rjmello commented Apr 9, 2024

parsl provider error messages are lost #679

parsl provider error messages are lost #679

Comments

benclifford commented Jan 27, 2022 • edited

benclifford commented Feb 2, 2022

benclifford commented Apr 6, 2024

rjmello commented Apr 9, 2024

benclifford commented Jan 27, 2022 •

edited