Fall back if scheduler access fails #214

bdice · 2020-01-19T02:00:12Z

Description

I'm running on the Bridges cluster in a singularity container. I tried to check the project status but it wouldn't print any output because the scheduler is not accessible by a singularity container on a compute node. It works fine if I include --ignore-errors, but I'm not sure if this should be considered a legitimate error or not. I would expect the errors that are ignored with --ignore-errors to occur during condition evaluation.

One possible solutions is to fall back to StandardEnvironment (no scheduler) and check the status without interfacing with the scheduler.

To reproduce

python project.py status

Error output

Singularity> python project.py status
Using environment configuration: BridgesEnvironment
WARNING:flow.project:Error occurred while querying scheduler: 'SLURM not available.'.
ERROR:flow.project:Error during status update: SLURM not available.
Use '--ignore-errors' to complete the update anyways.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/flow/scheduling/slurm.py", line 42, in _fetch
    result = subprocess.check_output(cmd).decode('utf-8', errors='backslashreplace')
  File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
    **kwargs).stdout
  File "/usr/lib/python3.6/subprocess.py", line 423, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/usr/lib/python3.6/subprocess.py", line 729, in __init__
    restore_signals, start_new_session)
  File "/usr/lib/python3.6/subprocess.py", line 1364, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'squeue': 'squeue'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/flow/project.py", line 2658, in _main_status
    self.print_status(jobs=jobs, **args)
  File "/usr/local/lib/python3.6/dist-packages/flow/project.py", line 1399, in print_status
    tmp = self._fetch_status(jobs, err, ignore_errors, no_parallelize)
  File "/usr/local/lib/python3.6/dist-packages/flow/project.py", line 1041, in _fetch_status
    self._fetch_scheduler_status(jobs, err, ignore_errors)
  File "/usr/local/lib/python3.6/dist-packages/flow/project.py", line 1021, in _fetch_scheduler_status
    scheduler_info = {sjob.name(): sjob.status() for sjob in self.scheduler_jobs(scheduler)}
  File "/usr/local/lib/python3.6/dist-packages/flow/project.py", line 1021, in <dictcomp>
    scheduler_info = {sjob.name(): sjob.status() for sjob in self.scheduler_jobs(scheduler)}
  File "/usr/local/lib/python3.6/dist-packages/flow/project.py", line 954, in scheduler_jobs
    for sjob in self._expand_bundled_jobs(scheduler.jobs()):
  File "/usr/local/lib/python3.6/dist-packages/flow/project.py", line 930, in _expand_bundled_jobs
    for job in scheduler_jobs:
  File "/usr/local/lib/python3.6/dist-packages/flow/scheduling/slurm.py", line 84, in jobs
    for job in _fetch(user=self.user):
  File "/usr/local/lib/python3.6/dist-packages/flow/scheduling/slurm.py", line 49, in _fetch
    raise RuntimeError("SLURM not available.")
RuntimeError: SLURM not available.

System configuration

Please complete the following information:

Operating System [e.g. macOS]: Linux-3.10.0-957.27.2.el7.x86_64-x86_64-with-Ubuntu-18.04-bionic
Version of Python [e.g. 3.7]: 3.6.8
Version of signac [e.g. 1.0]: 1.3.0
Version of signac-flow: 0.9.0

The text was updated successfully, but these errors were encountered:

kidrahahjo · 2020-02-18T08:26:16Z

@csadorf
I'd like to work on this, Can I?

csadorf · 2020-02-18T08:32:12Z

@kidrahahjo Yes, that would be very much appreciated. We will need to have at least one reviewer with access to the Bridges system, who can verify that the proposed solution works.

kidrahahjo · 2020-02-19T06:42:27Z

@bdice @csadorf
How am I supposed to test the issue?
Executing python project.py status raises an ModuleNotFoundError

Traceback (most recent call last):
File "flow/project.py", line 48, in
from .environment import get_environment
ModuleNotFoundError: No module named 'main.environment'; 'main' is not a package

csadorf · 2020-02-19T08:01:26Z

@kidrahahjo The project.py here refers to a flow-project project.py file in the sense as described here, not the flow/project.py module.

kidrahahjo · 2020-02-19T08:16:22Z

@kidrahahjo The project.py here refers to a flow-project project.py file in the sense as described here, not the flow/project.py module.

Thought so, I've replicated this example locally too. I will look into it.

kidrahahjo · 2020-02-23T16:21:11Z

@bdice I can't access StandardEnvironment directly, but what I can do is I can remove the scheduler access from the _fetch_scheduler_status method and then I can directly get the status from JobOperation.get_status method

kidrahahjo · 2020-02-23T16:44:36Z

Line 1077 --> flow/project.py
status[op.get_id()] = int(scheduler_info.get(op.get_id(), JobStatus.unknown))

While debugging, I used a statement print(status[op.get_id()],op.get_status()) and then executed a pytest command on test_project.py module and it captured stdout call (Multiple lines) as:

1 1

While the results are not surprising to me, what I wonder is why do tests fail when instead of status[op.get_id()] = int(scheduler_info.get(op.get_id(), JobStatus.unknown)), I use status[op.get_id()] = int(op.get_status())

bdice · 2020-02-25T15:41:33Z

@kidrahahjo The code contents have all changed between your previous comment and now, because of #114 (a major change that was just merged). I am not sure that it is useful to answer your previous question at this point because of the scale of changes.

I reviewed the error handling in this function and in the scheduler code, and I am not sure whether this is an issue that should be solved.

signac-flow/flow/project.py

Lines 1560 to 1587 in 1a01303

    
           def _fetch_scheduler_status(self, jobs=None, file=None, ignore_errors=False): 
        
               "Update the status docs." 
        
               if file is None: 
        
                   file = sys.stderr 
        
               if jobs is None: 
        
                   jobs = list(self) 
        
               try: 
        
                   scheduler = self._environment.get_scheduler() 
        
                   self.document.setdefault('_status', dict()) 
        
                   scheduler_info = {sjob.name(): sjob.status() for sjob in self.scheduler_jobs(scheduler)} 
        
                   status = dict() 
        
                   print("Query scheduler...", file=file) 
        
                   for job in tqdm(jobs, 
        
                                   desc="Fetching operation status", 
        
                                   total=len(jobs), file=file): 
        
                       for group in self._groups.values(): 
        
                           _id = group._generate_id(job) 
        
                           status[_id] = int(scheduler_info.get(_id, JobStatus.unknown)) 
        
                   self.document._status.update(status) 
        
               except NoSchedulerError: 
        
                   logger.debug("No scheduler available.") 
        
               except RuntimeError as error: 
        
                   logger.warning("Error occurred while querying scheduler: '{}'.".format(error)) 
        
                   if not ignore_errors: 
        
                       raise 
        
               else: 
        
                   logger.info("Updated job status cache.")

My main concern is that if we try to fall back to a "StandardEnvironment" behavior and allow those kinds of errors to pass with just a warning message, it may not be obvious to a user that their status checks, submission eligibility, etc. could be invalid because the scheduler was not queried correctly. That is, requiring users to pass --ignore-errors is a good-enough solution in this case. I don't think I had fully understood the consequences of what changing this behavior would do when I created the issue. I discussed with @vyasr and he agreed that this is probably unsafe to change in our submission model.

@kidrahahjo Thanks for looking at this issue. I'm going to close it for now since I don't think there is any action to be taken.

kidrahahjo · 2020-02-25T16:01:58Z

@bdice Okay. And also I have a question that will there be any alternative solution(in coming future maybe) for it that I can manage to solve?
Thank you for giving your time and explaining it to me though.

bdice added the enhancement New feature or request label Jan 19, 2020

csadorf added the good first issue Good for newcomers label Feb 17, 2020

csadorf assigned kidrahahjo Feb 18, 2020

bdice closed this as completed Feb 25, 2020

bdice added wontfix This will not be worked on and removed enhancement New feature or request good first issue Good for newcomers labels Feb 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fall back if scheduler access fails #214

Fall back if scheduler access fails #214

bdice commented Jan 19, 2020

kidrahahjo commented Feb 18, 2020

csadorf commented Feb 18, 2020

kidrahahjo commented Feb 19, 2020

csadorf commented Feb 19, 2020 •

edited

Loading

kidrahahjo commented Feb 19, 2020

kidrahahjo commented Feb 23, 2020

kidrahahjo commented Feb 23, 2020 •

edited

Loading

bdice commented Feb 25, 2020

kidrahahjo commented Feb 25, 2020

Fall back if scheduler access fails #214

Fall back if scheduler access fails #214

Comments

bdice commented Jan 19, 2020

Description

To reproduce

Error output

System configuration

kidrahahjo commented Feb 18, 2020

csadorf commented Feb 18, 2020

kidrahahjo commented Feb 19, 2020

csadorf commented Feb 19, 2020 • edited Loading

kidrahahjo commented Feb 19, 2020

kidrahahjo commented Feb 23, 2020

kidrahahjo commented Feb 23, 2020 • edited Loading

bdice commented Feb 25, 2020

kidrahahjo commented Feb 25, 2020

csadorf commented Feb 19, 2020 •

edited

Loading

kidrahahjo commented Feb 23, 2020 •

edited

Loading