Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fall back if scheduler access fails #214

Closed
bdice opened this issue Jan 19, 2020 · 9 comments
Closed

Fall back if scheduler access fails #214

bdice opened this issue Jan 19, 2020 · 9 comments
Assignees
Labels
wontfix This will not be worked on

Comments

@bdice
Copy link
Member

bdice commented Jan 19, 2020

Description

I'm running on the Bridges cluster in a singularity container. I tried to check the project status but it wouldn't print any output because the scheduler is not accessible by a singularity container on a compute node. It works fine if I include --ignore-errors, but I'm not sure if this should be considered a legitimate error or not. I would expect the errors that are ignored with --ignore-errors to occur during condition evaluation.

One possible solutions is to fall back to StandardEnvironment (no scheduler) and check the status without interfacing with the scheduler.

To reproduce

python project.py status

Error output

Singularity> python project.py status
Using environment configuration: BridgesEnvironment
WARNING:flow.project:Error occurred while querying scheduler: 'SLURM not available.'.
ERROR:flow.project:Error during status update: SLURM not available.
Use '--ignore-errors' to complete the update anyways.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/flow/scheduling/slurm.py", line 42, in _fetch
    result = subprocess.check_output(cmd).decode('utf-8', errors='backslashreplace')
  File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
    **kwargs).stdout
  File "/usr/lib/python3.6/subprocess.py", line 423, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/usr/lib/python3.6/subprocess.py", line 729, in __init__
    restore_signals, start_new_session)
  File "/usr/lib/python3.6/subprocess.py", line 1364, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'squeue': 'squeue'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/flow/project.py", line 2658, in _main_status
    self.print_status(jobs=jobs, **args)
  File "/usr/local/lib/python3.6/dist-packages/flow/project.py", line 1399, in print_status
    tmp = self._fetch_status(jobs, err, ignore_errors, no_parallelize)
  File "/usr/local/lib/python3.6/dist-packages/flow/project.py", line 1041, in _fetch_status
    self._fetch_scheduler_status(jobs, err, ignore_errors)
  File "/usr/local/lib/python3.6/dist-packages/flow/project.py", line 1021, in _fetch_scheduler_status
    scheduler_info = {sjob.name(): sjob.status() for sjob in self.scheduler_jobs(scheduler)}
  File "/usr/local/lib/python3.6/dist-packages/flow/project.py", line 1021, in <dictcomp>
    scheduler_info = {sjob.name(): sjob.status() for sjob in self.scheduler_jobs(scheduler)}
  File "/usr/local/lib/python3.6/dist-packages/flow/project.py", line 954, in scheduler_jobs
    for sjob in self._expand_bundled_jobs(scheduler.jobs()):
  File "/usr/local/lib/python3.6/dist-packages/flow/project.py", line 930, in _expand_bundled_jobs
    for job in scheduler_jobs:
  File "/usr/local/lib/python3.6/dist-packages/flow/scheduling/slurm.py", line 84, in jobs
    for job in _fetch(user=self.user):
  File "/usr/local/lib/python3.6/dist-packages/flow/scheduling/slurm.py", line 49, in _fetch
    raise RuntimeError("SLURM not available.")
RuntimeError: SLURM not available.

System configuration

Please complete the following information:

  • Operating System [e.g. macOS]: Linux-3.10.0-957.27.2.el7.x86_64-x86_64-with-Ubuntu-18.04-bionic
  • Version of Python [e.g. 3.7]: 3.6.8
  • Version of signac [e.g. 1.0]: 1.3.0
  • Version of signac-flow: 0.9.0
@bdice bdice added the enhancement New feature or request label Jan 19, 2020
@csadorf csadorf added the good first issue Good for newcomers label Feb 17, 2020
@kidrahahjo
Copy link
Member

@csadorf
I'd like to work on this, Can I?

@csadorf
Copy link
Contributor

csadorf commented Feb 18, 2020

@kidrahahjo Yes, that would be very much appreciated. We will need to have at least one reviewer with access to the Bridges system, who can verify that the proposed solution works.

@kidrahahjo
Copy link
Member

@bdice @csadorf
How am I supposed to test the issue?
Executing python project.py status raises an ModuleNotFoundError

Traceback (most recent call last):
File "flow/project.py", line 48, in
from .environment import get_environment
ModuleNotFoundError: No module named 'main.environment'; 'main' is not a package

@csadorf
Copy link
Contributor

csadorf commented Feb 19, 2020

@kidrahahjo The project.py here refers to a flow-project project.py file in the sense as described here, not the flow/project.py module.

@kidrahahjo
Copy link
Member

@kidrahahjo The project.py here refers to a flow-project project.py file in the sense as described here, not the flow/project.py module.

Thought so, I've replicated this example locally too. I will look into it.

@kidrahahjo
Copy link
Member

@bdice I can't access StandardEnvironment directly, but what I can do is I can remove the scheduler access from the _fetch_scheduler_status method and then I can directly get the status from JobOperation.get_status method

@kidrahahjo
Copy link
Member

kidrahahjo commented Feb 23, 2020

Line 1077 --> flow/project.py
status[op.get_id()] = int(scheduler_info.get(op.get_id(), JobStatus.unknown))

While debugging, I used a statement print(status[op.get_id()],op.get_status()) and then executed a pytest command on test_project.py module and it captured stdout call (Multiple lines) as:

1 1

While the results are not surprising to me, what I wonder is why do tests fail when instead of status[op.get_id()] = int(scheduler_info.get(op.get_id(), JobStatus.unknown)), I use status[op.get_id()] = int(op.get_status())

@bdice
Copy link
Member Author

bdice commented Feb 25, 2020

@kidrahahjo The code contents have all changed between your previous comment and now, because of #114 (a major change that was just merged). I am not sure that it is useful to answer your previous question at this point because of the scale of changes.

I reviewed the error handling in this function and in the scheduler code, and I am not sure whether this is an issue that should be solved.

signac-flow/flow/project.py

Lines 1560 to 1587 in 1a01303

def _fetch_scheduler_status(self, jobs=None, file=None, ignore_errors=False):
"Update the status docs."
if file is None:
file = sys.stderr
if jobs is None:
jobs = list(self)
try:
scheduler = self._environment.get_scheduler()
self.document.setdefault('_status', dict())
scheduler_info = {sjob.name(): sjob.status() for sjob in self.scheduler_jobs(scheduler)}
status = dict()
print("Query scheduler...", file=file)
for job in tqdm(jobs,
desc="Fetching operation status",
total=len(jobs), file=file):
for group in self._groups.values():
_id = group._generate_id(job)
status[_id] = int(scheduler_info.get(_id, JobStatus.unknown))
self.document._status.update(status)
except NoSchedulerError:
logger.debug("No scheduler available.")
except RuntimeError as error:
logger.warning("Error occurred while querying scheduler: '{}'.".format(error))
if not ignore_errors:
raise
else:
logger.info("Updated job status cache.")

My main concern is that if we try to fall back to a "StandardEnvironment" behavior and allow those kinds of errors to pass with just a warning message, it may not be obvious to a user that their status checks, submission eligibility, etc. could be invalid because the scheduler was not queried correctly. That is, requiring users to pass --ignore-errors is a good-enough solution in this case. I don't think I had fully understood the consequences of what changing this behavior would do when I created the issue. I discussed with @vyasr and he agreed that this is probably unsafe to change in our submission model.

@kidrahahjo Thanks for looking at this issue. I'm going to close it for now since I don't think there is any action to be taken.

@bdice bdice closed this as completed Feb 25, 2020
@bdice bdice added wontfix This will not be worked on and removed enhancement New feature or request good first issue Good for newcomers labels Feb 25, 2020
@kidrahahjo
Copy link
Member

@bdice Okay. And also I have a question that will there be any alternative solution(in coming future maybe) for it that I can manage to solve?
Thank you for giving your time and explaining it to me though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

3 participants