Scheduler test failing #1888

ljstephenson · 2022-05-06T22:50:59Z

Bug Report

One-Line Summary

python -m unittest -v artiq.test.test_scheduler.SchedulerCase.test_pending_priority fails.

Issue Details

Disabling the assertion and printing out the order in which events come in shows that they are only in a slightly different order; I'm not sure if behaviour has changed recently for this to be expected. The order I see is:

{'action': 'setitem', 'path': [], 'key': 0, 'value': {'pipeline': 'main', 'expid': {'log_level': 50, 'file': '\\\\jake.nist.gov\\688\\users\\ljs3\\scratch\\artiq\\artiq\\test\\test_scheduler.py', 'class_name': 'CheckPauseBackgroundExperiment', 'arguments': {}}, 'priority': 1, 'due_date': None, 'flush': False, 'status': 'pending', 'repo_msg': None}}
{'action': 'setitem', 'path': [], 'key': 1, 'value': {'pipeline': 'main', 'expid': {'log_level': 30, 'file': '\\\\jake.nist.gov\\688\\users\\ljs3\\scratch\\artiq\\artiq\\test\\test_scheduler.py', 'class_name': 'EmptyExperiment', 'arguments': {}}, 'priority': 3, 'due_date': 1651976855.5527163, 'flush': False, 'status': 'pending', 'repo_msg': None}}
{'action': 'setitem', 'path': [], 'key': 2, 'value': {'pipeline': 'main', 'expid': {'log_level': 30, 'file': '\\\\jake.nist.gov\\688\\users\\ljs3\\scratch\\artiq\\artiq\\test\\test_scheduler.py', 'class_name': 'EmptyExperiment', 'arguments': {}}, 'priority': 2, 'due_date': 1651876856.5527163, 'flush': False, 'status': 'pending', 'repo_msg': None}}
{'action': 'setitem', 'path': [0], 'key': 'status', 'value': 'preparing'}
{'action': 'setitem', 'path': [0], 'key': 'status', 'value': 'prepare_done'}
{'action': 'setitem', 'path': [2], 'key': 'status', 'value': 'preparing'}
{'action': 'setitem', 'path': [0], 'key': 'status', 'value': 'running'}
{'action': 'setitem', 'path': [2], 'key': 'status', 'value': 'prepare_done'}
{'action': 'setitem', 'path': [0], 'key': 'status', 'value': 'paused'}
{'action': 'setitem', 'path': [2], 'key': 'status', 'value': 'running'}
{'action': 'setitem', 'path': [2], 'key': 'status', 'value': 'run_done'}
{'action': 'setitem', 'path': [0], 'key': 'status', 'value': 'running'}
{'action': 'setitem', 'path': [2], 'key': 'status', 'value': 'analyzing'}
{'action': 'setitem', 'path': [2], 'key': 'status', 'value': 'deleting'}
{'action': 'delitem', 'path': [], 'key': 2}

These two items are reversed compared to the output expected by the test case:

{'action': 'setitem', 'path': [2], 'key': 'status', 'value': 'preparing'}
{'action': 'setitem', 'path': [0], 'key': 'status', 'value': 'running'}

Steps to Reproduce

Run python -m unittest -v artiq.test.test_scheduler.SchedulerCase.test_pending_priority

Expected Behavior

Test passes.

Actual (undesired) Behavior

Test fails.

>python -m unittest -v artiq.test.test_scheduler.SchedulerCase.test_pending_priority
test_pending_priority (artiq.test.test_scheduler.SchedulerCase)
Check due dates take precedence over priorities when waiting to ... C:\Users\<user>\Anaconda3\envs\artiq-test\lib\site-packages\sipyco\asyncio_tools.py:23: DeprecationWarning: There is no current event loop
  self.task = asyncio.ensure_future(log_exceptions(self._do))
C:\Users\<user>\Anaconda3\envs\artiq-test\lib\site-packages\sipyco\asyncio_tools.py:101: DeprecationWarning: There is no current event loop
  self._loop = asyncio.get_event_loop()
Unhandled exception in TaskObject task body
Traceback (most recent call last):
  File "C:\Users\<user>\Anaconda3\envs\artiq-test\lib\site-packages\sipyco\asyncio_tools.py", line 16, in log_exceptions
    return await awaitable()
  File "C:\Users\<user>\Anaconda3\envs\artiq-test\lib\site-packages\artiq\master\scheduler.py", line 215, in _do
    run.status = RunStatus.preparing
  File "C:\Users\<user>\Anaconda3\envs\artiq-test\lib\site-packages\artiq\master\scheduler.py", line 87, in status
    self._notifier[self.rid]["status"] = self._status.name
  File "C:\Users\<user>\Anaconda3\envs\artiq-test\lib\site-packages\sipyco\sync_struct.py", line 242, in __setitem__
    self.root.publish({"action": ModAction.setitem.value,
  File "C:\Users\<user>\Anaconda3\envs\artiq-test\lib\site-packages\artiq\test\test_scheduler.py", line 266, in notify
    self.assertEqual(mod, expect[expect_idx])
  File "C:\Users\<user>\Anaconda3\envs\artiq-test\lib\unittest\case.py", line 845, in assertEqual
    assertion_func(first, second, msg=msg)
  File "C:\Users\<user>\Anaconda3\envs\artiq-test\lib\unittest\case.py", line 1144, in assertDictEqual
    self.fail(self._formatMessage(msg, standardMsg))
  File "C:\Users\<user>\Anaconda3\envs\artiq-test\lib\unittest\case.py", line 675, in fail
    raise self.failureException(msg)
AssertionError: {'action': 'setitem', 'path': [2], 'key': 'status', 'value': 'preparing'} != {'path': [0], 'action': 'setitem', 'value': 'running', 'key': 'status'}
- {'action': 'setitem', 'key': 'status', 'path': [2], 'value': 'preparing'}
?                                                 ^             - ^^^^

+ {'action': 'setitem', 'key': 'status', 'path': [0], 'value': 'running'}
?

Your System (omit irrelevant parts)

Operating System: W10, conda
ARTIQ version: (with recent versions of ARTIQ, run artiq_client --version): ARTIQ v7.8044.c440f9f.beta
Version of the gateware and runtime loaded in the core device: (in the output of artiq_coremgmt -D .... log) (not relevant)
If using Conda, output of conda list (please submit as a file attachment, as this tends to be long)
conda-list.txt
Hardware involved: N/A

The text was updated successfully, but these errors were encountered:

sbourdeauducq · 2022-05-07T00:13:09Z

Probably Python 3.10 related, like your other issue. We didn't test anything with 3.10.

ljstephenson · 2022-05-09T20:14:52Z

Just checked - same failure in environments with 3.8 and 3.9
conda-list-39.txt
conda-list-38.txt

sbourdeauducq · 2022-05-09T23:08:00Z

Strange - it's passing on the server e.g. https://nixbld.m-labs.hk/build/120442/log
This code has not changed recently.

dnadlinger · 2022-05-10T11:35:52Z

Could this just be a timing/event ordering issue? I haven't had a look at this particular test, but some of them do rely on implicitly defining order via sleeps.

ljstephenson · 2022-05-12T16:20:42Z

Could this just be a timing/event ordering issue?

Certainly seems to be, but I don't know how that order could be defined more rigidly. The "issue" seems to be that experiment 2 is taking longer to prepare on my machine than the test server, so experiment 0 is reaching the run stage before 2 is done preparing; but the prepare stage is empty so I don't see how it can be made faster.

Strange

Agreed. I repro'd this on a second W10 machine with python=3.9, package list generated was identical except for zipp. I don't think this is high priority but I do think it's bad juju for a test to pass on some machines and not others...

dnadlinger · 2022-05-12T17:13:40Z

Certainly seems to be, but I don't know how that order could be defined more rigidly.

I haven't looked into what this particular test would require, but for testing Scheduler.idle (which we haven't gotten around to properly redesigning and upstreaming yet), I had introduced some asyncio events to ensure the requisite ordering:

    def _make_status_events(self, rid, previous_notify=None):
        status_events = {
            key: asyncio.Event()
            for key in ("prepare_done", "running", "paused", "run_done",
                        "deleting", "idle")
        }

        def notify(mod):
            if (mod["path"] == [rid] and mod["action"] == "setitem"
                    and mod["key"] == "status"):
                event = status_events.get(mod["value"], None)
                if event:
                    event.set()
            if previous_notify:
                previous_notify(mod)

        return status_events, notify

    def _make_scheduler_for_idle_test(self):
        class Flags:
            def __init__(self):
                self.be_idle = False
                self.take_long_in_idle = False
                self.termination_ok = False

        flags = Flags()

        def get_dataset(name):
            return getattr(flags, name)

        def check_termination(mod):
            self.assertEqual(
                mod, {
                    "action": "setitem",
                    "key": "termination_ok",
                    "value": (False, True),
                    "path": []
                })
            flags.termination_ok = True

        handlers = {
            "get_dataset": get_dataset,
            "update_dataset": check_termination
        }
        scheduler = Scheduler(_RIDCounter(0), handlers, None, None)

        status_events, notify = self._make_status_events(0)
        scheduler.notifier.publish = notify

        scheduler.start()

        expid_idle = _get_expid("IdleExperiment")
        # Suppress the SystemExit backtrace when worker process is killed.
        expid_idle["log_level"] = logging.CRITICAL
        scheduler.submit("main", expid_idle, 0, None, False)

        return scheduler, status_events, flags

    def test_close_with_idle_running(self):
        """Check scheduler exits with experiment still idle"""
        loop = self.loop
        scheduler, status_events, flags = self._make_scheduler_for_idle_test()
        loop.run_until_complete(status_events["running"].wait())
        flags.be_idle = True
        loop.run_until_complete(status_events["idle"].wait())
        loop.run_until_complete(scheduler.stop())

    def test_close_with_prepared_bg(self):
        """Check scheduler exits when there is still a prepare_done experiment"""
        loop = self.loop
        scheduler, status_events, _ = self._make_scheduler_for_idle_test()

        loop.run_until_complete(status_events["running"].wait())

        # Submit lower-priority experiment that is still waiting to be run when
        # the scheduler is terminated.
        expid = _get_expid("EmptyExperiment")
        scheduler.submit("main", expid, -1, None, False)

        loop.run_until_complete(scheduler.stop())

    def test_resume_idle(self):
        """Check scheduler resumes previously idle experiments"""
        loop = self.loop
        scheduler, status_events, flags = self._make_scheduler_for_idle_test()
        loop.run_until_complete(status_events["running"].wait())

        # Make sure we can un-idle by returning False from the idle callback.
        flags.be_idle = True
        loop.run_until_complete(status_events["idle"].wait())

        status_events["running"].clear()
        flags.be_idle = False
        loop.run_until_complete(status_events["running"].wait())

        # Make sure we can un-idle by requesting termination.
        flags.be_idle = True
        loop.run_until_complete(status_events["idle"].wait())

        self.assertFalse(flags.termination_ok)
        scheduler.request_termination(0)
        loop.run_until_complete(status_events["deleting"].wait())
        self.assertTrue(flags.termination_ok)

        loop.run_until_complete(scheduler.stop())

    def test_idle_bg(self):
        """Check scheduler runs lower-priority experiments while idle"""
        loop = self.loop
        scheduler, idle_status_events, flags = self._make_scheduler_for_idle_test()

        bg_status_events, notify = self._make_status_events(
            1, scheduler.notifier.publish)
        scheduler.notifier.publish = notify

        loop.run_until_complete(idle_status_events["running"].wait())

        # Submit experiment with lower priority.
        expid = _get_expid("BackgroundExperiment")
        scheduler.submit("main", expid, -1, None, False)

        check_pause_1 = lambda: loop.run_until_complete(scheduler.check_pause(1))

        loop.run_until_complete(bg_status_events["prepare_done"].wait())

        flags.be_idle = True
        loop.run_until_complete(bg_status_events["running"].wait())
        self.assertFalse(check_pause_1())

        idle_status_events["running"].clear()
        flags.be_idle = False
        loop.run_until_complete(idle_status_events["running"].wait())

        bg_status_events["running"].clear()
        flags.be_idle = True
        self.assertFalse(check_pause_1())
        loop.run_until_complete(bg_status_events["running"].wait())

        idle_status_events["running"].clear()
        flags.be_idle = False
        self.assertTrue(check_pause_1())
        loop.run_until_complete(idle_status_events["running"].wait())

        scheduler.request_termination(0)
        scheduler.request_termination(1)
        loop.run_until_complete(bg_status_events["deleting"].wait())

        loop.run_until_complete(scheduler.stop())

    def test_idle_timeout(self):
        """Check that blocking is_idle_callback times out"""
        loop = self.loop
        scheduler, status_events, flags = self._make_scheduler_for_idle_test()

        # Submit empty experiment with lower priority.
        empty_status_events, notify = self._make_status_events(
            1, scheduler.notifier.publish)
        scheduler.notifier.publish = notify

        expid = _get_expid("EmptyExperiment")
        scheduler.submit("main", expid, -1, None, False)

        # Make the is_idle_callback block; silencing the expected error
        # message in the log output.
        logging.disable(logging.ERROR)
        flags.be_idle = True
        flags.take_long_in_idle = True
        loop.run_until_complete(status_events["deleting"].wait())
        logging.disable(logging.NOTSET)

        # Make sure EmptyExperiment completes normally now.
        loop.run_until_complete(empty_status_events["run_done"].wait())

        loop.run_until_complete(scheduler.stop())

sbourdeauducq · 2022-05-13T00:18:32Z

Oh, I see. In this case it is probably simpler to have the test check relevant properties of the schedule instead of comparing it exactly to an expected schedule. Synchronizing the experiments to enforce a fully deterministic schedule would need complicated IPC.

sbourdeauducq · 2022-05-13T00:20:10Z

some of them do rely on implicitly defining order via sleeps.

Which ones? Only moninj tests do that (to ensure a scan and event report by the firmware) IIRC.

dnadlinger · 2022-05-13T01:51:54Z

Oh, just checked the Git history, and apparently I did write the offending test myself two years ago – how embarassing (in fairness, I was probably just copying test_steps() without thinking about what ordering guarantees the scheduler actually makes).

The above is an example on using get_dataset to do simple synchronisation, but yes, just testing for the expected properties might be cleaner and shorter.

Which ones [rely on sleep]?

It might be that the 0.2 second intervals in which pause/check_pause are called by the test experiments don't actually matter, only some scheduler internals – it's been a while.

sbourdeauducq · 2022-05-13T02:53:37Z

Those are just to prevent 100% CPU utilization while it waits for check_pause() to return True.

ljstephenson · 2022-05-13T15:56:47Z

Agree on checking properties; given the current overlap between this test and both test_steps() and test_pause(), it seems like the relevant check is that rid 2 goes through all stages before rid 1 does anything?

marmeladapk · 2022-07-15T10:59:54Z

I had this test fail, when my laptop cpu was forced to lowest frequency and other processes were running. With proper frequency restored this test passed.

sbourdeauducq assigned Deepskyhunter May 31, 2022

Deepskyhunter mentioned this issue Jun 2, 2022

Fix scheduler test #1901

Closed

7 tasks

This was referenced Jun 13, 2022

Fix scheduler test #1905

Closed

Improve test case for Issue1888 #1906

Closed

Deepskyhunter mentioned this issue Jun 21, 2022

Test Scheduler by the state progress of each RID #1920

Open

7 tasks

lamsyl mentioned this issue Oct 7, 2022

test_scheduler: Test desired properties instead of exact order #1974

Closed

8 tasks

sbourdeauducq unassigned Deepskyhunter Oct 8, 2022

yuk-m-labs linked a pull request Oct 10, 2022 that will close this issue

Refactor test_scheduler to test relevant properties instead of exact order #1976

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler test failing #1888

Scheduler test failing #1888

ljstephenson commented May 6, 2022

sbourdeauducq commented May 7, 2022

ljstephenson commented May 9, 2022

sbourdeauducq commented May 9, 2022

dnadlinger commented May 10, 2022

ljstephenson commented May 12, 2022

dnadlinger commented May 12, 2022 •

edited

sbourdeauducq commented May 13, 2022

sbourdeauducq commented May 13, 2022

dnadlinger commented May 13, 2022

sbourdeauducq commented May 13, 2022

ljstephenson commented May 13, 2022

marmeladapk commented Jul 15, 2022

Scheduler test failing #1888

Scheduler test failing #1888

Comments

ljstephenson commented May 6, 2022

Bug Report

One-Line Summary

Issue Details

Steps to Reproduce

Expected Behavior

Actual (undesired) Behavior

Your System (omit irrelevant parts)

sbourdeauducq commented May 7, 2022

ljstephenson commented May 9, 2022

sbourdeauducq commented May 9, 2022

dnadlinger commented May 10, 2022

ljstephenson commented May 12, 2022

dnadlinger commented May 12, 2022 • edited

sbourdeauducq commented May 13, 2022

sbourdeauducq commented May 13, 2022

dnadlinger commented May 13, 2022

sbourdeauducq commented May 13, 2022

ljstephenson commented May 13, 2022

marmeladapk commented Jul 15, 2022

dnadlinger commented May 12, 2022 •

edited