Scheduler code #7561

wouterdb · 2024-04-24T15:24:42Z

Review

I would like to get general approval on what is here, before we decide how to proceed.

Description

Potential implementation for queue scheduler

built a scheduler toolbox
integrated with the agent
added deploy, dryrun and getfact

Todo

add exception handling to the scheduler, now it stop working on exception
the version based cache is a mess now. I would propose to make it a timed cache that keep every version open for a fixed amount of time
resume after suspend has to be replaced with priorities in the queue
cleanup the full queue

Self Check:

Strike through any lines that are not applicable (~~line~~) then check the box

Attached issue to pull request
Changelog entry
Type annotations are present
Code is clear and sufficiently documented
No (preventable) type errors (check using make mypy or make mypy-diff)
Sufficient test cases (reproduces the bug/tests the requested feature)
Correct, in line with design
End user documentation is included or an issue is created for end-user documentation (add ref to issue here: )
If this PR fixes a race condition in the test suite, also push the fix to the relevant stable branche(s) (see test-fixes for more info)

arnaudsjs · 2024-05-16T07:37:57Z

src/inmanta/scheduler/scheduler.py

+    """Queue to schedule tasks with inter-dependencies and provide an overview of the queue"""
+
+    def __init__(self) -> None:
+        self.full_queue: PriorityQueue[typing.Tuple[int, int, BaseTask]] = PriorityQueue()


It would improve readability if we would convert the tuple into a proper object.

For now, I like the simplicity of it (i.e. the way it is compared is very clear, which is the most relevant property), I'll add a todo for later

src/inmanta/scheduler/scheduler.py

arnaudsjs · 2024-05-16T07:47:23Z

src/inmanta/scheduler/scheduler.py

+
+
+class SentinelTask(BaseTask):
+    """Task to unblock the queue on shutdown"""


This needs more documentation.

src/inmanta/agent/agent.py

arnaudsjs · 2024-05-16T09:24:01Z

src/inmanta/agent/agent.py

+        if new_request.is_periodic:
+            # periodic
+            if self.active_repair is not None and not self.active_repair.run.finished():
+                # active one
+                if self.active_repair.origin.is_full_deploy:
+                    self.logger.info(
+                        "Ignoring new run '%s' in favor of current '%s'", new_request.reason, self.active_repair.origin.reason
+                    )
+                    return
+                else:
+                    self.logger.info("Upgrading run '%s' to '%s'", self.active_repair.origin.reason, new_request.reason)
+                    self.active_repair.run.cancel()
+            self.active_repair = build_run(PRIO_PERIODIC)
+        else:
+            # Just do as you are told
+            if self.active_deploy is not None and not self.active_deploy.run.finished():
+                self.logger.info("Canceling run '%s' in favor of '%s'", self.active_repair.origin.reason, new_request.reason)


This part I don't understand. What about period deploys?

arnaudsjs · 2024-05-16T09:26:19Z

src/inmanta/agent/agent.py

@@ -524,6 +476,9 @@ def __init__(self, process: "Agent", name: str, uri: str, *, ensure_deploy_on_st

        # init
        self._nq = ResourceScheduler(self, self._env_id, name)
+        self.work_queue = scheduler.TaskQueue()
+        self.work_queue_drainer = scheduler.TaskRunner(self.work_queue)


Why have two different terms for the same thing (queue_drainer vs. TaskRunner)?

arnaudsjs · 2024-05-16T09:38:20Z

src/inmanta/agent/agent.py

@@ -754,52 +713,49 @@ async def dryrun(self, dry_run_id: uuid.UUID, version: int) -> Apireturn:
        return 200

    async def do_run_dryrun(self, version: int, dry_run_id: uuid.UUID) -> None:
-        async with self.dryrunlock:
-            async with self.ratelimiter:


Why do we remove the ratelimiter here?

the queue does that now

We know that we can run only one task at a time, because only a single logical thread drains the queue,...

Co-authored-by: arnaudsjs <2684622+arnaudsjs@users.noreply.github.com>

sanderr

Reviewed scheduler.py. Have yet to review the rest.

sanderr · 2024-05-17T09:59:26Z

src/inmanta/scheduler/scheduler.py

+        # Link to the parent queue, required to jump into run queue
+        self._queue: typing.Optional["TaskQueue"] = None
+        # priority / insertion number / self
+        self._entry: typing.Optional[typing.Tuple[int, int, "BaseTask"]] = None


Suggested change

self._entry: typing.Optional[typing.Tuple[int, int, "BaseTask"]] = None

self._entry: typing.Optional[tuple[int, int, "BaseTask"]] = None

sanderr · 2024-05-17T10:00:38Z

src/inmanta/scheduler/scheduler.py

+        self.done = False
+
+        # Link to the parent queue, required to jump into run queue
+        self._queue: typing.Optional["TaskQueue"] = None


Because you recently asked about import practice opinions, I'll comment on this. I personally feel like qualifiers are clutter when they're obvious. Especially for the common types like Optional, Sequence, TypeVar, Callable, ... we use them so often and so wide-spread that I feel like they're almost native.

This remains very subjective of course.

For abc and typing.TYPE_CHECKING I'm on the fence, and for all other modules used in this file I like the qualified approach you used.

sanderr · 2024-05-21T13:38:40Z

src/inmanta/scheduler/scheduler.py

+    def is_done(self) -> bool:
+        return self.done
+
+    def cancel(self) -> None:


Shouldn't we inform provides as well in this case? Otherwise we potentially create infinite waiters, i.e. a memory leak?

I'm going to leave this as a todo for later stage I think.

I'm not entirely clear on what we expect from the cancel.

e.g.

canceling a deploy for a resource that has already reached its desired state is equal to deploy done, but if it was previously failed, it equals failure.

if we store the future of the execution in the task, we can even cancel pre-emptively.

It is not unlikely this entire dependency mechanism will change.

sanderr · 2024-05-21T13:39:28Z

src/inmanta/scheduler/scheduler.py

+        """
+        Declare we wait for some other task
+
+        Adding the same task twice will make this one wait forever!


So will adding a cancelled task. Is that prohibited?

sanderr · 2024-05-21T13:42:18Z

src/inmanta/scheduler/scheduler.py

+    def __repr__(self) -> str:
+        return self._name


Typically, repr is meant to be an unambiguous representation of this object. I think we should at least make it f"Task(self._name)", and preferably include the priority as well.

sanderr · 2024-05-21T13:58:28Z

src/inmanta/scheduler/scheduler.py

+        return task
+
+    def view(self) -> typing.Sequence[Task]:
+        """This leaks a reference to the internal queue: don't touch it"""


This comment is too cryptic imo, what does "don't touch it" mean?

sanderr · 2024-05-21T14:00:43Z

src/inmanta/scheduler/scheduler.py

+        self.queue = queue
+        self.running = False
+        self.should_run = True
+        self.finished: typing.Optional[asyncio.Task[None]] = None


finished is a strange name for a task object.

to me, it is the point in the Future where it is finished.

sanderr · 2024-05-21T14:01:34Z

src/inmanta/scheduler/scheduler.py

+    async def run(self) -> None:
+        self.running = True
+        while self.should_run:
+            task = await self.queue.get()


Doesn't this raise an IndexError when the queue is exhausted?

nope, it waits (it is a blocking queue)

Remove and return an item from the queue. If queue is empty, wait until an item is available.

sanderr · 2024-05-21T14:02:30Z

src/inmanta/scheduler/scheduler.py

+        self.should_run = True
+        self.finished = asyncio.create_task(self.run())
+
+    def stop(self) -> None:


This stops the runner asyncronously. I think it could use a docstring on what exactly it achieves.

arnaudsjs · 2024-05-27T07:14:18Z

src/inmanta/agent/agent.py

-    def long_string(self) -> str:
-        return "{} awaits {}".format(self.resource_id.resource_str(), " ".join([str(aw) for aw in self.dependencies]))
+    def priority(self) -> int:
+        return 100  # 90 # 110


What does the comment mean?

arnaudsjs · 2024-05-28T08:27:46Z

src/inmanta/agent/agent.py

+        self.resource_id = resource_id
+        self.group_id: uuid.UUID = group_id
+        self.reason: str = reason
+        name = f"Deploy {self.resource_id} as part of {self.group_id} because {self.reason}"


This is more a description than a name, is it?

arnaudsjs · 2024-05-28T08:38:12Z

src/inmanta/agent/agent.py

-    def __init__(self, scheduler: "ResourceScheduler", resource_id: Id, gid: uuid.UUID, reason: str) -> None:
-        super().__init__(scheduler, resource_id, gid, reason)
+    async def run(self) -> None:
+        raise Exception("This task should never be scheduler")


Suggested change

raise Exception("This task should never be scheduler")

raise Exception("This task should never be scheduled")

arnaudsjs · 2024-05-28T09:25:03Z

src/inmanta/scheduler/scheduler.py

+        self.cancelled = False
+        # Indicates if a cancel is requested


Suggested change

self.cancelled = False

# Indicates if a cancel is requested

# Indicates if a cancel is requested

self.cancelled = False

arnaudsjs · 2024-05-28T09:50:05Z

tests/agent_server/test_basic_deploy.py

+    resource_container.Provider.set("agent1", "key1", "value1")
+    resource_container.Provider.set("agent1", "key1", "value1")
+    resource_container.Provider.set("agent1", "key1", "value1")


Three times the same line?

arnaudsjs · 2024-05-28T09:54:34Z

tests/agent_server/test_server_agent.py

@@ -2137,6 +2137,7 @@ async def test_reload(
    assert dep_state.index == resource_container.Provider.reloadcount("agent1", "key2")


+@pytest.mark.skip("Skipped for agent refactor")


We have to be more descriptive in this message. Why do we need to skip this test?

sanderr

On hold until further notice.

Scheduler code

f4e58d2

wouterdb mentioned this pull request Apr 24, 2024

agent rework: deploy sequencing first steps #7548

Open

wouterdb added 10 commits May 4, 2024 10:04

moving

9534ddd

Merge remote-tracking branch 'gh/master' into issue/scheduler_poc_2

88ec30b

first working deploy

514b3ec

make queue non-blocking / unbounded

2da27a8

rename for conflict

ea43495

added facts and dryrun

5ff4b48

fix pep8

22d0caa

added types

cc791e4

fix test

50d7d45

fix shutdown

b3ac471

wouterdb requested review from sanderr and arnaudsjs and removed request for arnaudsjs and sanderr May 7, 2024 10:59

wouterdb added 4 commits May 7, 2024 15:17

Make runner robust against failures

97eb079

clean the queue

336ead5

temp

6d112c5

Further fixes

6cb0b48

wouterdb requested review from sanderr and arnaudsjs May 15, 2024 13:58

arnaudsjs requested changes May 16, 2024

View reviewed changes

Apply suggestions from code review

8e94f6b

Co-authored-by: arnaudsjs <2684622+arnaudsjs@users.noreply.github.com>

sanderr reviewed May 21, 2024

View reviewed changes

wouterdb added 2 commits May 24, 2024 14:11

Merge remote-tracking branch 'gh/master' into issue/scheduler_poc_2

1357ddd

Reviewer comments

a75f1a9

wouterdb requested review from arnaudsjs and sanderr May 24, 2024 12:31

arnaudsjs requested changes May 28, 2024

View reviewed changes

arnaudsjs reviewed May 28, 2024

View reviewed changes

sanderr reviewed May 30, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler code #7561

Scheduler code #7561

wouterdb commented Apr 24, 2024 •

edited

arnaudsjs May 16, 2024

wouterdb May 24, 2024

arnaudsjs May 16, 2024

arnaudsjs May 16, 2024 •

edited

arnaudsjs May 16, 2024

wouterdb May 24, 2024

arnaudsjs May 16, 2024

wouterdb May 16, 2024

arnaudsjs May 16, 2024

wouterdb May 21, 2024

sanderr left a comment

sanderr May 17, 2024

sanderr May 17, 2024

sanderr May 17, 2024

sanderr May 21, 2024

wouterdb May 24, 2024

sanderr May 21, 2024

wouterdb May 24, 2024

sanderr May 21, 2024

sanderr May 21, 2024

sanderr May 21, 2024

wouterdb May 24, 2024

sanderr May 21, 2024

wouterdb May 24, 2024

sanderr May 21, 2024

arnaudsjs May 27, 2024

arnaudsjs May 28, 2024

arnaudsjs May 28, 2024

arnaudsjs May 28, 2024

arnaudsjs May 28, 2024

arnaudsjs May 28, 2024

sanderr left a comment



		class SentinelTask(BaseTask):
		"""Task to unblock the queue on shutdown"""

	self._entry: typing.Optional[typing.Tuple[int, int, "BaseTask"]] = None
	self._entry: typing.Optional[tuple[int, int, "BaseTask"]] = None

	raise Exception("This task should never be scheduler")
	raise Exception("This task should never be scheduled")

		@@ -2137,6 +2137,7 @@ async def test_reload(
		assert dep_state.index == resource_container.Provider.reloadcount("agent1", "key2")


		@pytest.mark.skip("Skipped for agent refactor")

Scheduler code #7561

Are you sure you want to change the base?

Scheduler code #7561

Conversation

wouterdb commented Apr 24, 2024 • edited

Review

Description

Self Check:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arnaudsjs May 16, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanderr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanderr left a comment

Choose a reason for hiding this comment

wouterdb commented Apr 24, 2024 •

edited

arnaudsjs May 16, 2024 •

edited