Ensure agents improvements #7612

sanderr · 2024-05-10T08:50:03Z

Description

Fixes bug in autostarted agent manager that occasionally causes agent's sessions to time out, dropping calls, resulting in a "stuck" orchestrator until a redeploy is triggered.

~~The bug has been there forever, but the conditions for it to trigger happen to be set up by #7278.~~

The core of the bug is the following. When we need a certain agent to be up, we call AutostartedAgentManager._ensure_agents. This makes sure an agent process is running for these agents and waits for them to be up. However, instead of starting a process to track the autostarted agent map and to trust that it would keep up with changes to it, we instead start a process with an explicit list of agent endpoints. If a new call comes in to _ensure_agents, and the agent is not yet up, we would kill the current process and start a new one. In killing the process, we would not expire its sessions, letting them time out on the 30s heartbeat timeout, losing any calls made to it in that window.

EDIT: Scratched some of the above: I wrote a test case for the first bullet point below, which I thought to be the root cause of the reason for killing the old agent process. However, turns out the test also succeeds on master. The agent's on_reconnect actually updates the process' agent map. So, I think my root cause analysis may have been wrong (at least less likely), but the second and third bullet points should fix the issue anyway (doesn't matter exactly how we got in the situation where only part of the agent endpoints were up, as long as we handle it correctly), and even if part of the issue persists, the logging improvements should help future investigation. And the first bullet point is still a good consistency fix imo.

This PR makes the following changes:

An agent process for autostarted agents (use_autostart_agent_map in the config file) now ignores any agent endpoints in the config file. Instead it runs purely in autostarted agent mode, starting instances for each of the endpoints in the agent map. It then tracks any changes made to the agent map. This last part was already in place, and resulted in a small inconsistency where the process would respect agent names in the config at start, and then suddenly move to consider the agent map the authority as soon as a change comes in. This inconsistency is now resolved by considering the agent map the authority for the entire lifetime of the agent process.
The autostarted agent manager now trusts in its processes to track the agent map. If a process is already running for an environment, but the agent endpoints are not yet up, it waits for them rather than to kill the process and start fresh.
When an explicit restart is requested, the autostarted agent manager now expires all sessions for the agents in the agent map, i.e. the endpoints for the killed process.
Improved robustness of wait condition for an agent to be up, specifically make sure we don't consider expired sessions to be up.
Made some log messages a bit more informative, no major changes.

Self Check:

Strike through any lines that are not applicable (~~line~~) then check the box

~~Attached issue to pull request~~
Changelog entry
Type annotations are present
Code is clear and sufficiently documented
No (preventable) type errors (check using make mypy or make mypy-diff)
Sufficient test cases (reproduces the bug/tests the requested feature)
Correct, in line with design
~~End user documentation is included or an issue is created for end-user documentation (add ref to issue here: )~~
~~If this PR fixes a race condition in the test suite, also push the fix to the relevant stable branche(s) (see test-fixes for more info)~~

sanderr · 2024-05-10T08:51:21Z

src/inmanta/server/agentmanager.py

+        )  # we know the type of this map
+
+        LOGGER.debug("Expiring sessions for autostarted agents %s", sorted(agent_map.keys()))
+        await self._agent_manager.expire_sessions_for_agents(env.id, agent_map.keys())


This expires all sessions for these endpoints (i.e. I don't track which session belongs to the killed process). But I believe an invariant of the autostarted agents are that they are 100% managed by the server, therefore I think this is a valid approach.

For specific debugging scenarios (when the autostarted agent is acting up or being restarted too often), we have resorted to taking over the endpoints from another agent process. It is a rare, but we have done it.

But this fix may make that unnecessary.

sanderr · 2024-05-10T09:07:14Z

src/inmanta/server/agentmanager.py

@@ -1295,6 +1267,75 @@ async def _fork_inmanta(
            if errhandle is not None:
                errhandle.close()

+    async def _wait_for_agents(


Moved this from an embedded function in __do_start_agent to a dedicated method. I made very limited changes:

updated the signature and the docstring

added connection wiring

improved log messages (previously logged in __do_start_agent) to include the list of instances

I'll mark all other significant changes with a comment.

sanderr · 2024-05-10T09:11:16Z

src/inmanta/server/agentmanager.py

+                    # make sure to check for expiry because sessions are unregistered from the agent manager asyncronously
+                    and not session.expired


I added this condition: a session might still be in the tid_endpoint_to_session mapping while it has already been expired (if the agent manager hasn't processed the expiry event yet).

sanderr · 2024-05-10T09:15:08Z

tests/test_agent_manager.py

@@ -1207,34 +1207,6 @@ async def _dummy_fork_inmanta(
    assert exception_message in str(excinfo.value)


-async def test_are_agents_active(server, client, environment, agent_factory) -> None:


These were unit tests for two methods that I dropped in this PR.

sanderr · 2024-05-10T13:58:07Z

tests/agent_server/test_server_agent.py

+@pytest.mark.parametrize("autostarted", (True, False))
+async def test_autostart_mapping_overrides_config(server, client, environment, async_finalizer, caplog, autostarted: bool):


This test succeeds on master as well (see note in PR description) but I left it because it's a good check to have.

sanderr · 2024-05-10T13:58:41Z

tests/test_agent_manager.py

@@ -1278,6 +1253,99 @@ async def test_dont_start_paused_agent(server, client, environment, caplog) -> N
    assert "took too long to start" not in caplog.text


+async def test_wait_for_agent_map_update(server, client, environment) -> None:


I verified that both these tests fail on master. They verify the core change in this PR.

changelogs/unreleased/ensure-agents-improvements.yml

wouterdb

I would prefer the autostarted agent to indicate that it is the autostarted agent when setting up its session.

This would make the behavior easier to understand, but I won't stand in the way either

changelogs/unreleased/ensure-agents-improvements.yml

src/inmanta/agent/agent.py

src/inmanta/server/agentmanager.py

wouterdb · 2024-05-13T07:38:39Z

src/inmanta/server/agentmanager.py

+        )  # we know the type of this map
+
+        LOGGER.debug("Expiring sessions for autostarted agents %s", sorted(agent_map.keys()))
+        await self._agent_manager.expire_sessions_for_agents(env.id, agent_map.keys())


For specific debugging scenarios (when the autostarted agent is acting up or being restarted too often), we have resorted to taking over the endpoints from another agent process. It is a rare, but we have done it.

But this fix may make that unnecessary.

arnaudsjs · 2024-05-13T07:16:26Z

src/inmanta/agent/agent.py

+        if "internal" not in agent_map:
+            raise ValueError("The internal agent must be present in the agent map")
+


Do we really want to hard fail on this? Shouldn't we take a more robust approach, i.e. add the internal agent to the agent_map if it's missing and 2) log a warning.

I'm not sure. The current behavior was to drop the internal agent altogether, so I don't think we have to worry too much about it being a breaking change (the previous behavior is just breaking in a different way). In that sense I think we can afford it, and if we can afford it I think it's the sensible thing to do. It also corresponds with the environment setting update behavior.

@wouterdb what's your opinion on this?

I think it is strange for the agent to reject the entire agent map, in case an entry is missing (that it can easily put in there itself).

I would understand if we reject the user input on the server side, here, I find it tricky, in that it can make existing, working setups stop working altogether. (this is code on the agent startup path)

(the more I think about it, the more I am convinced this can't got into a patch release, it is breaking)

Ok, I'll change this.

added a warning, I think this is definitely good.

auto-add it when it's missing. I could also just keep this part to the old behavior if you prefer.

arnaudsjs · 2024-05-13T07:30:58Z

src/inmanta/server/agentmanager.py

-        if len(agents) == 0:
-            return False
+        if connection is not None and connection.is_in_transaction():
+            raise Exception("_ensure_agents should not be called in a transaction context")


Let's add a comment explaining why this is required.

Turns out I should have made a note for myself even, I'd forgotten the specifics of it. I made an effort to make it clear though.

src/inmanta/server/agentmanager.py

tests/agent_server/test_server_agent.py

Co-authored-by: arnaudsjs <2684622+arnaudsjs@users.noreply.github.com>

wouterdb · 2024-05-13T12:56:20Z

src/inmanta/agent/agent.py

+        if "internal" not in agent_map:
+            raise ValueError("The internal agent must be present in the agent map")
+


I think it is strange for the agent to reject the entire agent map, in case an entry is missing (that it can easily put in there itself).

I would understand if we reject the user input on the server side, here, I find it tricky, in that it can make existing, working setups stop working altogether. (this is code on the agent startup path)

(the more I think about it, the more I am convinced this can't got into a patch release, it is breaking)

inmantaci · 2024-05-13T13:50:07Z

Processing this pull request

…od (Issue #7612, PR #7612) # Description Fixes bug in autostarted agent manager that occasionally causes agent's sessions to time out, dropping calls, resulting in a "stuck" orchestrator until a redeploy is triggered. ~~The bug has been there forever, but the conditions for it to trigger happen to be set up by #7278.~~ ~~The core of the bug is the following. When we need a certain agent to be up, we call `AutostartedAgentManager._ensure_agents`. This makes sure an agent process is running for these agents and waits for them to be up. However, instead of starting a process to track the autostarted agent map and to trust that it would keep up with changes to it, we instead start a process with an explicit list of agent endpoints.~~ If a new call comes in to `_ensure_agents`, and the agent is not yet up, we would kill the current process and start a new one. In killing the process, we would not expire its sessions, letting them time out on the 30s heartbeat timeout, losing any calls made to it in that window. EDIT: Scratched some of the above: I wrote a test case for the first bullet point below, which I thought to be the root cause of the reason for killing the old agent process. However, turns out the test also succeeds on master. The agent's `on_reconnect` actually updates the process' agent map. So, I think my root cause analysis may have been wrong (at least less likely), but the second and third bullet points should fix the issue anyway (doesn't matter exactly *how* we got in the situation where only part of the agent endpoints were up, as long as we handle it correctly), and even if part of the issue persists, the logging improvements should help future investigation. And the first bullet point is still a good consistency fix imo. This PR makes the following changes: - An agent process for autostarted agents (`use_autostart_agent_map` in the config file) now ignores any agent endpoints in the config file. Instead it runs purely in autostarted agent mode, starting instances for each of the endpoints in the agent map. It then tracks any changes made to the agent map. This last part was already in place, and resulted in a small inconsistency where the process would respect agent names in the config at start, and then suddenly move to consider the agent map the authority as soon as a change comes in. This inconsistency is now resolved by considering the agent map the authority for the entire lifetime of the agent process. - The autostarted agent manager now trusts in its processes to track the agent map. If a process is already running for an environment, but the agent endpoints are not yet up, it waits for them rather than to kill the process and start fresh. - When an explicit restart is requested, the autostarted agent manager now expires all sessions for the agents in the agent map, i.e. the endpoints for the killed process. - Improved robustness of wait condition for an agent to be up, specifically make sure we don't consider expired sessions to be up. - Made some log messages a bit more informative, no major changes. # Self Check: Strike through any lines that are not applicable (`~~line~~`) then check the box - [x] ~Attached issue to pull request~ - [x] Changelog entry - [x] Type annotations are present - [x] Code is clear and sufficiently documented - [x] No (preventable) type errors (check using make mypy or make mypy-diff) - [x] Sufficient test cases (reproduces the bug/tests the requested feature) - [x] Correct, in line with design - [x] ~End user documentation is included or an issue is created for end-user documentation (add ref to issue here: )~ - [x] ~If this PR fixes a race condition in the test suite, also push the fix to the relevant stable branche(s) (see [test-fixes](https://internal.inmanta.com/development/core/tasks/build-master.html#test-fixes) for more info)~

inmantaci · 2024-05-13T13:50:13Z

Failed to merge changes into iso6 due to merge conflict. Please open a pull request for these branches separately by cherry-picking the commit that was made on the branch master (git cherry-pick c15d870).

inmantaci · 2024-05-13T13:50:16Z

Merged into branches master in c15d870

…od (Issue #7612, PR #7612) # Description Fixes bug in autostarted agent manager that occasionally causes agent's sessions to time out, dropping calls, resulting in a "stuck" orchestrator until a redeploy is triggered. ~~The bug has been there forever, but the conditions for it to trigger happen to be set up by #7278.~~ ~~The core of the bug is the following. When we need a certain agent to be up, we call `AutostartedAgentManager._ensure_agents`. This makes sure an agent process is running for these agents and waits for them to be up. However, instead of starting a process to track the autostarted agent map and to trust that it would keep up with changes to it, we instead start a process with an explicit list of agent endpoints.~~ If a new call comes in to `_ensure_agents`, and the agent is not yet up, we would kill the current process and start a new one. In killing the process, we would not expire its sessions, letting them time out on the 30s heartbeat timeout, losing any calls made to it in that window. EDIT: Scratched some of the above: I wrote a test case for the first bullet point below, which I thought to be the root cause of the reason for killing the old agent process. However, turns out the test also succeeds on master. The agent's `on_reconnect` actually updates the process' agent map. So, I think my root cause analysis may have been wrong (at least less likely), but the second and third bullet points should fix the issue anyway (doesn't matter exactly *how* we got in the situation where only part of the agent endpoints were up, as long as we handle it correctly), and even if part of the issue persists, the logging improvements should help future investigation. And the first bullet point is still a good consistency fix imo. This PR makes the following changes: - An agent process for autostarted agents (`use_autostart_agent_map` in the config file) now ignores any agent endpoints in the config file. Instead it runs purely in autostarted agent mode, starting instances for each of the endpoints in the agent map. It then tracks any changes made to the agent map. This last part was already in place, and resulted in a small inconsistency where the process would respect agent names in the config at start, and then suddenly move to consider the agent map the authority as soon as a change comes in. This inconsistency is now resolved by considering the agent map the authority for the entire lifetime of the agent process. - The autostarted agent manager now trusts in its processes to track the agent map. If a process is already running for an environment, but the agent endpoints are not yet up, it waits for them rather than to kill the process and start fresh. - When an explicit restart is requested, the autostarted agent manager now expires all sessions for the agents in the agent map, i.e. the endpoints for the killed process. - Improved robustness of wait condition for an agent to be up, specifically make sure we don't consider expired sessions to be up. - Made some log messages a bit more informative, no major changes. # Self Check: Strike through any lines that are not applicable (`~~line~~`) then check the box - [x] ~Attached issue to pull request~ - [x] Changelog entry - [x] Type annotations are present - [x] Code is clear and sufficiently documented - [x] No (preventable) type errors (check using make mypy or make mypy-diff) - [x] Sufficient test cases (reproduces the bug/tests the requested feature) - [x] Correct, in line with design - [x] ~End user documentation is included or an issue is created for end-user documentation (add ref to issue here: )~ - [x] ~If this PR fixes a race condition in the test suite, also push the fix to the relevant stable branche(s) (see [test-fixes](https://internal.inmanta.com/development/core/tasks/build-master.html#test-fixes) for more info)~

inmantaci · 2024-05-13T13:50:17Z

Not closing this pull request due to previously commented issues for some of the destination branches. Please open a separate pull request for those branches by cherry-picking the relevant commit. You can safely close this pull request and delete the source branch.

inmantaci · 2024-05-13T13:50:37Z

This branch was not deleted as it seems to still be in use.

inmantaci · 2024-05-13T14:14:07Z

Processing #7616.

…od (Issue #7612, PR #7612) Pull request opened by the merge tool on behalf of #7612

…od (Issue #7612, PR #7612) Fixes bug in autostarted agent manager that occasionally causes agent's sessions to time out, dropping calls, resulting in a "stuck" orchestrator until a redeploy is triggered. ~~The bug has been there forever, but the conditions for it to trigger happen to be set up by #7278.~~ ~~The core of the bug is the following. When we need a certain agent to be up, we call `AutostartedAgentManager._ensure_agents`. This makes sure an agent process is running for these agents and waits for them to be up. However, instead of starting a process to track the autostarted agent map and to trust that it would keep up with changes to it, we instead start a process with an explicit list of agent endpoints.~~ If a new call comes in to `_ensure_agents`, and the agent is not yet up, we would kill the current process and start a new one. In killing the process, we would not expire its sessions, letting them time out on the 30s heartbeat timeout, losing any calls made to it in that window. EDIT: Scratched some of the above: I wrote a test case for the first bullet point below, which I thought to be the root cause of the reason for killing the old agent process. However, turns out the test also succeeds on master. The agent's `on_reconnect` actually updates the process' agent map. So, I think my root cause analysis may have been wrong (at least less likely), but the second and third bullet points should fix the issue anyway (doesn't matter exactly *how* we got in the situation where only part of the agent endpoints were up, as long as we handle it correctly), and even if part of the issue persists, the logging improvements should help future investigation. And the first bullet point is still a good consistency fix imo. This PR makes the following changes: - An agent process for autostarted agents (`use_autostart_agent_map` in the config file) now ignores any agent endpoints in the config file. Instead it runs purely in autostarted agent mode, starting instances for each of the endpoints in the agent map. It then tracks any changes made to the agent map. This last part was already in place, and resulted in a small inconsistency where the process would respect agent names in the config at start, and then suddenly move to consider the agent map the authority as soon as a change comes in. This inconsistency is now resolved by considering the agent map the authority for the entire lifetime of the agent process. - The autostarted agent manager now trusts in its processes to track the agent map. If a process is already running for an environment, but the agent endpoints are not yet up, it waits for them rather than to kill the process and start fresh. - When an explicit restart is requested, the autostarted agent manager now expires all sessions for the agents in the agent map, i.e. the endpoints for the killed process. - Improved robustness of wait condition for an agent to be up, specifically make sure we don't consider expired sessions to be up. - Made some log messages a bit more informative, no major changes. Strike through any lines that are not applicable (`~~line~~`) then check the box - [x] ~Attached issue to pull request~ - [x] Changelog entry - [x] Type annotations are present - [x] Code is clear and sufficiently documented - [x] No (preventable) type errors (check using make mypy or make mypy-diff) - [x] Sufficient test cases (reproduces the bug/tests the requested feature) - [x] Correct, in line with design - [x] ~End user documentation is included or an issue is created for end-user documentation (add ref to issue here: )~ - [x] ~If this PR fixes a race condition in the test suite, also push the fix to the relevant stable branche(s) (see [test-fixes](https://internal.inmanta.com/development/core/tasks/build-master.html#test-fixes) for more info)~

sanderr · 2024-05-15T08:31:53Z

iso6 merged in 97c0495

…tostarted (PR #7628) # Description #7612 made the autostarted agent manager take full authority over autostarted agents. However, it was decided that it should not start an autostarted agent process if a session is already active for all required agents, even if that session is not with an autostarted process. This pull request restores that behavior. It adds back two utility methods without change that were dropped in #7612, and its associated test case. It adds a new guard in `_ensure_agents` and adds a test case (parametrization of an existing one) for the intended behavior. The new test case has been verified to fail on #7612, and to succeed on the commit that precedes it (ensuring backwards compatibility). # Self Check: Strike through any lines that are not applicable (`~~line~~`) then check the box - [x] ~~Attached issue to pull request~~ - [x] Changelog entry - [x] Type annotations are present - [x] Code is clear and sufficiently documented - [x] No (preventable) type errors (check using make mypy or make mypy-diff) - [x] Sufficient test cases (reproduces the bug/tests the requested feature) - [x] Correct, in line with design - [x] ~~End user documentation is included or an issue is created for end-user documentation (add ref to issue here: )~~ - [x] ~~If this PR fixes a race condition in the test suite, also push the fix to the relevant stable branche(s) (see [test-fixes](https://internal.inmanta.com/development/core/tasks/build-master.html#test-fixes) for more info)~~

…od (Issue #7612, PR #7612) Fixes bug in autostarted agent manager that occasionally causes agent's sessions to time out, dropping calls, resulting in a "stuck" orchestrator until a redeploy is triggered. ~~The bug has been there forever, but the conditions for it to trigger happen to be set up by #7278.~~ ~~The core of the bug is the following. When we need a certain agent to be up, we call `AutostartedAgentManager._ensure_agents`. This makes sure an agent process is running for these agents and waits for them to be up. However, instead of starting a process to track the autostarted agent map and to trust that it would keep up with changes to it, we instead start a process with an explicit list of agent endpoints.~~ If a new call comes in to `_ensure_agents`, and the agent is not yet up, we would kill the current process and start a new one. In killing the process, we would not expire its sessions, letting them time out on the 30s heartbeat timeout, losing any calls made to it in that window. EDIT: Scratched some of the above: I wrote a test case for the first bullet point below, which I thought to be the root cause of the reason for killing the old agent process. However, turns out the test also succeeds on master. The agent's `on_reconnect` actually updates the process' agent map. So, I think my root cause analysis may have been wrong (at least less likely), but the second and third bullet points should fix the issue anyway (doesn't matter exactly *how* we got in the situation where only part of the agent endpoints were up, as long as we handle it correctly), and even if part of the issue persists, the logging improvements should help future investigation. And the first bullet point is still a good consistency fix imo. This PR makes the following changes: - An agent process for autostarted agents (`use_autostart_agent_map` in the config file) now ignores any agent endpoints in the config file. Instead it runs purely in autostarted agent mode, starting instances for each of the endpoints in the agent map. It then tracks any changes made to the agent map. This last part was already in place, and resulted in a small inconsistency where the process would respect agent names in the config at start, and then suddenly move to consider the agent map the authority as soon as a change comes in. This inconsistency is now resolved by considering the agent map the authority for the entire lifetime of the agent process. - The autostarted agent manager now trusts in its processes to track the agent map. If a process is already running for an environment, but the agent endpoints are not yet up, it waits for them rather than to kill the process and start fresh. - When an explicit restart is requested, the autostarted agent manager now expires all sessions for the agents in the agent map, i.e. the endpoints for the killed process. - Improved robustness of wait condition for an agent to be up, specifically make sure we don't consider expired sessions to be up. - Made some log messages a bit more informative, no major changes. Strike through any lines that are not applicable (`~~line~~`) then check the box - [x] ~Attached issue to pull request~ - [x] Changelog entry - [x] Type annotations are present - [x] Code is clear and sufficiently documented - [x] No (preventable) type errors (check using make mypy or make mypy-diff) - [x] Sufficient test cases (reproduces the bug/tests the requested feature) - [x] Correct, in line with design - [x] ~End user documentation is included or an issue is created for end-user documentation (add ref to issue here: )~ - [x] ~If this PR fixes a race condition in the test suite, also push the fix to the relevant stable branche(s) (see [test-fixes](https://internal.inmanta.com/development/core/tasks/build-master.html#test-fixes) for more info)~

…tostarted (PR #7628) # Description #7612 made the autostarted agent manager take full authority over autostarted agents. However, it was decided that it should not start an autostarted agent process if a session is already active for all required agents, even if that session is not with an autostarted process. This pull request restores that behavior. It adds back two utility methods without change that were dropped in #7612, and its associated test case. It adds a new guard in `_ensure_agents` and adds a test case (parametrization of an existing one) for the intended behavior. The new test case has been verified to fail on #7612, and to succeed on the commit that precedes it (ensuring backwards compatibility). # Self Check: Strike through any lines that are not applicable (`~~line~~`) then check the box - [x] ~~Attached issue to pull request~~ - [x] Changelog entry - [x] Type annotations are present - [x] Code is clear and sufficiently documented - [x] No (preventable) type errors (check using make mypy or make mypy-diff) - [x] Sufficient test cases (reproduces the bug/tests the requested feature) - [x] Correct, in line with design - [x] ~~End user documentation is included or an issue is created for end-user documentation (add ref to issue here: )~~ - [x] ~~If this PR fixes a race condition in the test suite, also push the fix to the relevant stable branche(s) (see [test-fixes](https://internal.inmanta.com/development/core/tasks/build-master.html#test-fixes) for more info)~~

…od (Issue #7612, PR #7612) Fixes bug in autostarted agent manager that occasionally causes agent's sessions to time out, dropping calls, resulting in a "stuck" orchestrator until a redeploy is triggered. ~~The bug has been there forever, but the conditions for it to trigger happen to be set up by #7278.~~ ~~The core of the bug is the following. When we need a certain agent to be up, we call `AutostartedAgentManager._ensure_agents`. This makes sure an agent process is running for these agents and waits for them to be up. However, instead of starting a process to track the autostarted agent map and to trust that it would keep up with changes to it, we instead start a process with an explicit list of agent endpoints.~~ If a new call comes in to `_ensure_agents`, and the agent is not yet up, we would kill the current process and start a new one. In killing the process, we would not expire its sessions, letting them time out on the 30s heartbeat timeout, losing any calls made to it in that window. EDIT: Scratched some of the above: I wrote a test case for the first bullet point below, which I thought to be the root cause of the reason for killing the old agent process. However, turns out the test also succeeds on master. The agent's `on_reconnect` actually updates the process' agent map. So, I think my root cause analysis may have been wrong (at least less likely), but the second and third bullet points should fix the issue anyway (doesn't matter exactly *how* we got in the situation where only part of the agent endpoints were up, as long as we handle it correctly), and even if part of the issue persists, the logging improvements should help future investigation. And the first bullet point is still a good consistency fix imo. This PR makes the following changes: - An agent process for autostarted agents (`use_autostart_agent_map` in the config file) now ignores any agent endpoints in the config file. Instead it runs purely in autostarted agent mode, starting instances for each of the endpoints in the agent map. It then tracks any changes made to the agent map. This last part was already in place, and resulted in a small inconsistency where the process would respect agent names in the config at start, and then suddenly move to consider the agent map the authority as soon as a change comes in. This inconsistency is now resolved by considering the agent map the authority for the entire lifetime of the agent process. - The autostarted agent manager now trusts in its processes to track the agent map. If a process is already running for an environment, but the agent endpoints are not yet up, it waits for them rather than to kill the process and start fresh. - When an explicit restart is requested, the autostarted agent manager now expires all sessions for the agents in the agent map, i.e. the endpoints for the killed process. - Improved robustness of wait condition for an agent to be up, specifically make sure we don't consider expired sessions to be up. - Made some log messages a bit more informative, no major changes. Strike through any lines that are not applicable (`~~line~~`) then check the box - [x] ~Attached issue to pull request~ - [x] Changelog entry - [x] Type annotations are present - [x] Code is clear and sufficiently documented - [x] No (preventable) type errors (check using make mypy or make mypy-diff) - [x] Sufficient test cases (reproduces the bug/tests the requested feature) - [x] Correct, in line with design - [x] ~End user documentation is included or an issue is created for end-user documentation (add ref to issue here: )~ - [x] ~If this PR fixes a race condition in the test suite, also push the fix to the relevant stable branche(s) (see [test-fixes](https://internal.inmanta.com/development/core/tasks/build-master.html#test-fixes) for more info)~

…tostarted (PR #7628) # Description #7612 made the autostarted agent manager take full authority over autostarted agents. However, it was decided that it should not start an autostarted agent process if a session is already active for all required agents, even if that session is not with an autostarted process. This pull request restores that behavior. It adds back two utility methods without change that were dropped in #7612, and its associated test case. It adds a new guard in `_ensure_agents` and adds a test case (parametrization of an existing one) for the intended behavior. The new test case has been verified to fail on #7612, and to succeed on the commit that precedes it (ensuring backwards compatibility). # Self Check: Strike through any lines that are not applicable (`~~line~~`) then check the box - [x] ~~Attached issue to pull request~~ - [x] Changelog entry - [x] Type annotations are present - [x] Code is clear and sufficiently documented - [x] No (preventable) type errors (check using make mypy or make mypy-diff) - [x] Sufficient test cases (reproduces the bug/tests the requested feature) - [x] Correct, in line with design - [x] ~~End user documentation is included or an issue is created for end-user documentation (add ref to issue here: )~~ - [x] ~~If this PR fixes a race condition in the test suite, also push the fix to the relevant stable branche(s) (see [test-fixes](https://internal.inmanta.com/development/core/tasks/build-master.html#test-fixes) for more info)~~

sanderr added 11 commits May 3, 2024 16:18

ensure_agents improvements: first draft

5b5ecf5

fixes

6c492ef

fixes

8685e78

refinements

c517c68

fix

0179dac

wip

d71473c

fixes

ded5e6c

pep8

c8c05a8

todos

18dcf34

todos

243cb39

todo

cc001e8

sanderr commented May 10, 2024

View reviewed changes

sanderr added 3 commits May 10, 2024 11:15

Merge branch 'master' into ensure-agents-improvements

82c7e6c

mypy

25c6b74

tests

da614fd

sanderr commented May 10, 2024

View reviewed changes

sanderr added 2 commits May 10, 2024 16:06

change entry

8228083

issuenr

ffa50f0

sanderr commented May 10, 2024

View reviewed changes

changelogs/unreleased/ensure-agents-improvements.yml Show resolved Hide resolved

sanderr requested review from arnaudsjs and wouterdb May 10, 2024 14:08

pep8

2ec766f

wouterdb reviewed May 13, 2024

View reviewed changes

arnaudsjs requested changes May 13, 2024

View reviewed changes

sanderr and others added 2 commits May 13, 2024 12:00

Update src/inmanta/server/agentmanager.py

ba6a018

Co-authored-by: arnaudsjs <2684622+arnaudsjs@users.noreply.github.com>

Update src/inmanta/server/agentmanager.py

1f09bd9

Co-authored-by: arnaudsjs <2684622+arnaudsjs@users.noreply.github.com>

wouterdb requested changes May 13, 2024

View reviewed changes

review comments

6b3ae24

sanderr requested a review from wouterdb May 13, 2024 13:20

wouterdb approved these changes May 13, 2024

View reviewed changes

arnaudsjs approved these changes May 13, 2024

View reviewed changes

sanderr added the merge-tool-ready This ticket is ready to be merged in label May 13, 2024

inmantaci mentioned this pull request May 13, 2024

Ensure agents improvements (#7612) #7616

Closed

inmantaci removed the merge-tool-ready This ticket is ready to be merged in label May 13, 2024

inmantaci assigned sanderr May 13, 2024

inmantaci pushed a commit that referenced this pull request May 13, 2024

Made various improvements to the AutostartedAgent._ensure_agents meth…

fdcbaf5

…od (Issue #7612, PR #7612) Pull request opened by the merge tool on behalf of #7612

sanderr closed this May 15, 2024

sanderr deleted the ensure-agents-improvements branch May 15, 2024 08:31

sanderr mentioned this pull request May 15, 2024

ensure_agents: do not start new agent process if a session is active #7628

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure agents improvements #7612

Ensure agents improvements #7612

sanderr commented May 10, 2024 •

edited

sanderr May 10, 2024 •

edited

wouterdb May 13, 2024

sanderr May 10, 2024 •

edited

sanderr May 10, 2024

sanderr May 10, 2024

sanderr May 10, 2024

sanderr May 10, 2024

wouterdb left a comment

wouterdb May 13, 2024

arnaudsjs May 13, 2024

sanderr May 13, 2024

sanderr May 13, 2024

wouterdb May 13, 2024 •

edited

sanderr May 13, 2024

sanderr May 13, 2024

arnaudsjs May 13, 2024

sanderr May 13, 2024

wouterdb May 13, 2024 •

edited

inmantaci commented May 13, 2024

inmantaci commented May 13, 2024

inmantaci commented May 13, 2024

inmantaci commented May 13, 2024

inmantaci commented May 13, 2024

inmantaci commented May 13, 2024

sanderr commented May 15, 2024

		# make sure to check for expiry because sessions are unregistered from the agent manager asyncronously
		and not session.expired

		@@ -1207,34 +1207,6 @@ async def _dummy_fork_inmanta(
		assert exception_message in str(excinfo.value)


		async def test_are_agents_active(server, client, environment, agent_factory) -> None:

		@pytest.mark.parametrize("autostarted", (True, False))
		async def test_autostart_mapping_overrides_config(server, client, environment, async_finalizer, caplog, autostarted: bool):

		@@ -1278,6 +1253,99 @@ async def test_dont_start_paused_agent(server, client, environment, caplog) -> N
		assert "took too long to start" not in caplog.text


		async def test_wait_for_agent_map_update(server, client, environment) -> None:

		if "internal" not in agent_map:
		raise ValueError("The internal agent must be present in the agent map")

Ensure agents improvements #7612

Ensure agents improvements #7612

Conversation

sanderr commented May 10, 2024 • edited

Description

Self Check:

sanderr May 10, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanderr May 10, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wouterdb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wouterdb May 13, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wouterdb May 13, 2024 • edited

Choose a reason for hiding this comment

inmantaci commented May 13, 2024

inmantaci commented May 13, 2024

inmantaci commented May 13, 2024

inmantaci commented May 13, 2024

inmantaci commented May 13, 2024

inmantaci commented May 13, 2024

sanderr commented May 15, 2024

sanderr commented May 10, 2024 •

edited

sanderr May 10, 2024 •

edited

sanderr May 10, 2024 •

edited

wouterdb May 13, 2024 •

edited

wouterdb May 13, 2024 •

edited