Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perform row-level locking when using Agents #3299

Closed
wants to merge 3 commits into from
Closed

Perform row-level locking when using Agents #3299

wants to merge 3 commits into from

Conversation

knu
Copy link
Member

@knu knu commented Jul 12, 2023

Huginn agents are not robust enough when it comes to dealing with concurrent events simultaneously because they don't have a proper locking mechanism built in to ensure the integrity of their memory state.

This PR is about making Huginn acquire a pessimistic lock whenever it reads and updates agent states.

@knu knu requested a review from dsander July 12, 2023 18:13
Copy link
Collaborator

@dsander dsander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general it's a great idea to make this more resilient, but synchronizing things (even in our tiny distributed system) is very tricky. I am mostly worried about putting locks around everything we ever change and then potentially(?) end up with a bunch of deadlocks.

What seems to be missing is everything in AgentControllerConcern. We do now have locks in the check and receive jobs, but these Agents can also control other Agents.

Not super sure, but aren't the locks in the AgentController unnecessary if we have the locks in the check and receive jobs?

Comment on lines +117 to +120
Agent.transaction do
@agent = current_user.agents.lock.find(params[:id])
@agent.update!(memory: {})
end
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly just curious, in some places you used .with_lock and in other wrapped .lock.find in a transaction "manually", afaik they are interchangeable functionality wise. WDYT about using one pattern everywhere it's feasible? I generally think with_lock is easier to read

Suggested change
Agent.transaction do
@agent = current_user.agents.lock.find(params[:id])
@agent.update!(memory: {})
end
@agent = current_user.agents.find(params[:id])
@agent.with_lock do
@agent.update!(memory: {})
end

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would cost you two select queries when you needed just one and could lead to a race condition.

SELECT; -- find
-- The record may change
BEGIN;  -- with_lock do
  SELECT FOR UPDATE;
  -- ...
COMMIT; -- end

Copy link
Member Author

@knu knu Jul 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, Huginn will never change id or the ownership of an agent so the race condition would do little to no harm in this particular case, but I'm talking about the situation in general.

@agent = current_user.agents.find(params[:id])
@scenario = current_user.scenarios.find(params[:scenario_id])
@agent.scenarios.destroy(@scenario)
Agent.transaction do
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm do we need this? Afaik we don't have agents that change the scenario of another Agent.

@knu
Copy link
Member Author

knu commented Jul 16, 2023

In general it's a great idea to make this more resilient, but synchronizing things (even in our tiny distributed system) is very tricky. I am mostly worried about putting locks around everything we ever change and then potentially(?) end up with a bunch of deadlocks.

I deeply understand your concern. As far as I investigated, running an agent is always done asynchronously either via Agent.async_check or Agent.async_receive, so a deadlock will not occur even if an agent triggers other agents while it's running. There can be a cyclic workflow, but each run is independent and does not involve a wait for another to finish.

What seems to be missing is everything in AgentControllerConcern. We do now have locks in the check and receive jobs, but these Agents can also control other Agents.

Not super sure, but aren't the locks in the AgentController unnecessary if we have the locks in the check and receive jobs?

Yes, this is one other thing to consider.

A controller agent can "run" other agents, but in an async way as I said above.

A controller agent can "configure" other agents, but controller agents are not allowed to be a controller of themselves, so it should be safe to acquire a lock on the target agents for updating while a controller agent is running.
Without a lock, updating an agent cannot be performed atomically, leading to a race condition between reading and writing.

I've been running my instance with this patch applied for 3-4 days now. So far I haven't seen a problem, but I'm going to keep an eye on it for a while longer.

@knu
Copy link
Member Author

knu commented Jul 18, 2023

It turned out that you should do this in a more fine-grained manner. Putting the whole run of check() in a transaction means any events created in it will be committed only after it finishes, and if the asynchronous calls for receive() of the downstream agents take place before that, they won't be able to see the created events.

So, while there should be a global mechanism to prevent an agent from being run while it is running, transactions to keep the memory/event state consistent should be carefully designed and implemented on a per-agent, fine-grained basis. I'll reconsider this and come back with a better implementation.

@knu knu marked this pull request as draft July 18, 2023 01:30
@knu knu closed this Nov 3, 2023
@knu knu deleted the lock_agents branch November 3, 2023 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants