Skip to content
This repository has been archived by the owner on Dec 26, 2023. It is now read-only.

Agents seem to be racing for work #486

Closed
take-five opened this issue Jun 28, 2023 · 2 comments · Fixed by #537
Closed

Agents seem to be racing for work #486

take-five opened this issue Jun 28, 2023 · 2 comments · Fixed by #537

Comments

@take-five
Copy link

Hi

I tried to spin up OTF locally with the following setup - one controller, one organization, a workspace configured to run on an agent.

Then I launched 3 agent processes and began running terraform plan repeatedly in the workspace. Every time one of the agents would report something like this:

2023/06/28 16:45:54 INFO executing phase run=run-23z0M1fCvZC6pGpq phase=plan
2023/06/28 16:45:57 INFO finishing phase run=run-23z0M1fCvZC6pGpq phase=plan

While the other two reported:

2023/06/28 16:45:54 ERROR starting phase run=run-23z0M1fCvZC6pGpq phase=plan error="Internal Server Error: phase already started"

It's probably benign and just an error noise, and I'm not sure what the good solution would be, because agents communicate with the controller over HTTPS and thus can't use Postgres directly (otherwise, it'd be possible to do something like SELECT .. FOR UPDATE SKIP LOCKED).

@leg100
Copy link
Owner

leg100 commented Jun 28, 2023

Hello. Yes, it is benign and just error noise as you say.

They all receive an event notifying them of a new run phase (plan or apply), and then then race to be the first to claim the phase, triggering a "thundering herd", albeit a small herd. It's not a terrible approach, performance isn't an issue, etc,. I could quash the errors...

But I've been working on a refactor for a little while: instead an agent manager makes the decision as to which agent a phase is assigned to. I've noticed that the same way TFC assigns jobs to its agents. This work will take a little while though, can't say when it'll be complete.

In the meantime let's keep this issue open for other folks.

leg100 added a commit that referenced this issue Jul 25, 2023
@leg100
Copy link
Owner

leg100 commented Jul 25, 2023

@take-five I opted to quash the error because it really is distracting and I don't know when this anticipated agent refactor will be complete.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants