Agents seem to be racing for work #486

take-five · 2023-06-28T13:53:23Z

Hi

I tried to spin up OTF locally with the following setup - one controller, one organization, a workspace configured to run on an agent.

Then I launched 3 agent processes and began running terraform plan repeatedly in the workspace. Every time one of the agents would report something like this:

2023/06/28 16:45:54 INFO executing phase run=run-23z0M1fCvZC6pGpq phase=plan
2023/06/28 16:45:57 INFO finishing phase run=run-23z0M1fCvZC6pGpq phase=plan

While the other two reported:

2023/06/28 16:45:54 ERROR starting phase run=run-23z0M1fCvZC6pGpq phase=plan error="Internal Server Error: phase already started"

It's probably benign and just an error noise, and I'm not sure what the good solution would be, because agents communicate with the controller over HTTPS and thus can't use Postgres directly (otherwise, it'd be possible to do something like SELECT .. FOR UPDATE SKIP LOCKED).

The text was updated successfully, but these errors were encountered:

leg100 · 2023-06-28T14:10:23Z

Hello. Yes, it is benign and just error noise as you say.

They all receive an event notifying them of a new run phase (plan or apply), and then then race to be the first to claim the phase, triggering a "thundering herd", albeit a small herd. It's not a terrible approach, performance isn't an issue, etc,. I could quash the errors...

But I've been working on a refactor for a little while: instead an agent manager makes the decision as to which agent a phase is assigned to. I've noticed that the same way TFC assigns jobs to its agents. This work will take a little while though, can't say when it'll be complete.

In the meantime let's keep this issue open for other folks.

Fixes #486

leg100 · 2023-07-25T18:07:02Z

@take-five I opted to quash the error because it really is distracting and I don't know when this anticipated agent refactor will be complete.

leg100 mentioned this issue Jul 25, 2023

fix: agent race error #537

Merged

leg100 closed this as completed in #537 Jul 25, 2023

leg100 added a commit that referenced this issue Jul 25, 2023

fix: agent race error (#537)

6b9e6b1

Fixes #486

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agents seem to be racing for work #486

Agents seem to be racing for work #486

take-five commented Jun 28, 2023

leg100 commented Jun 28, 2023

leg100 commented Jul 25, 2023

Agents seem to be racing for work #486

Agents seem to be racing for work #486

Comments

take-five commented Jun 28, 2023

leg100 commented Jun 28, 2023

leg100 commented Jul 25, 2023