Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2 Events from 1 Agent when running multiple workers #1454

Closed
TildeWill opened this issue Apr 27, 2016 · 20 comments
Closed

2 Events from 1 Agent when running multiple workers #1454

TildeWill opened this issue Apr 27, 2016 · 20 comments

Comments

@TildeWill
Copy link
Contributor

As I'm digging in to this problem I thought I'd open an issue in case someone else comes across this problem and/or has some insight. All this is locally on my MacBook Pro. I ran a script that dsander posted in another thread to create 500 pairs of agents (1000 total) plus one ManualEventAgent to kick off a set of web requests back to the web server.

trigger = User.find(1).agents.create!(type: "Agents::ManualEventAgent", name: "PostTrigger")

500.times do |i|
  options = {"secret" => "supersecretstring", "expected_receive_period_in_days" => 1, "payload_path" => "payload"}
  webhook = User.find(1).agents.create!(type: 'Agents::WebhookAgent', name: "Webhook #{i}", options: options)
  options = {"post_url" => "http://localhost:5000/users/1/web_requests/#{webhook.id}/supersecretstring", "expected_receive_period_in_days" => "1", "content_type" => "form", "method" => "post", "payload" => {"payload" => {"agent" => "#{i}"}}, "headers" => {}}
  webhook = User.find(1).agents.new(type: 'Agents::PostAgent', name: "Post #{i}", options: options)
  webhook.sources << trigger
  webhook.save!
end

Expected outcome: 501 Events, Actual: 502 (or more) Events.

My Procfile looks like this:

web: bundle exec rails server -p ${PORT-3000} -b ${IP-0.0.0.0}
scheduler: bundle exec rails runner bin/schedule.rb
worker1: bundle exec script/delayed_job -i 1 run
worker2: bundle exec script/delayed_job -i 2 run

If I comment out the second worker, I get the expected 501 Events.

The extra event looks like a duplicate. Their id's are back-to-back (eg. 302 and 303) with the same agent_id.
Other stuff that might be relevant that I can think of:
DELAYED_JOB_SLEEP_DELAY=10
Ruby 2.2.3
postgres (PostgreSQL) 9.4.5

@TildeWill
Copy link
Contributor Author

Maybe it's late, but now I'm very consistently getting 1001 Events. I don't think I've changed anything significant. I changed my .env to have DOMAIN=localhost:5000 rather than DOMAIN=localhost:3000 so that DOMAIN would match the URL the PostAgent is posting to. I thought that was the cause. But then I changed it back to DOMAIN=localhost:3000 and still got 1001 Events. head scratch

@dsander
Copy link
Collaborator

dsander commented Apr 27, 2016

Which version are you running? Before #1432 was merged you could get duplicate events by triggering the event propagation via the web interface at the same time the background processor runs the propagation.

@TildeWill
Copy link
Contributor Author

Yep, running code that's after #1432. No other agents on the system besides the ones generated by the above script.

@cantino
Copy link
Member

cantino commented Apr 30, 2016

I wonder if you get the same symptoms with Agents that don't interact with the network?

@TildeWill
Copy link
Contributor Author

I changed my setup script to be

trigger = User.find(1).agents.create!(type: "Agents::ManualEventAgent", name: "PostTrigger"); nil;

500.times do |i|
  options = {
    :expected_receive_period_in_days => "2",
    :rules => [{
      :type => "regex",
      :value => "rain",
      :path => "conditions"
    }],
    :message => "Agent: #{i} -It looks like '{{conditions}}' tomorrow in {{location}}"
  }

  agent = User.find(1).agents.new(type: 'Agents::TriggerAgent', name: "First Agent #{i}", options: options)
  agent.sources << trigger
  agent.save!
end

and Procfile looks like:

web: bundle exec rails server -p ${PORT-3000} -b ${IP-0.0.0.0}
schedule: bundle exec rails runner bin/schedule.rb
dj: bundle exec rake jobs:work

with a start command to kick off 2 workers: bundle exec foreman start -m dj=2,web=1,schedule=1

Results: Event.count = 507

@dsander
Copy link
Collaborator

dsander commented May 2, 2016

I see, my guess is that the duplicate events are caused by the scaling feature of foreman. We added the bulky multiple delayed_job worker examples to the Procfiles because DJ needs the -i argument to properly lock it's jobs when multiple workers are running.
Can you try if you are still getting duplicates with this Procfile?

web: bundle exec rails server -p ${PORT-3000} -b ${IP-0.0.0.0}
schedule: bundle exec rails runner bin/schedule.rb
dj2: bundle exec script/delayed_job -i 2 run
dj3: bundle exec script/delayed_job -i 3 run

@TildeWill
Copy link
Contributor Author

Got 509 with your Procfile

@dsander
Copy link
Collaborator

dsander commented May 2, 2016

Darn, thanks for checking that. Looks like another debugging session is in order. 😄

@TildeWill
Copy link
Contributor Author

Is it possible some jobs are failing and being retried in some way to cause duplicates? Is there something I can look in the logs for?

@dsander
Copy link
Collaborator

dsander commented May 3, 2016

It seems unlikely that the jobs of those agents would fail. Do you see a pattern in the duplicated events e.g. are the duplicates always from the same agents or is it completely random?

@TildeWill
Copy link
Contributor Author

It looks like it's all over the place. Here's the duplicate Events (events with overlapping agent_ids) from the last run I did where the count at the end was 509:

[#<Event:0x007fd9b975d1f0
  id: 72722,
  user_id: 1,
  agent_id: 22732,
  lat: nil,
  lng: nil,
  payload: {"message"=>"Agent: 4 -It looks like 'rain' tomorrow in "},
  created_at: Mon, 02 May 2016 13:47:22 PDT -07:00,
  updated_at: Mon, 02 May 2016 13:47:22 PDT -07:00,
  expires_at: nil>,
 #<Event:0x007fd9b975d038
  id: 72723,
  user_id: 1,
  agent_id: 22732,
  lat: nil,
  lng: nil,
  payload: {"message"=>"Agent: 4 -It looks like 'rain' tomorrow in "},
  created_at: Mon, 02 May 2016 13:47:22 PDT -07:00,
  updated_at: Mon, 02 May 2016 13:47:22 PDT -07:00,
  expires_at: nil>,
 #<Event:0x007fd9b975ce80
  id: 72778,
  user_id: 1,
  agent_id: 22768,
  lat: nil,
  lng: nil,
  payload: {"message"=>"Agent: 40 -It looks like 'rain' tomorrow in "},
  created_at: Mon, 02 May 2016 13:47:22 PDT -07:00,
  updated_at: Mon, 02 May 2016 13:47:22 PDT -07:00,
  expires_at: nil>,
 #<Event:0x007fd9b975ccc8
  id: 72779,
  user_id: 1,
  agent_id: 22768,
  lat: nil,
  lng: nil,
  payload: {"message"=>"Agent: 40 -It looks like 'rain' tomorrow in "},
  created_at: Mon, 02 May 2016 13:47:22 PDT -07:00,
  updated_at: Mon, 02 May 2016 13:47:22 PDT -07:00,
  expires_at: nil>,
 #<Event:0x007fd9b975ca98
  id: 72820,
  user_id: 1,
  agent_id: 22851,
  lat: nil,
  lng: nil,
  payload: {"message"=>"Agent: 123 -It looks like 'rain' tomorrow in "},
  created_at: Mon, 02 May 2016 13:47:23 PDT -07:00,
  updated_at: Mon, 02 May 2016 13:47:23 PDT -07:00,
  expires_at: nil>,
 #<Event:0x007fd9b975c868
  id: 72819,
  user_id: 1,
  agent_id: 22851,
  lat: nil,
  lng: nil,
  payload: {"message"=>"Agent: 123 -It looks like 'rain' tomorrow in "},
  created_at: Mon, 02 May 2016 13:47:23 PDT -07:00,
  updated_at: Mon, 02 May 2016 13:47:23 PDT -07:00,
  expires_at: nil>,
 #<Event:0x007fd9b975c700
  id: 72694,
  user_id: 1,
  agent_id: 22886,
  lat: nil,
  lng: nil,
  payload: {"message"=>"Agent: 158 -It looks like 'rain' tomorrow in "},
  created_at: Mon, 02 May 2016 13:47:21 PDT -07:00,
  updated_at: Mon, 02 May 2016 13:47:21 PDT -07:00,
  expires_at: nil>,
 #<Event:0x007fd9b975c548
  id: 72693,
  user_id: 1,
  agent_id: 22886,
  lat: nil,
  lng: nil,
  payload: {"message"=>"Agent: 158 -It looks like 'rain' tomorrow in "},
  created_at: Mon, 02 May 2016 13:47:21 PDT -07:00,
  updated_at: Mon, 02 May 2016 13:47:21 PDT -07:00,
  expires_at: nil>,
 #<Event:0x007fd9b975c318
  id: 72754,
  user_id: 1,
  agent_id: 22952,
  lat: nil,
  lng: nil,
  payload: {"message"=>"Agent: 224 -It looks like 'rain' tomorrow in "},
  created_at: Mon, 02 May 2016 13:47:22 PDT -07:00,
  updated_at: Mon, 02 May 2016 13:47:22 PDT -07:00,
  expires_at: nil>,
 #<Event:0x007fd9b975c1d8
  id: 72753,
  user_id: 1,
  agent_id: 22952,
  lat: nil,
  lng: nil,
  payload: {"message"=>"Agent: 224 -It looks like 'rain' tomorrow in "},
  created_at: Mon, 02 May 2016 13:47:22 PDT -07:00,
  updated_at: Mon, 02 May 2016 13:47:22 PDT -07:00,
  expires_at: nil>,
 #<Event:0x007fd9b975c070
  id: 72793,
  user_id: 1,
  agent_id: 23169,
  lat: nil,
  lng: nil,
  payload: {"message"=>"Agent: 441 -It looks like 'rain' tomorrow in "},
  created_at: Mon, 02 May 2016 13:47:23 PDT -07:00,
  updated_at: Mon, 02 May 2016 13:47:23 PDT -07:00,
  expires_at: nil>,
 #<Event:0x007fd9b9757e30
  id: 72792,
  user_id: 1,
  agent_id: 23169,
  lat: nil,
  lng: nil,
  payload: {"message"=>"Agent: 441 -It looks like 'rain' tomorrow in "},
  created_at: Mon, 02 May 2016 13:47:23 PDT -07:00,
  updated_at: Mon, 02 May 2016 13:47:23 PDT -07:00,
  expires_at: nil>,
 #<Event:0x007fd9b9757bd8
  id: 72849,
  user_id: 1,
  agent_id: 23175,
  lat: nil,
  lng: nil,
  payload: {"message"=>"Agent: 447 -It looks like 'rain' tomorrow in "},
  created_at: Mon, 02 May 2016 13:47:23 PDT -07:00,
  updated_at: Mon, 02 May 2016 13:47:23 PDT -07:00,
  expires_at: nil>,
 #<Event:0x007fd9b9757958
  id: 72850,
  user_id: 1,
  agent_id: 23175,
  lat: nil,
  lng: nil,
  payload: {"message"=>"Agent: 447 -It looks like 'rain' tomorrow in "},
  created_at: Mon, 02 May 2016 13:47:23 PDT -07:00,
  updated_at: Mon, 02 May 2016 13:47:23 PDT -07:00,
  expires_at: nil>,
 #<Event:0x007fd9b9757688
  id: 72676,
  user_id: 1,
  agent_id: 23225,
  lat: nil,
  lng: nil,
  payload: {"message"=>"Agent: 497 -It looks like 'rain' tomorrow in "},
  created_at: Mon, 02 May 2016 13:47:21 PDT -07:00,
  updated_at: Mon, 02 May 2016 13:47:21 PDT -07:00,
  expires_at: nil>,
 #<Event:0x007fd9b9757548
  id: 72679,
  user_id: 1,
  agent_id: 23225,
  lat: nil,
  lng: nil,
  payload: {"message"=>"Agent: 497 -It looks like 'rain' tomorrow in "},
  created_at: Mon, 02 May 2016 13:47:21 PDT -07:00,
  updated_at: Mon, 02 May 2016 13:47:21 PDT -07:00,
  expires_at: nil>]

@TildeWill
Copy link
Contributor Author

I'm also curious to know if someone else can reproduce the "extra events" I'm seeing? Maybe I just have a bad configuration locally?

@dsander
Copy link
Collaborator

dsander commented May 7, 2016

I was able to reproduce it in the development environment, after setting config.cache_classes = true the duplicate events went away, so I think that is not an issue in production. Not sure what we can do about it, setting Rails.configuration.cache_classes = true in script/delayed_job does not work.

@dsander
Copy link
Collaborator

dsander commented May 7, 2016

Spoke too soon, duplicates are still happening, really strange.

@TildeWill
Copy link
Contributor Author

Good to know I'm not alone. Maybe worth noting I'm using postgres (PostgreSQL) 9.4.5 as my database. I haven't tried the same experiment agains a MySQL database.

@TildeWill
Copy link
Contributor Author

When I comment out these lines from config/initializers/delayed_job.rb Everything works as expected (501 events consistently):

Delayed::Backend::ActiveRecord.configure do |config|
  config.reserve_sql_strategy = :default_sql
end

They seem to be put in for #984, which is in response to #977, and tied to collectiveidea/delayed_job_active_record#111

TildeWill added a commit to omniscopeio/huginn that referenced this issue May 7, 2016
- huginn#1454
- Now workers can be scaled independently on Heroku to address extra load
@dsander
Copy link
Collaborator

dsander commented May 7, 2016

Hmm maybe we can switch back to the optimized version for postgres, I suspect in mysql we would still see the deadlocks.
I was suspecting a race condition that could still happen when the manual event propagation is still running while the scheduled propagation job is started, but that would mean that more then 500 jobs would be en-queued, which I did not notice in my tests.

@TildeWill
Copy link
Contributor Author

As mentioned above, there were times where I saw > 1000 Events for what should have been 500 new events.

@TildeWill
Copy link
Contributor Author

I think using the optimized version for postgres makes sense. Is switching on the adapter the best option?

TildeWill added a commit to omniscopeio/huginn that referenced this issue May 15, 2016
- Now workers can be scaled independently on Heroku to address extra load
Fixes huginn#1454
cantino pushed a commit that referenced this issue May 16, 2016
…1501)

- Now workers can be scaled independently on Heroku to address extra load
Fixes #1454
@cantino
Copy link
Member

cantino commented May 16, 2016

Thanks for figuring this out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants