Node LRU Dispatcher stops dispatching tasks to connected nodes #24

dkataskin · 2017-10-05T09:19:35Z

Hi, not sure if this is a bug or I'm doing something wrong. In my case I create a global queue, connect two nodes with 10 workers on each and then see the following picture: node 2 receives only 10 tasks and after that only node 1 is executing tasks. Scheduler who sends task for execution to the queue sits on node 1.

I prepared a test project to demonstrate the behaviour: https://github.com/dkataskin/honeydew_nodelru

How to run it: start two nodes and connect them, node name should follow format "node{num}@127.0.0.1" as only node with name "node1@127.0.0.1" starts the scheduler. You will see that second node will execute only 10 tasks and then stop processing further jobs. Job id is monotonically increased int.

More instructions in the readme of the repository, here are logs from console, logs were edited for the sake of clarity:

Node 1:

11:52:05.962 [info]  app starting...
11:52:05.971 [info]  test scheduler is starting...
11:52:05.971 [info]  app started...

11:52:05.973 [debug] node1@127.0.0.1: processed job id=1
11:52:06.972 [debug] node1@127.0.0.1: processed job id=2
11:52:07.973 [debug] node1@127.0.0.1: processed job id=3

...

11:52:36.001 [debug] node1@127.0.0.1: processed job id=31

11:52:36.621 [info]  [Honeydew] Connection to node2@127.0.0.1 established, looking for workers...

11:52:37.002 [debug] node1@127.0.0.1: processed job id=32
11:52:39.004 [debug] node1@127.0.0.1: processed job id=34
11:52:41.006 [debug] node1@127.0.0.1: processed job id=36
11:52:43.008 [debug] node1@127.0.0.1: processed job id=38
11:52:45.010 [debug] node1@127.0.0.1: processed job id=40
11:52:47.012 [debug] node1@127.0.0.1: processed job id=42
11:52:49.014 [debug] node1@127.0.0.1: processed job id=44
11:52:51.016 [debug] node1@127.0.0.1: processed job id=46
11:52:53.018 [debug] node1@127.0.0.1: processed job id=48
11:52:55.020 [debug] node1@127.0.0.1: processed job id=50
11:52:57.022 [debug] node1@127.0.0.1: processed job id=52

11:52:58.023 [debug] node1@127.0.0.1: processed job id=53
11:52:59.024 [debug] node1@127.0.0.1: processed job id=54
11:53:00.025 [debug] node1@127.0.0.1: processed job id=55
...

notice that 10 jobs from interval 32-52 have been processed on just joined Node 2 and all further tasks have been processed on Node 1.

Node 2:

11:52:27.193 [info]  app starting...
11:52:27.203 [info]  app started...

iex(node2@127.0.0.1)1> Node.connect(:'node1@127.0.0.1')

11:52:36.621 [info]  [Honeydew] Connection to node1@127.0.0.1 established, looking for workers...

11:52:38.004 [debug] node2@127.0.0.1: processed job id=33
11:52:40.006 [debug] node2@127.0.0.1: processed job id=35
11:52:42.008 [debug] node2@127.0.0.1: processed job id=37
11:52:44.009 [debug] node2@127.0.0.1: processed job id=39
11:52:46.012 [debug] node2@127.0.0.1: processed job id=41
11:52:48.013 [debug] node2@127.0.0.1: processed job id=43
11:52:50.015 [debug] node2@127.0.0.1: processed job id=45
11:52:52.018 [debug] node2@127.0.0.1: processed job id=47
11:52:54.019 [debug] node2@127.0.0.1: processed job id=49
11:52:56.022 [debug] node2@127.0.0.1: processed job id=51

...

Honeydew.status({:global, :test_queue})
%{queue: %{count: 0, in_progress: 0, suspended: false},
  workers: %{#PID<18569.165.0> => nil, #PID<0.165.0> => nil,
    #PID<18569.166.0> => nil, #PID<0.166.0> => nil, #PID<18569.167.0> => nil,
    #PID<0.167.0> => nil, #PID<18569.168.0> => nil, #PID<0.168.0> => nil,
    #PID<18569.169.0> => nil, #PID<0.169.0> => nil, #PID<18569.170.0> => nil,
    #PID<0.170.0> => nil, #PID<18569.171.0> => nil, #PID<0.171.0> => nil,
    #PID<18569.172.0> => nil, #PID<0.172.0> => nil, #PID<18569.173.0> => nil,
    #PID<0.173.0> => nil, #PID<18569.174.0> => nil, #PID<0.174.0> => nil}}

The text was updated successfully, but these errors were encountered:

koudelka · 2017-10-05T20:26:25Z

Hey @dkataskin,

Thanks for the really excellent demonstration of the issue, that makes it a lot easier for me to figure out what's going on. :)

There's definitely a bug here, and it has to do with running multiple queue processes. Each queue process maintains its own dispatcher state, and it looks like there's an issue with node1 stealing node2's workers out from under it when the two machines connect. I'll have to have a look at that.

In the meantime, you can solve the issue by just running a single queue process per queue name in the cluster.

I'm in the middle of a big effort to fix up the distributed aspect of honeydew, wherein multiple queue processes will behave properly.

koudelka · 2017-12-07T22:44:21Z

closing due to inactivity.

dkataskin · 2017-12-08T13:47:23Z

Hi @koudelka, has it been fixed?

koudelka · 2017-12-08T23:35:21Z

hey @dkataskin, sorry, i closed the issue because i think i was incorrectly expecting you to let me know if running a single queue was working out for you.

i'm still working on the larger effort, i don't expect to have that done for another month or two though.

koudelka · 2018-11-05T21:18:01Z

hey @dkataskin, this should be fixed now. want to give it a shot?

koudelka · 2018-11-30T23:41:29Z

closing…

koudelka closed this as completed Dec 7, 2017

koudelka reopened this Dec 8, 2017

koudelka closed this as completed Nov 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node LRU Dispatcher stops dispatching tasks to connected nodes #24

Node LRU Dispatcher stops dispatching tasks to connected nodes #24

dkataskin commented Oct 5, 2017

koudelka commented Oct 5, 2017

koudelka commented Dec 7, 2017

dkataskin commented Dec 8, 2017

koudelka commented Dec 8, 2017

koudelka commented Nov 5, 2018

koudelka commented Nov 30, 2018

Node LRU Dispatcher stops dispatching tasks to connected nodes #24

Node LRU Dispatcher stops dispatching tasks to connected nodes #24

Comments

dkataskin commented Oct 5, 2017

koudelka commented Oct 5, 2017

koudelka commented Dec 7, 2017

dkataskin commented Dec 8, 2017

koudelka commented Dec 8, 2017

koudelka commented Nov 5, 2018

koudelka commented Nov 30, 2018