Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node LRU Dispatcher stops dispatching tasks to connected nodes #24

Closed
dkataskin opened this issue Oct 5, 2017 · 6 comments
Closed

Node LRU Dispatcher stops dispatching tasks to connected nodes #24

dkataskin opened this issue Oct 5, 2017 · 6 comments

Comments

@dkataskin
Copy link

Hi, not sure if this is a bug or I'm doing something wrong. In my case I create a global queue, connect two nodes with 10 workers on each and then see the following picture: node 2 receives only 10 tasks and after that only node 1 is executing tasks. Scheduler who sends task for execution to the queue sits on node 1.

I prepared a test project to demonstrate the behaviour: https://github.com/dkataskin/honeydew_nodelru

How to run it: start two nodes and connect them, node name should follow format "node{num}@127.0.0.1" as only node with name "node1@127.0.0.1" starts the scheduler. You will see that second node will execute only 10 tasks and then stop processing further jobs. Job id is monotonically increased int.

More instructions in the readme of the repository, here are logs from console, logs were edited for the sake of clarity:

Node 1:

11:52:05.962 [info]  app starting...
11:52:05.971 [info]  test scheduler is starting...
11:52:05.971 [info]  app started...

11:52:05.973 [debug] node1@127.0.0.1: processed job id=1
11:52:06.972 [debug] node1@127.0.0.1: processed job id=2
11:52:07.973 [debug] node1@127.0.0.1: processed job id=3

...

11:52:36.001 [debug] node1@127.0.0.1: processed job id=31

11:52:36.621 [info]  [Honeydew] Connection to node2@127.0.0.1 established, looking for workers...

11:52:37.002 [debug] node1@127.0.0.1: processed job id=32
11:52:39.004 [debug] node1@127.0.0.1: processed job id=34
11:52:41.006 [debug] node1@127.0.0.1: processed job id=36
11:52:43.008 [debug] node1@127.0.0.1: processed job id=38
11:52:45.010 [debug] node1@127.0.0.1: processed job id=40
11:52:47.012 [debug] node1@127.0.0.1: processed job id=42
11:52:49.014 [debug] node1@127.0.0.1: processed job id=44
11:52:51.016 [debug] node1@127.0.0.1: processed job id=46
11:52:53.018 [debug] node1@127.0.0.1: processed job id=48
11:52:55.020 [debug] node1@127.0.0.1: processed job id=50
11:52:57.022 [debug] node1@127.0.0.1: processed job id=52

11:52:58.023 [debug] node1@127.0.0.1: processed job id=53
11:52:59.024 [debug] node1@127.0.0.1: processed job id=54
11:53:00.025 [debug] node1@127.0.0.1: processed job id=55
...

notice that 10 jobs from interval 32-52 have been processed on just joined Node 2 and all further tasks have been processed on Node 1.

Node 2:

11:52:27.193 [info]  app starting...
11:52:27.203 [info]  app started...

iex(node2@127.0.0.1)1> Node.connect(:'node1@127.0.0.1')

11:52:36.621 [info]  [Honeydew] Connection to node1@127.0.0.1 established, looking for workers...

11:52:38.004 [debug] node2@127.0.0.1: processed job id=33
11:52:40.006 [debug] node2@127.0.0.1: processed job id=35
11:52:42.008 [debug] node2@127.0.0.1: processed job id=37
11:52:44.009 [debug] node2@127.0.0.1: processed job id=39
11:52:46.012 [debug] node2@127.0.0.1: processed job id=41
11:52:48.013 [debug] node2@127.0.0.1: processed job id=43
11:52:50.015 [debug] node2@127.0.0.1: processed job id=45
11:52:52.018 [debug] node2@127.0.0.1: processed job id=47
11:52:54.019 [debug] node2@127.0.0.1: processed job id=49
11:52:56.022 [debug] node2@127.0.0.1: processed job id=51

...

Honeydew.status({:global, :test_queue})
%{queue: %{count: 0, in_progress: 0, suspended: false},
  workers: %{#PID<18569.165.0> => nil, #PID<0.165.0> => nil,
    #PID<18569.166.0> => nil, #PID<0.166.0> => nil, #PID<18569.167.0> => nil,
    #PID<0.167.0> => nil, #PID<18569.168.0> => nil, #PID<0.168.0> => nil,
    #PID<18569.169.0> => nil, #PID<0.169.0> => nil, #PID<18569.170.0> => nil,
    #PID<0.170.0> => nil, #PID<18569.171.0> => nil, #PID<0.171.0> => nil,
    #PID<18569.172.0> => nil, #PID<0.172.0> => nil, #PID<18569.173.0> => nil,
    #PID<0.173.0> => nil, #PID<18569.174.0> => nil, #PID<0.174.0> => nil}}
@koudelka
Copy link
Owner

koudelka commented Oct 5, 2017

Hey @dkataskin,

Thanks for the really excellent demonstration of the issue, that makes it a lot easier for me to figure out what's going on. :)

There's definitely a bug here, and it has to do with running multiple queue processes. Each queue process maintains its own dispatcher state, and it looks like there's an issue with node1 stealing node2's workers out from under it when the two machines connect. I'll have to have a look at that.

In the meantime, you can solve the issue by just running a single queue process per queue name in the cluster.

I'm in the middle of a big effort to fix up the distributed aspect of honeydew, wherein multiple queue processes will behave properly.

@koudelka
Copy link
Owner

koudelka commented Dec 7, 2017

closing due to inactivity.

@koudelka koudelka closed this as completed Dec 7, 2017
@dkataskin
Copy link
Author

Hi @koudelka, has it been fixed?

@koudelka
Copy link
Owner

koudelka commented Dec 8, 2017

hey @dkataskin, sorry, i closed the issue because i think i was incorrectly expecting you to let me know if running a single queue was working out for you.

i'm still working on the larger effort, i don't expect to have that done for another month or two though.

@koudelka koudelka reopened this Dec 8, 2017
@koudelka
Copy link
Owner

koudelka commented Nov 5, 2018

hey @dkataskin, this should be fixed now. want to give it a shot?

@koudelka
Copy link
Owner

closing…

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants