Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

polling seems not to be working with jinad #1815

Closed
JoanFM opened this issue Jan 29, 2021 · 10 comments · Fixed by #1857
Closed

polling seems not to be working with jinad #1815

JoanFM opened this issue Jan 29, 2021 · 10 comments · Fixed by #1857
Labels
type/bug Something isn't working

Comments

@JoanFM
Copy link
Member

JoanFM commented Jan 29, 2021

Describe the bug
In the same Pod, there seems to be only one Pea receiving all the load.

@JoanFM JoanFM added type/bug Something isn't working priority/critical Urgent: Security, critical bugs, blocking issues. drop everything until this issue is addressed. labels Jan 29, 2021
@hanxiao
Copy link
Member

hanxiao commented Jan 29, 2021

can't reproduce in the following steps:

  1. run jinad
  2. run the following code
import numpy as np

from jina import Flow

with Flow().add(host='localhost:8000', parallel=3) as f:
    f.index(np.random.random([100000, 10]))
  1. wait for jinad's log and you can see, by the termination, the peas are receiving on avg. 333 requests, which ~= 100,000 / 100 (request_size) / 3
👻         DAEMON@23375[I]:127.0.0.1:64806 is disconnected
         pod0/3@23491[I]:recv ControlRequest  from ctl▸pod0/3/ZEDRuntime▸⚐
         pod0/3@23491[I]:#sent: 670 #recv: 335 sent_size: 4.0 MB recv_size: 3.9 MB
         pod0/3@23491[I]:no update since 2021-01-29 19:51:50, will not save. If you really want to save it, call "touch()" before "save()" to force saving
         pod0/3@23375[S]:terminated
👻       PeaStore@23375[S]:445d5011-a452-4d78-a9b6-889dee224370 is released from the store.
👻         DAEMON@23375[I]:127.0.0.1:64784 is disconnected
         pod0/2@23489[I]:recv ControlRequest  from ctl▸pod0/2/ZEDRuntime▸⚐
         pod0/2@23489[I]:#sent: 670 #recv: 335 sent_size: 4.0 MB recv_size: 3.9 MB
         pod0/2@23489[I]:no update since 2021-01-29 19:51:49, will not save. If you really want to save it, call "touch()" before "save()" to force saving
         pod0/2@23375[S]:terminated
👻       PeaStore@23375[S]:f7982669-54ab-4518-81b7-1472ded8f191 is released from the store.
👻         DAEMON@23375[I]:127.0.0.1:64762 is disconnected
         pod0/1@23487[I]:recv ControlRequest  from ctl▸pod0/1/ZEDRuntime▸⚐
         pod0/1@23487[I]:#sent: 666 #recv: 333 sent_size: 4.0 MB recv_size: 3.9 MB
         pod0/1@23487[I]:no update since 2021-01-29 19:51:49, will not save. If you really want to save it, call "touch()" before "save()" to force saving
         pod0/1@23375[S]:terminated

@JoanFM
Copy link
Member Author

JoanFM commented Jan 30, 2021

Did u see them actually receiving IndexRequests? I do not remember what avg results were being printed, but we did not see them receiving any data

@hanxiao
Copy link
Member

hanxiao commented Feb 1, 2021

yes, they do you can replicate my steps on your laptop and check it out. For understanding the bug I need a reproducible example.

@JoanFM
Copy link
Member Author

JoanFM commented Feb 2, 2021

It does not seem to be a problem

@JoanFM JoanFM closed this as completed Feb 2, 2021
@JoanFM
Copy link
Member Author

JoanFM commented Feb 3, 2021

Problem has been seen again!

@JoanFM JoanFM reopened this Feb 3, 2021
@JoanFM JoanFM removed the priority/critical Urgent: Security, critical bugs, blocking issues. drop everything until this issue is addressed. label Feb 3, 2021
@JoanFM
Copy link
Member Author

JoanFM commented Feb 3, 2021

It seems that when request arrivrd then no IDLE pea existed.

Would it make sense or is it possible to randomize which pea receives request if none is idle? it seems to be always the first Pea

@hanxiao
Copy link
Member

hanxiao commented Feb 3, 2021

i dont understand what is the problem what is the code that can reproduce the problem?

@JoanFM
Copy link
Member Author

JoanFM commented Feb 3, 2021

i dont understand what is the problem what is the code that can reproduce the problem?

It is not a problem I think. It is not easy to reproduce but we were seeing a case where scheduling: load_balance and polling: any were leading to the data always being sent to the same shard.

But it can be due to the fact that by the time next request arrived no Pea was IDLE. I am not sure if this is a known behavior.

we did not manage to reproduce it consistently, but we see it often in our tests in AWS

@JoanFM
Copy link
Member Author

JoanFM commented Feb 3, 2021

Will try to debug further

@JoanFM
Copy link
Member Author

JoanFM commented Feb 4, 2021

https://stackoverflow.com/questions/52278364/is-id-returns-the-actual-memory-address-in-cpython

The problem is that we are relying on id to identify the identity of a ZMQlet. Which can collide when working in different processes

@JoanFM JoanFM linked a pull request Feb 4, 2021 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants