New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
m3coordinator seems very sensitive to latency to m3dbnodes #1184
Comments
@BertHartm Are you running the latest release of M3? |
@BertHartm Do you mind adding to this your m3coordinator config?
We realized that the worker pool for writing from m3coordinator -> M3DB might not behave very nicely with the default settings. If that improves your situation we'll update the default |
I spun up a similar situation, and this didn't help. I think it may have slowed down the burn a bit, but I can't actually justify that in the numbers I got. |
I was really hoping that would fix it...Ok we may need to spin up some kind of high latency environment on our end so we can debug more thoroughly. If the change is really that dramatic maybe I can just inject some synthetic latency... |
same problem. my m3coordinator pod resources are follworing: i had add the following in my coordinator configuration: and i still got the m3coordinator oom kill. any help? thanks |
I apologize that I can't give very hard numbers for this, and my set up that caused this is no more, but I think it should be easy to reproduce.
When moving my cluster to a new vnet, there was additional latency between the vnets. This meant that as I added new nodes, and removed old nodes, the average latency of the nodes that m3coordinator was writing to went up (because it was on the old vnet).
Intra-vnet latencies seemed to be about .8ms, and between the vnets was 5-15ms (usually on the lower end, but less stable).
m3coordinator seemed to fall of a cliff performance-wise when I was at 6 nodes on the new vnet, and 3 on the old. I was migrating by isolation group, so the 3 groups had 0, 1, and 2 old boxes respectively, and I had just gone from 0, 2, and 2.
I was doing about 40k samples/s and using the default timeouts from prometheus, and m3coordinator was no longer able to keep up and so was causing prometheus to OOM and cycle. I was seeing latencies of p50 ~2s, p90 ~8s, and p99 was maxed out at 10s. I know from other setups that I can do 400k samples/s if I'm not cross vnet.
Pointing prometheus to the m3coordinator in the new vnet caused the problem to disappear.
I'll try to reproduce later by running m3coordinator as a sidecar to a prometheus that's not in the same vnet, and provide more examples then if I can, but wanted to record.
The text was updated successfully, but these errors were encountered: