Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

m3coordinator seems very sensitive to latency to m3dbnodes #1184

Closed
BertHartm opened this issue Nov 16, 2018 · 6 comments
Closed

m3coordinator seems very sensitive to latency to m3dbnodes #1184

BertHartm opened this issue Nov 16, 2018 · 6 comments
Labels
area:coordinator All issues pertaining to coordinator T: Optimization T: Perf

Comments

@BertHartm
Copy link
Contributor

I apologize that I can't give very hard numbers for this, and my set up that caused this is no more, but I think it should be easy to reproduce.

When moving my cluster to a new vnet, there was additional latency between the vnets. This meant that as I added new nodes, and removed old nodes, the average latency of the nodes that m3coordinator was writing to went up (because it was on the old vnet).
Intra-vnet latencies seemed to be about .8ms, and between the vnets was 5-15ms (usually on the lower end, but less stable).
m3coordinator seemed to fall of a cliff performance-wise when I was at 6 nodes on the new vnet, and 3 on the old. I was migrating by isolation group, so the 3 groups had 0, 1, and 2 old boxes respectively, and I had just gone from 0, 2, and 2.
I was doing about 40k samples/s and using the default timeouts from prometheus, and m3coordinator was no longer able to keep up and so was causing prometheus to OOM and cycle. I was seeing latencies of p50 ~2s, p90 ~8s, and p99 was maxed out at 10s. I know from other setups that I can do 400k samples/s if I'm not cross vnet.

Pointing prometheus to the m3coordinator in the new vnet caused the problem to disappear.

I'll try to reproduce later by running m3coordinator as a sidecar to a prometheus that's not in the same vnet, and provide more examples then if I can, but wanted to record.

@BertHartm BertHartm changed the title m3coordinator is very sensitive to latency to m3dbnodes m3coordinator seems very sensitive to latency to m3dbnodes Nov 16, 2018
@richardartoul
Copy link
Contributor

@BertHartm Are you running the latest release of M3?

@BertHartm
Copy link
Contributor Author

I've got c27cff0 deployed. I believe I tried running a build from this week also (maybe a2fab37 ?)

@richardartoul
Copy link
Contributor

@BertHartm Do you mind adding to this your m3coordinator config?

writeWorkerPoolPolicy:
  grow: true
  size: 1024
  shards: 64
  killProbability: 0.01

We realized that the worker pool for writing from m3coordinator -> M3DB might not behave very nicely with the default settings. If that improves your situation we'll update the default

@richardartoul richardartoul added T: Perf T: Optimization area:coordinator All issues pertaining to coordinator labels Nov 16, 2018
@BertHartm
Copy link
Contributor Author

I spun up a similar situation, and this didn't help. I think it may have slowed down the burn a bit, but I can't actually justify that in the numbers I got.

@richardartoul
Copy link
Contributor

I was really hoping that would fix it...Ok we may need to spin up some kind of high latency environment on our end so we can debug more thoroughly. If the change is really that dramatic maybe I can just inject some synthetic latency...

@xkfen
Copy link

xkfen commented Feb 26, 2020

same problem.
my node info:
4core 8G

my m3coordinator pod resources are follworing:
resources: limits: cpu: "1" memory: 4Gi requests: cpu: "1" memory: 4Gi

i had add the following in my coordinator configuration:
writeWorkerPoolPolicy: grow: true size: 1024 shards: 64 killProbability: 0.01

and i still got the m3coordinator oom kill.

any help? thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:coordinator All issues pertaining to coordinator T: Optimization T: Perf
Projects
None yet
Development

No branches or pull requests

4 participants