m3coordinator seems very sensitive to latency to m3dbnodes #1184

BertHartm · 2018-11-16T17:23:52Z

I apologize that I can't give very hard numbers for this, and my set up that caused this is no more, but I think it should be easy to reproduce.

When moving my cluster to a new vnet, there was additional latency between the vnets. This meant that as I added new nodes, and removed old nodes, the average latency of the nodes that m3coordinator was writing to went up (because it was on the old vnet).
Intra-vnet latencies seemed to be about .8ms, and between the vnets was 5-15ms (usually on the lower end, but less stable).
m3coordinator seemed to fall of a cliff performance-wise when I was at 6 nodes on the new vnet, and 3 on the old. I was migrating by isolation group, so the 3 groups had 0, 1, and 2 old boxes respectively, and I had just gone from 0, 2, and 2.
I was doing about 40k samples/s and using the default timeouts from prometheus, and m3coordinator was no longer able to keep up and so was causing prometheus to OOM and cycle. I was seeing latencies of p50 ~2s, p90 ~8s, and p99 was maxed out at 10s. I know from other setups that I can do 400k samples/s if I'm not cross vnet.

Pointing prometheus to the m3coordinator in the new vnet caused the problem to disappear.

I'll try to reproduce later by running m3coordinator as a sidecar to a prometheus that's not in the same vnet, and provide more examples then if I can, but wanted to record.

richardartoul · 2018-11-16T18:30:43Z

@BertHartm Are you running the latest release of M3?

BertHartm · 2018-11-16T18:37:21Z

I've got c27cff0 deployed. I believe I tried running a build from this week also (maybe a2fab37 ?)

richardartoul · 2018-11-16T19:17:08Z

@BertHartm Do you mind adding to this your m3coordinator config?

writeWorkerPoolPolicy:
  grow: true
  size: 1024
  shards: 64
  killProbability: 0.01

We realized that the worker pool for writing from m3coordinator -> M3DB might not behave very nicely with the default settings. If that improves your situation we'll update the default

BertHartm · 2018-11-19T18:15:43Z

I spun up a similar situation, and this didn't help. I think it may have slowed down the burn a bit, but I can't actually justify that in the numbers I got.

richardartoul · 2018-11-19T19:19:17Z

I was really hoping that would fix it...Ok we may need to spin up some kind of high latency environment on our end so we can debug more thoroughly. If the change is really that dramatic maybe I can just inject some synthetic latency...

xkfen · 2020-02-26T11:10:46Z

same problem.
my node info:
4core 8G

my m3coordinator pod resources are follworing:
resources: limits: cpu: "1" memory: 4Gi requests: cpu: "1" memory: 4Gi

i had add the following in my coordinator configuration:
writeWorkerPoolPolicy: grow: true size: 1024 shards: 64 killProbability: 0.01

and i still got the m3coordinator oom kill.

any help? thanks

BertHartm changed the title ~~m3coordinator is very sensitive to latency to m3dbnodes~~ m3coordinator seems very sensitive to latency to m3dbnodes Nov 16, 2018

richardartoul added T: Perf T: Optimization area:coordinator All issues pertaining to coordinator labels Nov 16, 2018

gibbscullen closed this as completed Sep 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

m3coordinator seems very sensitive to latency to m3dbnodes #1184

m3coordinator seems very sensitive to latency to m3dbnodes #1184

BertHartm commented Nov 16, 2018

richardartoul commented Nov 16, 2018

BertHartm commented Nov 16, 2018

richardartoul commented Nov 16, 2018

BertHartm commented Nov 19, 2018

richardartoul commented Nov 19, 2018

xkfen commented Feb 26, 2020 •

edited

m3coordinator seems very sensitive to latency to m3dbnodes #1184

m3coordinator seems very sensitive to latency to m3dbnodes #1184

Comments

BertHartm commented Nov 16, 2018

richardartoul commented Nov 16, 2018

BertHartm commented Nov 16, 2018

richardartoul commented Nov 16, 2018

BertHartm commented Nov 19, 2018

richardartoul commented Nov 19, 2018

xkfen commented Feb 26, 2020 • edited

xkfen commented Feb 26, 2020 •

edited