ingester does not hold the calculated load #7042
Replies: 4 comments 1 reply
-
|
Beta Was this translation helpful? Give feedback.
-
|
Currently I think it is EFS problem. |
Beta Was this translation helpful? Give feedback.
-
|
Changed throughput_mode from default (bursting) to elastic, the speed dramatically improved, it can be final solution, I will monitor for a few days more |
Beta Was this translation helpful? Give feedback.
-
|
Apologies for a tangent comment. @Alexlugovtsov I'm also trying to set up Mimir in AWS ECS Fargate. Although, I have not mounted any EFS on the tasks. And possibly due to this, I'm seeing metrics disappear for a while after every rolling restart of the service. Can you please share how you are using the EFS as the WAL storage on 4 different ingester tasks ? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Describe the bug
Mimir is deployed in AWS ECS in micro service mode
after some time of running ingesters are getting failed in the ring, but not removed,
(the pods are getting restarted and start time is about 2-3 hours while ingester very slowly loads the /data folder)
so the write operations are suspended with error "too many unhealthy ingesters in the ring"
And while it is working constantly getting Warnings and Errors in Distributor:
But the Ingester with 172.16.148.85 is ACTIVE In the Ring Status Dashboard of Distributor
To Reproduce
Deploy in microservice mode in AWS ECS
1 distributor (8CPU/16GB)
4 ingesters (8CPU/16GB each)
microservice start command
Make the load to ~300000 prometheus_tsdb_head_series from prometheus
According to capacity plan it must be enough:
4 CPU (3.36)
10 RAM (8.5)
20 GB HDD (16.8)
I have just doubled the capacity to 8 CPU/16GB/Unlimit EFS HDD for each ingester
Expected behavior
Ingesters handles the load
Environment
Additional Context
ingesters have a volume in AWS EFS connected to folder /data and it constantly growing, current size of 1 of 4 is 1TB, other 3 is just under 100GB (which is still way above calculated capacity of 20GB)

And Throughput utilization on EFS volumes is above 80%:
Currently I assume, that the load does not fit to AWS ECS deployment type of Mimir
or the EFS storage cannot handle the load
or I missed some configuration elements
please help )
Beta Was this translation helpful? Give feedback.
All reactions