ingester does not hold the calculated load #7042

Alexlugovtsov · 2024-01-03T15:06:57Z

Alexlugovtsov
Jan 3, 2024

Describe the bug

Mimir is deployed in AWS ECS in micro service mode
after some time of running ingesters are getting failed in the ring, but not removed,
(the pods are getting restarted and start time is about 2-3 hours while ingester very slowly loads the /data folder)
so the write operations are suspended with error "too many unhealthy ingesters in the ring"
And while it is working constantly getting Warnings and Errors in Distributor:

ts=2024-01-03T13:45:28.051066285Z caller=pool.go:196 level=warn msg="removing ingester failing healthcheck" addr=172.16.148.85:9095 reason="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
ts=2024-01-03T13:45:28.272628175Z caller=logging.go:126 level=warn traceID=15187c2ee37d0666 msg="POST /api/v1/push (500) 2.005029849s Response: \"failed pushing to ingester: rpc error: code = DeadlineExceeded desc = context deadline exceeded\\n\" ws: false; Content-Encoding: snappy; Content-Length: 79109; Content-Type: application/x-protobuf; User-Agent: Prometheus/2.44.0; X-Amzn-Trace-Id: Root=1-659564f6-4ece345b4ac42c6c7e80cbd3; X-Forwarded-For: 172.16.149.5; X-Forwarded-Port: 443; X-Forwarded-Proto: https; X-Prometheus-Remote-Write-Version: 0.1.0; "

But the Ingester with 172.16.148.85 is ACTIVE In the Ring Status Dashboard of Distributor

To Reproduce

Deploy in microservice mode in AWS ECS
1 distributor (8CPU/16GB)
4 ingesters (8CPU/16GB each)
microservice start command

        "-target=distributor",
        "-common.storage.backend", "s3",
        "-common.storage.s3.endpoint", "s3.eu-central-1.amazonaws.com",
        "-common.storage.s3.region", "eu-central-1",
        "-common.storage.s3.bucket-name", "${aws_s3_bucket.blocks.id}",
        "-blocks-storage.storage-prefix=blocks",
        "-blocks-storage.tsdb.dir=/data/ingester",
        "-auth.multitenancy-enabled=false",
        "-distributor.ingestion-rate-limit=1000000",
        "-distributor.ingestion-burst-size=10000000",
        "-memberlist.join=${var.resource_prefix}-memberlist-distributor",
        "-memberlist.join=${var.resource_prefix}-memberlist-ingester-1",
        "-memberlist.join=${var.resource_prefix}-memberlist-ingester-2",
        "-memberlist.join=${var.resource_prefix}-memberlist-ingester-3",
        "-memberlist.join=${var.resource_prefix}-memberlist-ingester-4",

        "-target=ingester",
        "-common.storage.backend", "s3",
        "-common.storage.s3.endpoint", "s3.eu-central-1.amazonaws.com",
        "-common.storage.s3.region", "eu-central-1",
        "-common.storage.s3.bucket-name", "${aws_s3_bucket.blocks.id}",
        "-blocks-storage.storage-prefix=blocks",
        "-blocks-storage.tsdb.dir=/data/ingester",
        "-auth.multitenancy-enabled=false",
        "-ingester.max-global-series-per-user=10000000",
        "-ingester.out-of-order-time-window=4h",
        "-memberlist.join=${var.resource_prefix}-memberlist-distributor",
        "-memberlist.join=${var.resource_prefix}-memberlist-ingester-1",
        "-memberlist.join=${var.resource_prefix}-memberlist-ingester-2",
        "-memberlist.join=${var.resource_prefix}-memberlist-ingester-3",
        "-memberlist.join=${var.resource_prefix}-memberlist-ingester-4",

Make the load to ~300000 prometheus_tsdb_head_series from prometheus
According to capacity plan it must be enough:
4 CPU (3.36)
10 RAM (8.5)
20 GB HDD (16.8)
I have just doubled the capacity to 8 CPU/16GB/Unlimit EFS HDD for each ingester

sum(prometheus_tsdb_head_series) = 336160

Start Mimir (SHA or version): mimir:2.10.4
Perform Operations(Read/Write/Others)

Expected behavior

Ingesters handles the load

Environment

Infrastructure: AWS ECS
Deployment tool: terraform

Additional Context

ingesters have a volume in AWS EFS connected to folder /data and it constantly growing, current size of 1 of 4 is 1TB, other 3 is just under 100GB (which is still way above calculated capacity of 20GB)
And Throughput utilization on EFS volumes is above 80%:

      mountPoints = [{
        sourceVolume  = "service-storage"
        containerPath = "/data"
        readOnly      = false
    }]

  volume {
    name = "service-storage"

    efs_volume_configuration {
      file_system_id          = aws_efs_file_system.mimir-1.id
      transit_encryption      = "ENABLED"
      transit_encryption_port = 2999
      authorization_config {
        access_point_id = aws_efs_access_point.mimir-1.id
        iam             = "ENABLED"
      }
    }
  }

Currently I assume, that the load does not fit to AWS ECS deployment type of Mimir
or the EFS storage cannot handle the load
or I missed some configuration elements
please help )

dimitarvdimitrov · 2024-01-04T08:23:27Z

dimitarvdimitrov
Jan 4, 2024

Have you checked the logs of the ingester which has a filling disk?
Is it all 4 ingesters that have 2-3 hours startup time or only the one with 1TB consumed disk?

1 reply

Alexlugovtsov Jan 4, 2024
Author

The logs of 1TB ingester is in attachment below:
ingester-1TB-storage.log
I did not find any information about why it happens
the start time of 2-3 hours is for all ingesters, that are ~100GB in /data folder,
for 1TB ingester it is way more and 99% of startup time is consumed in only one type of event:

ts=2024-01-04T00:29:38.935373758Z caller=head.go:782 level=info user=anonymous msg="WAL segment loaded" segment=1294 maxSegment=1712
ts=2024-01-04T01:28:23.457680146Z caller=head.go:782 level=info user=anonymous msg="WAL segment loaded" segment=1712 maxSegment=1712

so to load WAL segments from 1294 to 1712 it took ~1 hour (T00:29:38 - T01:28:23)

Alexlugovtsov · 2024-01-05T14:57:36Z

Alexlugovtsov
Jan 5, 2024
Author

Currently I think it is EFS problem.
Because of AWS Docs of EFS Performance states,
that if Vol is below 100GB, than Throughput speed is dramatically lowed, after 72min/day of burst load to:
• Drive up to 15 MiBps read-only continuously (2 megabytes/sec)
• Drive up to 5 MiBps write-only continuously (0.65 megabytes/sec)
As for temp solution changed vol to ECS ephemeral_storage to monitor the behavior, but of course hot WAL data in EFS volumes are stuck.

0 replies

Alexlugovtsov · 2024-01-05T23:18:55Z

Alexlugovtsov
Jan 5, 2024
Author

Changed throughput_mode from default (bursting) to elastic, the speed dramatically improved, it can be final solution, I will monitor for a few days more

resource "aws_efs_file_system" "mimir" {
  throughput_mode  = "elastic"
}

0 replies

nikhilo · 2026-05-22T17:04:56Z

nikhilo
May 22, 2026

Apologies for a tangent comment. @Alexlugovtsov I'm also trying to set up Mimir in AWS ECS Fargate. Although, I have not mounted any EFS on the tasks. And possibly due to this, I'm seeing metrics disappear for a while after every rolling restart of the service. Can you please share how you are using the EFS as the WAL storage on 4 different ingester tasks ?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ingester does not hold the calculated load #7042

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 4 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

ingester does not hold the calculated load #7042

Uh oh!

Uh oh!

Alexlugovtsov Jan 3, 2024

Describe the bug

To Reproduce

Expected behavior

Environment

Additional Context

Replies: 4 comments · 1 reply

Uh oh!

dimitarvdimitrov Jan 4, 2024

Uh oh!

Uh oh!

Alexlugovtsov Jan 4, 2024 Author

Uh oh!

Alexlugovtsov Jan 5, 2024 Author

Uh oh!

Alexlugovtsov Jan 5, 2024 Author

Uh oh!

nikhilo May 22, 2026

Alexlugovtsov
Jan 3, 2024

Replies: 4 comments 1 reply

dimitarvdimitrov
Jan 4, 2024

Alexlugovtsov Jan 4, 2024
Author

Alexlugovtsov
Jan 5, 2024
Author

Alexlugovtsov
Jan 5, 2024
Author

nikhilo
May 22, 2026