Prometheus / Grafana: store data for longer times #402

FedericoCeratto · 2019-11-27T15:22:13Z

Investigate solutions for long term storage, ideally > 1y.

hellais · 2019-11-27T15:52:35Z

@SuperQ do you have some tips on what we can do for this? Is it recommended to extend the prometheus retention time from 15d, to say 300+?

SuperQ · 2019-11-27T17:09:19Z

Yes, it's totally fine to change the Prometheus retention to 365d. Things to consider:

Backups (Prometheus provides a snapshot API)
Capacity planning storage
Capacity planning memory
Queries that need to pull in a year of data

The last two are probably the trickiest. It's easy enough for Prometheus to store the data, but depending on the index sizes and queries you want to run over a long period of time, you will need more memory on the Prometheus server to query all that data.

One of the things that can help a lot for this is to have recording rules that summarize the data you want to query over a long period of time. For example, if you have data scraped ever 15 seconds, having a recording rule with a 1 minute interval that produces a fewer metrics, can save an order of magnitude at query time.

For example, node_cpu_seconds_total can have quite a lot of metrics, but if you only care about node-level CPU utilization, it can be a lot fewer metrics to look at with a recording rule. A single node utilization with a 1 min recording interval needs about 525k samples to query a full year. But that's a lot less than per-cpu-mode at 15 seconds.

The default Prometheus query limiter is set to 50 million samples per query. (--query.max-samples=50000000). To query all this data, you'll need about 100MiB of temporary memory for this very large query.

hellais · 2019-12-05T11:42:28Z

@SuperQ thanks for the detailed response. What is your take on using something like https://thanos.io/?

It seems to support putting the metrics data inside of different storage system like an object store, which could work well for us.

SuperQ · 2019-12-05T13:02:19Z

Thanos or Cortex are both good options for external long-term storage. I use Thanos at work, as we're running in GCP and can use the GCS object storage, and we're mostly in one region so query latency/bandwidth to the individual Prometheus servers isn't a problem.

I don't remember what the Prometheus server setup is like for Ooni. Is there more than one? How widely distributed?

hellais · 2019-12-05T14:48:24Z

I don't remember what the Prometheus server setup is like for Ooni. Is there more than one? How widely distributed?

We currently have a single host doing the scraping, metrics storage and charting.

See: https://github.com/ooni/sysadmin/blob/master/ansible/deploy-prometheus.yml

SuperQ · 2019-12-06T08:59:38Z

With just a single host, adding Thanos would be overcomplicated and unnecessary. The Prometheus TSDB is just fine for that kind of setup. Things like Thanos are good for when you have many Prometheus servers spread over a large network.

FedericoCeratto · 2019-12-06T10:36:17Z

@SuperQ can you please clarify how this is achieved: "have recording rules that summarize the data you want to query over a long period of time"? I'm looking at various issues in the Prometheus repository and it seems that downsampling is not supported and out of scope.

SuperQ · 2019-12-07T09:27:56Z

You can write a recording rule like this:

groups:
- name: CPU rules
  interval: 1m
  rules:
  # CPU in use ratio.
  - record: instance:node_cpu_utilization:ratio
    expr: >
      1 -
      avg without (cpu,mode) (
        rate(node_cpu_seconds_total{mode="idle"}[1m])
      )

This will create a single per-instance downsampled CPU utilizaiton metric. This metric will contain less granular data, making it easier to query over a long period of time. This works fine for small installations like the Ooni project.

Once you get into multiple Prometheus servers with many millions of metrics, things like Thanos can be added to provide additional scalability.

hellais · 2019-12-10T17:33:16Z

As an MVP for this cycle we will bump it up by 30 days and then check how much is the storage increase and then re-assess.

implements: #402

hellais · 2020-01-03T10:50:45Z

We have for the time being bumped it up to 30 days. Let's see how it goes.

FedericoCeratto added the enhancement label Nov 27, 2019

FedericoCeratto added the effort/M label Dec 10, 2019

FedericoCeratto assigned FedericoCeratto and unassigned FedericoCeratto Dec 10, 2019

hellais added effort/S and removed effort/M labels Dec 10, 2019

hellais modified the milestones: Sprint 2, Sprint 2 - Cuttlefish, Sprint 3 - Water Sparrow Dec 18, 2019

hellais self-assigned this Dec 23, 2019

hellais added a commit that referenced this issue Jan 2, 2020

Bump prometheus retention time to 30 days

44d7997

implements: #402

hellais mentioned this issue Jan 2, 2020

Bump prometheus retention time to 30 days #418

Merged

hellais closed this as completed Jan 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus / Grafana: store data for longer times #402

Prometheus / Grafana: store data for longer times #402

FedericoCeratto commented Nov 27, 2019

hellais commented Nov 27, 2019

SuperQ commented Nov 27, 2019

hellais commented Dec 5, 2019

SuperQ commented Dec 5, 2019

hellais commented Dec 5, 2019

SuperQ commented Dec 6, 2019

FedericoCeratto commented Dec 6, 2019

SuperQ commented Dec 7, 2019 •

edited

Loading

hellais commented Dec 10, 2019

hellais commented Jan 3, 2020

Prometheus / Grafana: store data for longer times #402

Prometheus / Grafana: store data for longer times #402

Comments

FedericoCeratto commented Nov 27, 2019

hellais commented Nov 27, 2019

SuperQ commented Nov 27, 2019

hellais commented Dec 5, 2019

SuperQ commented Dec 5, 2019

hellais commented Dec 5, 2019

SuperQ commented Dec 6, 2019

FedericoCeratto commented Dec 6, 2019

SuperQ commented Dec 7, 2019 • edited Loading

hellais commented Dec 10, 2019

hellais commented Jan 3, 2020

SuperQ commented Dec 7, 2019 •

edited

Loading