[docs][proposal] Scaled Prometheus Pipeline #2118

Scott8440 · 2020-07-29T17:37:03Z

Summary

Design doc discussing how to scale prometheus with the end goal of improving query times and supporting increased capacity.

Test Plan

N/a

(cherry picked from commit c91e8ec) Signed-off-by: Scott8440 <scott8440@gmail.com>

…of object storage based optimizations Signed-off-by: Scott8440 <scott8440@gmail.com> (cherry picked from commit a3e67d8)

xjtian

Reorganize the doc a bit for me please:

Problem statement: as-is
Solution proposal: Thanos with single prom server (or possibly HA) and object storage. Elaborate on the current proposed architecture on the push and query sides - what will you deploy, what talks to what, data flow for timeseries from the edge. Describe what object storage is and what our options for it are for public cloud and on-prem deployments
Implementation details: query-side
Implementation details: push-side

xjtian · 2020-07-30T17:47:08Z

docs/proposals/scaled-prometheus-pipeline/scaled-prometheus-pipeline.md

+
+Object storage will allow us to only store a few hours of metrics on the server itself (potentially keeping everything in-memory) and then exporting older metrics to object storage elsewhere. For example on an AWS deployment metrics would be stored in S3.
+
+We will then deploy multiple Querier components behind a load balancer which are configured to talk to both the prometheus server and the Object storage. This will distribute the compute and I/O load away from the prometheus server to the stateless querier components which can be trivially scaled horizontally to handle increase query loads.


xjtian · 2020-07-30T17:47:51Z

docs/proposals/scaled-prometheus-pipeline/scaled-prometheus-pipeline.md

+
+With this setup, we only need to deploy the Thanos `sidecar` and multiple `Querier` components, along with Object storage to achieve faster queries.
+
+### Cortex


let's just keep this doc about thanos which seems like a far more approriate solution to our problem space

Signed-off-by: Scott8440 <scott8440@gmail.com>

xjtian · 2020-08-03T05:25:04Z

docs/proposals/scaled-prometheus-pipeline/scaled-prometheus-pipeline.md

+
+|Step	|Est. Time	|
+|---	|---	|
+|Deploy Thanos locally and experiment with loads to validate query time improvements	|2 wk	|


you'll probably have to deploy this on AWS yourself otherwise you won't be able to test what kind of impact object storage has on query performance

Scott8440 marked this pull request as ready for review July 29, 2020 17:38

Scott8440 force-pushed the thanosProposal branch from 47f47ae to 20a0162 Compare July 29, 2020 17:41

xjtian self-requested a review July 30, 2020 01:04

xjtian added the type: proposal Proposals and design documents label Jul 30, 2020

Scott8440 added 2 commits July 30, 2020 09:34

[docs][proposal] Scaled prometheus pipeline first draft

1e41c77

(cherry picked from commit c91e8ec) Signed-off-by: Scott8440 <scott8440@gmail.com>

v2 proposal. Address comments and remove horizontal scaling in favor …

80f6d03

…of object storage based optimizations Signed-off-by: Scott8440 <scott8440@gmail.com> (cherry picked from commit a3e67d8)

Scott8440 force-pushed the thanosProposal branch from 20a0162 to 80f6d03 Compare July 30, 2020 16:35

xjtian suggested changes Jul 30, 2020

View reviewed changes

reorganize proposal

d46e901

Signed-off-by: Scott8440 <scott8440@gmail.com>

xjtian approved these changes Aug 3, 2020

View reviewed changes

xjtian merged commit 1eda5f4 into magma:master Aug 4, 2020

Scott8440 deleted the thanosProposal branch September 3, 2020 17:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[docs][proposal] Scaled Prometheus Pipeline #2118

[docs][proposal] Scaled Prometheus Pipeline #2118

Scott8440 commented Jul 29, 2020

xjtian left a comment

xjtian Jul 30, 2020

xjtian Jul 30, 2020

xjtian Aug 3, 2020


		Object storage will allow us to only store a few hours of metrics on the server itself (potentially keeping everything in-memory) and then exporting older metrics to object storage elsewhere. For example on an AWS deployment metrics would be stored in S3.

		We will then deploy multiple Querier components behind a load balancer which are configured to talk to both the prometheus server and the Object storage. This will distribute the compute and I/O load away from the prometheus server to the stateless querier components which can be trivially scaled horizontally to handle increase query loads.


		With this setup, we only need to deploy the Thanos `sidecar` and multiple `Querier` components, along with Object storage to achieve faster queries.

		### Cortex

[docs][proposal] Scaled Prometheus Pipeline #2118

[docs][proposal] Scaled Prometheus Pipeline #2118

Conversation

Scott8440 commented Jul 29, 2020

Summary

Test Plan

xjtian left a comment

Choose a reason for hiding this comment

xjtian Jul 30, 2020

Choose a reason for hiding this comment

xjtian Jul 30, 2020

Choose a reason for hiding this comment

xjtian Aug 3, 2020

Choose a reason for hiding this comment