Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs][proposal] Scaled Prometheus Pipeline #2118

Merged
merged 3 commits into from
Aug 4, 2020

Conversation

Scott8440
Copy link
Contributor

Summary

Design doc discussing how to scale prometheus with the end goal of improving query times and supporting increased capacity.

Test Plan

N/a

@Scott8440 Scott8440 marked this pull request as ready for review July 29, 2020 17:38
@xjtian xjtian self-requested a review July 30, 2020 01:04
@xjtian xjtian added the type: proposal Proposals and design documents label Jul 30, 2020
(cherry picked from commit c91e8ec)
Signed-off-by: Scott8440 <scott8440@gmail.com>
…of object storage based optimizations

Signed-off-by: Scott8440 <scott8440@gmail.com>
(cherry picked from commit a3e67d8)
Copy link
Contributor

@xjtian xjtian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reorganize the doc a bit for me please:

  • Problem statement: as-is
  • Solution proposal: Thanos with single prom server (or possibly HA) and object storage. Elaborate on the current proposed architecture on the push and query sides - what will you deploy, what talks to what, data flow for timeseries from the edge. Describe what object storage is and what our options for it are for public cloud and on-prem deployments
  • Implementation details: query-side
  • Implementation details: push-side


Object storage will allow us to only store a few hours of metrics on the server itself (potentially keeping everything in-memory) and then exporting older metrics to object storage elsewhere. For example on an AWS deployment metrics would be stored in S3.

We will then deploy multiple Querier components behind a load balancer which are configured to talk to both the prometheus server and the Object storage. This will distribute the compute and I/O load away from the prometheus server to the stateless querier components which can be trivially scaled horizontally to handle increase query loads.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

increased


With this setup, we only need to deploy the Thanos `sidecar` and multiple `Querier` components, along with Object storage to achieve faster queries.

### Cortex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's just keep this doc about thanos which seems like a far more approriate solution to our problem space

Signed-off-by: Scott8440 <scott8440@gmail.com>

|Step |Est. Time |
|--- |--- |
|Deploy Thanos locally and experiment with loads to validate query time improvements |2 wk |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you'll probably have to deploy this on AWS yourself otherwise you won't be able to test what kind of impact object storage has on query performance

@xjtian xjtian merged commit 1eda5f4 into magma:master Aug 4, 2020
@Scott8440 Scott8440 deleted the thanosProposal branch September 3, 2020 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: proposal Proposals and design documents
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants