Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

receive: Replication #1270

Merged
merged 2 commits into from Jun 24, 2019
Merged

receive: Replication #1270

merged 2 commits into from Jun 24, 2019

Conversation

squat
Copy link
Member

@squat squat commented Jun 21, 2019

This commit adds a new replication feature to the Thanos receiver.
By default, replication is turned off, however, when replication is
enabled (replication factor >=2), the target node for a time series will
replicate the time series to the other nodes concurrently and
synchronously. If the replication to >= (rf+1)/2 nodes fails, the
original write request is failed.

Changes

  • add optional replication to the Thanos receive component
  • add e2e tests to exercise replication

Verification

  • e2e tests
  • manual testing to ensure replicated metrics appear in querier
  • checking new metrics to ensure replication errors don't increase in regular operation

cc @brancz @bwplotka @metalmatze

@brancz brancz changed the title Replicate receive: Replication Jun 21, 2019
@brancz
Copy link
Member

brancz commented Jun 24, 2019

As replication wasn't specified in the original design, could you move the paragraph from open questions to the "Rollout/scaling/failure of receiver nodes" section and add appropriate detail?

Copy link
Member

@brancz brancz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general strategy looks sound. Just a few nits to keep this clean.

errs = err
continue
}
errs = errors.Wrap(errs, err.Error())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MultiErrror makes more sense here no? In fact .parallelizeRequsets could immediately return a MultiError

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! I didn’t realize we had a util for that :)

@@ -49,16 +50,42 @@ var (
},
[]string{"handler"},
)
replicationRequestsTotal = prometheus.NewCounter(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's move not only the registration, but also the initialization of these to the NewHandler function

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we should also pass the prometheus.Registry into NewHandler?!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Precisely.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The registry is already passed into NewHandler via the options struct 👍

// Increment the counters as necessary now that
// the requests will go out.
defer func() {
requestCounter.Inc()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will also be increased, even when there is an error. Is that intended?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is tracking the total, number of requests. failed or successful. Using the error counter we can know the percentage of errors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's probably worth unifying in one metric with different labels for success/failure

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I'd prefer having a unified counter. 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so don’t track total; instead use one counter vec with a result label for success or failure. When we want to know the percentage of failed request we do errors/(errors+successes)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct

This commit adds a new replication feature to the Thanos receiver.
By default, replication is turned off, however, when replication is
enabled (replication factor >=2), the target node for a time series will
replicate the time series to the other nodes concurrently and
synchronously. If the replication to >= (rf+1)/2 nodes fails, the
original write request is failed.
Copy link
Member

@brancz brancz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very slick 👍 And the tests are so simple, really love it!

@brancz brancz merged commit 70ba420 into thanos-io:master Jun 24, 2019

### Replication

The Thanos receiver supports replication of received time-series to other receivers in the same hashring. The replication factor is controlled by setting a flag on the receivers and indicates the maximum number of copies of any time-series that should be stored in the hashring. If any time-series in a write request received by a Thanos receiver is not successfully written to at least `(REPLICATION_FATOR + 1)/2` nodes, the receiver responds with an error. For example, to attempt to store 3 copies of every time-series and ensure that every time-series is successfully written to at least 2 Thanos receivers in the target hashring, all receivers should be configured with the following flag:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The Thanos receiver supports replication of received time-series to other receivers in the same hashring. The replication factor is controlled by setting a flag on the receivers and indicates the maximum number of copies of any time-series that should be stored in the hashring. If any time-series in a write request received by a Thanos receiver is not successfully written to at least `(REPLICATION_FATOR + 1)/2` nodes, the receiver responds with an error. For example, to attempt to store 3 copies of every time-series and ensure that every time-series is successfully written to at least 2 Thanos receivers in the target hashring, all receivers should be configured with the following flag:
The Thanos receiver supports replication of received time-series to other receivers in the same hashring. The replication factor is controlled by setting a flag on the receivers and indicates the maximum number of copies of any time-series that should be stored in the hashring. If any time-series in a write request received by a Thanos receiver is not successfully written to at least `(REPLICATION_FACTOR + 1)/2` nodes, the receiver responds with an error. For example, to attempt to store 3 copies of every time-series and ensure that every time-series is successfully written to at least 2 Thanos receivers in the target hashring, all receivers should be configured with the following flag:

squat added a commit to squat/thanos that referenced this pull request Jun 24, 2019
This commit makes a small fix to the spelling of `REPLICATION_FACTOR`
as suggested by bwplotka in
thanos-io#1270 (comment).
brancz pushed a commit that referenced this pull request Jun 24, 2019
This commit makes a small fix to the spelling of `REPLICATION_FACTOR`
as suggested by bwplotka in
#1270 (comment).
@squat squat deleted the replicate branch August 31, 2019 01:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants