[HWKMETRICS-613] Add an index to keep track of metrics that need to b… #785

stefannegrea · 2017-03-28T04:47:29Z

…e deleted. A metric becomes eligible for deletion when all the data points for the metric expire.

stefannegrea · 2017-03-28T04:48:34Z

Please do not merge. I am submitting an early PR to see the impact on write performance.

burmanm · 2017-03-28T06:37:16Z

We removed earlier the feature for metrics_idx and now we're reintroducing the double writes? This should not happen during the writepath.

It will effectively halve our writing performance (you can measure this with the JMH benchmarks, ignore the perf-stability-test).

If such feature is necessary a much better way would be to do a daily scan to see which metrics are no longer used and then delete them. Same way as we do compression job, you can scan them there once and not interrupt with the write pipeline at all.

jsanda · 2017-03-28T11:55:08Z

I think there several options we can consider.

Perform the writes in band within the http request
I agree that this will slow overall write performance with respect to REST API.
Perform the writes out of bad
We could write the metrics to a queue that is periodically flushed and done asynchronously with respect to http requests
Daily scan of data table
Yes we do this with the compression job, but we would avoid it if we could.
Daily scan of data_compressed table
The advantage of this over scanning the data table is that the data_compressed table is flushed much less frequently so the queries should incur a lot less I/O.

Options 1 and 2 minimize the reads we have to do which is generally a good thing. We never have to touch the data or data_compressed tables with these options. There is no need for the job to be fast, so if we a daily scan we can certainly throttle the reads to avoid putting to much stress on Cassandra. I guess we need to look at some performance numbers for comparison.

burmanm · 2017-03-28T12:57:16Z

The first option should be ruled out, I don't think 50% performance loss in the backend is acceptable. Second option would require some sort aggregation (so we don't update too often the same row) and if that state must be trusted, then it can't be lost.

For those reasons, I think reading is a better choice. Just throttle it enough. It shouldn't take that much time to fetch the min/max timestamps from each metric, like we previously did with the metric definitions fetching. We really don't need to do this operation often.

jsanda · 2017-03-28T13:17:15Z

There is a key difference in writing to metrics_idx with respect to compaction. metrics_idx uses LCS. The new table here uses STCS, which is less I/O intensive than LCS. While I agree that option 1 will slow down write performance, I do not necessarily expect a 50% performance hit.

For option 2, the aggregation is nothing more than storing the MetricId. That's it. And we can tolerate write failures.

stefannegrea · 2017-03-28T13:33:17Z

Here is an idea for a slightly different approach. The call to update the expiration index is now in three places: create metric, update tags, add data points.

Create metric and update tags would stay the same as in this PR. For the data points, use the compression job. The expiration index is updated only once, when a compressed block is written. This way we leave the index in place for easy removal, but leave the data point insertion path almost unchanged. And the number of writes for data points will drastically reduced since it is done once every 2 hours for a metric with data points.

Any thoughts?

jsanda · 2017-03-28T13:42:00Z

Doing it during the compression job makes sense since we have already queried data.

burmanm · 2017-03-28T13:52:35Z

LCS / STCS makes no difference in the write path as the write happens to WAL & memtable, not to SSTables yet. We do suffer a 50% reduction and even the REST-test got -40% performance (although we didn't spend twice the time in the REST-layer).

Compression job could do the update, it doesn't hurt there.

jsanda · 2017-03-28T13:58:22Z

LCS / STCS makes no difference in the write path as the write happens to WAL & memtable, not to SSTables yet

If SSD is used or if the commit log is on its own, dedicated disk, then I agree that the compaction strategy should not matter as much; however, I think it is safest to assume that the commit log is on shared storage in which case the additional I/O from LCS absolutely can make a significant different. LCS is also more CPU intensive than STCS which could also be a factor.

All that aside, I like @stefannegrea's idea of doing the update during the compression job. We get the best of both worlds. We do not impact the ingest write path and we avoid performing extra reads.

burmanm · 2017-03-28T14:01:34Z

Well, even with the same disk being used we don't fsync on every write (only every 10 seconds), so the write is actually only to the memory at first. The memtable handling in Cassandra is just very slow..

stefannegrea · 2017-03-28T14:16:32Z

The only problem with my idea is that String metrics will need to be excluded from purge for now since they are not compressed.

jsanda · 2017-03-28T14:21:22Z

I think we can handle string metrics separately. At some point, I think they ought to go in a separate table, and they are not even used in OpenShift. For v1, I say we do not worry about string metrics. For v2, let's get a separate table for string metrics, and we can decide what approach to take.

…ta points by updating the retention index only when a compressed block is persisted.

…re properly created when data points are compressed.

…ed metrics.

…cy configurable.

…via JMX.

… daily at a configured interval; the default is every 7 days.

stefannegrea · 2017-03-29T23:37:34Z

retest this please

jsanda · 2017-03-30T03:22:45Z

core/metrics-core-service/src/main/java/org/hawkular/metrics/core/jobs/CompressData.java

        // TODO Optimization - new worker per token - use parallelism in Cassandra (with configured parallelism)
-        return metricsService.compressBlock(metricIds, startOfSlice, endOfSlice, pageSize)
+        return metricsService.compressBlock(metricIds, startOfSlice, endOfSlice, pageSize, subject)
                .doOnError(t -> logger.warn("Failed to compress data", t))


You need to call either subject.onCompleted() or subject.onError() here in the doOnError callback.

jsanda · 2017-03-30T03:30:48Z

...metrics-core-service/src/main/java/org/hawkular/metrics/core/service/MetricsServiceImpl.java

    }
+
+    @Override
+    public <T> void updateMetricExpiration(Metric<T> metric) {


Let's keep things reactive and functional and return either Observable or (preferably) Completable instead of void.

jsanda · 2017-03-30T03:41:49Z

It would be good to have a test (or tests) in DeleteExpiredMetricsJobITest that execute the repeating job. Look at CompressDataJobITest.testCompressJob() for an example on how to do this, and/or hit me up with questions.

You also probably want to change the test method name from DeleteExpiredMetricsJobITest.testCompressJob to something like DeleteExpiredMetricsJobITest.testDeleteExpiredMetricsJob or DeleteExpiredMetricsJobITest.runDeleteExpiredMetricsJob.

jsanda · 2017-03-30T03:57:52Z

...metrics-core-service/src/main/java/org/hawkular/metrics/core/service/MetricsServiceImpl.java

+    @Override
+    public <T> void updateMetricExpiration(Metric<T> metric) {
+        if (!MetricType.STRING.equals(metric.getType())) {
+            long expiration = 0;


Should we calculate expiration using the latest timestamp in the compressed block?

I would not do any extra work because that will just change the time by an hour or two, since the job is run in daily increments it will not make much of a difference.

jsanda · 2017-03-30T04:00:51Z

@stefannegrea, I just now thought about something we probably need to handle. What if the compression job is not running for some reason? The metrics expiration index won't get updated, and if the delete job is running, we could potentially wind up deleting metrics that should not be deleted. Considering we allow the compression job to be disabled, this scenario is entirely possible.

…ration index reactive. Also, complete the publish subject in case of a compression error.

…ion job via the scheduler.

…ion job are not flags.

…e if there is still unexpired data for a metric before purging it.

jsanda · 2017-03-30T21:58:30Z

Can you change the return type of DataAccess.updateMetricExpirationIndex to be Observable so it is consistent with the rest of DataAccess?

jsanda · 2017-03-31T02:37:29Z

In JobServiceImpl.start, we need to move the first statement, scheduler.start() to the end of the method after we register and schedule the jobs. The reason for this is due to the changes for rescheduling the DeleteExpiredMetricsJob. The new Scheduler.unscheduleJob method can only safely be called right now when the scheduler is not running. Please add Javadocs to the unschedule method indicating this, and add a TODO (ticket would be great too) that says the method should support unscheduling jobs while the scheduler is running.

jsanda · 2017-03-31T02:57:46Z

.../metrics-core-service/src/main/java/org/hawkular/metrics/core/jobs/DeleteExpiredMetrics.java

+        }
+        if (!compressJobEnabled) {
+            expirationIndexResults = expirationIndexResults
+                    .flatMap(r -> session


I think there are a couple problems here. First, I do not think that you can unconditionally use expirationIndexResults. What if compression has always been disabled? And there is no guarantee that metrics are explicitly created. I think we need some extra book keeping to know if/when metrics_expiration_idx has been updated. Secondly, you need to update metrics_expiration_idx in this call chain as you iterate through the data table.

If there is no entry in the expiration index then that is good, it means the data will automatically expire based on the TTL. We are only concerned about manually cleaning that does not have a TTL (in general that is indexes). For that case, an entry in the expiration index is created every a metric is intentionally created or when tags are inserted or updated.

All what the compress job do is extend the expiration time. If the compress job is disabled, then we have no way of reliably extending the expiration time, hence we need to query and check if there is unexpired data.

jsanda · 2017-03-31T03:09:26Z

.../metrics-core-service/src/main/java/org/hawkular/metrics/core/jobs/DeleteExpiredMetrics.java

+                            .map(empty -> r));
+        }
+
+        return expirationIndexResults


I would be inclined to use concatMap here instead of flatMap just as a way to throttle. I would also add in some failure handling. If deleting one metric fails for whatever reason, we want to continue with the stream.

There's reusable retry mechanism in MetricsServiceImpl, use that one.

…urn Observable. The scheduler is now started after all the recurring jobs are scheduled. And log errors when metrics cannot be deleted.

jsanda · 2017-03-31T14:52:07Z

.../metrics-core-service/src/main/java/org/hawkular/metrics/core/jobs/DeleteExpiredMetrics.java

        return expirationIndexResults
-                .flatMap(metricId -> metricsService.deleteMetric(metricId))
+                .concatMap(metricId -> metricsService.deleteMetric(metricId))
+                .doOnError(e -> {


This logs the error which is good, but we still terminate the stream early. You need onErrorResumeNext or something similar.

…a delete operation fails, it will be tried again when the expiration index is reprocessed again.

[HWKMETRICS-613] Add an index to keep track of metrics that need to b…

5e78205

…e deleted. A metric becomes eligible for deletion when all the data points for the metric expire.

stefannegrea added the DO NOT MERGE label Mar 28, 2017

Stefan Negrea added 6 commits March 28, 2017 12:09

[HWKMETRICS-613] Reduce the number of writes to the expiration for da…

cc02682

…ta points by updating the retention index only when a compressed block is persisted.

[HWKMETRICS-613] Add a tests to check that expiration index entries a…

99054d3

…re properly created when data points are compressed.

[HWKMETRICS-613] Add integration tests for the job that deletes expir…

30872dc

…ed metrics.

[HWKMETRICS-613] Make the expiration delay and expiration job frequen…

9d48140

…cy configurable.

[HWKMETRICS-613] Allow starting the job that deletes expired metrics …

4136201

…via JMX.

[HWKMETRICS-613] Schedule the job that deletes expired metrics to run…

ef12a5c

… daily at a configured interval; the default is every 7 days.

stefannegrea removed the DO NOT MERGE label Mar 29, 2017

[HWKMETRICS-613] Increase the time window for the test to 2 hours.

7587a51

jsanda reviewed Mar 30, 2017

View reviewed changes

Stefan Negrea added 2 commits March 30, 2017 12:11

[HWKMETRICS-613] Make the MetricsService method for updating the expi…

115ce03

…ration index reactive. Also, complete the publish subject in case of a compression error.

[HWKMETRICS-613] Add a test for running the scheduled metrics expirat…

962c077

…ion job via the scheduler.

Stefan Negrea added 3 commits March 30, 2017 14:20

[HWKMETRICS-613] Use the DateTimeService for current system time.

82c5d27

[HWKMETRICS-613] The configuration properties for the metrics expirat…

da1f884

…ion job are not flags.

[HWKMETRICS-613] If the compression job is disabled, then check to se…

6c35b62

…e if there is still unexpired data for a metric before purging it.

jsanda reviewed Mar 31, 2017

View reviewed changes

[HWKMETRICS-613] Update DataAccess.updateMetricExpirationIndex to ret…

6f34f0f

…urn Observable. The scheduler is now started after all the recurring jobs are scheduled. And log errors when metrics cannot be deleted.

jsanda reviewed Mar 31, 2017

View reviewed changes

Stefan Negrea added 2 commits March 31, 2017 10:37

[HWKMETRICS-613] Use onErrorResumeNext to ignore deletion errors. If …

58d897c

…a delete operation fails, it will be tried again when the expiration index is reprocessed again.

[HWKMETRICS-613] Add logging to DeleteExpiredMetrics job.

6bf7140

jsanda merged commit 6d67ed2 into master Mar 31, 2017

stefannegrea deleted the HWKMETRICS-613 branch May 11, 2017 15:32

[HWKMETRICS-613] Add an index to keep track of metrics that need to b… #785

[HWKMETRICS-613] Add an index to keep track of metrics that need to b… #785

Uh oh!

Conversation

stefannegrea commented Mar 28, 2017

Uh oh!

stefannegrea commented Mar 28, 2017

Uh oh!

burmanm commented Mar 28, 2017

Uh oh!

jsanda commented Mar 28, 2017

Uh oh!

burmanm commented Mar 28, 2017

Uh oh!

jsanda commented Mar 28, 2017

Uh oh!

stefannegrea commented Mar 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jsanda commented Mar 28, 2017

Uh oh!

burmanm commented Mar 28, 2017

Uh oh!

jsanda commented Mar 28, 2017

Uh oh!

burmanm commented Mar 28, 2017

Uh oh!

stefannegrea commented Mar 28, 2017

Uh oh!

jsanda commented Mar 28, 2017

Uh oh!

stefannegrea commented Mar 29, 2017

Uh oh!

jsanda Mar 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jsanda commented Mar 30, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jsanda commented Mar 30, 2017

Uh oh!

jsanda commented Mar 30, 2017

Uh oh!

jsanda commented Mar 31, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

stefannegrea commented Mar 28, 2017 •

edited

Loading

jsanda Mar 30, 2017 •

edited

Loading