Create new metrics earlier #888

piotr1212 · 2020-04-16T20:24:07Z

In large Graphite installations the queue's can get really long. It can take
an hour for Graphite to write all metrics in queue. New db files are created
when the metric is written, which can take too long. This separates the creation
of metrics from writing data to them and moves the creation to an earlier moment.

Whenever a new metric is received it's name is pushed to a new_metric list. The
first step in the writer loop is to check if there are new metrics received and
creates them if they don't exist on disk yet. After the creation the writer
continues as usual with writing metrics from the queue but it does not check if
the file already exists, to prevent that the check occurs twice and has impact
on IO. If the file does not exists at thie point it is logged.

Fixes: graphite-project/graphite-web#629

In large Graphite installations the queue's can get really long. It can take an hour for Graphite to write all metrics in queue. New db files are created when the metric is written, which can take too long. This separates the creation of metrics from writing data to them and moves the creation to an earlier moment. Whenever a new metric is received it's name is pushed to a new_metric list. The first step in the writer loop is to check if there are new metrics received and creates them if they don't exist on disk yet. After the creation the writer continues as usual with writing metrics from the queue but it does not check if the file already exists, to prevent that the check occurs twice and has impact on IO. If the file does not exists at thie point it is logged. Fixes: graphite-project/graphite-web#629

piotr1212 · 2020-04-16T20:26:02Z

Not sure if I missed something but I think this would fix the issues in graphite-project/graphite-web#629 but a much simpler solution than everything discussed.

I see I broke Python2, not sure if we still care?

DanCech · 2020-04-16T20:34:51Z

lib/carbon/writer.py

+    (metric, datapoints) = cache.drain_metric()
+    if metric is None:
+      # end the loop
+      break


This will only create new files as long as there are also updates to process, is that what we want?

Not sure if I am following, new files are created first, on every loop, break is later.

Notice that the diff is kinda screwed, the part with the break is old code.

imagine if there is more to get from pop_new_metric() than there is to get from drain_metric(): then the loop would do create-new, drain-old, create-new, drain-old, alternating, and stop when drain-old was done, before all new metrics were handled ...

but that seems impossible, there can't be more new metrics than total cached metrics, right ... but I understand how one might be unsure, when looking at this code in isolation.

Here's another question: is it possible for drain_metric() to return a new metric before pop_new_metric()? Imagine there are two new cached metrics and nothing else, A and B. A is received first, but B gets more datapoints. pop_new_metric() returns A, then drain_metric() returns B (then pop_new_metric returns B on the next iteration). The FileNotFound will be hit below, and B datapoints will be lost, I think, needlessly.

I have an alternate design idea: keep the original logic here in this loop, and instead modify drain_metric() to preferentially return new metrics.

but that seems impossible, there can't be more new metrics than total cached metrics, right ... but I understand how one might be unsure, when looking at this code in isolation.

Yup, that should be impossible.

Here's another question: is it possible for drain_metric() to return a new metric before pop_new_metric()? Imagine there are two new cached metrics and nothing else, A and B. A is received first, but B gets more datapoints. pop_new_metric() returns A, then drain_metric() returns B (then pop_new_metric returns B on the next iteration). The FileNotFound will be hit below, and B datapoints will be lost, I think, needlessly.

Yes, that is possible when the MAX_CREATES_PER_MINUTE limiter is hit. This is on purpose, currently the metrics get dropped as well, see the old XXX TODO part.
I'm thinking if it wouldn't be better to drain the new_metrics first before continueing with drain_metric().

I have an alternate design idea: keep the original logic here in this loop, and instead modify drain_metric() to preferentially return new metrics.

That is exactly what the naive strategy does. That one is not efficient as it does not write the metric with most datapoints resulting in small writes.

The cache doesn't know if a metric is really new (as not on disk) it can just see that it is not in the cache dict. I didn't want to add IO stuff to the cache thread to check if it is on disk, no particular reason it just felt bad to mix it. Maybe it would be better?

Hmm you're exactly right ... my idea was equivalent to the existing naive strategy ...

That is an intriguing idea, doing the exists() check just before adding to new_metrics (only if the metric is not in the cache). That would give the best signal, but might be too expensive, as you worried.

I don't have any better ideas, maybe your current method will work well in practice :)

ploxiln · 2020-04-16T23:37:10Z

lib/carbon/cache.py

@@ -236,6 +237,10 @@ def get_datapoints(self, metric):
    """Return a list of currently cached datapoints sorted by timestamp"""
    return sorted(self.get(metric, {}).items(), key=by_timestamp)

+  def pop_new_metric(self):
+    # return first seen metric
+    return self.new_metrics.pop(0)


this can raise "IndexError: pop from empty list"
(which does not appear to be handled where this is called)

Thanks, it was caught later on and not logged so it got lost. Changed the code so we log exceptions and handled this.

handle indexexception log exceptions in writer

ploxiln · 2020-04-21T20:20:10Z

Some more thoughts, hopefully helpful:

Currently (without this pull-request), state.database.exists(metric) is called for each metric as it is removed from the cache by cache.drain_metric(). When more datapoints for the metric are reported, it will be added to the cache again, and the exists() check will run again when it is removed.

If you instead called state.database.exists(metric) for each metric when it is first added to the cache, that would result in the same overall number of exists() checks. Each metric is added, remains in cache for some time, then is removed (and written). Adds and removes are symmetric, in total count. But they would be distributed in time differently (may or may not be more bursty).

Regarding the file-create-rate-limit: with the status-quo (check-on-remove), when the limit is hit, all cached data for the affected metric is lost. With the alternative (create-on-add), you can skip adding the new metric if the file-create-rate-limit is hit, and only lose the last bit of data submitted, and cause it to check again when the next bit of data is submitted. This loses as little data as possible, while honoring the file-create-rate-limit, but doing more exists() checks (maybe a lot more if the limit is hit a lot). I think exists() checks should be relatively cheap.

Finally, if the whisper file is deleted while the metric is already in the cache (by the user or some separate cleanup job), then in the create-on-add design, you lose all the cached data when it later fails to write it out. This seems like a very unlikely corner case.

So I think it could be a good idea to move the exists() check and create attempt, to just when the metric is being added to the cache. Or, instead of just one exists()/create before each drain_metric(), do all available new_metrics in a loop, before moving on to drain_metric(), and if new-file-create fails, remove that metric from the cache so it doesn't collect more before being lost.

piotr1212 · 2020-04-22T07:08:11Z

Yes, basically what I was thinking. Already have a patch to loop through all new_metrics before moving on to drain_metric().

piotr1212 · 2020-04-22T07:11:17Z

last commit needs testing. I'm quite busy at the moment, don't know when I will get around to it.

lgtm-com · 2020-04-22T07:22:25Z

This pull request introduces 1 alert when merging b9dbd1b into 42c1242 - view on LGTM.com

new alerts:

1 for Unused import

ploxiln · 2020-04-22T23:22:26Z

lib/carbon/cache.py

+    #  return self.new_metrics.pop(0)
+    #except IndexError:
+    #  return None
+    return next(self.get_new_metrics())


looks like these methods are not used? the writer currently just does:

for new_metric in cache.new_metrics:

ploxiln · 2020-04-22T23:24:04Z

I have a more minimal version of this idea, with a couple of tweaks, possibly useful:

--- a/lib/carbon/cache.py
+++ b/lib/carbon/cache.py
@@ -189,6 +189,7 @@ class _MetricCache(defaultdict):
   def __init__(self, strategy=None):
     self.lock = threading.Lock()
     self.size = 0
+    self.new_metrics = []
     self.strategy = None
     if strategy:
       self.strategy = strategy(self)
@@ -253,6 +254,8 @@ class _MetricCache(defaultdict):
           log.msg("MetricCache is full: self.size=%d" % self.size)
           events.cacheFull()
         else:
+          if not self[metric]:
+            self.new_metrics.append(metric)
           self.size += 1
           self[metric][timestamp] = value
           if self.strategy:
diff --git a/lib/carbon/writer.py b/lib/carbon/writer.py
index 7b63cba..f46421b 100644
--- a/lib/carbon/writer.py
+++ b/lib/carbon/writer.py
@@ -95,24 +95,16 @@ def writeCachedDataPoints():
 
   cache = MetricCache()
   while cache:
-    (metric, datapoints) = cache.drain_metric()
-    if metric is None:
-      # end the loop
-      break
-
-    dbFileExists = state.database.exists(metric)
-
-    if not dbFileExists:
-      if CREATE_BUCKET and not CREATE_BUCKET.drain(1):
-        # If our tokenbucket doesn't have enough tokens available to create a new metric
-        # file then we'll just drop the metric on the ground and move on to the next
-        # metric.
-        # XXX This behavior should probably be configurable to no tdrop metrics
-        # when rate limitng unless our cache is too big or some other legit
-        # reason.
-        instrumentation.increment('droppedCreates')
+    # first create new metrics files so graphite-web knows they exist
+    while cache.new_metrics:
+      metric = cache.new_metrics.pop()
+      if state.database.exists(metric):
         continue
 
+      if CREATE_BUCKET:
+        CREATE_BUCKET.drain(1, blocking=True)
+        # log waitTime ?
+
       archiveConfig = None
       xFilesFactor, aggregationMethod = None, None
 
@@ -149,6 +141,12 @@ def writeCachedDataPoints():
         instrumentation.increment('errors')
         continue
 
+    # now drain and persist some data
+    (metric, datapoints) = cache.drain_metric()
+    if metric is None:
+      # end the loop
+      break
+
     # If we've got a rate limit configured lets makes sure we enforce it
     waitTime = 0
     if UPDATE_BUCKET:

piotr1212 · 2020-04-23T07:58:50Z

yeah, basically that but don't block on the create_bucket limiter, just continue with writing datapoints. We don't want to hang the whole writer.

stale · 2020-06-26T17:38:43Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

DanCech reviewed Apr 16, 2020

View reviewed changes

ploxiln reviewed Apr 16, 2020

View reviewed changes

fix py2 FileNotFoundError

1adf10c

handle indexexception log exceptions in writer

piotr1212 added 2 commits April 22, 2020 09:09

test pypy3

9de4246

loop through all new_metrics before moving on to drain_metric()

b9dbd1b

ploxiln reviewed Apr 22, 2020

View reviewed changes

piotr1212 added 2 commits April 27, 2020 19:21

Merge remote-tracking branch 'origin/master' into create_metric_early

9653e7a

clean up

896b0a6

stale bot added the stale label Jun 26, 2020

piotr1212 added the pinned label Jun 29, 2020

stale bot removed the stale label Jun 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create new metrics earlier #888

Create new metrics earlier #888

piotr1212 commented Apr 16, 2020

piotr1212 commented Apr 16, 2020

DanCech Apr 16, 2020

piotr1212 Apr 17, 2020

ploxiln Apr 17, 2020

piotr1212 Apr 17, 2020

ploxiln Apr 17, 2020

ploxiln Apr 16, 2020

piotr1212 Apr 17, 2020

ploxiln commented Apr 21, 2020

piotr1212 commented Apr 22, 2020

piotr1212 commented Apr 22, 2020

lgtm-com bot commented Apr 22, 2020

ploxiln Apr 22, 2020

ploxiln commented Apr 22, 2020

piotr1212 commented Apr 23, 2020

stale bot commented Jun 26, 2020

Create new metrics earlier #888

Are you sure you want to change the base?

Create new metrics earlier #888

Conversation

piotr1212 commented Apr 16, 2020

piotr1212 commented Apr 16, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ploxiln commented Apr 21, 2020

piotr1212 commented Apr 22, 2020

piotr1212 commented Apr 22, 2020

lgtm-com bot commented Apr 22, 2020

Choose a reason for hiding this comment

ploxiln commented Apr 22, 2020

piotr1212 commented Apr 23, 2020

stale bot commented Jun 26, 2020