addition of new metric #2060

pmoogi-redhat · 2021-02-15T10:56:01Z

Description

Currently in_tail plugin doesn't support publishing of inbound logloss - i.e. difference between total bytes written to disk (logfile) and total bytes collected or read by fluentd. This PR got changes in fluentd/lib/fluent/plugin/in_tail.rb, and fluent-plugin-prometheus/lib/fluent/plugin/in_prometheus_tail_monitor.rb plugins to enable publishing of the below parameters

totalbytes_logged_in_disk (written to disk at each unique inode level per container process)
totalbytes_collected_from_disk (read or collected by fluentd at each unique inode level per container process)

/cc @alanconway @jcantrill
/assign @alanconway

/cherry-pick

Links

Depending on PR(s):
Bugzilla:
Github issue:
JIRA: https://issues.redhat.com/browse/LOG-1032
Enhancement proposal: many rotations getting missed by fluentd, next enhancement proposal is to get fluentd know about all actual rotations done by CRIO/conmon process by reading extra meta data such as . Based on what rotations fluentd could track and which one got missed computing log-loss as [no-of-rotations-those-missed * maxsizeoflogfile + sum over all tracked rotations by fluentd as [totalbytes_logged_in_disk - totalbytes_collected_from_disk]

openshift-ci-robot · 2021-02-15T10:56:31Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pmoogi-redhat
To complete the pull request process, please assign alanconway after the PR has been reviewed.
You can assign the PR to them by writing /assign @alanconway in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

eranra · 2021-02-15T11:32:23Z

...td/vendored_gem_src/fluent-plugin-prometheus/lib/fluent/plugin/in_prometheus_tail_monitor.rb

+          'Current max fsize of file on rotation event'),
+        countonrotate: get_gauge(
+          :countonrotate,
+          'No of rotation noticed by fluentd'),


s/No/Number

restored back all changes done vendored_gem_src as it is not supposed to be changed.

eranra · 2021-02-15T11:32:44Z

...td/vendored_gem_src/fluent-plugin-prometheus/lib/fluent/plugin/in_prometheus_tail_monitor.rb

+          'Current max fsize of file on rotation event'),
+        countonrotate: get_gauge(
+          :countonrotate,
+          'No of rotation noticed by fluentd'),


s/rotation/rotations

restored back all changes done in vendored_gem_src as it is not supposed to be changed.

eranra · 2021-02-15T11:34:20Z

...td/vendored_gem_src/fluent-plugin-prometheus/lib/fluent/plugin/in_prometheus_tail_monitor.rb

          :fluentd_tail_file_inode,
          'Current inode of file.'),
+        maxfsize: get_gauge(
+          :maxfsize,
+          'Current max fsize of file on rotation event'),


Maybe you can drop the current? if this is the max fsize than it is enought to state that. Current might not be "user facing" value,.

We don't need to track file size. The interesting metrics are bytes_logged/bytes_collected. max file size is an implementation detail that could change in future (e.g. if we start streaming logs from crio directly via pipes)

Restored back all changes done in vendored_gem_src as it is not supposed to be changed. Final changes don't have maxfsize and countonrotate parameters getting published.

eranra · 2021-02-15T11:35:09Z

...td/vendored_gem_src/fluent-plugin-prometheus/lib/fluent/plugin/in_prometheus_tail_monitor.rb

+          'totalbytes read by fluentd IOHandler'),
+        totalbytesavailable: get_gauge(
+          :totalbytesavailable,
+          'totalbytes available at each instance of rotation - to be read by fluentd IOHandler'),


I am not sure that I understand the term ... maybe something with the words
(After reading the code I understand that this is the potential ... but I am not sure this is useful for custoemrs to be exposed)

I changed the terms in the JIRA, @eranra if you agree these are more expressive we should update the code:

bytes_logged (counter) bytes written to log files by containers. bytes_collected (counter) bytes read by the collector for forwarding.

Log loss can be computed by a prometheus query expression: (bytes_logged - bytes_collected)

@pmoogi-redhat note the metrics should be counters not guages. They always increase.

@alanconway yes ... those terms are simple and make sense :-)

changed parameter names from totalbytesavailable and totalbytesread to totalbytes_logged_in_disk
& totalbytes_collected_from_disk respectively. Got a separate class implementation to monitor stat for getting totalbytes_logged_in_disk

eranra · 2021-02-15T11:36:05Z

...td/vendored_gem_src/fluent-plugin-prometheus/lib/fluent/plugin/in_prometheus_tail_monitor.rb

+          @metrics[:countonrotate].set(label, countonrotate)
+          @metrics[:totalbytesread].set(label, totalbytesread)
+          @metrics[:totalbytesavailable].set(label, totalbytesavailable)
+          #@log.info "IN PROMETHEUS PLUGIN pr.read_inode and pe.read_pos #{pe.read_inode} #{pe.read_pos} maxfsize #{maxfsize} countonrotate #{countonrotate} totalbytesread #{totalbytesread} totalbytesavailable #{totalbytesavailable} "


Consider change to log.debug and remove comment

restored back all changes done in vendored_gem_src as that is not supposed to be changed.

eranra · 2021-02-15T11:38:18Z

...td/vendored_gem_src/fluent-plugin-prometheus/lib/fluent/plugin/in_prometheus_tail_monitor.rb

-      else
-        @monitor_agent = Fluent::MonitorAgentInput.new
-      end
+      @monitor_agent = Fluent::Plugin::MonitorAgentInput.new
    end

    def start
      super

      @metrics = {


Missing: the high level value which is numebr of Log lines dropped is missing ... can you advice on the formula to calculate this from bellow metrics ?

I saw this in @logLoss but this is not exposed as metric (why?)

It's not exposed as this is a raw value we should be able to calculate from exposing what is possible to collect and what is actually collected.

restored back all changes done in vendored_gem_src as that is not supposed to be changed.

eranra · 2021-02-15T11:43:27Z

fluentd/vendored_gem_src/fluentd/lib/fluent/plugin/in_tail.rb

      end

      def on_rotate(stat)
+        @maxfsize=@pe.read_pos


Do you want to MAX on all rotations so you get to more accurate value over time ? (assuming this is coming from fixed value set by conmon)

We don't care about max file size. We only care about bytes_logged and bytes_collected. I think this code can be simplified to accumulate directly into the prometheus client metrics (like the Go code at https://github.com/alanconway/file-metric/blob/master/cmd/file-metric/main.go) so that the embedded prom client has an up-to-date view whenever there is a pull from prometheus.

Taken out this change as maxfsize need not be monitored as a metric.

@eranra MAX over all rotation is not required as rotation always happen when fsize > maxsize of log file.

eranra · 2021-02-15T11:44:49Z

fluentd/vendored_gem_src/fluentd/lib/fluent/plugin/in_tail.rb

@@ -519,6 +526,11 @@ def initialize(path, rotate_wait, pe, log, read_from_head, enable_watch_timer, e
      attr_accessor :timer_trigger
      attr_accessor :line_buffer, :line_buffer_timer_flusher
      attr_accessor :unwatched  # This is used for removing position entry from PositionFile
+      attr_accessor :totalbytesavailable  # This is used for removing position entry from PositionFile


Fix the remark (to be fixed in all variables)

resolved. please see my earlier comments.

eranra · 2021-02-15T11:49:57Z

fluentd/vendored_gem_src/fluentd/lib/fluent/plugin/in_tail.rb

+        @totalbytesread=0
+        @totalbytesavailable=0
+        @inodelastfsize_map={}
+        @inodereadfsize_map={}


As rotations continue forever this map is gowing and growing ... I do not know how much ... but this feels like a memory leak to me ... need to see how large this gets in real-life and consider limiting this somehow in time

I agree. But we should come up with a solution instead of allowing it to grow

@jcantrill Maybe aggrigation of infomration from this map every set time for oldest inodes ... this can be second phase and not part of this PR from my PoV ... but we should not forget to do that.

I don't think this map is necessary, and the logic can be simpler. See https://github.com/alanconway/file-metric/blob/b0cb7946c3e9dd6b607ef6f5e02218e5c1b33e94/cmd/file-metric/main.go#L69 which is in Go, but the logic can be reproduced in ruby. In fact we don't need the inode, the only thing that matters is when the file size drops, that means we have started a new rotation.

Hash map based implementation is taken out to make implementation more performance efficient. A new write_watcher.rb is introduced to measure totalbytes_logged_in_disk.

eranra · 2021-02-15T11:50:25Z

fluentd/vendored_gem_src/fluentd/lib/fluent/plugin/in_tail.rb

@@ -510,6 +510,13 @@ def initialize(path, rotate_wait, pe, log, read_from_head, enable_watch_timer, e
        @from_encoding = from_encoding
        @encoding = encoding
        @open_on_every_update = open_on_every_update
+        @totalbytesread=0
+        @totalbytesavailable=0
+        @inodelastfsize_map={}


Same as next comment

@eranra @alanconway resolved this in the new implementation.

eranra · 2021-02-15T11:52:10Z

fluentd/vendored_gem_src/fluentd/lib/fluent/plugin/in_tail.rb

+        if @io_handler && @rotate_handler
+          @totalbytesread=0
+          @countonrotate=0
+          @inodereadfsize_map.each do |k, v|


Might be expensive loop after some time if the number of rotations is significant (maybe you can consolidate the info after some time to a variable and remove a lot of the data in that map)

So we need to expose metrics for each file instead of exposing a sum?

@jcantrill I think we want need one value ... prodinf a metric per file is not very useful and not very scalable. But, maybe the accomolation can be done in 'streaming' like manner and not doing the entire calculation every time

I've been learning about prometheus and here's what I think: we have 2 metrics, counters, bytes_loggged and bytes_collected. However those metrics have labels that allow the user to refine the set of log streams they care about. In Go this is called a CounterVec, not sure what it is in fluentd. See https://github.com/alanconway/file-metric/blob/b0cb7946c3e9dd6b607ef6f5e02218e5c1b33e94/cmd/file-metric/main.go
I need to do some research to figure out what the right set of labels is. In my go example it's just the file name, but we need to look at other observatorium label schemes and consider our own data model to find a set that makes sense for queries. The cardinality is always going to be per-log-file, but the log file name isn't a very obvious way for users to query.

eranra · 2021-02-15T11:55:34Z

fluentd/vendored_gem_src/fluentd/lib/fluent/plugin/in_tail.rb

@@ -854,9 +892,18 @@ def on_notify(stat)
          begin
            if @inode != inode || fsize < @fsize
              @on_rotate.call(stat)
+              inodelastfsize_map[@inode]=@fsize
+              @totalbytesavailable=@fsize


What is the purpose of maxfsize if you are using the fsize for logloss computation?

maxfsize is taken out now as it is not required to be monitored.

eranra · 2021-02-15T11:56:57Z

...td/vendored_gem_src/fluent-plugin-prometheus/lib/fluent/plugin/in_prometheus_tail_monitor.rb

@@ -9,7 +9,7 @@ class PrometheusTailMonitorInput < Fluent::Plugin::Input

    helpers :timer

-    config_param :interval, :time, default: 5
+    config_param :interval, :time, default: 1


Can you explain why we need to move to more frequent interval ?

This if needs to be changed should be made in the config generated for the plug-in, not here

all changes done in the vendored_gem_src are restored back as it is not supposed to be changed for creating custom plugin. now all changes sit in fluentd/lib/ folder.

openshift-ci · 2021-02-15T12:53:50Z

@pmoogi-redhat: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/cluster-logging-operator-e2e	`f96f720`	link	`/test cluster-logging-operator-e2e`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

jcantrill

These changes will need to be made against the upstream repo. If we intend to carry them here until those changes land then both these PRs must be moved to lib

jcantrill · 2021-02-15T12:48:06Z

...td/vendored_gem_src/fluent-plugin-prometheus/lib/fluent/plugin/in_prometheus_tail_monitor.rb

-      else
-        @monitor_agent = Fluent::MonitorAgentInput.new
-      end
+      @monitor_agent = Fluent::Plugin::MonitorAgentInput.new


This looks like the original intent of the if was lost

jcantrill · 2021-02-15T12:50:42Z

fluentd/vendored_gem_src/fluentd/lib/fluent/plugin/in_tail.rb

+        @totalbytesread=0
+        @totalbytesavailable=0
+        @inodelastfsize_map={}
+        @inodereadfsize_map={}


I agree. But we should come up with a solution instead of allowing it to grow

jcantrill · 2021-02-15T12:51:24Z

fluentd/vendored_gem_src/fluentd/lib/fluent/plugin/in_tail.rb

+        @inodereadfsize_map={}
+        @countonrotate=0
+        @maxfsize=0
+        @logloss=0


This should not be required

jcantrill · 2021-02-15T12:53:11Z

fluentd/vendored_gem_src/fluentd/lib/fluent/plugin/in_tail.rb

+        if @io_handler && @rotate_handler
+          @totalbytesread=0
+          @countonrotate=0
+          @inodereadfsize_map.each do |k, v|


So we need to expose metrics for each file instead of exposing a sum?

jcantrill · 2021-02-15T15:31:17Z

/hold

alanconway

Along with the embedded comments, the code should be separated into its own class so we have minimal changes to the fluentd in_tail plugin. That will make it easier to upgrade to new versions of in_tail in future.

alanconway · 2021-02-17T22:08:50Z

fluentd/vendored_gem_src/fluentd/lib/fluent/plugin/in_tail.rb

+        @totalbytesread=0
+        @totalbytesavailable=0
+        @inodelastfsize_map={}
+        @inodereadfsize_map={}


I don't think this map is necessary, and the logic can be simpler. See https://github.com/alanconway/file-metric/blob/b0cb7946c3e9dd6b607ef6f5e02218e5c1b33e94/cmd/file-metric/main.go#L69 which is in Go, but the logic can be reproduced in ruby. In fact we don't need the inode, the only thing that matters is when the file size drops, that means we have started a new rotation.

alanconway · 2021-02-17T22:16:52Z

...td/vendored_gem_src/fluent-plugin-prometheus/lib/fluent/plugin/in_prometheus_tail_monitor.rb

+          'totalbytes read by fluentd IOHandler'),
+        totalbytesavailable: get_gauge(
+          :totalbytesavailable,
+          'totalbytes available at each instance of rotation - to be read by fluentd IOHandler'),


I changed the terms in the JIRA, @eranra if you agree these are more expressive we should update the code:

bytes_logged (counter) bytes written to log files by containers. bytes_collected (counter) bytes read by the collector for forwarding.

Log loss can be computed by a prometheus query expression: (bytes_logged - bytes_collected)

@pmoogi-redhat note the metrics should be counters not guages. They always increase.

alanconway · 2021-02-17T22:20:59Z

...td/vendored_gem_src/fluent-plugin-prometheus/lib/fluent/plugin/in_prometheus_tail_monitor.rb

          :fluentd_tail_file_inode,
          'Current inode of file.'),
+        maxfsize: get_gauge(
+          :maxfsize,
+          'Current max fsize of file on rotation event'),


We don't need to track file size. The interesting metrics are bytes_logged/bytes_collected. max file size is an implementation detail that could change in future (e.g. if we start streaming logs from crio directly via pipes)

alanconway · 2021-02-17T22:26:56Z

fluentd/vendored_gem_src/fluentd/lib/fluent/plugin/in_tail.rb

+        if @io_handler && @rotate_handler
+          @totalbytesread=0
+          @countonrotate=0
+          @inodereadfsize_map.each do |k, v|


I've been learning about prometheus and here's what I think: we have 2 metrics, counters, bytes_loggged and bytes_collected. However those metrics have labels that allow the user to refine the set of log streams they care about. In Go this is called a CounterVec, not sure what it is in fluentd. See https://github.com/alanconway/file-metric/blob/b0cb7946c3e9dd6b607ef6f5e02218e5c1b33e94/cmd/file-metric/main.go
I need to do some research to figure out what the right set of labels is. In my go example it's just the file name, but we need to look at other observatorium label schemes and consider our own data model to find a set that makes sense for queries. The cardinality is always going to be per-log-file, but the log file name isn't a very obvious way for users to query.

alanconway · 2021-02-17T22:29:13Z

fluentd/vendored_gem_src/fluentd/lib/fluent/plugin/in_tail.rb

      end

      def on_rotate(stat)
+        @maxfsize=@pe.read_pos


We don't care about max file size. We only care about bytes_logged and bytes_collected. I think this code can be simplified to accumulate directly into the prometheus client metrics (like the Go code at https://github.com/alanconway/file-metric/blob/master/cmd/file-metric/main.go) so that the embedded prom client has an up-to-date view whenever there is a pull from prometheus.

pmoogi-redhat · 2021-02-25T11:57:50Z

closing this one as new PR is raised with lot many revamp of this implementation and placement of code.

addition of new metric

f96f720

openshift-ci-robot assigned alanconway Feb 15, 2021

openshift-ci-robot requested review from alanconway and jcantrill February 15, 2021 10:56

eranra reviewed Feb 15, 2021

View reviewed changes

jcantrill requested changes Feb 15, 2021

View reviewed changes

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 15, 2021

alanconway suggested changes Feb 17, 2021

View reviewed changes

pmoogi-redhat closed this Feb 25, 2021

addition of new metric #2060

addition of new metric #2060

Conversation

pmoogi-redhat commented Feb 15, 2021 • edited

Description

Links

openshift-ci-robot commented Feb 15, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eranra Feb 15, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci bot commented Feb 15, 2021

jcantrill left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcantrill commented Feb 15, 2021

alanconway left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pmoogi-redhat commented Feb 25, 2021

pmoogi-redhat commented Feb 15, 2021 •

edited

eranra Feb 15, 2021 •

edited