Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LOG-1213: Metric for inbound log data loss at the collector #2070

Conversation

pmoogi-redhat
Copy link
Contributor

@pmoogi-redhat pmoogi-redhat commented Feb 25, 2021

Description

Currently in_tail plugin doesn't support publishing of inbound logloss - i.e. difference between total bytes written to disk (logfile) and total bytes collected or read by fluentd. This PR got changes in fluentd/lib/fluent/plugin/in_tail.rb, and fluent-plugin-prometheus/lib/fluent/plugin/in_prometheus_tail_monitor.rb plugins to enable publishing of the below parameters

  1. totalbytes_logged_in_disk (written to disk per container process)
  2. totalbytes_collected_from_disk (read or collected by fluentd per container process)

/cc @alanconway @jcantrill
/assign @alanconway

/cherry-pick

Links

  • Depending on PR(s):

  • Bugzilla:

  • Github issue:

  • JIRA: https://issues.redhat.com/browse/LOG-1213

  • Enhancement proposal: many rotations getting missed by fluentd, next enhancement proposal is to get fluentd know about all actual rotations done by CRIO/conmon process by reading extra meta data such as . Based on what rotations fluentd could track and which one got missed computing log-loss as [no-of-rotations-those-missed * maxsizeoflogfile + sum over all tracked rotations by fluentd as [totalbytes_logged_in_disk - totalbytes_collected_from_disk]

Copy link
Contributor

@alanconway alanconway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest metric names:

  • fluentd_input_status_total_bytes_logged
  • fluentd_input_status_total_bytes_collected
    This fits the pattern of other fluentd metrics. The "on_disk", seems redundant.

The code is hard to review because so much of it is copied from fluentd. Can you subclass the original plugin classes and just override what you need instead of copying the code?

fluentd/lib/in_tail_metric/write_watcher.rb Outdated Show resolved Hide resolved
fluentd/lib/in_tail_metric/write_watcher.rb Outdated Show resolved Hide resolved
fluentd/lib/in_tail_metric/write_watcher.rb Outdated Show resolved Hide resolved
@pmoogi-redhat pmoogi-redhat force-pushed the Metric-for-inbound-log-data-loss-at-the-collector-JIRA-ticket-LOG-1032 branch 2 times, most recently from b74b907 to cdf1430 Compare March 3, 2021 05:30
Copy link
Contributor

@alanconway alanconway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some simple clean up comments in line.

We will need a test to prove that this works before we merge it.
I think it would be sufficient to test it on a stand-alone fluentd with a very simple dummy configuration, and use the Go HTTP client to scrape the metrics and validate them.
Check out tests/functional and check with @jcantrill on using his functional framework, it's designed to deploy a stand-alone fluentd without setting up the whole logging system.

fluentd/lib/in_tail_metric/in_tail_metric.rb Outdated Show resolved Hide resolved
fluentd/lib/in_tail_metric/write_watcher.rb Outdated Show resolved Hide resolved
@jcantrill jcantrill changed the title Metric for inbound log data loss at the collector jira ticket log 1032 LOG-1032: Metric for inbound log data loss at the collector Mar 5, 2021
@jcantrill
Copy link
Contributor

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 5, 2021
@pmoogi-redhat pmoogi-redhat force-pushed the Metric-for-inbound-log-data-loss-at-the-collector-JIRA-ticket-LOG-1032 branch from 464437e to fda80db Compare March 8, 2021 04:06
@pmoogi-redhat
Copy link
Contributor Author

get_gauge_or_counter is not working as per expectations, throwing some errors hence leaving it to get_gauge type for total_bytes_collected metric. require some help there. have tested the functionality by looking at prometheus dashboard if total_bytes_collected getting published correctly. it is working fine as per my stand-alone test environment.

@pmoogi-redhat pmoogi-redhat force-pushed the Metric-for-inbound-log-data-loss-at-the-collector-JIRA-ticket-LOG-1032 branch from fda80db to 9ed90c9 Compare March 8, 2021 09:42
Copy link
Contributor

@alanconway alanconway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there!

@pmoogi-redhat pmoogi-redhat force-pushed the Metric-for-inbound-log-data-loss-at-the-collector-JIRA-ticket-LOG-1032 branch 4 times, most recently from 3abf323 to 5ba119c Compare March 11, 2021 14:47
Copy link
Contributor

@alanconway alanconway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of points to clean up, otherwise LGTM.

fluentd/lib/in_tail_metric/write_watcher.rb Outdated Show resolved Hide resolved
@alanconway
Copy link
Contributor

/approve

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 22, 2021
@pmoogi-redhat pmoogi-redhat force-pushed the Metric-for-inbound-log-data-loss-at-the-collector-JIRA-ticket-LOG-1032 branch 3 times, most recently from c94553a to 658b5ff Compare April 15, 2021 06:27
…b and in_tail/in_tail.rb in_tail/write_watcher.rb keeping them in /origin-aggregated-logging/fluentd/lib/
@pmoogi-redhat pmoogi-redhat force-pushed the Metric-for-inbound-log-data-loss-at-the-collector-JIRA-ticket-LOG-1032 branch from 658b5ff to c1b2be4 Compare April 15, 2021 08:12
@alanconway
Copy link
Contributor

Final names for the metrics:

log_logged_bytes_total: total number of bytes written to a single log file path, accounting for rotations.
log_collected_bytes_total: total number of bytes collected from a single log file.
Both metrics have the label "path" which is the absolute path to the log file.

@jcantrill jcantrill changed the title LOG-1032: Metric for inbound log data loss at the collector LOG-1213: Metric for inbound log data loss at the collector Apr 16, 2021
@@ -0,0 +1,120 @@
require 'fluent/plugin/input'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should be either removed entirely, appear as a patch to vendored_src, are a new file to vendored_src since you cant patch it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jcantrill i would like to know where we can keep this file version controlled as a single piece of truth to all changes that are used to create patch over vendored_src baseline implementation of this plugin.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to add it as a net new patch to the lib. Additionally, there are examples where we copy a file wholesale into vendor dir before building. This may be a valid choice for now

@@ -0,0 +1,1048 @@
#
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should be either removed entirely, appear as a patch to vendored_src, are a new file to vendored_src since you cant patch it

fluentd/lib/in_tail/write_watcher.rb Outdated Show resolved Hide resolved
Copy link
Contributor

@alanconway alanconway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 26, 2021
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alanconway, pmoogi-redhat

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@@ -0,0 +1,120 @@
require 'fluent/plugin/input'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to add it as a net new patch to the lib. Additionally, there are examples where we copy a file wholesale into vendor dir before building. This may be a valid choice for now

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Apr 29, 2021
@openshift-ci-robot
Copy link

New changes are detected. LGTM label has been removed.

Copy link
Contributor Author

@pmoogi-redhat pmoogi-redhat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alanconway @jcantrill kindly review - added k8 regex based filename parsing.

@alanconway
Copy link
Contributor

@jcantrill @pmoogi-redhat I think we can re-factor this code into a single, fully independent plug-in belonging to us, with no patches to fluentd. That will solve the labelling problems.
The idea is to follow the model of the fluentd prometheus-tail-monitor:
https://github.com/openshift/origin-aggregated-logging/blob/master/fluentd/vendored_gem_src/fluent-plugin-prometheus/lib/fluent/plugin/in_prometheus_tail_monitor.rb#L1
That plugin scans for registered in-tail plugins and uses inside knowledge of the in-tail implementation to grab the read-position and inode values collected by in-tail. It samples on a timer, so it doesn't see every read - but that doesn't matter. In-tail keeps watching the old inode for a grace period of several seconds after the file rolls over, which gives us time to register the last read position -so we should get a full count.
I've started writing this out, but I'm not sure we'll have time to get it done and tested for 5.1. I'll try to finish toda so @pmoogi-redhat can review & test it tomorrow.

@pmoogi-redhat pmoogi-redhat force-pushed the Metric-for-inbound-log-data-loss-at-the-collector-JIRA-ticket-LOG-1032 branch from 69e1909 to 1357a78 Compare April 30, 2021 04:08
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 30, 2021

@pmoogi-redhat: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/unit 4c81554 link /test unit
ci/prow/elastic-operator-e2e 4c81554 link /test elastic-operator-e2e
ci/prow/images 4c81554 link /test images
ci/prow/cluster-logging-operator-e2e 4c81554 link /test cluster-logging-operator-e2e
ci/prow/clo-functional 4c81554 link /test clo-functional
ci/prow/smoke 4c81554 link /test smoke

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@pmoogi-redhat
Copy link
Contributor Author

Separate PR is raised to enable this new metric as a new plugin - new_plugin_Metric-for-inbound-log-data-loss-at-the-collector-JIRA-ticket-LOG-1032. Hence closing this PR which reflects in_tail and fluent-prometheus-plugin changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants