Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

addition of new metric #2060

Closed
wants to merge 1 commit into from
Closed

Conversation

pmoogi-redhat
Copy link
Contributor

@pmoogi-redhat pmoogi-redhat commented Feb 15, 2021

Description

Currently in_tail plugin doesn't support publishing of inbound logloss - i.e. difference between total bytes written to disk (logfile) and total bytes collected or read by fluentd. This PR got changes in fluentd/lib/fluent/plugin/in_tail.rb, and fluent-plugin-prometheus/lib/fluent/plugin/in_prometheus_tail_monitor.rb plugins to enable publishing of the below parameters

  1. totalbytes_logged_in_disk (written to disk at each unique inode level per container process)
  2. totalbytes_collected_from_disk (read or collected by fluentd at each unique inode level per container process)

/cc @alanconway @jcantrill
/assign @alanconway

/cherry-pick

Links

  • Depending on PR(s):
  • Bugzilla:
  • Github issue:
  • JIRA: https://issues.redhat.com/browse/LOG-1032
  • Enhancement proposal: many rotations getting missed by fluentd, next enhancement proposal is to get fluentd know about all actual rotations done by CRIO/conmon process by reading extra meta data such as . Based on what rotations fluentd could track and which one got missed computing log-loss as [no-of-rotations-those-missed * maxsizeoflogfile + sum over all tracked rotations by fluentd as [totalbytes_logged_in_disk - totalbytes_collected_from_disk]

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pmoogi-redhat
To complete the pull request process, please assign alanconway after the PR has been reviewed.
You can assign the PR to them by writing /assign @alanconway in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

'Current max fsize of file on rotation event'),
countonrotate: get_gauge(
:countonrotate,
'No of rotation noticed by fluentd'),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/No/Number

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restored back all changes done vendored_gem_src as it is not supposed to be changed.

'Current max fsize of file on rotation event'),
countonrotate: get_gauge(
:countonrotate,
'No of rotation noticed by fluentd'),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/rotation/rotations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restored back all changes done in vendored_gem_src as it is not supposed to be changed.

:fluentd_tail_file_inode,
'Current inode of file.'),
maxfsize: get_gauge(
:maxfsize,
'Current max fsize of file on rotation event'),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can drop the current? if this is the max fsize than it is enought to state that. Current might not be "user facing" value,.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to track file size. The interesting metrics are bytes_logged/bytes_collected. max file size is an implementation detail that could change in future (e.g. if we start streaming logs from crio directly via pipes)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restored back all changes done in vendored_gem_src as it is not supposed to be changed. Final changes don't have maxfsize and countonrotate parameters getting published.

'totalbytes read by fluentd IOHandler'),
totalbytesavailable: get_gauge(
:totalbytesavailable,
'totalbytes available at each instance of rotation - to be read by fluentd IOHandler'),
Copy link

@eranra eranra Feb 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that I understand the term ... maybe something with the words
(After reading the code I understand that this is the potential ... but I am not sure this is useful for custoemrs to be exposed)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the terms in the JIRA, @eranra if you agree these are more expressive we should update the code:

bytes_logged (counter) bytes written to log files by containers.
bytes_collected (counter) bytes read by the collector for forwarding. 

Log loss can be computed by a prometheus query expression: (bytes_logged - bytes_collected)

@pmoogi-redhat note the metrics should be counters not guages. They always increase.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alanconway yes ... those terms are simple and make sense :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed parameter names from totalbytesavailable and totalbytesread to totalbytes_logged_in_disk
& totalbytes_collected_from_disk respectively. Got a separate class implementation to monitor stat for getting totalbytes_logged_in_disk

@metrics[:countonrotate].set(label, countonrotate)
@metrics[:totalbytesread].set(label, totalbytesread)
@metrics[:totalbytesavailable].set(label, totalbytesavailable)
#@log.info "IN PROMETHEUS PLUGIN pr.read_inode and pe.read_pos #{pe.read_inode} #{pe.read_pos} maxfsize #{maxfsize} countonrotate #{countonrotate} totalbytesread #{totalbytesread} totalbytesavailable #{totalbytesavailable} "
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider change to log.debug and remove comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restored back all changes done in vendored_gem_src as that is not supposed to be changed.

else
@monitor_agent = Fluent::MonitorAgentInput.new
end
@monitor_agent = Fluent::Plugin::MonitorAgentInput.new
end

def start
super

@metrics = {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing: the high level value which is numebr of Log lines dropped is missing ... can you advice on the formula to calculate this from bellow metrics ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw this in @logLoss but this is not exposed as metric (why?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not exposed as this is a raw value we should be able to calculate from exposing what is possible to collect and what is actually collected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restored back all changes done in vendored_gem_src as that is not supposed to be changed.

end

def on_rotate(stat)
@maxfsize=@pe.read_pos
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to MAX on all rotations so you get to more accurate value over time ? (assuming this is coming from fixed value set by conmon)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't care about max file size. We only care about bytes_logged and bytes_collected. I think this code can be simplified to accumulate directly into the prometheus client metrics (like the Go code at https://github.com/alanconway/file-metric/blob/master/cmd/file-metric/main.go) so that the embedded prom client has an up-to-date view whenever there is a pull from prometheus.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taken out this change as maxfsize need not be monitored as a metric.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eranra MAX over all rotation is not required as rotation always happen when fsize > maxsize of log file.

@@ -519,6 +526,11 @@ def initialize(path, rotate_wait, pe, log, read_from_head, enable_watch_timer, e
attr_accessor :timer_trigger
attr_accessor :line_buffer, :line_buffer_timer_flusher
attr_accessor :unwatched # This is used for removing position entry from PositionFile
attr_accessor :totalbytesavailable # This is used for removing position entry from PositionFile
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the remark (to be fixed in all variables)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved. please see my earlier comments.

@totalbytesread=0
@totalbytesavailable=0
@inodelastfsize_map={}
@inodereadfsize_map={}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As rotations continue forever this map is gowing and growing ... I do not know how much ... but this feels like a memory leak to me ... need to see how large this gets in real-life and consider limiting this somehow in time

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. But we should come up with a solution instead of allowing it to grow

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jcantrill Maybe aggrigation of infomration from this map every set time for oldest inodes ... this can be second phase and not part of this PR from my PoV ... but we should not forget to do that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this map is necessary, and the logic can be simpler. See https://github.com/alanconway/file-metric/blob/b0cb7946c3e9dd6b607ef6f5e02218e5c1b33e94/cmd/file-metric/main.go#L69 which is in Go, but the logic can be reproduced in ruby. In fact we don't need the inode, the only thing that matters is when the file size drops, that means we have started a new rotation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hash map based implementation is taken out to make implementation more performance efficient. A new write_watcher.rb is introduced to measure totalbytes_logged_in_disk.

@@ -510,6 +510,13 @@ def initialize(path, rotate_wait, pe, log, read_from_head, enable_watch_timer, e
@from_encoding = from_encoding
@encoding = encoding
@open_on_every_update = open_on_every_update
@totalbytesread=0
@totalbytesavailable=0
@inodelastfsize_map={}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as next comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eranra @alanconway resolved this in the new implementation.

if @io_handler && @rotate_handler
@totalbytesread=0
@countonrotate=0
@inodereadfsize_map.each do |k, v|
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be expensive loop after some time if the number of rotations is significant (maybe you can consolidate the info after some time to a variable and remove a lot of the data in that map)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we need to expose metrics for each file instead of exposing a sum?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jcantrill I think we want need one value ... prodinf a metric per file is not very useful and not very scalable. But, maybe the accomolation can be done in 'streaming' like manner and not doing the entire calculation every time

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been learning about prometheus and here's what I think: we have 2 metrics, counters, bytes_loggged and bytes_collected. However those metrics have labels that allow the user to refine the set of log streams they care about. In Go this is called a CounterVec, not sure what it is in fluentd. See https://github.com/alanconway/file-metric/blob/b0cb7946c3e9dd6b607ef6f5e02218e5c1b33e94/cmd/file-metric/main.go
I need to do some research to figure out what the right set of labels is. In my go example it's just the file name, but we need to look at other observatorium label schemes and consider our own data model to find a set that makes sense for queries. The cardinality is always going to be per-log-file, but the log file name isn't a very obvious way for users to query.

@@ -854,9 +892,18 @@ def on_notify(stat)
begin
if @inode != inode || fsize < @fsize
@on_rotate.call(stat)
inodelastfsize_map[@inode]=@fsize
@totalbytesavailable=@fsize
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of maxfsize if you are using the fsize for logloss computation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maxfsize is taken out now as it is not required to be monitored.

@@ -9,7 +9,7 @@ class PrometheusTailMonitorInput < Fluent::Plugin::Input

helpers :timer

config_param :interval, :time, default: 5
config_param :interval, :time, default: 1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why we need to move to more frequent interval ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if needs to be changed should be made in the config generated for the plug-in, not here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all changes done in the vendored_gem_src are restored back as it is not supposed to be changed for creating custom plugin. now all changes sit in fluentd/lib/ folder.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 15, 2021

@pmoogi-redhat: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/cluster-logging-operator-e2e f96f720 link /test cluster-logging-operator-e2e

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Copy link
Contributor

@jcantrill jcantrill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes will need to be made against the upstream repo. If we intend to carry them here until those changes land then both these PRs must be moved to lib

else
@monitor_agent = Fluent::MonitorAgentInput.new
end
@monitor_agent = Fluent::Plugin::MonitorAgentInput.new
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like the original intent of the if was lost

@totalbytesread=0
@totalbytesavailable=0
@inodelastfsize_map={}
@inodereadfsize_map={}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. But we should come up with a solution instead of allowing it to grow

@inodereadfsize_map={}
@countonrotate=0
@maxfsize=0
@logloss=0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be required

if @io_handler && @rotate_handler
@totalbytesread=0
@countonrotate=0
@inodereadfsize_map.each do |k, v|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we need to expose metrics for each file instead of exposing a sum?

@jcantrill
Copy link
Contributor

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 15, 2021
Copy link
Contributor

@alanconway alanconway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Along with the embedded comments, the code should be separated into its own class so we have minimal changes to the fluentd in_tail plugin. That will make it easier to upgrade to new versions of in_tail in future.

@totalbytesread=0
@totalbytesavailable=0
@inodelastfsize_map={}
@inodereadfsize_map={}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this map is necessary, and the logic can be simpler. See https://github.com/alanconway/file-metric/blob/b0cb7946c3e9dd6b607ef6f5e02218e5c1b33e94/cmd/file-metric/main.go#L69 which is in Go, but the logic can be reproduced in ruby. In fact we don't need the inode, the only thing that matters is when the file size drops, that means we have started a new rotation.

'totalbytes read by fluentd IOHandler'),
totalbytesavailable: get_gauge(
:totalbytesavailable,
'totalbytes available at each instance of rotation - to be read by fluentd IOHandler'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the terms in the JIRA, @eranra if you agree these are more expressive we should update the code:

bytes_logged (counter) bytes written to log files by containers.
bytes_collected (counter) bytes read by the collector for forwarding. 

Log loss can be computed by a prometheus query expression: (bytes_logged - bytes_collected)

@pmoogi-redhat note the metrics should be counters not guages. They always increase.

:fluentd_tail_file_inode,
'Current inode of file.'),
maxfsize: get_gauge(
:maxfsize,
'Current max fsize of file on rotation event'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to track file size. The interesting metrics are bytes_logged/bytes_collected. max file size is an implementation detail that could change in future (e.g. if we start streaming logs from crio directly via pipes)

if @io_handler && @rotate_handler
@totalbytesread=0
@countonrotate=0
@inodereadfsize_map.each do |k, v|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been learning about prometheus and here's what I think: we have 2 metrics, counters, bytes_loggged and bytes_collected. However those metrics have labels that allow the user to refine the set of log streams they care about. In Go this is called a CounterVec, not sure what it is in fluentd. See https://github.com/alanconway/file-metric/blob/b0cb7946c3e9dd6b607ef6f5e02218e5c1b33e94/cmd/file-metric/main.go
I need to do some research to figure out what the right set of labels is. In my go example it's just the file name, but we need to look at other observatorium label schemes and consider our own data model to find a set that makes sense for queries. The cardinality is always going to be per-log-file, but the log file name isn't a very obvious way for users to query.

end

def on_rotate(stat)
@maxfsize=@pe.read_pos
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't care about max file size. We only care about bytes_logged and bytes_collected. I think this code can be simplified to accumulate directly into the prometheus client metrics (like the Go code at https://github.com/alanconway/file-metric/blob/master/cmd/file-metric/main.go) so that the embedded prom client has an up-to-date view whenever there is a pull from prometheus.

@pmoogi-redhat
Copy link
Contributor Author

closing this one as new PR is raised with lot many revamp of this implementation and placement of code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants