New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
addition of new metric #2060
addition of new metric #2060
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: pmoogi-redhat The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
'Current max fsize of file on rotation event'), | ||
countonrotate: get_gauge( | ||
:countonrotate, | ||
'No of rotation noticed by fluentd'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/No/Number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
restored back all changes done vendored_gem_src as it is not supposed to be changed.
'Current max fsize of file on rotation event'), | ||
countonrotate: get_gauge( | ||
:countonrotate, | ||
'No of rotation noticed by fluentd'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/rotation/rotations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
restored back all changes done in vendored_gem_src as it is not supposed to be changed.
:fluentd_tail_file_inode, | ||
'Current inode of file.'), | ||
maxfsize: get_gauge( | ||
:maxfsize, | ||
'Current max fsize of file on rotation event'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can drop the current
? if this is the max fsize than it is enought to state that. Current might not be "user facing" value,.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to track file size. The interesting metrics are bytes_logged/bytes_collected. max file size is an implementation detail that could change in future (e.g. if we start streaming logs from crio directly via pipes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restored back all changes done in vendored_gem_src as it is not supposed to be changed. Final changes don't have maxfsize and countonrotate parameters getting published.
'totalbytes read by fluentd IOHandler'), | ||
totalbytesavailable: get_gauge( | ||
:totalbytesavailable, | ||
'totalbytes available at each instance of rotation - to be read by fluentd IOHandler'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure that I understand the term ... maybe something with the words
(After reading the code I understand that this is the potential ... but I am not sure this is useful for custoemrs to be exposed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the terms in the JIRA, @eranra if you agree these are more expressive we should update the code:
bytes_logged (counter) bytes written to log files by containers.
bytes_collected (counter) bytes read by the collector for forwarding.
Log loss can be computed by a prometheus query expression: (bytes_logged - bytes_collected)
@pmoogi-redhat note the metrics should be counters not guages. They always increase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alanconway yes ... those terms are simple and make sense :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed parameter names from totalbytesavailable and totalbytesread to totalbytes_logged_in_disk
& totalbytes_collected_from_disk respectively. Got a separate class implementation to monitor stat for getting totalbytes_logged_in_disk
@metrics[:countonrotate].set(label, countonrotate) | ||
@metrics[:totalbytesread].set(label, totalbytesread) | ||
@metrics[:totalbytesavailable].set(label, totalbytesavailable) | ||
#@log.info "IN PROMETHEUS PLUGIN pr.read_inode and pe.read_pos #{pe.read_inode} #{pe.read_pos} maxfsize #{maxfsize} countonrotate #{countonrotate} totalbytesread #{totalbytesread} totalbytesavailable #{totalbytesavailable} " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider change to log.debug and remove comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
restored back all changes done in vendored_gem_src as that is not supposed to be changed.
else | ||
@monitor_agent = Fluent::MonitorAgentInput.new | ||
end | ||
@monitor_agent = Fluent::Plugin::MonitorAgentInput.new | ||
end | ||
|
||
def start | ||
super | ||
|
||
@metrics = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing: the high level value which is numebr of Log lines dropped is missing ... can you advice on the formula to calculate this from bellow metrics ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw this in @logLoss but this is not exposed as metric (why?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not exposed as this is a raw value we should be able to calculate from exposing what is possible to collect and what is actually collected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
restored back all changes done in vendored_gem_src as that is not supposed to be changed.
end | ||
|
||
def on_rotate(stat) | ||
@maxfsize=@pe.read_pos |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to MAX on all rotations so you get to more accurate value over time ? (assuming this is coming from fixed value set by conmon)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't care about max file size. We only care about bytes_logged and bytes_collected. I think this code can be simplified to accumulate directly into the prometheus client metrics (like the Go code at https://github.com/alanconway/file-metric/blob/master/cmd/file-metric/main.go) so that the embedded prom client has an up-to-date view whenever there is a pull from prometheus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taken out this change as maxfsize need not be monitored as a metric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eranra MAX over all rotation is not required as rotation always happen when fsize > maxsize of log file.
@@ -519,6 +526,11 @@ def initialize(path, rotate_wait, pe, log, read_from_head, enable_watch_timer, e | |||
attr_accessor :timer_trigger | |||
attr_accessor :line_buffer, :line_buffer_timer_flusher | |||
attr_accessor :unwatched # This is used for removing position entry from PositionFile | |||
attr_accessor :totalbytesavailable # This is used for removing position entry from PositionFile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the remark (to be fixed in all variables)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved. please see my earlier comments.
@totalbytesread=0 | ||
@totalbytesavailable=0 | ||
@inodelastfsize_map={} | ||
@inodereadfsize_map={} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As rotations continue forever this map is gowing and growing ... I do not know how much ... but this feels like a memory leak to me ... need to see how large this gets in real-life and consider limiting this somehow in time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. But we should come up with a solution instead of allowing it to grow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jcantrill Maybe aggrigation of infomration from this map every set time for oldest inodes ... this can be second phase and not part of this PR from my PoV ... but we should not forget to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this map is necessary, and the logic can be simpler. See https://github.com/alanconway/file-metric/blob/b0cb7946c3e9dd6b607ef6f5e02218e5c1b33e94/cmd/file-metric/main.go#L69 which is in Go, but the logic can be reproduced in ruby. In fact we don't need the inode, the only thing that matters is when the file size drops, that means we have started a new rotation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hash map based implementation is taken out to make implementation more performance efficient. A new write_watcher.rb is introduced to measure totalbytes_logged_in_disk.
@@ -510,6 +510,13 @@ def initialize(path, rotate_wait, pe, log, read_from_head, enable_watch_timer, e | |||
@from_encoding = from_encoding | |||
@encoding = encoding | |||
@open_on_every_update = open_on_every_update | |||
@totalbytesread=0 | |||
@totalbytesavailable=0 | |||
@inodelastfsize_map={} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as next comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eranra @alanconway resolved this in the new implementation.
if @io_handler && @rotate_handler | ||
@totalbytesread=0 | ||
@countonrotate=0 | ||
@inodereadfsize_map.each do |k, v| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be expensive loop after some time if the number of rotations is significant (maybe you can consolidate the info after some time to a variable and remove a lot of the data in that map)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we need to expose metrics for each file instead of exposing a sum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jcantrill I think we want need one value ... prodinf a metric per file is not very useful and not very scalable. But, maybe the accomolation can be done in 'streaming' like manner and not doing the entire calculation every time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been learning about prometheus and here's what I think: we have 2 metrics, counters, bytes_loggged and bytes_collected. However those metrics have labels that allow the user to refine the set of log streams they care about. In Go this is called a CounterVec, not sure what it is in fluentd. See https://github.com/alanconway/file-metric/blob/b0cb7946c3e9dd6b607ef6f5e02218e5c1b33e94/cmd/file-metric/main.go
I need to do some research to figure out what the right set of labels is. In my go example it's just the file name, but we need to look at other observatorium label schemes and consider our own data model to find a set that makes sense for queries. The cardinality is always going to be per-log-file, but the log file name isn't a very obvious way for users to query.
@@ -854,9 +892,18 @@ def on_notify(stat) | |||
begin | |||
if @inode != inode || fsize < @fsize | |||
@on_rotate.call(stat) | |||
inodelastfsize_map[@inode]=@fsize | |||
@totalbytesavailable=@fsize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of maxfsize if you are using the fsize for logloss computation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maxfsize is taken out now as it is not required to be monitored.
@@ -9,7 +9,7 @@ class PrometheusTailMonitorInput < Fluent::Plugin::Input | |||
|
|||
helpers :timer | |||
|
|||
config_param :interval, :time, default: 5 | |||
config_param :interval, :time, default: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why we need to move to more frequent interval ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This if needs to be changed should be made in the config generated for the plug-in, not here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all changes done in the vendored_gem_src are restored back as it is not supposed to be changed for creating custom plugin. now all changes sit in fluentd/lib/ folder.
@pmoogi-redhat: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes will need to be made against the upstream repo. If we intend to carry them here until those changes land then both these PRs must be moved to lib
else | ||
@monitor_agent = Fluent::MonitorAgentInput.new | ||
end | ||
@monitor_agent = Fluent::Plugin::MonitorAgentInput.new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like the original intent of the if was lost
@totalbytesread=0 | ||
@totalbytesavailable=0 | ||
@inodelastfsize_map={} | ||
@inodereadfsize_map={} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. But we should come up with a solution instead of allowing it to grow
@inodereadfsize_map={} | ||
@countonrotate=0 | ||
@maxfsize=0 | ||
@logloss=0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be required
if @io_handler && @rotate_handler | ||
@totalbytesread=0 | ||
@countonrotate=0 | ||
@inodereadfsize_map.each do |k, v| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we need to expose metrics for each file instead of exposing a sum?
/hold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Along with the embedded comments, the code should be separated into its own class so we have minimal changes to the fluentd in_tail plugin. That will make it easier to upgrade to new versions of in_tail in future.
@totalbytesread=0 | ||
@totalbytesavailable=0 | ||
@inodelastfsize_map={} | ||
@inodereadfsize_map={} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this map is necessary, and the logic can be simpler. See https://github.com/alanconway/file-metric/blob/b0cb7946c3e9dd6b607ef6f5e02218e5c1b33e94/cmd/file-metric/main.go#L69 which is in Go, but the logic can be reproduced in ruby. In fact we don't need the inode, the only thing that matters is when the file size drops, that means we have started a new rotation.
'totalbytes read by fluentd IOHandler'), | ||
totalbytesavailable: get_gauge( | ||
:totalbytesavailable, | ||
'totalbytes available at each instance of rotation - to be read by fluentd IOHandler'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the terms in the JIRA, @eranra if you agree these are more expressive we should update the code:
bytes_logged (counter) bytes written to log files by containers.
bytes_collected (counter) bytes read by the collector for forwarding.
Log loss can be computed by a prometheus query expression: (bytes_logged - bytes_collected)
@pmoogi-redhat note the metrics should be counters not guages. They always increase.
:fluentd_tail_file_inode, | ||
'Current inode of file.'), | ||
maxfsize: get_gauge( | ||
:maxfsize, | ||
'Current max fsize of file on rotation event'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to track file size. The interesting metrics are bytes_logged/bytes_collected. max file size is an implementation detail that could change in future (e.g. if we start streaming logs from crio directly via pipes)
if @io_handler && @rotate_handler | ||
@totalbytesread=0 | ||
@countonrotate=0 | ||
@inodereadfsize_map.each do |k, v| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been learning about prometheus and here's what I think: we have 2 metrics, counters, bytes_loggged and bytes_collected. However those metrics have labels that allow the user to refine the set of log streams they care about. In Go this is called a CounterVec, not sure what it is in fluentd. See https://github.com/alanconway/file-metric/blob/b0cb7946c3e9dd6b607ef6f5e02218e5c1b33e94/cmd/file-metric/main.go
I need to do some research to figure out what the right set of labels is. In my go example it's just the file name, but we need to look at other observatorium label schemes and consider our own data model to find a set that makes sense for queries. The cardinality is always going to be per-log-file, but the log file name isn't a very obvious way for users to query.
end | ||
|
||
def on_rotate(stat) | ||
@maxfsize=@pe.read_pos |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't care about max file size. We only care about bytes_logged and bytes_collected. I think this code can be simplified to accumulate directly into the prometheus client metrics (like the Go code at https://github.com/alanconway/file-metric/blob/master/cmd/file-metric/main.go) so that the embedded prom client has an up-to-date view whenever there is a pull from prometheus.
closing this one as new PR is raised with lot many revamp of this implementation and placement of code. |
Description
Currently in_tail plugin doesn't support publishing of inbound logloss - i.e. difference between total bytes written to disk (logfile) and total bytes collected or read by fluentd. This PR got changes in fluentd/lib/fluent/plugin/in_tail.rb, and fluent-plugin-prometheus/lib/fluent/plugin/in_prometheus_tail_monitor.rb plugins to enable publishing of the below parameters
/cc @alanconway @jcantrill
/assign @alanconway
/cherry-pick
Links