Skip to content

Conversation

@jotak
Copy link
Member

@jotak jotak commented Sep 23, 2022

  • Allow to prefix operational metrics (it introduces a global settings
    for metrics, which will probably be extended later)
  • Unify metrics usage across stages: e.g. channel size was tracked only
    in netflow ingester; now they share a common struct that generate
    metrics. More metrics now include the "stage" label.
  • Adding new gauges for tracking channel sizes
  • Some renaming
  • Updated metrics doc

Some refactoring to make doc generation work with metrics that are not defined as globals (need to decouple metrics definition from its instantiation in prom registry) . Update doc to include labels, and sort alphabetically.

Also I had to move health & operationalMetrics into the same package to avoid repetitions like operationalmetrics.Metrics , which resulted also in move health_test into the pipeline package (to avoid cycle)

Breaking changes: the breaking changes are only about metric names (not API), see below

- Allow to prefix operational metrics (it introduces a global settings
  for metrics, which will probably be extended later)
- Unify metrics usage across stages: e.g. channel size was tracked only
  in netflow ingester; now they share a common struct that generate
metrics
- Adding new gauges for tracking channel sizes
- Some renaming
- Updated metrics doc
@jotak jotak force-pushed the add-queue-size-gauges branch from 04954c0 to d4157c0 Compare September 23, 2022 16:41
@jotak
Copy link
Member Author

jotak commented Sep 23, 2022

Here's the details of the operational metrics
The names here don't show any prefix. When FLP is started from the operator, it will be configured with the "netobserv_" prefix, which hence will be added in front of all these metrics.
(Note that metrics produced by third party, e.g. goflow, are not affected by this change)

conntrack_input_records (unchanged)

Name conntrack_input_records
Description The total number of input records per classification.
Type counter
Labels classification

conntrack_memory_connections (unchanged)

Name conntrack_memory_connections
Description The total number of tracked connections in memory.
Type gauge
Labels

conntrack_output_records (unchanged)

Name conntrack_output_records
Description The total number of output records.
Type counter
Labels type

encode_prom_errors (unchanged)

Name encode_prom_errors
Description Total errors during metrics generation
Type counter
Labels error, metric, key

ingest_flows_processed (renamed from "ingest_collector_flow_logs_processed", added stage label)

Name ingest_flows_processed
Description Number of flow logs processed by the ingester
Type counter
Labels stage

metrics_processed (renamed from "encode_prom_metrics_processed", added stage label)

Name metrics_processed
Description Number of metrics processed
Type counter
Labels stage

records_written (renamed from "loki_records_written", added stage label, now works also with kafka write)

Name records_written
Description Number of output records written
Type counter
Labels stage

stage_duration_ms (unchanged)

Name stage_duration_ms
Description Pipeline stage duration in milliseconds
Type histogram
Labels stage

stage_in_queue_size (renamed from ingest_collector_queue_length, generalized for all stages)

Name stage_in_queue_size
Description Pipeline stage input queue size (number of elements in queue)
Type gauge
Labels stage

stage_out_queue_size (new metric)

Name stage_out_queue_size
Description Pipeline stage output queue size (number of elements in queue)
Type gauge
Labels stage

@jotak jotak marked this pull request as ready for review September 23, 2022 16:58
}
}

// TODO / FIXME / FIGUREOUT: seems like we have 2 input channels for Loki? (this one, and see also pipeline_builder.go / getStageNode / StageWrite)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @mariomac / @eranra , what do you think about that? It seems than we have 2 chained channels here for write_loki input: this one and the one defined in pipeline_builder.go for all stages. Any reason for that? Should we get rid of this one?

Copy link

@mariomac mariomac Sep 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This channel seems to act as a buffer to avoid blocking the previous pipeline stage. Probably we'd need to verify that removing the in channel wouldn't affect performance.

When we are able to use Go 1.18, maybe we could upgrade to an upper-upperstream version of the go-pipes library that allows defining the buffer length between pipeline stages.

@codecov-commenter
Copy link

codecov-commenter commented Sep 27, 2022

Codecov Report

Merging #314 (08d3dbe) into main (d2d178b) will increase coverage by 0.46%.
The diff coverage is 73.72%.

@@            Coverage Diff             @@
##             main     #314      +/-   ##
==========================================
+ Coverage   69.65%   70.11%   +0.46%     
==========================================
  Files          82       84       +2     
  Lines        4729     4832     +103     
==========================================
+ Hits         3294     3388      +94     
+ Misses       1240     1229      -11     
- Partials      195      215      +20     
Flag Coverage Δ
unittests 70.11% <73.72%> (+0.46%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
cmd/confgenerator/main.go 0.00% <0.00%> (ø)
cmd/flowlogs-pipeline/main.go 0.00% <0.00%> (ø)
cmd/operationalmetricstodoc/main.go 0.00% <0.00%> (ø)
pkg/confgen/config.go 33.33% <ø> (ø)
pkg/config/config.go 50.00% <10.00%> (-13.64%) ⬇️
pkg/operational/metrics.go 66.01% <66.01%> (ø)
pkg/pipeline/pipeline_builder.go 70.03% <70.58%> (-0.98%) ⬇️
pkg/pipeline/write/write_loki.go 70.43% <75.00%> (-0.39%) ⬇️
pkg/pipeline/encode/encode_prom.go 77.82% <80.00%> (+0.45%) ⬆️
pkg/pipeline/encode/encode_kafka.go 69.86% <83.33%> (+0.84%) ⬆️
... and 18 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

if val, found := flow[info.Filter.Key]; found {
sVal, ok := val.(string)
if !ok {
sVal = fmt.Sprintf("%v", val)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: maybe fmt.Sprint(val) is slightly faster.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jotak jotak force-pushed the add-queue-size-gauges branch from 08d3dbe to 82dd179 Compare September 27, 2022 09:24
@jotak jotak added the breaking-change This pull request has breaking changes. They should be described in PR description. label Sep 27, 2022
@jotak jotak merged commit 0949fbe into netobserv:main Sep 27, 2022
Comment on lines +30 to +35
type MetricDefinition struct {
Name string
Help string
Type metricType
Labels []string
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jotak IIUC, Type metricType is only used by the auto-documentation.
So, theoretically, one can create a MetricDefinition of type Counter and pass it to NewGauge() causing a misleading documentation.
If I'm correct, is it worth adding a check on NewGauge()/NewCounter()/NewHistogram() to validate the type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's correct, sure we can do what you suggest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking-change This pull request has breaking changes. They should be described in PR description.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants