New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metric hacky friday #455
Metric hacky friday #455
Conversation
nautobot_golden_config/metrics.py
Outdated
backup_gauges = GaugeMetricFamily( | ||
"nautobot_gc_backup_total", "Nautobot Golden Config Backups", labels=["seconds", "status"] | ||
) | ||
time_delta_to_include = PLUGIN_SETTINGS.get("metrics", {}).get("time_delta", timedelta(days=1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the last day the most common metric? I would have thought it would have been just the status in general
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what you propose is
success_count = GoldenConfig.objects.filter(backup_last_attempt_date=F('backup_last_success_date')).count()
_attempt_count = GoldenConfig.objects.filter(backup_last_attempt_date__isnull=False).count()
failure_count = _attempt_count - success_count
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The status of the last Job run is exposed from the capacity_metrics
(which will be migrated into Nautobot core), example visualization here. I like exposing the counts of successful/attempted Job runs within a given interval, because those metrics will be stored in Prometheus and then you may query/process them as you like (e.g. visualize 1 year).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had a discussion and agreed that on the operational side of things the metric that Ken proposed would be easier to handle. The time dimension then comes from the time-series database. We just have a gauge that displays the current amount of successful/failing backups/intended/compliance/etc., does that make sense? The key point here is GoldenConfig
object status vs. JobResult
object status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! and much better said than I could have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I've updated the metrics to reflect your comments. Please check again!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good after the latest commit.
Can you address the merge conflicts? We may get one more release out before adding this, still thinking about it. |
We are good in principle here, however, just determining which release. |
Any plans to release this? |
* POC for Prometheus instrumentation * Advance POC. * add compliance job metric * forgot yield * add simple metric rules per feature * added docstring * clean up stuff * cleanup * update invoke * fix pylint * fix black * fix CI for 1.5.3 for metrics * fix pylint * fix metrics based on comments * fix metrics * update lock file * fix linters * update `poetry.lock` --------- Co-authored-by: Leo Kirchner <leo.kirchner98@gmail.com>
Some basic metrics for Golden Config plugin #453. @Kircheneer helped me and had the baseline for me :)
metric_gc_jobs
exposes The successful vs failed GC related jobsmetric_golden_config
exposes number of devices that are configured per GC featuremetric_compliance
exposes number of compliant vs non-compliant devices per featureBelow there is the exposed metrics in
/metrics/
Also, I update a lot of files for passing black, and the
CI
to use Nautobot version1.5.13
in order to have the metrics functionality.