Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus graphs traffic drops regularly #169

Open
jasonyates-thg opened this issue Jul 4, 2023 · 9 comments
Open

Prometheus graphs traffic drops regularly #169

jasonyates-thg opened this issue Jul 4, 2023 · 9 comments

Comments

@jasonyates-thg
Copy link

I'm using gnmic to monitor various Juniper MX devices, i'm subscribing to interface metrics and the sample interval is configured as 10s. My output is prometheus which is configured to scrape every 20s.

Once I graph my data, there is a routine drop in traffic across every device and every interface roughly every 60 minutes. At this point i'm unsure if it's something the Juniper device is doing, or something gnmic is doing or where to begin troubleshooting.

Example graphs from 2 different devices, different hardware (MX10k vs MX204) running different Junos software versions.

Screenshot 2023-07-04 at 22 45 55 Screenshot 2023-07-04 at 22 46 13
@karimra
Copy link
Collaborator

karimra commented Jul 5, 2023

I would be surprised if this is coming from the routers if it's happening on all devices.
It looks like the period is more like 50minutes.

Could you share the prometheus output config used for gNMIc ? Do you have export-timestamps set to true there ?

I would also look at the promQL expression used in Grafana, I assume it uses the rate() function.
Which interval is used for the rate function? It's usually recommended to use $__rate_interval.

@jasonyates-thg
Copy link
Author

I'd agree they are every 50 minutes. My output config is largely the default

# Configure outputs
outputs:
  prometheus:
    type: prometheus
    listen: :9804
    path: /metrics
    event-processors:
      - interface-descriptions
    service-registration:
      address: 10.249.0.250:8500
      service-address: 10.249.1.215

My promQL expression is using irate vs rate. If I change to rate the dips in the graph become less severe but they are still there. The expression is using $__rate_interval

Example query
irate(interfaces_interface_state_counters_in_octets{source=~"router1|router2", description!=""}[$__rate_interval])*8

irate:
Screenshot 2023-07-05 at 10 21 02

rate:
Screenshot 2023-07-05 at 10 21 11

@hellt
Copy link
Collaborator

hellt commented Jul 5, 2023

Maybe worth checking the raw collected metrics over the window where the dip is seen. To check what gnmic actually received from the NE

@karimra
Copy link
Collaborator

karimra commented Jul 5, 2023

Could you try with export-timestamps: true ? Without it Prometheus assigns the scrape request timestamp to the metrics, instead of the device timestamp.

The second set of graphs show that the dips are not in sync across devices while maintaining the ~50minutes dip interval, I would then manually check the values sent by the device for potential lower counter values as @hellt pointed above.

@jasonyates-thg
Copy link
Author

So, adding export-timestamps: true fixed the dips in the graph (see below). However, it has also had an adverse impact on the majority of devices whereby the interface metrics are no longer being recorded.

Screenshot 2023-07-05 at 18 02 49 Screenshot 2023-07-05 at 18 39 17

I can't find any reasoning for it. All devices are synced to an NTP time source and have their timezone set to UTC. What's equally interesting is that it only impacted interface metrics, subscriptions to other metrics such as bgp continued to work correctly.

Looking at the metrics exposed by prometheus, it's reporting the same number of samples scraped so my only assumption is that prometheus is dropping them for some reason?

@karimra
Copy link
Collaborator

karimra commented Jul 5, 2023

Are the timestamps sent by the devices in nanoseconds ?
Is the prometheus server also synced ?

@jasonyates-thg
Copy link
Author

They are sent by the device in miliseconds - example timestamp 1688578740000

I confirmed that both the server running gnmic and the server running prometheus are sync'd to the same timesource and the reported time between those servers and the Juniper devices is correct.

@jasonyates-thg
Copy link
Author

jasonyates-thg commented Jul 5, 2023

By way of a visual example. I have 2 routers in this graph, both in the same physical location, both sync'd to the same NTP time source, both showing the same time running show system status. Both are subscribed to by the same gnmic collector (I run a pair using consul for target registration, consul shows the target lock is acquired by gnmic01 for both devices).

One continued to work whilst I had export-timestamps: true whilst the other did not. Once I removed the configuration, the non-working device started working again.

Screenshot 2023-07-05 at 20 16 04

@karimra
Copy link
Collaborator

karimra commented Jul 5, 2023

gNMIc Prometheus output converts timestamps from nanoseconds to milliseconds.
The gNMI notification timestamp should be sent by the device in ns, while Prometheus expects ms.
Can you double check using the cli which timestamp unit the device uses?

If the device sends timestamps in ms, there is a processor to convert them to ns:

processors:
  ts-ms-to-ns:
    event-starlark:
      source: |
        def apply(*events):
          for e in events:
            # change timestmap to ns
            e.timestamp = int(e.timestamp * 1000 * 1000)
          return events

Then add the processor name under the prometheus output:

outputs:
  prometheus:
    type: prometheus
    listen: :9804
    path: /metrics
    event-processors:
      - ts-ms-to-ns # <--new processor
      - interface-descriptions
    service-registration:
      address: 10.249.0.250:8500
      service-address: 10.249.1.215

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants