-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Pagination Support #37
Comments
@camrossi do you have any recommendations what would be good settings? Or do you think it should be defined on a query level? |
On the system I tested the issue got triggered when grabbing Is a bit of a trial and error but seems to me a page size of 1000 is working fine for both use cases. Perhaps set a default of 1000 with an option to override it ? |
@camrossi I have implemented what I believe is a solution to paging, but... Just setting |
Hi @thenodon ! I made a few tests: on high scale fabric we need to also pass an API timeout or the APIC might stop processing on its side. I just added This was mostly needed for the I am still getting a few queries failing but I think is due to some other issue... but I am having some trouble using If I do |
Anyhow the filtering works with
This will get ~9 pages of data over 110 seconds but then the APIC will refuse the next request with a HTTP 400. |
Hi @camrossi - first the
So from the When it comes to the problem that you do not get all the pages, I can also reproduce it. I have to set the page-size vary low. I ran the query 400 means a bad request, but the response to do not include any information what is "bad". Do you have access to the any logs on the apic side that could share some more insight to what the problem relates to? You can run this on a smaller cluster just change the line 44 in aci_connection.go to something small, as long that the response require more than 10 iterations, like: |
@camrossi I think I found it. Tested with curl and it worked as expected, like you also said. The only different is that in aci-exporter I set the header |
@camrossi related to the timeout issue. When the aci-exporter create the http client the default timeout is 0, which means no timeout. You can define it in the config.yml like:
If you run from prometheus you must define prometheus scrape timeout against the exporter. |
I am not sure if is the content type… setting it with CURL dosen’t result in an issue, let me investigate more on my side!
Also can you dump on the logs the full HTTP request (header and body) I am not sure how to do it and currently traveling for work so don't have much time to experiment with code :D
And I can confirm that removing the encoding works even if it makes no sense...
|
Alright I found the real issue! the APIC-cookie keeps getting re-added on each page so on page 11 there are 11 APIC-cookie concatenated.
No idea why I looked at the code but I don't see why cookiejar would do this and the APIC is sendinf |
@camrossi Interesting findings and it helped me to understand the real problem. It had nothing to do with the header
Default is 1000 |
@thenodon success it works.... now I am not sure if prometheus will work. The query I am testing consists of pulling all the static port bindings ( I can't even fault the APIC too much as we are pulling static port bindings on a fabric that is pretty much at the highest multi dimensional scale you can go. Small note: I had to increate the page size to 10K or it was gonna take too long, I think we would want to eventually support setting this per query, the limits is the size (in bytes) so depending what query you do 10K is too much or you could go even higher. |
@camrossi good to hear that it worked, but I think we need to do some thinking for this large fabric use case. A scrape time higher than a minute is bad in the context of Prometheus and I do not see Prometheus managing paging in the scrape logic. That would involve complex discovery and the exporter must expose page number or page range in the api. And how many pages is it that must be scraped?
I think we should test the first alternative first if you do not have any objection. If my math is not wrong you did 19 pages in 360 sec, that would mean that each page took approx 19 sec to complete. So with a parallel approach the total max time should be x sec (single call to get the |
@camrossi I have implemented the parallel page fetch. Will be interesting to see how that effect the total processing time on your large cluster. On my small cluster I ran
The fix is in the issue_37 branch |
I will give it a try next time I get access to the cluster. |
@camrossi I forgot to mention but you can disable parallel processing in the config:
It would be nice to try to find the "sweet spot" for the normal ACI customer related how to fetch data. Or even better be able to do some analyse of the cluster on startup and from that set the different configuration parameters of how to process results. I can imaging that that the number of pods, spines and leaf are defining the size, but may be also more "virtual" objects like epg etc. Ideas? |
@camrossi any feedback? |
Hi @ahmedaall not sure if it will merge without conflicts, but give it a try. I will hope to be able to continue to focus on this soon since for a large fabric it make so much sense. Looking forward to your feedback and test results. Great if you can share the size of the fabric if you do comparing tests. |
@thenodon I have 62 switchs and 111 EPG. I confirm that this feature is quite important :). The accuracy of monitoring depends a lot on the response time. I'll keep you inform. |
@thenodon Did you have the same output during the docker build : ./aci-api.go:69:26: cannot use *newAciConnection(ctx, fabricConfig) (variable of type AciConnection) as *AciConnection value in struct literal |
@ahmedaall not using docker when I developing. What branch are you on? Have you tried to merge master into branch issue_37? There are many things changed between master and branch issue_37. I think it needs more work than just a merge. I will see if I get some time this week. I have a clear idea what need to change, but doing it from the master branch. |
@thenodon Yes, I merged master into issue_37 taking issue_37 as reference. |
@ahmedaall and @camrossi I have now used the current master and created branch
|
Thanks Anders!
This seems very promising, I will try to give it a test next week!
Cam
From: Anders Håål ***@***.***>
Date: Friday, 12 January 2024 at 6:43 am
To: opsdis/aci-exporter ***@***.***>
Cc: Camillo Rossi (camrossi) ***@***.***>, Mention ***@***.***>
Subject: Re: [opsdis/aci-exporter] Implement Pagination Support (Issue #37)
@ahmedaall<https://github.com/ahmedaall> and @camrossi<https://github.com/camrossi> I have now used the current master and created branch pagination_issue_37 with the parallel page handling that was done in branch issue_37. The README.md is updated, search for Large fabric configuration. Some notes that is important:
* The class used in the query will automatically be added in the final request url as order-by=<class_name>.dn so you do not need to added in the configuration of the query. Without the order-by you will not get a consistent response.
* When using parallel page the the first request has to be done in "single" way to get the total number of entities that will be returned. That number is than used to calculate the number of parallel requests to do in parallel with the right offset.
* I detected that if the url query include rsp-subtree-include=count it will not work in parallel mode since its the count of items returned. So if that "string" is in the query it use single page request. Not sure if there are any more corner cases.
So checkout pagination_issue_37 and give it a spin. Looking forward to your feedback and especially what, hopefully, latency decrease it gives on a larger fabric.
—
Reply to this email directly, view it on GitHub<#37 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AD6E4PKBZSTNQ6KRBNQUWFTYOA6FJAVCNFSM6AAAAAA5KLTDFGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBXHA2TCNJUG4>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Hi @thenodon # Exporter port
port: 9643
# Configuration file name default without postfix
config: config
# The prefix of the metrics
prefix:
fabrics:
# cisco_sandbox:
# username: admin
# password: ""
# apic:
# - https://sandboxapicdc.cisco.com
MY_FABRIC:
# Apic username
username: ACI_USERNAME
# Apic password.
password: ACI_PASSWORD
# # The available apic controllers
# # The aci-exporter will use the first apic it can successfully login to, starting with the first in the list
apic:
- https://LB-FABRIC
# Http client settings used to access apic
# Below is the default values, where 0 is no timeout
httpclient:
# insecurehttps: true
# keepalive: 15
# timeout: 10
# this is the max and also the default value
pagesize: 1000
# enable parallel paging, default is false
parallel_paging: true
# Http server settings - this is for the web server aci-exporter expose
# Below is the default values, where 0 is no timeout
httpserver:
# read_timeout: 0
# write_timeout: 0
# Define the output format should be in openmetrics format - deprecated from future version after 0.4.0, use below metric_format
#openmetrics: true
metric_format:
# Output in openmetrics format, default false
openmetrics: false
# Transform all label keys to lower case format, default false. E.g. oobMgmtAddr will be oobmgmtaddr
label_key_to_lower_case: true
# Transform all label keys to snake case format, default false. E.g. oobMgmtAddr will be oob_mgmt_addr
label_key_to_snake_case: false |
@ahmedaall - you should run the branch |
@thenodon I misspoke. I run class_queries:
interface_info:
class_name: l1PhysIf
query_parameter: "?rsp-subtree=children&rsp-subtree-include=stats&rsp-subtree-class=ethpmPhysIf,eqptIngrBytes5min,eqptEgrBytes5min,eqptIngrDropPkts5min,eqptEgrDropPkts5min&query-target-filter=and(ne( l1PhysIf.adminSt, \"down\"))"
metrics:
- name: interface_speed_temp
value_name: l1PhysIf.children.[ethpmPhysIf].attributes.operSpeed
type: gauge
help: The current operational speed of the interface, in bits per second.
# value_transform:
# 'unknown': 0
# '100M': 100000000
# '1G': 1000000000
# '10G': 10000000000
# '25G': 25000000000
# '40G': 40000000000
# '100G': 100000000000
# '400G': 400000000000
- name: interface_admin_state
# The field in the json that is used as the metric value, qualified path (gjson) under imdata
value_name: l1PhysIf.attributes.adminSt
# Type
type: gauge
# Help text without prefix of metrics name
help: The current admin state of the interface.
value_transform:
'down': 0 ## ~ disabled interfaces
'up': 1 ## ~ enabled interfaces
- name: interface_oper_state
# The field in the json that is used as the metric value, qualified path (gjson) under imdata
value_name: l1PhysIf.children.[ethpmPhysIf].attributes.operSt
# Type
type: gauge
# Help text without prefix of metrics name
help: The current operational state of the interface. (0=unknown, 1=down, 2=up, 3=link-up)
# A string to float64 transform table of the value
value_transform:
"down": 0 ## ~ disabled interfaces
"up": 1 ## ~ enabled interfaces
# The labels to extract as regex
labels:
# The field in the json used to parse the labels from
- property_name: l1PhysIf.attributes.dn
# The regex where the string enclosed in the P<xyz> is the label name
regex: "^topology/pod-(?P<pod_id>[1-9][0-9]*)/node-(?P<node_id>[1-9][0-9]*)/sys/phys-\\[(?P<interface_name>[^\\]]+)\\]"
# Ajout de l'attribut descr au champs
- property_name: l1PhysIf.attributes.descr
regex: "^(?P<interface_description>.*)"
- property_name: l1PhysIf.children.[ethpmPhysIf].attributes.operSpeed
regex: "^(?P<speed_temp>.*)"
interface_info_more:
class_name: l1PhysIf
query_parameter: "?rsp-subtree=children&rsp-subtree-include=stats&rsp-subtree-class=ethpmPhysIf,eqptIngrBytes5min,eqptEgrBytes5min,eqptIngrDropPkts5min,eqptEgrDropPkts5min&query-target-filter=and(ne( l1PhysIf.adminSt, \"down\"))"
metrics:
- name: interface_rx_unicast
value_name: l1PhysIf.children.[eqptIngrBytes5min].attributes.unicastCum
type: counter
help: The number of unicast bytes received on the interface since it was integrated into the fabric.
unit: bytes
- name: interface_rx_multicast
value_name: l1PhysIf.children.[eqptIngrBytes5min].attributes.multicastCum
type: counter
unit: bytes
help: The number of multicast bytes received on the interface since it was integrated into the fabric.
- name: interface_rx_broadcast
value_name: l1PhysIf.children.[eqptIngrBytes5min].attributes.floodCum
type: counter
unit: bytes
help: The number of broadcast bytes received on the interface since it was integrated into the fabric.
- name: interface_rx_buffer_dropped
value_name: l1PhysIf.children.[eqptIngrDropPkts5min].attributes.bufferCum
type: counter
unit: pkts
help: The number of packets dropped by the interface due to a
buffer overrun while receiving since it was integrated into the
fabric.
- name: interface_rx_error_dropped
value_name: l1PhysIf.children.[eqptIngrDropPkts5min].attributes.errorCum
type: counter
unit: pkts
help: The number of packets dropped by the interface due to a
packet error while receiving since it was integrated into the
fabric.
- name: interface_rx_forwarding_dropped
value_name: l1PhysIf.children.[eqptIngrDropPkts5min].attributes.forwardingCum
type: counter
unit: pkts
help: The number of packets dropped by the interface due to a
forwarding issue while receiving since it was integrated into the
fabric.
- name: interface_rx_loadbal_dropped
value_name: l1PhysIf.children.[eqptIngrDropPkts5min].attributes.lbCum
type: counter
unit: pkts
help: The number of packets dropped by the interface due to a
load balancing issue while receiving since it was integrated into
the fabric.
- name: interface_tx_unicast
value_name: l1PhysIf.children.[eqptEgrBytes5min].attributes.unicastCum
type: counter
help: The number of unicast bytes transmitted on the interface since it was integrated into the fabric.
unit: bytes
- name: interface_tx_multicast
value_name: l1PhysIf.children.[eqptEgrBytes5min].attributes.multicastCum
type: counter
unit: bytes
help: The number of multicast bytes transmitted on the interface since it was integrated into the fabric.
- name: interface_tx_broadcast
value_name: l1PhysIf.children.[eqptEgrBytes5min].attributes.floodCum
type: counter
unit: bytes
help: The number of broadcast bytes transmitted on the interface since it was integrated into the fabric.
- name: interface_tx_queue_dropped
value_name: l1PhysIf.children.[eqptEgrDropPkts5min].attributes.afdWredCum
type: counter
unit: pkts
help: The number of packets dropped by the interface during queue
management while transmitting since it was integrated into the
fabric.
- name: interface_tx_buffer_dropped
value_name: l1PhysIf.children.[eqptEgrDropPkts5min].attributes.bufferCum
type: counter
unit: pkts
help: The number of packets dropped by the interface due to a
buffer overrun while transmitting since it was integrated into the
fabric.
- name: interface_tx_error_dropped
value_name: l1PhysIf.children.[eqptEgrDropPkts5min].attributes.errorCum
type: counter
unit: pkts
help: The number of packets dropped by the interface due to a
packet error while transmitting since it was integrated into the
fabric.
# The labels to extract as regex
labels:
# The field in the json used to parse the labels from
- property_name: l1PhysIf.attributes.dn
# The regex where the string enclosed in the P<xyz> is the label name
regex: "^topology/pod-(?P<pod_id>[1-9][0-9]*)/node-(?P<node_id>[1-9][0-9]*)/sys/phys-\\[(?P<interface_name>[^\\]]+)\\]"
# Ajout de l'attribut descr au champs
- property_name: l1PhysIf.attributes.descr
regex: "^(?P<interface_description>.*)"
- property_name: l1PhysIf.children.[ethpmPhysIf].attributes.operSpeed
regex: "^(?P<speed_temp>.*)" |
@ahmedaall sorry but have not had the time yet. I hope that also @camrossi would have a change to test on his mega fabric. I do not see that parallel paging will have a major impact on a normal sized fabric. If the payload will fit in the max size paging will not improve the latency. In the end the biggest impact of the latency is the time spent in the apic responding to the api call. The suggestion I made in #41 will not directly improve the latency, but instead of waiting for the query that take the longest time all separate queries will finish without waiting for the query with the largest latency. |
@thenodon I have been currently swallowed in a different black whole but this is still on my to do list...:) |
This issue is now implemented with the release of version 0.8.0, https://github.com/opsdis/aci-exporter/releases/tag/v0.8.0 |
Currently aci-exporter works fine for most configurations but on large scale fabric if a query returns too many object it might hit the
Maximum response size
the APIC can handle and the query will fail.aci-exporter should implement pagination
The text was updated successfully, but these errors were encountered: