Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Pagination Support #37

Closed
camrossi opened this issue Sep 28, 2023 · 32 comments
Closed

Implement Pagination Support #37

camrossi opened this issue Sep 28, 2023 · 32 comments

Comments

@camrossi
Copy link
Contributor

Currently aci-exporter works fine for most configurations but on large scale fabric if a query returns too many object it might hit the Maximum response size the APIC can handle and the query will fail.

aci-exporter should implement pagination

@thenodon
Copy link
Member

@camrossi do you have any recommendations what would be good settings? Or do you think it should be defined on a query level?

@camrossi
Copy link
Contributor Author

camrossi commented Sep 28, 2023

On the system I tested the issue got triggered when grabbing vlanCktEp as well as fvAEPg when the child objects are included.

Is a bit of a trial and error but seems to me a page size of 1000 is working fine for both use cases.
I also doubt anyone will have a fabric that big... I tested against the one of fabrics we use for scale testing that has something like 15K EPGs, 750K Static port bindings and 500 switches

Perhaps set a default of 1000 with an option to override it ?

@thenodon
Copy link
Member

thenodon commented Oct 7, 2023

@camrossi I have implemented what I believe is a solution to paging, but... Just setting page and page-size and iterate over the result until all pages are consumed is not enough. For this to work you also need the order-by query parameter to get a consistent response, otherwise the final result include duplicates and not all objects in a random way. Currently the order-by is set in an implicit way in the doGet method, but it assume that the class has a dn attribute. Not sure dn is an attribute on all classes, but it looks like it works for all the test queries I have. So the query attribute added is order-by=<class_name of the query>.dn.
Would be create if you can test it on your large cluster. Checkout branch issue_37. The paging size is hard coded to 1000, but can be made configurable. Not sure if the order-by have performance implications, but maybe you know.

@camrossi
Copy link
Contributor Author

camrossi commented Oct 8, 2023

Hi @thenodon !

I made a few tests: on high scale fabric we need to also pass an API timeout or the APIC might stop processing on its side. I just added &timeout=180 where you set page and page-size. 180 is the timeout we use in the APIC UI as well so should be a safe bet

This was mostly needed for the count queries but dose not hurt to have it set for all the queries.

I am still getting a few queries failing but I think is due to some other issue... but I am having some trouble using
queries=xxx syntax to execute only some queries.

If I do &queries=compound_queries doesn't execute any queries same if I do &queries=object_count do you support filtering like that?

@camrossi
Copy link
Contributor Author

camrossi commented Oct 8, 2023

Anyhow the filtering works with class_queries so I manage to do this test so I could focus on a single request that is failing:

curl -s 'http://localhost:9643/probe?target=scale2-fabric&queries=epg_port_vlan_binding'

This will get ~9 pages of data over 110 seconds but then the APIC will refuse the next request with a HTTP 400.
There are ~20 pages to be retrieved and what I am not really clear on at the moment is why it is happening... I did the exact same thing with curl doing 20 request (without re-authenticating) and it worked just fine.

@thenodon
Copy link
Member

thenodon commented Oct 9, 2023

Hi @camrossi - first the queries= query attribute should be set to the name of the "query" in the config file. A query is always of one of the types:

  • class_queries
  • compound_queries
  • group_class_queries

So from the example_config.yml the query interface_info is a class_queries. You can not set the query parameter queries= to one of the type names, only to a named query. Not sure if this is clear in the documentation.

When it comes to the problem that you do not get all the pages, I can also reproduce it. I have to set the page-size vary low. I ran the query interface_info that returns 98 entries (totalCount=98). I set the page-size to 3, which means that I have to iterate 33 times to get all entries. But what happens is that after 10 successful request the 11th get 400. So I tried to start on a different offset than 0. I get the same result setting offset to 10, page=10, but when trying request page=20 a 400 is return. So only 10 requests are successful, independent of the page-size. I also tried to remove the order-by, but it did not make any changes.
So it do not matter what offset I start from (page=X) only 10 request are successful and the next gives 400.

400 means a bad request, but the response to do not include any information what is "bad". Do you have access to the any logs on the apic side that could share some more insight to what the problem relates to? You can run this on a smaller cluster just change the line 44 in aci_connection.go to something small, as long that the response require more than 10 iterations, like:
var pageCount uint64 = 3

@thenodon
Copy link
Member

thenodon commented Oct 9, 2023

@camrossi I think I found it. Tested with curl and it worked as expected, like you also said. The only different is that in aci-exporter I set the header Content-Type: application/json, and when I did run with curl I did not set that header. Removed the header in aci-exporter and it works! Can not understand why this problem just happen on the call number 11 in the iteration. Still great if you access to logs to see what was the issue on apic side.
You can do a pull on the issue_37 branch and test. Have also added to the logs the number of "pages" that was collected based on the page-size.
INFO[0013] api call fabric class=ethpmPhysIf exec_time=286413 fabric=cam length=119622 method=GET pages=3 requestid=2WWFkMYD5BcK4CWV1B5J4FeA3Rk status=0 uri="https://173.36.219.190/api/class/ethpmPhysIf.json"

@thenodon
Copy link
Member

thenodon commented Oct 9, 2023

@camrossi related to the timeout issue. When the aci-exporter create the http client the default timeout is 0, which means no timeout. You can define it in the config.yml like:

httpclient:
  insecurehttps: true
  # Default is 15 seconds
  keepalive: 15
  # Zero means no timeout
  timeout: 0

If you run from prometheus you must define prometheus scrape timeout against the exporter.

@camrossi
Copy link
Contributor Author

camrossi commented Oct 9, 2023 via email

@camrossi
Copy link
Contributor Author

camrossi commented Oct 10, 2023

Alright I found the real issue! the APIC-cookie keeps getting re-added on each page so on page 11 there are 11 APIC-cookie concatenated.
I found this out by making http req and taking a sniffer trace:

Cookie: APIC-cookie=<DATA>; APIC-cookie=<DATA>; APIC-cookie=<DATA>
etc....

No idea why I looked at the code but I don't see why cookiejar would do this and the APIC is sendinf Set-Cookie: APIC-cookie= only on the loging method.

@thenodon
Copy link
Member

thenodon commented Oct 10, 2023

@camrossi Interesting findings and it helped me to understand the real problem. It had nothing to do with the header Content-Type: application/json was set or not. It was my header logic that was buggy.
I have now, hopefully, fixed this and tested with different page-size and with the original Content-Type header set. The page-size can be set in the config as:

httpclient:
  pagesize: 100
  ....

Default is 1000
Looking forward to your verification on your monster cluster. Branch is still issue_37

@camrossi
Copy link
Contributor Author

@thenodon success it works.... now I am not sure if prometheus will work. The query I am testing consists of pulling all the static port bindings (765327) and that takes 6minutes. If that is too long would it make sense to return data to prometheus on a per page basis? I did pull 19 pages in those 6min.

I can't even fault the APIC too much as we are pulling static port bindings on a fabric that is pretty much at the highest multi dimensional scale you can go.

Small note: I had to increate the page size to 10K or it was gonna take too long, I think we would want to eventually support setting this per query, the limits is the size (in bytes) so depending what query you do 10K is too much or you could go even higher.

@thenodon
Copy link
Member

thenodon commented Oct 13, 2023

@camrossi good to hear that it worked, but I think we need to do some thinking for this large fabric use case. A scrape time higher than a minute is bad in the context of Prometheus and I do not see Prometheus managing paging in the scrape logic. That would involve complex discovery and the exporter must expose page number or page range in the api. And how many pages is it that must be scraped?
I think the way forward are some of these alternatives:

  • Make every page fetch parallel. Currently they are done sequential. The question is, do this burst of api calls against the apic become a resource problem for the apic? This should be fairly simple to implement and I think we should try that.
  • Doing async fetching of metrics and caching the result. That could be implemented in the aci-exporter or as an external proxy cache. I have developed a number of exporters that use this pattern where the exporter collect data with its own "scrape" interval and cache it and Prometheus can scrape every minute, but the metrics in Prometheus will be the same during the interval of the exporter scrape/fetch interval. This means lower resolution of the metrics in Prometheus but may be okay. This is often the scenario for streaming metrics where the device/platform push its metrics every X minutes.

I think we should test the first alternative first if you do not have any objection. If my math is not wrong you did 19 pages in 360 sec, that would mean that each page took approx 19 sec to complete. So with a parallel approach the total max time should be x sec (single call to get the totalCount using just page size of 1) + 19 sec (the remaining pages in parallel) sec range. The question is how it affects the apic?

@thenodon
Copy link
Member

@camrossi I have implemented the parallel page fetch. Will be interesting to see how that effect the total processing time on your large cluster. On my small cluster I ran interface_info from the example_config.yml with a pagesize=1:

  • sequential - ~24 sec
  • parallel - ~3 sec

The fix is in the issue_37 branch

@camrossi
Copy link
Contributor Author

I will give it a try next time I get access to the cluster.
Anyway this is not super urgent :D

@thenodon
Copy link
Member

@camrossi I forgot to mention but you can disable parallel processing in the config:

httpclient:
  # Below are the default values
  pagesize: 1000
  # If false paging will be done sequentially for the query
  parallel_paging: true

It would be nice to try to find the "sweet spot" for the normal ACI customer related how to fetch data. Or even better be able to do some analyse of the cluster on startup and from that set the different configuration parameters of how to process results. I can imaging that that the number of pods, spines and leaf are defining the size, but may be also more "virtual" objects like epg etc. Ideas?

@thenodon
Copy link
Member

@camrossi any feedback?

@ahmedaall
Copy link

Hi @thenodon, @camrossi,
I can test the parallel page fetch with my fabric. Today am up to date with the master branch. I will merge with issue_37 and give you a feedback.

@thenodon
Copy link
Member

Hi @ahmedaall not sure if it will merge without conflicts, but give it a try. I will hope to be able to continue to focus on this soon since for a large fabric it make so much sense. Looking forward to your feedback and test results. Great if you can share the size of the fabric if you do comparing tests.

@ahmedaall
Copy link

ahmedaall commented Jan 11, 2024

@thenodon I have 62 switchs and 111 EPG. I confirm that this feature is quite important :). The accuracy of monitoring depends a lot on the response time. I'll keep you inform.

@ahmedaall
Copy link

@thenodon Did you have the same output during the docker build :

./aci-api.go:69:26: cannot use *newAciConnection(ctx, fabricConfig) (variable of type AciConnection) as *AciConnection value in struct literal
./aci-connection.go:91:2: jar declared and not used
./aci-connection.go:91:12: undefined: cookiejar
./aci-connection.go:117:21: undefined: responseTime
./aci-connection.go:120:2: undefined: connectionCache
./aci-connection.go:121:9: undefined: connectionCache
./aci-connection.go:127:17: c.tokenProcessing undefined (type *AciConnection has no field or method tokenProcessing)
./aci-connection.go:137:4: c.tokenMutex undefined (type *AciConnection has no field or method tokenMutex)
./aci-connection.go:138:10: c.tokenMutex undefined (type *AciConnection has no field or method tokenMutex)
./aci-connection.go:171:4: c.token undefined (type *AciConnection has no field or method token)
./aci-connection.go:171:4: too many errors
ERROR: process "/bin/sh -c go get -u && go mod tidy && go build -o aci-exporter *.go" did not complete successfully: exit code: 1
RUN go get -u && go mod tidy && go build -o aci-exporter *.go:
./aci-connection.go:91:2: jar declared and not used
./aci-connection.go:91:12: undefined: cookiejar
./aci-connection.go:117:21: undefined: responseTime
./aci-connection.go:120:2: undefined: connectionCache
./aci-connection.go:121:9: undefined: connectionCache
./aci-connection.go:127:17: c.tokenProcessing undefined (type *AciConnection has no field or method tokenProcessing)
./aci-connection.go:137:4: c.tokenMutex undefined (type *AciConnection has no field or method tokenMutex)
./aci-connection.go:138:10: c.tokenMutex undefined (type *AciConnection has no field or method tokenMutex)
./aci-connection.go:171:4: c.token undefined (type *AciConnection has no field or method token)
./aci-connection.go:171:4: too many errors

@thenodon
Copy link
Member

@ahmedaall not using docker when I developing. What branch are you on? Have you tried to merge master into branch issue_37? There are many things changed between master and branch issue_37. I think it needs more work than just a merge. I will see if I get some time this week. I have a clear idea what need to change, but doing it from the master branch.

@ahmedaall
Copy link

@thenodon Yes, I merged master into issue_37 taking issue_37 as reference.
Ok I'm going to rollback to the master branch while the changes are made

@thenodon
Copy link
Member

@ahmedaall and @camrossi I have now used the current master and created branch pagination_issue_37 with the parallel page handling that was done in branch issue_37. The README.md is updated, search for Large fabric configuration. Some notes that is important:

  • The class used in the query will automatically be added in the final request url as order-by=<class_name>.dn so you do not need to added in the configuration of the query. Without the order-by you will not get a consistent response.
  • When using parallel page the the first request has to be done in "single" way to get the total number of entities that will be returned. That number is than used to calculate the number of parallel requests to do in parallel with the right offset.
  • I detected that if the url query include rsp-subtree-include=count it will not work in parallel mode since its the count of items returned. So if that "string" is in the query it use single page request. Not sure if there are any more corner cases.
    So checkout pagination_issue_37 and give it a spin. Looking forward to your feedback and especially what, hopefully, latency decrease it gives on a larger fabric.

@camrossi
Copy link
Contributor Author

camrossi commented Jan 12, 2024 via email

@ahmedaall
Copy link

Hi @thenodon
I managed to succesfully merge pagination_issue_37.
With the default parameters the scrapping time doesn't seems to change.
Maybe I miss something. Here is my config :

 # Exporter port
port: 9643
# Configuration file name default without postfix
config: config
# The prefix of the metrics
prefix: 

fabrics:
#  cisco_sandbox:
#    username: admin
#    password: ""
#    apic:
#      - https://sandboxapicdc.cisco.com

  MY_FABRIC:
    # Apic username
    username: ACI_USERNAME
    # Apic password.
    password: ACI_PASSWORD
  #   # The available apic controllers
  #   # The aci-exporter will use the first apic it can successfully login to, starting with the first in the list
    apic:
      - https://LB-FABRIC

# Http client settings used to access apic
# Below is the default values, where 0 is no timeout
httpclient:
#  insecurehttps: true
#  keepalive: 15
#  timeout: 10
  # this is the max and also the default value
  pagesize: 1000
  # enable parallel paging, default is false
  parallel_paging: true
  
# Http server settings - this is for the web server aci-exporter expose
# Below is the default values, where 0 is no timeout
httpserver:
#  read_timeout: 0
#  write_timeout: 0

# Define the output format should be in openmetrics format - deprecated from future version after 0.4.0, use below metric_format
#openmetrics: true
metric_format:
  # Output in openmetrics format, default false
  openmetrics: false
  # Transform all label keys to lower case format, default false. E.g. oobMgmtAddr will be oobmgmtaddr
  label_key_to_lower_case: true
  # Transform all label keys to snake case format, default false. E.g. oobMgmtAddr will be oob_mgmt_addr
  label_key_to_snake_case: false

@thenodon
Copy link
Member

thenodon commented Jan 16, 2024

@ahmedaall - you should run the branch pagination_issue_37, you do not have to merge. Sure you got the latest changes to that branch? What queries to you run?

@ahmedaall
Copy link

ahmedaall commented Jan 17, 2024

@thenodon I misspoke. I run pagination_issue_37 with my custom Dockerfile and my custom yaml config.
Here is my main query :

class_queries:
  interface_info:
    class_name: l1PhysIf
    query_parameter: "?rsp-subtree=children&rsp-subtree-include=stats&rsp-subtree-class=ethpmPhysIf,eqptIngrBytes5min,eqptEgrBytes5min,eqptIngrDropPkts5min,eqptEgrDropPkts5min&query-target-filter=and(ne( l1PhysIf.adminSt, \"down\"))"
    metrics:
      - name: interface_speed_temp
        value_name: l1PhysIf.children.[ethpmPhysIf].attributes.operSpeed
        type: gauge
        help: The current operational speed of the interface, in bits per second.
      #        value_transform:
      #          'unknown': 0
      #          '100M': 100000000
      #          '1G': 1000000000
      #          '10G': 10000000000
      #          '25G': 25000000000
      #          '40G': 40000000000
      #          '100G': 100000000000
      #          '400G': 400000000000

      - name: interface_admin_state
        # The field in the json that is used as the metric value, qualified path (gjson) under imdata
        value_name: l1PhysIf.attributes.adminSt
        # Type
        type: gauge
        # Help text without prefix of metrics name
        help: The current admin state of the interface.
        value_transform:
          'down':               0       ## ~ disabled interfaces
          'up':                 1       ## ~ enabled interfaces

      - name: interface_oper_state
        # The field in the json that is used as the metric value, qualified path (gjson) under imdata
        value_name: l1PhysIf.children.[ethpmPhysIf].attributes.operSt
        # Type
        type: gauge
        # Help text without prefix of metrics name
        help: The current operational state of the interface. (0=unknown, 1=down, 2=up, 3=link-up)
        # A string to float64 transform table of the value
        value_transform:
          "down":               0       ## ~ disabled interfaces
          "up":                 1       ## ~ enabled interfaces

    # The labels to extract as regex
    labels:
      # The field in the json used to parse the labels from
      - property_name: l1PhysIf.attributes.dn
        # The regex where the string enclosed in the P<xyz> is the label name
        regex: "^topology/pod-(?P<pod_id>[1-9][0-9]*)/node-(?P<node_id>[1-9][0-9]*)/sys/phys-\\[(?P<interface_name>[^\\]]+)\\]"
        # Ajout de l'attribut descr au champs
      - property_name: l1PhysIf.attributes.descr
        regex: "^(?P<interface_description>.*)"
      - property_name: l1PhysIf.children.[ethpmPhysIf].attributes.operSpeed
        regex: "^(?P<speed_temp>.*)"

  interface_info_more:
    class_name: l1PhysIf
    query_parameter: "?rsp-subtree=children&rsp-subtree-include=stats&rsp-subtree-class=ethpmPhysIf,eqptIngrBytes5min,eqptEgrBytes5min,eqptIngrDropPkts5min,eqptEgrDropPkts5min&query-target-filter=and(ne( l1PhysIf.adminSt, \"down\"))"
    metrics:
      - name: interface_rx_unicast
        value_name: l1PhysIf.children.[eqptIngrBytes5min].attributes.unicastCum
        type: counter
        help: The number of unicast bytes received on the interface since it was integrated into the fabric.
        unit: bytes

      - name: interface_rx_multicast
        value_name: l1PhysIf.children.[eqptIngrBytes5min].attributes.multicastCum
        type: counter
        unit: bytes
        help: The number of multicast bytes received on the interface since it was integrated into the fabric.

      - name: interface_rx_broadcast
        value_name: l1PhysIf.children.[eqptIngrBytes5min].attributes.floodCum
        type: counter
        unit: bytes
        help: The number of broadcast bytes received on the interface since it was integrated into the fabric.

      - name: interface_rx_buffer_dropped
        value_name: l1PhysIf.children.[eqptIngrDropPkts5min].attributes.bufferCum
        type: counter
        unit: pkts
        help: The number of packets dropped by the interface due to a
          buffer overrun while receiving since it was integrated into the
          fabric.
      - name: interface_rx_error_dropped
        value_name: l1PhysIf.children.[eqptIngrDropPkts5min].attributes.errorCum
        type: counter
        unit: pkts
        help: The number of packets dropped by the interface due to a
          packet error while receiving since it was integrated into the
          fabric.
      - name: interface_rx_forwarding_dropped
        value_name: l1PhysIf.children.[eqptIngrDropPkts5min].attributes.forwardingCum
        type: counter
        unit: pkts
        help: The number of packets dropped by the interface due to a
          forwarding issue while receiving since it was integrated into the
          fabric.
      - name: interface_rx_loadbal_dropped
        value_name: l1PhysIf.children.[eqptIngrDropPkts5min].attributes.lbCum
        type: counter
        unit: pkts
        help: The number of packets dropped by the interface due to a
          load balancing issue while receiving since it was integrated into
          the fabric.

      - name: interface_tx_unicast
        value_name: l1PhysIf.children.[eqptEgrBytes5min].attributes.unicastCum
        type: counter
        help: The number of unicast bytes transmitted on the interface since it was integrated into the fabric.
        unit: bytes

      - name: interface_tx_multicast
        value_name: l1PhysIf.children.[eqptEgrBytes5min].attributes.multicastCum
        type: counter
        unit: bytes
        help: The number of multicast bytes transmitted on the interface since it was integrated into the fabric.

      - name: interface_tx_broadcast
        value_name: l1PhysIf.children.[eqptEgrBytes5min].attributes.floodCum
        type: counter
        unit: bytes
        help: The number of broadcast bytes transmitted on the interface since it was integrated into the fabric.

      - name: interface_tx_queue_dropped
        value_name: l1PhysIf.children.[eqptEgrDropPkts5min].attributes.afdWredCum
        type: counter
        unit: pkts
        help: The number of packets dropped by the interface during queue
          management while transmitting since it was integrated into the
          fabric.

      - name: interface_tx_buffer_dropped
        value_name: l1PhysIf.children.[eqptEgrDropPkts5min].attributes.bufferCum
        type: counter
        unit: pkts
        help: The number of packets dropped by the interface due to a
          buffer overrun while transmitting since it was integrated into the
          fabric.

      - name: interface_tx_error_dropped
        value_name: l1PhysIf.children.[eqptEgrDropPkts5min].attributes.errorCum
        type: counter
        unit: pkts
        help: The number of packets dropped by the interface due to a
          packet error while transmitting since it was integrated into the
          fabric.

    # The labels to extract as regex
    labels:
      # The field in the json used to parse the labels from
      - property_name: l1PhysIf.attributes.dn
        # The regex where the string enclosed in the P<xyz> is the label name
        regex: "^topology/pod-(?P<pod_id>[1-9][0-9]*)/node-(?P<node_id>[1-9][0-9]*)/sys/phys-\\[(?P<interface_name>[^\\]]+)\\]"
        # Ajout de l'attribut descr au champs
      - property_name: l1PhysIf.attributes.descr
        regex: "^(?P<interface_description>.*)"
      - property_name: l1PhysIf.children.[ethpmPhysIf].attributes.operSpeed
        regex: "^(?P<speed_temp>.*)"

@ahmedaall
Copy link

ahmedaall commented Jan 23, 2024

Hi @thenodon. Do you have any clue about some misconfig that I could have make for parallel pagination ?
For information, the syntax that you gave me in #41, works pretty good to reduce scrapping time.

@thenodon
Copy link
Member

@ahmedaall sorry but have not had the time yet. I hope that also @camrossi would have a change to test on his mega fabric. I do not see that parallel paging will have a major impact on a normal sized fabric. If the payload will fit in the max size paging will not improve the latency. In the end the biggest impact of the latency is the time spent in the apic responding to the api call. The suggestion I made in #41 will not directly improve the latency, but instead of waiting for the query that take the longest time all separate queries will finish without waiting for the query with the largest latency.

@camrossi
Copy link
Contributor Author

@thenodon I have been currently swallowed in a different black whole but this is still on my to do list...:)

camrossi added a commit to camrossi/aci-exporter that referenced this issue Jul 2, 2024
@thenodon
Copy link
Member

This issue is now implemented with the release of version 0.8.0, https://github.com/opsdis/aci-exporter/releases/tag/v0.8.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants