Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add load balancing between available apic's #1

Closed
thenodon opened this issue Jul 23, 2020 · 65 comments
Closed

Add load balancing between available apic's #1

thenodon opened this issue Jul 23, 2020 · 65 comments

Comments

@thenodon
Copy link
Member

Describe the solution you'd like
If more then one apic is configured for the fabric the exporter should be able to round-robin between them.

@ahmedaall
Copy link

Hi,
I was about to make the same issue request but you read my mind.
Have you been able to make progress on this point ?
Thx

@thenodon
Copy link
Member Author

thenodon commented Aug 5, 2023

Hi @ahmedaall. I have not done any development on this issue, but If you like to contribute it would be great. Another option is to use some external LB that take care of this and that support more options like health checks etc.

@ahmedaall
Copy link

ahmedaall commented Aug 18, 2023

Hi @thenodon. During my failover test I saw after shutting down my Apic1 that the exporter become completely unreachable (while when i cut the apic2 or 3 the exporter still works perfectly). So instead of having my 3 Apics adresses in my config, I put the Load Balancer adresse that round robin between them with session persistence :

# Profiles for different fabrics
fabrics:
  # This is the Cisco provided sandbox that is open for testing
  # cisco_sandbox:
  #   username: admin
  #   password: <check the cisco sandbox to get the password>
  #   apic:
  #     - https://sandboxapicdc.cisco.com

  MY_FABRIC:
    username: ACI_USERNAME
    password: ACI_PASSWORD
    apic:
      - https://LB
#      - https://APIC1
#      - https://APIC2
#      - https://APIC3

So I tried again my failover test APIC redundancy by shutting down the Apic1. And at this moment I have this aci_up metrics to DOWN :

# HELP aci_up The connection state 1=UP, 0=DOWN
# TYPE aci_up gauge
aci_up{fabric="MY_FABRIC"} 0

PS : The Load Balancer round robin between APICs properly with a browser access during the failover, but the exporter still goes down.

@thenodon
Copy link
Member Author

Hi @ahmedaall , great that you are testing this. I do not have the have an environment where I can test this myself. I think there should be something in the logs that could help. Great if you could attach. You could also do some debugging. I think a break point in aci-connection.go in the login function is a good starting point since its called on every request.

@ahmedaall
Copy link

ahmedaall commented Aug 21, 2023

Hi @thenodon. Yes, here are the logs after noticing that the exporter goes down only when apic1 is down.
After restarting apic1, I turned off apic2 then apic3 and the exporter manages to work properly and switch to the next available apic :

BEFORE SHUTTING DOWN APIC1 : 
{"class":"topSystem","exec_time":1194174,"fabric":"MY_FABRIC","length":54981,"level":"info","method":"GET","msg":"api call fabric","requestid":"2UHeqJJd39oafNMVwa2mHNJU3AN","status":200,"time":"2023-08-21T06:58:44Z","uri":"https://APIC2/api/class/topSystem.json?rsp-subtree-include=health"}
{"class":"fvTenant","exec_time":1194421,"fabric":"MY_FABRIC","length":2164,"level":"info","method":"GET","msg":"api call fabric","requestid":"2UHeqJJd39oafNMVwa2mHNJU3AN","status":200,"time":"2023-08-21T06:58:44Z","uri":"https://APIC2/api/class/fvTenant.json?rsp-subtree-include=health,required"}
{"class":"faults","exec_time":1195618,"fabric":"MY_FABRIC","length":2879,"level":"info","method":"GET","msg":"api call fabric","requestid":"2UHeqJJd39oafNMVwa2mHNJU3AN","status":200,"time":"2023-08-21T06:58:44Z","uri":"https://APIC2/api/class/faultCountsWithDetails.json"}
{"class":"fvCtx","exec_time":1196088,"fabric":"MY_FABRIC","length":7894,"level":"info","method":"GET","msg":"api call fabric","requestid":"2UHeqJJd39oafNMVwa2mHNJU3AN","status":200,"time":"2023-08-21T06:58:44Z","uri":"https://APIC2/api/class/fvCtx.json?rsp-subtree-include=health,required"}
{"class":"eqptTemp5min","exec_time":1206707,"fabric":"MY_FABRIC","length":21586,"level":"info","method":"GET","msg":"api call fabric","requestid":"2UHeqJJd39oafNMVwa2mHNJU3AN","status":200,"time":"2023-08-21T06:58:44Z","uri":"https://APIC2/api/class/eqptTemp5min.json?rsp-subtree-include=stats\u0026rsp-subtree-class=eqptTemp5min\u0026query-target-filter=wcard(eqptTemp5min.dn,\".*sup/sensor-2/CDeqptTemp5min\")"}
{"class":"fvAp","exec_time":1210511,"fabric":"MY_FABRIC","length":3642,"level":"info","method":"GET","msg":"api call fabric","requestid":"2UHeqJJd39oafNMVwa2mHNJU3AN","status":200,"time":"2023-08-21T06:58:44Z","uri":"https://APIC2/api/class/fvAp.json?rsp-subtree-include=health,required"}
{"class":"fabricHealthTotal","exec_time":1211007,"fabric":"MY_FABRIC","length":544,"level":"info","method":"GET","msg":"api call fabric","requestid":"2UHeqJJd39oafNMVwa2mHNJU3AN","status":200,"time":"2023-08-21T06:58:44Z","uri":"https://APIC2/api/class/fabricHealthTotal.json?query-target-filter=wcard(fabricHealthTotal.dn,\"topology/.*/health\")"}
{"class":"procSysMem5min","exec_time":1211795,"fabric":"MY_FABRIC","length":26934,"level":"info","method":"GET","msg":"api call fabric","requestid":"2UHeqJJd39oafNMVwa2mHNJU3AN","status":200,"time":"2023-08-21T06:58:44Z","uri":"https://APIC2/api/class/procSysMem5min.json"}
{"exec_time":2271430,"fabric":"MY_FABRIC","level":"info","msg":"total scrape time ","requestid":"2UHeqJJd39oafNMVwa2mHNJU3AN","time":"2023-08-21T06:58:44Z"}
{"exec_time":35451,"fabric":"MY_FABRIC","level":"info","method":"POST","msg":"api call fabric","requestid":"2UHeqJJd39oafNMVwa2mHNJU3AN","status":200,"time":"2023-08-21T06:58:45Z","uri":"https://APIC2/api/aaaLogout.xml"}
{"exec_time":4286804,"fabric":"MY_FABRIC","length":1064980,"level":"info","method":"GET","msg":"api call","requestid":"2UHeqJJd39oafNMVwa2mHNJU3AN","status":200,"time":"2023-08-21T06:58:46Z","uri":"/probe?target=MY_FABRIC"}

AFTER SHUTTING DOWN APIC1 : 
{"exec_time":90839,"fabric":"MY_FABRIC","level":"info","method":"POST","msg":"api call fabric","requestid":"2UHfYFFTFXy9vN7M48szPfQ8cUD","status":200,"time":"2023-08-21T07:04:31Z","uri":"https://APIC2/api/aaaLogin.xml"}
{"fabric":"MY_FABRIC","level":"info","msg":"Using apic https://APIC2","requestid":"2UHfYFFTFXy9vN7M48szPfQ8cUD","time":"2023-08-21T07:04:31Z"}
{"class":"aci_name","exec_time":90723882,"fabric":"MY_FABRIC","length":0,"level":"info","method":"GET","msg":"api call fabric","requestid":"2UHfYFFTFXy9vN7M48szPfQ8cUD","status":503,"time":"2023-08-21T07:06:01Z","uri":"https://APIC2/api/mo/topology/pod-1/node-1/av.json"}
{"fabric":"MY_FABRIC","level":"error","msg":"Request /api/mo/topology/pod-1/node-1/av.json failed - ACI api returned 503.","requestid":"2UHfYFFTFXy9vN7M48szPfQ8cUD","time":"2023-08-21T07:06:01Z"}
{"exec_time":25825,"fabric":"MY_FABRIC","level":"info","method":"POST","msg":"api call fabric","requestid":"2UHfYFFTFXy9vN7M48szPfQ8cUD","status":200,"time":"2023-08-21T07:06:01Z","uri":"https://APIC2/api/aaaLogout.xml"}
{"exec_time":90841072,"fabric":"MY_FABRIC","length":101,"level":"info","method":"GET","msg":"api call","requestid":"2UHfYFFTFXy9vN7M48szPfQ8cUD","status":503,"time":"2023-08-21T07:06:01Z","uri":"/probe?target=MY_FABRIC"}

Looks like the MO (Managed object) is reachable only through the apic1

@thenodon
Copy link
Member Author

@ahmedaall Why do you not test to run against apic2 without LB and verify if that apic node works for queries. 503 is an interesting response. Running directly against apic2 from aci-exporter hopefully revile if the problem is on the apic or exporter side .

@ahmedaall
Copy link

@thenodon Yes, you have a point. I tried to run against apic2 and 3 as targets. It works perfectly. But when I shutdown the apic1 the exporter goes down :

{"class":"aci_name","exec_time":90295767,"fabric":"MY_FABRIC","length":0,"level":"info","method":"GET","msg":"api call fabric","requestid":"2UHqtP9LRXobqtpFKzcGsaTZIrK","status":503,"time":"2023-08-21T08:39:18Z","uri":"https://APIC2/api/mo/topology/pod-1/node-1/av.json"}

{"fabric":"MY_FABRIC","level":"error","msg":"Request /api/mo/topology/pod-1/node-1/av.json failed - ACI api returned 503.","requestid":"2UHqtP9LRXobqtpFKzcGsaTZIrK","time":"2023-08-21T08:39:18Z"}

{"exec_time":26211,"fabric":"MY_FABRIC","level":"info","method":"POST","msg":"api call fabric","requestid":"2UHqtP9LRXobqtpFKzcGsaTZIrK","status":200,"time":"2023-08-21T08:39:18Z","uri":"https://APIC2/api/aaaLogout.xml"}

{"exec_time":90442276,"fabric":"MY_FABRIC","length":101,"level":"info","method":"GET","msg":"api call","requestid":"2UHqtP9LRXobqtpFKzcGsaTZIrK","status":503,"time":"2023-08-21T08:39:18Z","uri":"/probe?target=MY_FABRIC"}

It look like an APIC issue

@thenodon
Copy link
Member Author

Do you have the same behavior just using curl? Like to take aci-exporter out of the equation :)

@ahmedaall
Copy link

ahmedaall commented Aug 21, 2023

The curl works from the exporter to the APICs :)

@thenodon
Copy link
Member Author

@ahmedaall I meant do curl works directly at the apic2 when you take apic1 down for queries?

@ahmedaall
Copy link

@thenodon yes, I confirm that curl works from the exporter to the apic2 when apic1 is down

@thenodon
Copy link
Member Author

@ahmedaall as I interpret your answer is that the problem is related to the LB or aci-exporter, since curl request to any apic will continue to work even if one, or more apics, are down. A couple of more questions.
Is the LB set up to do round robin? If so when all apics are up will the requests from aci-exporter hit all apics in round-robin way? Using curl against the LB will it behave differently then aci-exporter?
Looking at the above logs it looks like aci-exporter is requesting https://APIC2 and not https://LB - I thought that was what you where testing.

@ahmedaall
Copy link

ahmedaall commented Aug 21, 2023

@thenodon all the tests were done by directly specifying the target apics as below :

  MY_FABRIC:
    username: ACI_USERNAME
    password: ACI_PASSWORD
    apic:
#      - https://LB
      - https://APIC2
      - https://APIC3
#      - https://APIC1

I did not debug by putting the load balancer because I assume that the problem would not be on this side. But I confirm you that curl from the exporter to the APICs and LB works properly.
The load balancer is in round robin with session persistence

@thenodon
Copy link
Member Author

Sound really strange - aci-exporter connect to https://APIC2 and that works until APIC1 is shutdown - correct? But using curl against https://APIC2 works even after APIC1 is shutdown. Is this correct ?

@ahmedaall
Copy link

Exactly

@ahmedaall
Copy link

the only explanation would be that there is a dependency towards apic1. In the logs we see that he manages to login but can't access https://APIC/api/mo/.../.../... It's unlikely but maybe access to the MO is only possible through the APIC1. To verify this hypothesis I will try to make a terraform plan from one of my vm by first turning off the APIC1

@thenodon
Copy link
Member Author

But how can it work for curl ?

@thenodon
Copy link
Member Author

@camrossi don you any ideas what this problem can be related to?

@ahmedaall
Copy link

Maybe there is a policy or other things that block the access against the MO, but not against the rest.
Note that in my curl I only put the url of the apics in destination without specifying a precise path :

curl https://APIC2

@ahmedaall
Copy link

the only explanation would be that there is a dependency towards apic1. In the logs we see that he manages to login but can't access https://APIC/api/mo/.../.../... It's unlikely but maybe access to the MO is only possible through the APIC1. To verify this hypothesis I will try to make a terraform plan from one of my vm by first turning off the APIC1

I tried to do terraform plan while shutting down apic1 to see if it is a spof for api access. It works.... So maybe the problem is in the exporter side.

@thenodon
Copy link
Member Author

Can you tell me more about the "terraform plan" you did.

@ahmedaall
Copy link

I wanted to verify the hypothesis that only apic1 can communicate with the api. I therefore launched a terraform plan against the address of the load balancer while apic1 was off. And it worked..

@ahmedaall
Copy link

ahmedaall commented Aug 22, 2023

Hi @thenodon, I found the issue !

urlMap := make(map[string]string)
urlMap["login"] = "/api/aaaLogin.xml"
urlMap["logout"] = "/api/aaaLogout.xml"
urlMap["faults"] = "/api/class/faultCountsWithDetails.json"
urlMap["aci_name"] = "/api/mo/topology/pod-1/node-1/av.json"

The urlMap aci_name contains only the path to the "pod-1/node-1" which corresponds to...APIC1. It is therefore normal that with each reboot of the APIC1, the export becomes unavailable. I think aci-connection.go should be updated so that it can contain the url maps of all the APICs targets entered in the config file.

@thenodon
Copy link
Member Author

@ahmedaall - thanks for the trouble shooting. So if I have 2 apic you have node-1 and node-2? Is it possible to get the number of apic nodes? How is node numbering defined in the apic? node-1 will be the first to boot or is it a configuration?

@ahmedaall
Copy link

@thenodon
So if I have 2 apic you have node-1 and node-2?
--> Yes but the nodes can be on different pods, and therefore not necessarily on pod1. Because the correspondence between the node and the pod is specific to each infrastructure, you can use /api/node/class/fabricNode.json?query-target-filter=eq(fabricNode.role,"controller") catching "dn" attribut (e.g "dn":"topology/pod-2/node-3"). And add it to the urlmap

How is node numbering defined in the apic?
--> It depends on each infrastructure. The order is defined by the administrator.

node-1 will be the first to boot or is it a configuration?
--> No, its fully redondant by design between all apics

@thenodon
Copy link
Member Author

@ahmedaall I have tried to solve this issue. Would be great if you can test it. You can build aci-exporter from branch https://github.com/opsdis/aci-exporter/tree/issue_1. The fix is in commit da40506.
The name of the aci can now be set in the configuration file, but if not it use the /api/node/class/fabricNode.json?query-target-filter=eq(fabricNode.role,"controller") as you suggested. It will loop over returned controller until a name can be determined. The name will be cached until aci-exporter is stopped. Hopefully this will solve the issue. Looking forward to your feedback.

@ahmedaall
Copy link

Hi @thenodon. Thank you for the update ! I'll keep you inform when update my exporter

@ahmedaall
Copy link

@thenodon I don't understand what value I have to enter in aci_name value

@thenodon
Copy link
Member Author

thenodon commented Aug 28, 2023

@ahmedaall its optional, but if you set it the name from the ACI will not be used. This is the value of the aci label. Have an example in eaxmple-config.yaml

@thenodon
Copy link
Member Author

@ahmedaall any update on this?

@camrossi
Copy link
Contributor

Hi Folks,
Sorry a bit late to the party... Just wanted to say that you can use the infraCont class that returns the av for all the APICs. Same as what you do but perhaps more elegant.

I also tested your code and is not crashing for me.

I implemented this as a test BUT I do not know Go, I am not joking this is the first time I write anything in go so is probably horrible and I am amazed it even works

func (p aciAPI) getAciName() (string, error) {
	if p.connection.fabricConfig.AciName != "" {
		return p.connection.fabricConfig.AciName, nil
	}

	data, err := p.connection.getByClassQuery("infraCont", "query-target=self")
	if err != nil {
		return "", err
	}
	p.connection.fabricConfig.AciName = gjson.Get(data, "imdata.#.infraCont.attributes.fbDmNm").Array()[0].Str

	if p.connection.fabricConfig.AciName != "" {
		return p.connection.fabricConfig.AciName, nil
	}
	return "", fmt.Errorf("could not determine ACI name")
}

@thenodon
Copy link
Member Author

thenodon commented Sep 14, 2023

@camrossi - glad to here that the branch worked for you. And your golang works fine. The problem is that I can not get your alternative query to work using the class infraCont. I have tried it against cisco sandbox and also against a customers aci. I'm just getting 400. Can it be an aci version issue? Or do infraCont only work if there is a cluster of apic controllers? I think the cisco sandbox is a single apic. Check on https://pubhub.devnetcloud.com/media/apic-mim-ref-311/docs/MO-infraCont.html and it states:
infra:Cont An APIC cluster is comprised of multiple APIC controllers that provide operators with a unified real time monitoring, diagnostic, and configuration management capability for the ACI fabric.

@camrossi the query you suggested worked fine :)

@ahmedaall
Copy link

ahmedaall commented Sep 14, 2023

  • What version of golang are you using?
  • What OS and version do you build and run on?

I build from golang:latest. I am using version go1.21.1, in a Debian 12

@thenodon
Copy link
Member Author

@ahmedaall I have the same golang version and running on ubuntu 22.04. I would suggest that you test what I recommended in #1 (comment)

@ahmedaall
Copy link

@thenodon I just launch the sandbox. I am a newbie with it. I'll do the test as fast as possible.

@thenodon
Copy link
Member Author

The cisco sandbox is on the internet, https://devnetsandbox.cisco.com/RM/Topology, so just configure aci-exporter with:

fabrics:
  cisco_sandbox:
    username: admin
    password: "!v3G@!4@Y"
    apic:
      - https://sandboxapicdc.cisco.com

@ahmedaall
Copy link

@thenodon ok I'll take those parameters and for the minimal config I'll directly import example-config.yaml and desactivate my current config yaml file

@ahmedaall
Copy link

ahmedaall commented Sep 14, 2023

@thenodon ok I'll take those parameters and for the minimal config I'll directly import example-config.yaml and desactivate my current config yaml file

Done. But what url do I need to enter to launch the exporter with sandbox target ?
I used to launch http://my_exporter/probe?target=MY_FABRIC with my local APIC

@thenodon
Copy link
Member Author

@ahmedaall if you used the above config för the cisco sandbox the name is cisco_sandbox. So the curl is:

 http://my_exporter/probe?target=cisco_sandbox&queries=my_query

Where my_query is some basic query in the config file.

@ahmedaall
Copy link

@thenodon Everything works fine with the sandbox :

{"level":"info","msg":"aci-exporter starting on port 9643","time":"2023-09-14T14:04:14Z"}

{"level":"info","msg":"Read timeout 0s, Write timeout 0s","time":"2023-09-14T14:04:14Z"}

{"exec_time":1015415,"fabric":"cisco_sandbox","level":"info","method":"POST","msg":"api call fabric","requestid":"2VOI1DmZOBkw9N0AUieRthm53KK","status":200,"time":"2023-09-14T14:08:03Z","uri":"https://sandboxapicdc.cisco.com/api/aaaLogin.xml"}

{"fabric":"cisco_sandbox","level":"info","msg":"Using apic https://sandboxapicdc.cisco.com","requestid":"2VOI1DmZOBkw9N0AUieRthm53KK","time":"2023-09-14T14:08:03Z"}

{"class":"fabricNode","exec_time":189166,"fabric":"cisco_sandbox","length":580,"level":"info","method":"GET","msg":"api call fabric","requestid":"2VOI1DmZOBkw9N0AUieRthm53KK","status":200,"time":"2023-09-14T14:08:03Z","uri":"https://sandboxapicdc.cisco.com/api/class/fabricNode.json?query-target-filter=eq(fabricNode.role,\"controller\")"}

@thenodon
Copy link
Member Author

@ahmedaall can you run the same config with the same query against your own apic. Just add you config in the fabric section of the config file

@ahmedaall
Copy link

@thenodon So in the same file I change the target from the sandbox to my APIC and I have this :

{"level":"info","msg":"aci-exporter starting on port 9643","time":"2023-09-14T14:35:03Z"}
{"level":"info","msg":"Read timeout 0s, Write timeout 0s","time":"2023-09-14T14:35:03Z"}
2023/09/14 14:35:30 http: panic serving 100.64.6.20:35536: runtime error: invalid memory address or nil pointer dereference
goroutine 36 [running]:
net/http.(*conn).serve.func1()
/usr/local/go/src/net/http/server.go:1868 +0xb9
panic({0x9479a0?, 0xe2d440?})
/usr/local/go/src/runtime/panic.go:920 +0x270
main.AciConnection.login({{0xab0cd0, 0xc0002f17d0}, 0x0, 0xc0002d72c8, 0xc00031a930, 0xc00031a900, {{0xaac400, 0xc0002977c0}, 0x0, {0xaae5e8, ...}, ...}, ...})
/build/aci-connection.go:85 +0x25
main.aciAPI.CollectMetrics({{0xab0cd0, 0xc0002f17d0}, {{0xab0cd0, 0xc0002f17d0}, 0x0, 0xc0002d72c8, 0xc00031a930, 0xc00031a900, {{0xaac400, 0xc0002977c0}, ...}, ...}, ...})
/build/aci-api.go:109 +0xa5
main.HandlerInit.getMonitorMetrics({{0xc0002f0030?, 0xc0002f06c0?, 0xc0002f0750?}, 0xc0002f0c60?}, {0xaaffb8, 0xc000321820}, 0xc000327b00)
/build/aci-exporter.go:272 +0x345
net/http.HandlerFunc.ServeHTTP(0x410225?, {0xaaffb8?, 0xc000321820?}, 0xf8?)
/usr/local/go/src/net/http/server.go:2136 +0x29
main.main.promMonitor.func2({0xaaffb8?, 0xc000321800}, 0xaaa4b0?)
/build/aci-exporter.go:345 +0xe3
net/http.HandlerFunc.ServeHTTP(0xc000327a00?, {0xaaffb8?, 0xc000321800?}, 0xaaa4b0?)
/usr/local/go/src/net/http/server.go:2136 +0x29
main.main.logcall.func3({0xab0258?, 0xc000348000}, 0xc000327a00)
/build/aci-exporter.go:322 +0x156
net/http.HandlerFunc.ServeHTTP(0x445220?, {0xab0258?, 0xc000348000?}, 0x70f85a?)
/usr/local/go/src/net/http/server.go:2136 +0x29
net/http.(*ServeMux).ServeHTTP(0xe70140?, {0xab0258, 0xc000348000}, 0xc000327a00)
/usr/local/go/src/net/http/server.go:2514 +0x142
net/http.serverHandler.ServeHTTP({0xc0002f1560?}, {0xab0258?, 0xc000348000?}, 0x6?)
/usr/local/go/src/net/http/server.go:2938 +0x8e
net/http.(*conn).serve(0xc000230900, {0xab0cd0, 0xc0002f1470})
/usr/local/go/src/net/http/server.go:2009 +0x5f4
created by net/http.(*Server).Serve in goroutine 1
/usr/local/go/src/net/http/server.go:3086 +0x5cb

@thenodon
Copy link
Member Author

@ahmedaall so now against the lb?

fabrics:
  MY_FABRIC:
    username: ACI_USERNAME
    password: ACI_PASSWORD
    apic:
      - https://my-load-balancer

First try to add the aci_name like this:

fabrics:
  MY_FABRIC:
    username: ACI_USERNAME
    password: ACI_PASSWORD
    apic:
      - https://my-load-balancer
    aci_name: foobar

Do you get the same panic?

You should also verify to run against one of the apic endpoint, no lb.
My thought is that its something related that lb and how its configured. So test with the same config against different fabric section. You can have multiple fabric settings in the same config file as described in the example config

@ahmedaall
Copy link

@thenodon
I have already test without the LB. Same issue.
ok i'll test. aci_name refers to the name of the fabric I guess ?

@thenodon
Copy link
Member Author

@ahmedaall so everything works with the released version, both through lb and against apic endpoint.
The branch compiled version works against cisco sandbox, but not against any setup you have onprem, with or without lb.
Is this correct?
Its vary strange. What version of apic do you run?

@ahmedaall
Copy link

@thenodon I have version 5.2(6e)
"The branch compiled version works against cisco sandbox, but not against any setup you have onprem, with or without lb.
Is this correct?" --> Exactly

@camrossi
Copy link
Contributor

I tested against a real ACI 5.2(8) with my code and it worked just fine by using infraCont however I did mage to make it crash...
It's the name you use: if start with a Capital letter it crashes:
Not sure if if the same but I see that @ahmedaall fabric name is MY_FABRIC so I would say probably?

data:
  config.yaml: |
    fabrics:
      Fab1:

but this works just fine, can you check if you have the same behaviour ?

data:
  config.yaml: |
    fabrics:
      fab1:

@thenodon let me see if I can get you read only access to one of our DMZ fabrics so you can actually test on something real, might not have lots of flexibility in term of versions but would be better than what you have now :D

Here my crash trace when the uppercase:

2023/09/14 23:04:00 http: panic serving 10.32.0.11:36052: runtime error: invalid memory address or nil pointer dereference
goroutine 2288 [running]:
net/http.(*conn).serve.func1()
	/usr/local/go/src/net/http/server.go:1854 +0xbf
panic({0x92ec20, 0xda7d70})
	/usr/local/go/src/runtime/panic.go:890 +0x263
main.AciConnection.login({{0xa7bb70, 0xc000b443f0}, 0x0, 0xc000940518, 0xc000b44870, 0xc000b44840, {{0xa78320, 0xc00013dcc0}, 0x0, {0xa7a688, ...}, ...}, ...})
	/build/aci-connection.go:85 +0x31
main.aciAPI.CollectMetrics({{0xa7bb70, 0xc000b443f0}, {{0xa7bb70, 0xc000b443f0}, 0x0, 0xc000940518, 0xc000b44870, 0xc000b44840, {{0xa78320, 0xc00013dcc0}, ...}, ...}, ...})
	/build/aci-api.go:109 +0xb8
main.HandlerInit.getMonitorMetrics({{0xc00032ef30?, 0xc00032f770?, 0xc00032f830?}, 0xc00032fd10?}, {0xa7b400, 0xc0002b0120}, 0xc000296200)
	/build/aci-exporter.go:272 +0x345
net/http.HandlerFunc.ServeHTTP(0x40d90a?, {0xa7b400?, 0xc0002b0120?}, 0x30?)
	/usr/local/go/src/net/http/server.go:2122 +0x2f
main.promMonitor.func1({0xa7b400?, 0xc0002b00e0}, 0xa76801?)
	/build/aci-exporter.go:345 +0xf8
net/http.HandlerFunc.ServeHTTP(0xa7bac8?, {0xa7b400?, 0xc0002b00e0?}, 0xa768c0?)
	/usr/local/go/src/net/http/server.go:2122 +0x2f
main.logcall.func1({0xa7b640?, 0xc00009e000}, 0xc000296100)
	/build/aci-exporter.go:322 +0x263
net/http.HandlerFunc.ServeHTTP(0xc000362000?, {0xa7b640?, 0xc00009e000?}, 0x40d90a?)
	/usr/local/go/src/net/http/server.go:2122 +0x2f
net/http.(*ServeMux).ServeHTTP(0xc0003e600b?, {0xa7b640, 0xc00009e000}, 0xc000296100)
	/usr/local/go/src/net/http/server.go:2500 +0x149
net/http.serverHandler.ServeHTTP({0xc000b44090?}, {0xa7b640, 0xc00009e000}, 0xc000296100)
	/usr/local/go/src/net/http/server.go:2936 +0x316
net/http.(*conn).serve(0xc00027e240, {0xa7bb70, 0xc0003965d0})
	/usr/local/go/src/net/http/server.go:1995 +0x612
created by net/http.(*Server).Serve
	/usr/local/go/src/net/http/server.go:3089 +0x5ed

@camrossi
Copy link
Contributor

If I make r.URL.Query().Get("target") lower case with strings.ToLower() the crash is gone.
But is really odd as is not a case of using a fabric name that dosen't exists ... very confusing anyway hope this shed some light !

@thenodon
Copy link
Member Author

@camrossi thanks for your findings. I can confirm I can reproduce the same behavior with the uppercase name of the fabric. Currently not sure why, but I think its a combination on how yaml package deserialize the yaml file and how I match the target query path against the deserialize fabrics structures. I will investigate it more and what changed between the released version and the branch version related to dependency since @ahmedaall got it to work in the released version.
@ahmedaall and for everyone else - the fabrics named section must be lowercase! Change to lowercase and let us know if it works.
@camrossi - would be great to be given access to your DMZ fabric. So the question is if the infraCont is different between ACI 5.2 and 6 (cisco sandbox) since it did not worked for me against cisco sandbox.

@ahmedaall
Copy link

@camrossi @thenodon thank you for the debug. So I try to change from uppercase to lower for the name of the fabric :

# Exporter port
port: 9643
# Configuration file name default without postfix
config: config
# The prefix of the metrics
prefix: 

fabrics:
#  cisco_sandbox:
#    username: admin
#    password: ""
#    apic:
#      - https://sandboxapicdc.cisco.com

  my_fabric:
    username: ACI_USERNAME
    password: ACI_PASSWORD
    apic:
      - https://APIC1
      - https://APIC2
      - https://APIC3

but...I have the same issue. I try with and without the lb. Did I forgot something ?

@camrossi
Copy link
Contributor

Yes @ahmedaall you need to update the prometheus config as well (assuming you use that) so that it uses the lowercase name.
I have this for my prom lab config, relevant section is the target one

prometheus:
  prometheusSpec:
    scrapeInterval: 30s
    evaluationInterval: 30s
    additionalScrapeConfigs:
    - job_name: 'aci'
      scrape_interval: 1m
      scrape_timeout: 30s
      metrics_path: /probe
      static_configs:
      - targets: ['fab1','fab2']

and then in aci-exporter

    fabrics:
      # This is the Cisco provided sandbox that is open for testing
      fab1:
        # Apic username
        username: admin
        # Apic password
        password: <>
        # The available apic controllers
        # The aci-exporter will use the first apic it can successfully login to, starting with the first in the list
        apic:
          - https://<>
      fab2:
        # Apic username
        username: admin
        # Apic password
        password: <>
        # The available apic controllers
        # The aci-exporter will use the first apic it can successfully login to, starting with the first in the list
        apic:
          - https://<>
          - https://<>
          - https://<>

@ahmedaall
Copy link

ahmedaall commented Sep 15, 2023

@camrossi Yes I didn't mention it but I did update my target in prometheus file :)
The problem was that my browser made an automatic uppercase to my target...So I tried with an other browser and the exporter works !
Now I will restart my failover test shuting down my APIC1. I'll keep you inform

@thenodon
Copy link
Member Author

thenodon commented Sep 16, 2023

@ahmedaall @camrossi have added documentation and check on /probe endpoint that fabric is in lower case. You can do a pull on the issue_1 branch. For more info see commit 6924973
Changed the query to @camrossi suggestion for detecting aci name, commit acb7428

@thenodon
Copy link
Member Author

@ahmedaall have you had a chance to verify your LB solution yet? Would like to close this and make a release.

@ahmedaall
Copy link

@thenodon not yet... i'll do it today or tomorrow. Sorry for the time it takes.

@ahmedaall
Copy link

@thenodon Ok I just finished my failover test. Everything works perfectly. Even if I shutdown 2 of my 3 APICs it works good !

@thenodon
Copy link
Member Author

Great @ahmedaall. If you have the time to share your setup I can add it to the README. Hope you are now happy with aci-exporter and give it a star.
I will make a release in the coming days.

@ahmedaall
Copy link

@thenodon I confirm you that I am very happy with the aci-exporter. It is a really good job. Thanks a lot your reactivity with this issue.

@thenodon
Copy link
Member Author

Thanks @ahmedaall and @camrossi for your support. Will close this with pull request #38

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants