Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure SD stopped working for VMSS instances after upgrade from 2.41 > 2.48 #13245

Closed
roman-vynar opened this issue Dec 4, 2023 · 15 comments
Closed

Comments

@roman-vynar
Copy link
Contributor

What did you do?

Azure SD stopped working for VMSS instances after upgrade from 2.41 > 2.48.

  - job_name: tmp
    azure_sd_configs:
      - subscription_id: 'xxx-xxx-xxx-xxx'
        authentication_method: ManagedIdentity
        resource_group: 'xxx'
        proxy_from_environment: true

It was working fine with 2.41 for both VM and VMSS instances but after upgrade it is discovering only regular VMs.
Any instance within VM Scaling Set is not discovered and the error is returned (as many errors as VMSS you have):

ts=2023-12-04T17:19:13.235Z caller=azure.go:391 level=warn component="discovery manager scrape" 
discovery=azure config=tmp msg="Network interface does not exist" 
name=/subscriptions/xxx-xxx-xxx-xxx/resourceGroups/xxx/providers/Microsoft.Compute/virtualMachineScaleSets/nomad-main-xxx-vmss/virtualMachines/282/networkInterfaces/primary.nic 
err="network interface does not exist"

Moreover, I checked the resource in error by following the path /subscriptions/xxx-xxx-xxx-xxx/resourceGroups/xxx/providers/Microsoft.Compute/virtualMachineScaleSets/nomad-main-xxx-vmss/virtualMachines/282/networkInterfaces/primary.nic and there is no problem to see it.

What did you expect to see?

Azure SD working as previously.

What did you see instead? Under which circumstances?

Missing VMSS instances in Targets. Only VMs are present.

System information

No response

Prometheus version

No response

Prometheus configuration file

No response

Alertmanager version

No response

Alertmanager configuration file

No response

Logs

No response

@arukiidou
Copy link
Contributor

@daniel-resdiary
Copy link
Contributor

Looks like this is breaking because getNetworkInterfaceByID uses *armnetwork.InterfacesClient.Get() to return all Network Interfaces. *armnetwork.InterfacesClient.Get() only works with Virtual Machine interfaces.
To get VMSS instance interfaces it needs to use *armnetwork.InterfacesClient.GetVirtualMachineScaleSetNetworkInterface().

VMSS Network Interfaces have a very different Resource ID format to VM Network Interfaces, which is why there are two functions to return these.

This used to work (before #11860) because getNetworkInterfaceByID used to just call GET on the NIC ID path. Doing it that way meant it didn't matter that the ID was formatted differently for each ID type.

@daniel-resdiary
Copy link
Contributor

I've done a little bit of further testing on this.
It looks like it would be possible to use the *arm.ResourceID returned by newAzureResourceFromID (ultimately from arm.ParseResourceID(id)) to determine whether a given NIC ID is a VM NIC or a VMSS NIC.

The ResourceType on that *arm.ResourceID is different in each case.
It looks like it returns:

  • Microsoft.Compute/virtualMachineScaleSets/virtualMachines/networkInterfaces for a VMSS NIC
  • Microsoft.Network/networkInterfaces for a VM NIC

@bwplotka
Copy link
Member

Have you tried current main of Prometheus with #13241 included (not expert in Azure here, but we recently merged, perhaps, related fix)

@roman-vynar
Copy link
Contributor Author

roman-vynar commented Dec 11, 2023

@bwplotka that PR is about Public IPs.

For VMSS, it is a primary private interface (internal IPs).
@daniel-resdiary is right, it should be a different method to get NIC from VMSS rather than VM.

@bwplotka
Copy link
Member

Thanks, help wanted in fixing & testing.

Also volunteers welcome to setup some integration test against Azure (we need sponsorship for that too, but perhaps some kind soul from Microsoft could help us 🙈 ).

@daniel-resdiary
Copy link
Contributor

I'll take a look at this and see if I can get a fix.
I don't know how to write a test for it, though, as it needs to call GET on a real Azure NIC...

@daniel-resdiary
Copy link
Contributor

I think I have a fix for this: #13283
This is my first time contributing to any open source project, so please let me know if I've missed anything.

@bwplotka
Copy link
Member

bwplotka commented Dec 13, 2023

@daniel-resdiary amazing potential fix is merged to main. Before closing this issue, can somebody else confirm this helps e.g. @roman-vynar? You will need to use the main branch e.g. "prom/prometheus:main" docker image (EDIT: Its latest "main" tag includes the fix now).

Once confirmed we can close this AND we can consider cherry-picking the fix for 2.49.0 release (rc.0 is out for now)🤗

@roman-vynar
Copy link
Contributor Author

Hmm, it partially works for me.
VMSS discovery works but I don't have everything discovered because it says there are duplicate registrations on SD:

ts=2023-12-13T14:04:52.744Z caller=refresh.go:96 level=error component="discovery manager scrape" discovery=azure config=mqa_nginx msg="Unable to register metrics" err="failed to register metric: duplicate metrics collector registration attempted"
ts=2023-12-13T14:04:52.744Z caller=consul.go:349 level=error component="discovery manager scrape" discovery=consul config=mqa_golang msg="Unable to register metrics" err="failed to register metric: duplicate metrics collector registration attempted"
ts=2023-12-13T14:04:52.745Z caller=refresh.go:96 level=error component="discovery manager scrape" discovery=azure config=cadvisor msg="Unable to register metrics" err="failed to register metric: duplicate metrics collector registration attempted"

Not sure why, it is whether something else changed or my specific config with relabeling.
I am still debugging and let you know.

@SimonDreher
Copy link

Nice that the fix is already on its way. Until then I can confirm that a downgrade to 2.47.2 worked for us.

@bwplotka
Copy link
Member

bwplotka commented Dec 19, 2023

Any update @roman-vynar?

@SimonDreher does prom/prometheus:main works for you now?

@roman-vynar
Copy link
Contributor Author

Yes, this works fine.

@roman-vynar
Copy link
Contributor Author

roman-vynar commented Dec 19, 2023

@bwplotka looks like I have discovered another bug while testing this, not in v2.48.0 but it is in the main.
Not related to Azure SD of VMSS.

Looks like the following commit 6de80d7
It throws errors like this

ts=2023-12-19T15:35:28.274Z caller=consul.go:349 level=error component="discovery manager scrape"
 discovery=consul config=xxx msg="Unable to register metrics" 
err="failed to register metric: duplicate metrics collector registration attempted"

if you have 1+ jobs with the same type of SD, e.g. 2 jobs with consul_sd_config or azure_sd_config etc.

Is it possible to verify that before a new release of prometheus?

@bwplotka
Copy link
Member

Nice, this commit is not in 2.49. I will create separate issue for it.

Thanks! Closing this particular issue and will cherry-pick the fix onto 2.49

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants