Azure SD stopped working for VMSS instances after upgrade from 2.41 > 2.48 #13245

roman-vynar · 2023-12-04T17:37:41Z

What did you do?

Azure SD stopped working for VMSS instances after upgrade from 2.41 > 2.48.

  - job_name: tmp
    azure_sd_configs:
      - subscription_id: 'xxx-xxx-xxx-xxx'
        authentication_method: ManagedIdentity
        resource_group: 'xxx'
        proxy_from_environment: true

It was working fine with 2.41 for both VM and VMSS instances but after upgrade it is discovering only regular VMs.
Any instance within VM Scaling Set is not discovered and the error is returned (as many errors as VMSS you have):

ts=2023-12-04T17:19:13.235Z caller=azure.go:391 level=warn component="discovery manager scrape" 
discovery=azure config=tmp msg="Network interface does not exist" 
name=/subscriptions/xxx-xxx-xxx-xxx/resourceGroups/xxx/providers/Microsoft.Compute/virtualMachineScaleSets/nomad-main-xxx-vmss/virtualMachines/282/networkInterfaces/primary.nic 
err="network interface does not exist"

Moreover, I checked the resource in error by following the path /subscriptions/xxx-xxx-xxx-xxx/resourceGroups/xxx/providers/Microsoft.Compute/virtualMachineScaleSets/nomad-main-xxx-vmss/virtualMachines/282/networkInterfaces/primary.nic and there is no problem to see it.

What did you expect to see?

Azure SD working as previously.

What did you see instead? Under which circumstances?

Missing VMSS instances in Targets. Only VMs are present.

System information

No response

Prometheus version

No response

Prometheus configuration file

No response

Alertmanager version

No response

Alertmanager configuration file

No response

Logs

No response

The text was updated successfully, but these errors were encountered:

arukiidou · 2023-12-11T13:29:28Z

may relates Upgrade to github.com/Azure/azure-sdk-for-go/sdk #11860

daniel-resdiary · 2023-12-11T16:37:35Z

Looks like this is breaking because getNetworkInterfaceByID uses *armnetwork.InterfacesClient.Get() to return all Network Interfaces. *armnetwork.InterfacesClient.Get() only works with Virtual Machine interfaces.
To get VMSS instance interfaces it needs to use *armnetwork.InterfacesClient.GetVirtualMachineScaleSetNetworkInterface().

VMSS Network Interfaces have a very different Resource ID format to VM Network Interfaces, which is why there are two functions to return these.

This used to work (before #11860) because getNetworkInterfaceByID used to just call GET on the NIC ID path. Doing it that way meant it didn't matter that the ID was formatted differently for each ID type.

daniel-resdiary · 2023-12-11T16:59:42Z

I've done a little bit of further testing on this.
It looks like it would be possible to use the *arm.ResourceID returned by newAzureResourceFromID (ultimately from arm.ParseResourceID(id)) to determine whether a given NIC ID is a VM NIC or a VMSS NIC.

The ResourceType on that *arm.ResourceID is different in each case.
It looks like it returns:

Microsoft.Compute/virtualMachineScaleSets/virtualMachines/networkInterfaces for a VMSS NIC
Microsoft.Network/networkInterfaces for a VM NIC

bwplotka · 2023-12-11T20:26:53Z

Have you tried current main of Prometheus with #13241 included (not expert in Azure here, but we recently merged, perhaps, related fix)

roman-vynar · 2023-12-11T20:33:28Z

@bwplotka that PR is about Public IPs.

For VMSS, it is a primary private interface (internal IPs).
@daniel-resdiary is right, it should be a different method to get NIC from VMSS rather than VM.

bwplotka · 2023-12-12T11:02:15Z

Thanks, help wanted in fixing & testing.

Also volunteers welcome to setup some integration test against Azure (we need sponsorship for that too, but perhaps some kind soul from Microsoft could help us 🙈 ).

daniel-resdiary · 2023-12-12T13:37:21Z

I'll take a look at this and see if I can get a fix.
I don't know how to write a test for it, though, as it needs to call GET on a real Azure NIC...

daniel-resdiary · 2023-12-12T16:39:19Z

I think I have a fix for this: #13283
This is my first time contributing to any open source project, so please let me know if I've missed anything.

bwplotka · 2023-12-13T13:01:06Z

@daniel-resdiary amazing potential fix is merged to main. Before closing this issue, can somebody else confirm this helps e.g. @roman-vynar? You will need to use the main branch e.g. "prom/prometheus:main" docker image (EDIT: Its latest "main" tag includes the fix now).

Once confirmed we can close this AND we can consider cherry-picking the fix for 2.49.0 release (rc.0 is out for now)🤗

roman-vynar · 2023-12-13T14:12:51Z

Hmm, it partially works for me.
VMSS discovery works but I don't have everything discovered because it says there are duplicate registrations on SD:

ts=2023-12-13T14:04:52.744Z caller=refresh.go:96 level=error component="discovery manager scrape" discovery=azure config=mqa_nginx msg="Unable to register metrics" err="failed to register metric: duplicate metrics collector registration attempted"
ts=2023-12-13T14:04:52.744Z caller=consul.go:349 level=error component="discovery manager scrape" discovery=consul config=mqa_golang msg="Unable to register metrics" err="failed to register metric: duplicate metrics collector registration attempted"
ts=2023-12-13T14:04:52.745Z caller=refresh.go:96 level=error component="discovery manager scrape" discovery=azure config=cadvisor msg="Unable to register metrics" err="failed to register metric: duplicate metrics collector registration attempted"

Not sure why, it is whether something else changed or my specific config with relabeling.
I am still debugging and let you know.

SimonDreher · 2023-12-15T11:41:05Z

Nice that the fix is already on its way. Until then I can confirm that a downgrade to 2.47.2 worked for us.

bwplotka · 2023-12-19T13:23:49Z

Any update @roman-vynar?

@SimonDreher does prom/prometheus:main works for you now?

roman-vynar · 2023-12-19T15:14:16Z

Yes, this works fine.

roman-vynar · 2023-12-19T15:39:04Z

@bwplotka looks like I have discovered another bug while testing this, not in v2.48.0 but it is in the main.
Not related to Azure SD of VMSS.

Looks like the following commit 6de80d7
It throws errors like this

ts=2023-12-19T15:35:28.274Z caller=consul.go:349 level=error component="discovery manager scrape"
 discovery=consul config=xxx msg="Unable to register metrics" 
err="failed to register metric: duplicate metrics collector registration attempted"

if you have 1+ jobs with the same type of SD, e.g. 2 jobs with consul_sd_config or azure_sd_config etc.

Is it possible to verify that before a new release of prometheus?

bwplotka · 2023-12-19T17:10:35Z

Nice, this commit is not in 2.49. I will create separate issue for it.

Thanks! Closing this particular issue and will cherry-pick the fix onto 2.49

roman-vynar mentioned this issue Dec 11, 2023

Upgrade to github.com/Azure/azure-sdk-for-go/sdk #11860

Merged

bwplotka added help wanted kind/bug labels Dec 12, 2023

daniel-resdiary mentioned this issue Dec 12, 2023

Get VM Scale Set NIC #13283

Merged

bwplotka closed this as completed Dec 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Azure SD stopped working for VMSS instances after upgrade from 2.41 > 2.48 #13245

Azure SD stopped working for VMSS instances after upgrade from 2.41 > 2.48 #13245

roman-vynar commented Dec 4, 2023

arukiidou commented Dec 11, 2023

daniel-resdiary commented Dec 11, 2023

daniel-resdiary commented Dec 11, 2023

bwplotka commented Dec 11, 2023

roman-vynar commented Dec 11, 2023 •

edited

bwplotka commented Dec 12, 2023

daniel-resdiary commented Dec 12, 2023

daniel-resdiary commented Dec 12, 2023

bwplotka commented Dec 13, 2023 •

edited

roman-vynar commented Dec 13, 2023

SimonDreher commented Dec 15, 2023

bwplotka commented Dec 19, 2023 •

edited

roman-vynar commented Dec 19, 2023

roman-vynar commented Dec 19, 2023 •

edited

bwplotka commented Dec 19, 2023

Azure SD stopped working for VMSS instances after upgrade from 2.41 > 2.48 #13245

Azure SD stopped working for VMSS instances after upgrade from 2.41 > 2.48 #13245

Comments

roman-vynar commented Dec 4, 2023

What did you do?

What did you expect to see?

What did you see instead? Under which circumstances?

System information

Prometheus version

Prometheus configuration file

Alertmanager version

Alertmanager configuration file

Logs

arukiidou commented Dec 11, 2023

daniel-resdiary commented Dec 11, 2023

daniel-resdiary commented Dec 11, 2023

bwplotka commented Dec 11, 2023

roman-vynar commented Dec 11, 2023 • edited

bwplotka commented Dec 12, 2023

daniel-resdiary commented Dec 12, 2023

daniel-resdiary commented Dec 12, 2023

bwplotka commented Dec 13, 2023 • edited

roman-vynar commented Dec 13, 2023

SimonDreher commented Dec 15, 2023

bwplotka commented Dec 19, 2023 • edited

roman-vynar commented Dec 19, 2023

roman-vynar commented Dec 19, 2023 • edited

bwplotka commented Dec 19, 2023

roman-vynar commented Dec 11, 2023 •

edited

bwplotka commented Dec 13, 2023 •

edited

bwplotka commented Dec 19, 2023 •

edited

roman-vynar commented Dec 19, 2023 •

edited