Skip to content

discovery/azure: fix system managed identity when client_id is empty#18323

Merged
bwplotka merged 1 commit into
prometheus:mainfrom
ogulcanaydogan:fix/16634-azure-system-managed-identity
Mar 20, 2026
Merged

discovery/azure: fix system managed identity when client_id is empty#18323
bwplotka merged 1 commit into
prometheus:mainfrom
ogulcanaydogan:fix/16634-azure-system-managed-identity

Conversation

@ogulcanaydogan
Copy link
Copy Markdown
Contributor

@ogulcanaydogan ogulcanaydogan commented Mar 19, 2026

Summary

Fixes #16634

When using ManagedIdentity authentication with system-assigned identity, the client_id field is intentionally left empty. However, the current code unconditionally sets options.ID = azidentity.ClientID(cfg.ClientID), which passes an empty string instead of nil. The Azure SDK treats an empty ClientID as a request for a user-assigned identity with an empty client ID, rather than falling back to system-assigned identity.

This fix only sets options.ID when cfg.ClientID is non-empty, matching the pattern already used in storage/remote/azuread/azuread.go (lines 330-339).

Changes

  • discovery/azure/azure.go: Only set options.ID when cfg.ClientID is non-empty in the ManagedIdentity case of newCredential()
  • discovery/azure/azure_test.go: Add TestNewCredentialManagedIdentity covering both system-assigned (empty ClientID) and user-assigned (non-empty ClientID) cases

Context

This was discussed in #16634 and builds on the Workload Identity pattern merged in #17207. @krajorama invited contribution in the issue thread.

Signed-off-by: Ogulcan Aydogan ogulcanaydogan@hotmail.com

[BUGFIX] Azure SD: Fix system-assigned managed identity not working when `client_id` is empty.

When using ManagedIdentity authentication with system-assigned identity,
the client_id field is intentionally left empty. However, the current code
unconditionally sets options.ID = azidentity.ClientID(cfg.ClientID), which
passes an empty string instead of nil. The Azure SDK treats an empty
ClientID as a request for a user-assigned identity with an empty client ID,
rather than falling back to system-assigned identity.

Fix by only setting options.ID when cfg.ClientID is non-empty, matching the
pattern already used in storage/remote/azuread/azuread.go.

Fixes prometheus#16634

Signed-off-by: Ogulcan Aydogan <ogulcanaydogan@hotmail.com>
@ogulcanaydogan ogulcanaydogan requested a review from a team as a code owner March 19, 2026 10:51
@ogulcanaydogan ogulcanaydogan requested a review from bboreham March 19, 2026 10:51
Copy link
Copy Markdown
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing; thank you.

We don't have great e2e tests to catch those, feedback welcome on how to improve test coverage for those tricky Azure integrations.

@bwplotka bwplotka merged commit 166d201 into prometheus:main Mar 20, 2026
33 of 34 checks passed
renovate Bot added a commit to sdwilsh/ansible-playbooks that referenced this pull request Apr 8, 2026
##### [\`v3.11.0\`](https://github.com/prometheus/prometheus/releases/tag/v3.11.0)

- \[CHANGE] Hetzner SD: The `__meta_hetzner_datacenter` label is deprecated for the role `robot` but kept for backward compatibility, use the `__meta_hetzner_robot_datacenter` label instead. For the role `hcloud`, the label is deprecated and will stop working after the 1 July 2026. [#17850](prometheus/prometheus#17850)
- \[CHANGE] Hetzner SD: The `__meta_hetzner_hcloud_datacenter_location` and `__meta_hetzner_hcloud_datacenter_location_network_zone` labels are deprecated, use the `__meta_hetzner_hcloud_location` and `__meta_hetzner_hcloud_location_network_zone` labels instead. [#17850](prometheus/prometheus#17850)
- \[CHANGE] Promtool: Redirect debug output to stderr to avoid interfering with stdout-based tool output. [#18346](prometheus/prometheus#18346)
- \[FEATURE] AWS SD: Add Elasticache Role. [#18099](prometheus/prometheus#18099)
- \[FEATURE] AWS SD: Add RDS Role. [#18206](prometheus/prometheus#18206)
- \[FEATURE] Azure SD: Add support for Azure Workload Identity authentication method. [#17207](prometheus/prometheus#17207)
- \[FEATURE] Discovery: Introduce `prometheus_sd_last_update_timestamp_seconds` metric to track the last time a service discovery update was sent to consumers. [#18194](prometheus/prometheus#18194)
- \[FEATURE] Kubernetes SD: Add support for node role selectors for pod roles. [#18006](prometheus/prometheus#18006)
- \[FEATURE] Kubernetes SD: Introduce pod-based labels for deployment, cronjob, and job controller names: `__meta_kubernetes_pod_deployment_name`, `__meta_kubernetes_pod_cronjob_name` and `__meta_kubernetes_pod_job_name`, respectively. [#17774](prometheus/prometheus#17774)
- \[FEATURE] PromQL: Add `</` and `>/` operators for trimming observations from native histograms. [#17904](prometheus/prometheus#17904)
- \[FEATURE] PromQL: Add experimental `histogram_quantiles` variadic function for computing multiple quantiles at once. [#17285](prometheus/prometheus#17285)
- \[FEATURE] TSDB: Add `storage.tsdb.retention.percentage` configuration to configure the maximum percent of disk usable for TSDB storage. [#18080](prometheus/prometheus#18080)
- \[FEATURE] TSDB: Add an experimental `fast-startup` feature flag that writes a `series_state.json` file to the WAL directory to track active series state across restarts. [#18303](prometheus/prometheus#18303)
- \[FEATURE] TSDB: Add an experimental `st-storage` feature flag. When enabled, Prometheus stores ingested start timestamps (ST, previously called Created Timestamp) from scrape or OTLP in the TSDB and Agent WAL, and exposes them via Remote Write 2. [#18062](prometheus/prometheus#18062)
- \[FEATURE] TSDB: Add an experimental `xor2-encoding` feature flag for the new TSDB block float sample chunk encoding that is optimized for scraped data and allows encoding start timestamps. [#18062](prometheus/prometheus#18062)
- \[ENHANCEMENT] HTTP client: Add AWS `external_id` support for sigv4. [#17916](prometheus/prometheus#17916)
- \[ENHANCEMENT] Kubernetes SD: Deduplicate deprecation warning logs from the Kubernetes API to reduce noise. [#17829](prometheus/prometheus#17829)
- \[ENHANCEMENT] TSDB: Remove old temporary checkpoints when creating a Checkpoint. [#17598](prometheus/prometheus#17598)
- \[ENHANCEMENT] UI: Add autocomplete support for experimental `first_over_time` and `ts_of_first_over_time` PromQL functions. [#18318](prometheus/prometheus#18318)
- \[ENHANCEMENT] Vultr SD: Upgrade govultr library from v2 to v3 for continued security patches and maintenance. [#18347](prometheus/prometheus#18347)
- \[PERF] PromQL: Improve performance and reduce heap allocations in joins (VectorBinop)/And/Or/Unless. [#17159](prometheus/prometheus#17159)
- \[PERF] PromQL: Partially address performance regression in native histogram aggregations due to using `KahanAdd`. [#18252](prometheus/prometheus#18252)
- \[PERF] Remote write: Optimize WAL watching used for RW sending to reuse internal buffers. [#18250](prometheus/prometheus#18250)
- \[PERF] TSDB: Optimize LabelValues intersection performance for matchers. [#18069](prometheus/prometheus#18069)
- \[PERF] UI: Skip restacking on hover in stacked series charts. [#18230](prometheus/prometheus#18230)
- \[BUGFIX] AWS SD: Fix EC2 SD ignoring the configured `endpoint` option, a regression from the AWS SDK v2 migration. [#18133](prometheus/prometheus#18133)
- \[BUGFIX] AWS SD: Fix panic in EC2 SD when DescribeAvailabilityZones returns nil ZoneName or ZoneId. [#18133](prometheus/prometheus#18133)
- \[BUGFIX] Agent: Fix memory leak caused by duplicate SeriesRefs being loaded as active series. [#17538](prometheus/prometheus#17538)
- \[BUGFIX] Alerting: Fix alert state incorrectly resetting to pending when the FOR period is increased in the config file. [#18244](prometheus/prometheus#18244)
- \[BUGFIX] Azure SD: Fix system-assigned managed identity not working when `client_id` is empty. [#18323](prometheus/prometheus#18323)
- \[BUGFIX] Consul SD: Fix filter parameter not being applied to health service endpoint, causing Node and Node.Meta filters to be ignored. [#17349](prometheus/prometheus#17349)
- \[BUGFIX] Kubernetes SD: Fix duplicate targets generated by `*DualStack` EndpointSlices policies. [#18192](prometheus/prometheus#18192)
- \[BUGFIX] OTLP: Fix ErrTooOldSample being returned as HTTP 500 instead of 400 in PRW v2 histogram write paths, preventing infinite client retry loops. [#18084](prometheus/prometheus#18084)
- \[BUGFIX] OTLP: Fix exemplars getting mixed between incorrect parts of a histogram. [#18056](prometheus/prometheus#18056)
- \[BUGFIX] PromQL: Do not skip histogram buckets in queries where histogram trimming is used. [#18263](prometheus/prometheus#18263)
- \[BUGFIX] Remote write: Fix `prometheus_remote_storage_sent_batch_duration_seconds` measuring before the request was sent. [#18214](prometheus/prometheus#18214)
- \[BUGFIX] Rules: Fix alert state restoration when rule labels contain Go template expressions. [#18375](prometheus/prometheus#18375)
- \[BUGFIX] Scrape: Fix panic when parsing bare label names without an equal sign in brace-only metric notation. [#18229](prometheus/prometheus#18229)
- \[BUGFIX] TSDB: Fail early when `use-uncached-io` feature flag is set on unsupported environments. [#18219](prometheus/prometheus#18219)
- \[BUGFIX] TSDB: Fall back to CLI flag values when retention is removed from config file. [#18200](prometheus/prometheus#18200)
- \[BUGFIX] TSDB: Fix memory leaks in buffer pools by clearing reference fields before returning buffers to pools. [#17895](prometheus/prometheus#17895)
- \[BUGFIX] TSDB: Fix missing mmap of histogram chunks during WAL replay. [#18306](prometheus/prometheus#18306)
- \[BUGFIX] TSDB: Fix storage.tsdb.retention.time unit mismatch in file causing retention to be 1e6 times longer than configured. [#18200](prometheus/prometheus#18200)
- \[BUGFIX] Tracing: Fix missing traceID in query log when tracing is enabled, previously only spanID was emitted. [#18189](prometheus/prometheus#18189)
- \[BUGFIX] UI: Fix tooltip Y-offset drift when using multiple graph panels. [#18228](prometheus/prometheus#18228)
- \[BUGFIX] UI: Update retention display in runtime info when config is reloaded. [#18200](prometheus/prometheus#18200)
renovate Bot added a commit to sdwilsh/ansible-playbooks that referenced this pull request Apr 8, 2026
##### [\`v3.11.0\`](https://github.com/prometheus/prometheus/releases/tag/v3.11.0)

- \[CHANGE] Hetzner SD: The `__meta_hetzner_datacenter` label is deprecated for the role `robot` but kept for backward compatibility, use the `__meta_hetzner_robot_datacenter` label instead. For the role `hcloud`, the label is deprecated and will stop working after the 1 July 2026. [#17850](prometheus/prometheus#17850)
- \[CHANGE] Hetzner SD: The `__meta_hetzner_hcloud_datacenter_location` and `__meta_hetzner_hcloud_datacenter_location_network_zone` labels are deprecated, use the `__meta_hetzner_hcloud_location` and `__meta_hetzner_hcloud_location_network_zone` labels instead. [#17850](prometheus/prometheus#17850)
- \[CHANGE] Promtool: Redirect debug output to stderr to avoid interfering with stdout-based tool output. [#18346](prometheus/prometheus#18346)
- \[FEATURE] AWS SD: Add Elasticache Role. [#18099](prometheus/prometheus#18099)
- \[FEATURE] AWS SD: Add RDS Role. [#18206](prometheus/prometheus#18206)
- \[FEATURE] Azure SD: Add support for Azure Workload Identity authentication method. [#17207](prometheus/prometheus#17207)
- \[FEATURE] Discovery: Introduce `prometheus_sd_last_update_timestamp_seconds` metric to track the last time a service discovery update was sent to consumers. [#18194](prometheus/prometheus#18194)
- \[FEATURE] Kubernetes SD: Add support for node role selectors for pod roles. [#18006](prometheus/prometheus#18006)
- \[FEATURE] Kubernetes SD: Introduce pod-based labels for deployment, cronjob, and job controller names: `__meta_kubernetes_pod_deployment_name`, `__meta_kubernetes_pod_cronjob_name` and `__meta_kubernetes_pod_job_name`, respectively. [#17774](prometheus/prometheus#17774)
- \[FEATURE] PromQL: Add `</` and `>/` operators for trimming observations from native histograms. [#17904](prometheus/prometheus#17904)
- \[FEATURE] PromQL: Add experimental `histogram_quantiles` variadic function for computing multiple quantiles at once. [#17285](prometheus/prometheus#17285)
- \[FEATURE] TSDB: Add `storage.tsdb.retention.percentage` configuration to configure the maximum percent of disk usable for TSDB storage. [#18080](prometheus/prometheus#18080)
- \[FEATURE] TSDB: Add an experimental `fast-startup` feature flag that writes a `series_state.json` file to the WAL directory to track active series state across restarts. [#18303](prometheus/prometheus#18303)
- \[FEATURE] TSDB: Add an experimental `st-storage` feature flag. When enabled, Prometheus stores ingested start timestamps (ST, previously called Created Timestamp) from scrape or OTLP in the TSDB and Agent WAL, and exposes them via Remote Write 2. [#18062](prometheus/prometheus#18062)
- \[FEATURE] TSDB: Add an experimental `xor2-encoding` feature flag for the new TSDB block float sample chunk encoding that is optimized for scraped data and allows encoding start timestamps. [#18062](prometheus/prometheus#18062)
- \[ENHANCEMENT] HTTP client: Add AWS `external_id` support for sigv4. [#17916](prometheus/prometheus#17916)
- \[ENHANCEMENT] Kubernetes SD: Deduplicate deprecation warning logs from the Kubernetes API to reduce noise. [#17829](prometheus/prometheus#17829)
- \[ENHANCEMENT] TSDB: Remove old temporary checkpoints when creating a Checkpoint. [#17598](prometheus/prometheus#17598)
- \[ENHANCEMENT] UI: Add autocomplete support for experimental `first_over_time` and `ts_of_first_over_time` PromQL functions. [#18318](prometheus/prometheus#18318)
- \[ENHANCEMENT] Vultr SD: Upgrade govultr library from v2 to v3 for continued security patches and maintenance. [#18347](prometheus/prometheus#18347)
- \[PERF] PromQL: Improve performance and reduce heap allocations in joins (VectorBinop)/And/Or/Unless. [#17159](prometheus/prometheus#17159)
- \[PERF] PromQL: Partially address performance regression in native histogram aggregations due to using `KahanAdd`. [#18252](prometheus/prometheus#18252)
- \[PERF] Remote write: Optimize WAL watching used for RW sending to reuse internal buffers. [#18250](prometheus/prometheus#18250)
- \[PERF] TSDB: Optimize LabelValues intersection performance for matchers. [#18069](prometheus/prometheus#18069)
- \[PERF] UI: Skip restacking on hover in stacked series charts. [#18230](prometheus/prometheus#18230)
- \[BUGFIX] AWS SD: Fix EC2 SD ignoring the configured `endpoint` option, a regression from the AWS SDK v2 migration. [#18133](prometheus/prometheus#18133)
- \[BUGFIX] AWS SD: Fix panic in EC2 SD when DescribeAvailabilityZones returns nil ZoneName or ZoneId. [#18133](prometheus/prometheus#18133)
- \[BUGFIX] Agent: Fix memory leak caused by duplicate SeriesRefs being loaded as active series. [#17538](prometheus/prometheus#17538)
- \[BUGFIX] Alerting: Fix alert state incorrectly resetting to pending when the FOR period is increased in the config file. [#18244](prometheus/prometheus#18244)
- \[BUGFIX] Azure SD: Fix system-assigned managed identity not working when `client_id` is empty. [#18323](prometheus/prometheus#18323)
- \[BUGFIX] Consul SD: Fix filter parameter not being applied to health service endpoint, causing Node and Node.Meta filters to be ignored. [#17349](prometheus/prometheus#17349)
- \[BUGFIX] Kubernetes SD: Fix duplicate targets generated by `*DualStack` EndpointSlices policies. [#18192](prometheus/prometheus#18192)
- \[BUGFIX] OTLP: Fix ErrTooOldSample being returned as HTTP 500 instead of 400 in PRW v2 histogram write paths, preventing infinite client retry loops. [#18084](prometheus/prometheus#18084)
- \[BUGFIX] OTLP: Fix exemplars getting mixed between incorrect parts of a histogram. [#18056](prometheus/prometheus#18056)
- \[BUGFIX] PromQL: Do not skip histogram buckets in queries where histogram trimming is used. [#18263](prometheus/prometheus#18263)
- \[BUGFIX] Remote write: Fix `prometheus_remote_storage_sent_batch_duration_seconds` measuring before the request was sent. [#18214](prometheus/prometheus#18214)
- \[BUGFIX] Rules: Fix alert state restoration when rule labels contain Go template expressions. [#18375](prometheus/prometheus#18375)
- \[BUGFIX] Scrape: Fix panic when parsing bare label names without an equal sign in brace-only metric notation. [#18229](prometheus/prometheus#18229)
- \[BUGFIX] TSDB: Fail early when `use-uncached-io` feature flag is set on unsupported environments. [#18219](prometheus/prometheus#18219)
- \[BUGFIX] TSDB: Fall back to CLI flag values when retention is removed from config file. [#18200](prometheus/prometheus#18200)
- \[BUGFIX] TSDB: Fix memory leaks in buffer pools by clearing reference fields before returning buffers to pools. [#17895](prometheus/prometheus#17895)
- \[BUGFIX] TSDB: Fix missing mmap of histogram chunks during WAL replay. [#18306](prometheus/prometheus#18306)
- \[BUGFIX] TSDB: Fix storage.tsdb.retention.time unit mismatch in file causing retention to be 1e6 times longer than configured. [#18200](prometheus/prometheus#18200)
- \[BUGFIX] Tracing: Fix missing traceID in query log when tracing is enabled, previously only spanID was emitted. [#18189](prometheus/prometheus#18189)
- \[BUGFIX] UI: Fix tooltip Y-offset drift when using multiple graph panels. [#18228](prometheus/prometheus#18228)
- \[BUGFIX] UI: Update retention display in runtime info when config is reloaded. [#18200](prometheus/prometheus#18200)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ServiceDiscovery using Azure System Managed Identity stopped working after upgrade to 3.4.0

2 participants