Skip to content

FEAT: Add ActiveDirectoryMSI support for bulk copy#573

Open
bewithgaurav wants to merge 9 commits into
mainfrom
bewithgaurav/gh534-bulkcopy-msi
Open

FEAT: Add ActiveDirectoryMSI support for bulk copy#573
bewithgaurav wants to merge 9 commits into
mainfrom
bewithgaurav/gh534-bulkcopy-msi

Conversation

@bewithgaurav
Copy link
Copy Markdown
Collaborator

@bewithgaurav bewithgaurav commented May 12, 2026

Work Item / Issue Reference

AB#

GitHub Issue: #534


Summary

Adds Authentication=ActiveDirectoryMSI support to bulk copy. Partial fix for #534.

  • Zero-arg ManagedIdentityCredential() for system-assigned MSI.
  • ManagedIdentityCredential(client_id=UID) for user-assigned MSI. Matches ODBC convention where UID carries the identity's client_id under MSI.
  • Threads optional credential_kwargs through get_auth_token / get_raw_token / _acquire_token so future auth methods that need constructor args plug in via the same channel.
  • Credential cache key stays a plain string for zero-arg auth types and becomes a tuple (auth_type, sorted_kwargs) when kwargs are present, so different client_ids get separate cached credentials.

mssql-py-core (the Rust path used by bulk copy) doesn't itself acquire Entra tokens. Python pre-acquires a JWT and passes it as access_token. Today this works for Default, DeviceCode, and Interactive. MSI was missing, the most common Azure-hosted-service auth method.

ServicePrincipal and Password are explicitly out of scope for this PR. They need different design work and ship separately.

How user-assigned MSI survives the bulkcopy fresh-token path

Connection.__init__ calls process_connection_string, which sanitizes self.connection_str (strips UID=, Authentication=, PWD=). Bulkcopy needs a fresh token at copy time. Re-parsing self.connection_str at that point would lose UID and silently degrade user-assigned MSI to system-assigned (wrong-identity bug).

Fix shape: process_connection_string returns a 4-tuple including the captured credential_kwargs. Connection.__init__ persists _credential_kwargs alongside _auth_type. cursor.bulkcopy() reads self.connection._credential_kwargs directly. No re-parse, no chance for the sanitized string to drop the client_id.

Connection string examples

# System-assigned MSI
"Server=tcp:foo.database.windows.net;Authentication=ActiveDirectoryMSI;Database=mydb;"

# User-assigned MSI
"Server=tcp:foo.database.windows.net;Authentication=ActiveDirectoryMSI;UID=<client_id_guid>;Database=mydb;"

Adds Authentication=ActiveDirectoryMSI to the auth pipeline:

- Zero-arg ManagedIdentityCredential() for system-assigned MSI.
- ManagedIdentityCredential(client_id=UID) for user-assigned MSI,
  matching ODBC's convention where UID carries the identity's
  client_id under MSI.
- Threads optional credential_kwargs through get_auth_token /
  get_raw_token / _acquire_token so future auth methods that need
  constructor args (e.g. ClientSecretCredential) can plug in via
  the same channel.
- Cache key remains a plain string for zero-arg auth types and
  becomes a tuple when kwargs are present, so different client_ids
  get separate cached credentials.

Partial fix for #534. ServicePrincipal and
Password to follow as separate PRs.
Copilot AI review requested due to automatic review settings May 12, 2026 12:09
@bewithgaurav bewithgaurav changed the title Add ActiveDirectoryMSI support for bulk copy FIX: Add ActiveDirectoryMSI support for bulk copy May 12, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds support for Authentication=ActiveDirectoryMSI (managed identity) token acquisition for the bulk copy (mssql-py-core) path, including user-assigned MSI via UID=<client_id>.

Changes:

  • Add AuthType.MSI (activedirectorymsi) and map it to ManagedIdentityCredential.
  • Thread optional credential_kwargs through token acquisition and update the credential-instance cache key to incorporate kwargs.
  • Attempt to extract MSI client_id from the connection string and pass it into bulk copy token acquisition; add tests and a changelog entry.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
mssql_python/constants.py Adds AuthType.MSI enum value.
mssql_python/auth.py Adds MSI credential support, credential kwargs plumbing, and cache key changes; adds extract_credential_kwargs.
mssql_python/cursor.py Bulk copy now tries to extract MSI credential kwargs and pass them to get_raw_token.
tests/test_008_auth.py Adds tests for MSI auth type parsing, credential construction, cache isolation, and connection-string processing.
CHANGELOG.md Documents MSI support for bulk copy.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mssql_python/cursor.py Outdated
Comment thread mssql_python/auth.py Outdated
Comment thread tests/test_008_auth.py
@bewithgaurav bewithgaurav changed the title FIX: Add ActiveDirectoryMSI support for bulk copy FEAT: Add ActiveDirectoryMSI support for bulk copy May 13, 2026
@bewithgaurav bewithgaurav marked this pull request as draft May 13, 2026 12:15
Connection.__init__ overwrites self.connection_str with the sanitized
(UID-stripped) string returned by process_connection_string. The original
implementation re-parsed self.connection_str at bulkcopy time via
extract_credential_kwargs, which silently dropped the user-assigned MSI
client_id and degraded to system-assigned  a wrong-identity bug.MSI

Changes:

- process_connection_string now returns a 4-tuple including the captured
  credential_kwargs so callers can persist them.
- Connection.__init__ stores _credential_kwargs alongside _auth_type.
- cursor.bulkcopy() reads self.connection._credential_kwargs instead of
  re-parsing self.connection_str.
- The public extract_credential_kwargs helper is removed (it only existed
  to support the broken re-parse path; nothing else needs it).
- black --line-length=100 reformats (CI was red).

Tests:

- test_bulkcopy_path_preserves_user_assigned_msi_client_id: invokes
  cursor.bulkcopy() with a mocked mssql_py_core, patches
  AADAuth.get_raw_token to capture the args it receives, and asserts
  the captured credential_kwargs match Connection._credential_kwargs.
  Fails if cursor reverts to re-parsing self.connection.connection_str.
- test_credential_kwargs_persisted_for_user_assigned_msi: asserts
  Connection.__init__ stores _credential_kwargs from the 4-tuple.
- test_credential_kwargs_none_for_system_assigned_msi.
- test_credential_kwargs_none_for_non_msi_auth.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bewithgaurav bewithgaurav marked this pull request as ready for review May 14, 2026 12:31
@github-actions github-actions Bot added the pr-size: medium Moderate update size label May 14, 2026
Comment thread mssql_python/auth.py Outdated
identity's client_id. Returns None for system-assigned MSI.
"""
for param in parameters:
key, _, value = param.strip().partition("=")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to get this from the conn str parser map? Assuming we need the conn str UID

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what will happen if someone provides a UID={hello=world}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks - missed this, parser should be used as a standard across
that handles {hello=world} correctly

bewithgaurav and others added 3 commits May 14, 2026 18:40
Per @saurabh500's review: braced ODBC values like UID={hello=world} need
the canonical parser, not naive partition('='). Without this, the helper
returns '{hello=world}' verbatim and ManagedIdentityCredential rejects it.
Worse, a UID containing a literal ';' would be truncated.

_extract_msi_client_id now delegates to _ConnectionStringParser, which
handles braces, escaped '}}' inside braces, and '=' inside braced values
correctly. validate_keywords=False so the helper never raises on keys
the auth flow doesn't care about.

Tests:
- test_msi_braced_uid_value_is_unwrapped: UID={hello=world} -> 'hello=world'
- test_msi_braced_uid_with_semicolon_is_preserved: UID={abc;def;ghi}

Note: process_connection_string and extract_auth_type still use naive
split(';') for Authentication= detection across all Entra ID auth types.
That's pre-existing and tracked separately for a parser-wide refactor.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread mssql_python/auth.py
except Exception: # noqa: BLE001 — parser raises ConnectionStringParseError on malformed input;
# absence of UID is the safe answer for credential extraction.
return None
uid = (parsed.get("uid") or "").strip()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont know if this is the right place, but we are kind of downgrading to SAMI if UAMI uid is not found.

Since we are falling back, do you think it makes sense to add a logging statement to help debug if needed?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes you're correct - the fallback to SAMI is expected and as per the ODBC behaviour
added logger statements for better debugging help, thanks for the suggestion!

Comment thread mssql_python/auth.py
try:
with _credential_cache_lock:
if auth_type not in _credential_cache:
if cache_key not in _credential_cache:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no pruning of credential cache. I don't think its needed right now, but I would encourage adding a comment about it.

This case is not usually possible but in some crazy scenario where the same app is serving requests for Multi-tenants (I don't think ODBC supports multi tenant Managed Identity yet).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a comment to reference this just above the _credential_cache declaration

- Debug log distinguishing user-assigned vs system-assigned MSI when the
  user passes Authentication=ActiveDirectoryMSI. Helps diagnose which
  branch was taken when token acquisition fails. Logs client_id length,
  not value (still identity material).
- Comment above _credential_cache explains the cache key shape so the
  unbounded growth is understood as a deliberate choice rather than an
  oversight.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 14, 2026

📊 Code Coverage Report

🔥 Diff Coverage

100%


🎯 Overall Coverage

25%


📈 Total Lines Covered: 6982 out of 27089
📁 Project: mssql-python


Diff Coverage

Diff: main...HEAD, staged and unstaged changes

  • mssql_python/auth.py (100%)
  • mssql_python/connection.py (100%)
  • mssql_python/constants.py (100%)

Summary

  • Total: 34 lines
  • Missing: 0 lines
  • Coverage: 100%

📋 Files Needing Attention

📉 Files with overall lowest coverage (click to expand)
mssql_python.pybind.build._deps.simdutf-src.src.haswell.implementation.cpp: 0.4%
mssql_python.pybind.build._deps.simdutf-src.src.implementation.cpp: 6.7%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.implementation.h: 10.4%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.scalar.utf16_to_utf8.utf16_to_utf8.h: 25.3%
mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.pybind.ddbc_bindings.h: 59.7%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.internal.isadetection.h: 65.3%
mssql_python.row.py: 70.5%
mssql_python.pybind.logger_bridge.hpp: 70.8%
mssql_python.pybind.ddbc_bindings.cpp: 74.2%

🔗 Quick Links

⚙️ Build Summary 📋 Coverage Details

View Azure DevOps Build

Browse Full Coverage Report

Connection.__init__ already parses the same connection string through
_ConnectionStringParser via _construct_connection_string (connection.py
line 253) before process_connection_string is ever called. By the time
_extract_msi_client_id runs, the input is guaranteed parseable.

The try/except was dead code. A real parse failure here would indicate
an upstream bug and should propagate, not silently degrade user-assigned
MSI to system-assigned (which is the wrong-identity failure mode this PR
exists to prevent).

Brings diff coverage to 100%.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread mssql_python/auth.py
# time we get here the input is guaranteed parseable. No defensive
# try/except: a parse failure now means a real bug upstream and should
# propagate, not silently degrade user-assigned MSI to system-assigned.
parsed = _ConnectionStringParser(validate_keywords=False)._parse(connection_string)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionally correct but a tiny perf hit due to double parsing of connections string.
Perfect solution would require parsing done earlier in the connect api and passing around the map to this function. That's a bigger change. I recommend a follow up PR and tracking with a GH issue.

An adhoc fix is to maintain a hashmap of connection string and uid. But that's prone to other problems esp concurrency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-size: medium Moderate update size

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants