Skip to content

fix(oci/openai-compat): auto-refresh instance-principal token (closes silent 401 after ~15min)#206

Merged
fede-kamel merged 3 commits into
mainfrom
fix/oci-instance-principal-token-refresh-v2
May 17, 2026
Merged

fix(oci/openai-compat): auto-refresh instance-principal token (closes silent 401 after ~15min)#206
fede-kamel merged 3 commits into
mainfrom
fix/oci-instance-principal-token-refresh-v2

Conversation

@fede-kamel
Copy link
Copy Markdown
Contributor

Summary

OCIOpenAIModel constructed its OCIRequestSigner without a refresh_signer callback. OCIRequestSigner.auth_flow has both a refresh-on-401 branch and a periodic-refresh branch — both early-return when refresh_signer is None, which meant the federation token captured at process start was used forever. On OKE, instance-principal tokens expire on the order of 15–30 minutes, so any agent pod older than that would 401 on every GenAI call until restarted.

Production symptom: chats silently fall through to reason=error after ~15 minutes of pod uptime; httpx logs show HTTP/1.1 401 Unauthorized on every chat.completions call. Pod restart was the only known workaround.

Fix

Wires the signer's own refresh_security_token method into the wrapper via a new _refresh_callable_for(signer) helper:

  • InstancePrincipalsSecurityTokenSigner — has refresh_security_token
  • get_resource_principals_signer() — returns a signer with the same contract
  • DelegationTokenSigner (and other OCI-SDK-convention variants) — same contract, picked up via hasattr rather than isinstance checks
  • User-principal API-key Signer — no refresh_security_token, so the helper returns None and the refresh path stays dormant

refresh_interval is also tightened from the upstream 3600.0 default to 600.0 — short enough that proactive refresh beats the typical 15–30 minute federation-token TTL even if a 401 doesn't fire first.

No public-API change; the wiring is internal.

Test plan

  • TestRefreshCallableFor — 3 cases (refresh attr present / missing / non-callable defensive path)
  • TestClientWiresRefreshSigner — 2 cases (token signer gets the callback, static signer gets None)
  • All 49 OCI-area unit tests pass (tests/unit/test_oci_*.py, test_rag_embeddings_oci.py)
  • hatch run check clean (format-check + ruff + mypy on Python 3.12)
  • Version bumped to 0.2.0b12 + changelog entry
  • Live verify after 0.2.0b12 PyPI release rolls into the consumer image

@oracle-contributor-agreement
Copy link
Copy Markdown

Thank you for your pull request and welcome to our community! To contribute, please sign the Oracle Contributor Agreement (OCA).
The following contributors of this PR have not signed the OCA:

To sign the OCA, please create an Oracle account and sign the OCA in Oracle's Contributor Agreement Application.

When signing the OCA, please provide your GitHub username. After signing the OCA and getting an OCA approval from Oracle, this PR will be automatically updated.

If you are an Oracle employee, please make sure that you are a member of the main Oracle GitHub organization, and your membership in this organization is public.

@oracle-contributor-agreement oracle-contributor-agreement Bot added the OCA Required At least one contributor does not have an approved Oracle Contributor Agreement. label May 17, 2026
@fede-kamel fede-kamel force-pushed the fix/oci-instance-principal-token-refresh-v2 branch from c6a13a9 to 51009dc Compare May 17, 2026 02:51
@oracle-contributor-agreement oracle-contributor-agreement Bot added OCA Verified All contributors have signed the Oracle Contributor Agreement. and removed OCA Required At least one contributor does not have an approved Oracle Contributor Agreement. labels May 17, 2026
… silent 401 after ~15min)

OCIOpenAIModel constructed its OCIRequestSigner without a refresh_signer
callback. OCIRequestSigner.auth_flow has both a refresh-on-401 branch
and a periodic-refresh branch — both early-return when refresh_signer
is None, which meant the federation token captured at process start
was used forever. On OKE, instance-principal tokens expire on the
order of 15–30 minutes, so any agent pod older than that would 401 on
every GenAI call until restarted.

Production symptom: chats silently fall through to reason=error after
~15 minutes of pod uptime, with httpx logs showing
"HTTP/1.1 401 Unauthorized" on every chat.completions call. Pod restart
was the only known workaround.

The fix wires the signer's own refresh_security_token method (present
on InstancePrincipalsSecurityTokenSigner, get_resource_principals_signer()
returns a signer with the same contract, and any DelegationTokenSigner
variant that follows the OCI SDK convention) into the wrapper via a
new _refresh_callable_for(signer) helper. Static signers (user-principal
API key) have no refresh_security_token attribute and the helper returns
None, so the refresh path stays dormant for them.

refresh_interval is also tightened from the upstream 3600s default to
600s — short enough that proactive refresh beats the typical 15–30
minute federation-token TTL even if a 401 doesn't fire first.

No public-API change; the wiring is internal. Bumps version to
0.2.0b12 with a changelog entry.

Test coverage:
  * TestRefreshCallableFor — 3 cases (refresh attr present, missing,
    non-callable defensive path)
  * TestClientWiresRefreshSigner — 2 cases (token signer gets the
    callback, static signer gets None)

All 49 OCI-area unit tests pass; hatch run check (format-check + ruff +
mypy) green.

Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>
OCIResponsesModel uses the same OCIRequestSigner pattern as
OCIOpenAIModel for its /v1/responses HTTP client, and was hitting the
identical bug — signer constructed at startup without a refresh_signer
callback, federation token frozen at process-start, 401 wall after
~15-30 min on OKE with instance-principal auth.

Same fix: import _refresh_callable_for from openai_compat and pass it
plus refresh_interval=600.0 to OCIRequestSigner. Two new test cases
in TestHttpClientWiresRefreshSigner mirror the openai_compat coverage:
token signer gets the callback, static signer gets None.

CHANGELOG entry expanded into a transport-by-transport table so the
scope of the fix is unambiguous (openai_compat + responses are both
fixed; OCIModel/native SDK and OracleGenAIEmbeddings already refresh
internally via the OCI SDK and are unaffected).

Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>
@fede-kamel fede-kamel force-pushed the fix/oci-instance-principal-token-refresh-v2 branch from 51009dc to 06999e7 Compare May 17, 2026 02:53
OCI instance-principal token now auto-refreshes on both
openai-style HTTP transports (OCIOpenAIModel + OCIResponsesModel).
Closes the silent 401 after ~15min federation-token expiry on OKE
pods with instance-principal auth.

Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>
@fede-kamel fede-kamel merged commit 8ed85cc into main May 17, 2026
10 checks passed
@fede-kamel fede-kamel deleted the fix/oci-instance-principal-token-refresh-v2 branch May 17, 2026 03:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

OCA Verified All contributors have signed the Oracle Contributor Agreement.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant