Skip to content

fix(oci/openai-compat): auto-refresh instance-principal token (closes silent 401 after ~15min)#205

Closed
fede-kamel wants to merge 1 commit into
mainfrom
fix/oci-instance-principal-token-refresh
Closed

fix(oci/openai-compat): auto-refresh instance-principal token (closes silent 401 after ~15min)#205
fede-kamel wants to merge 1 commit into
mainfrom
fix/oci-instance-principal-token-refresh

Conversation

@fede-kamel
Copy link
Copy Markdown
Contributor

Summary

OCIOpenAIModel constructed its OCIRequestSigner without a refresh_signer callback. OCIRequestSigner.auth_flow has both a refresh-on-401 branch and a periodic-refresh branch — both early-return when refresh_signer is None, which meant the federation token captured at process start was used forever. On OKE, instance-principal tokens expire on the order of 15–30 minutes, so any agent pod older than that would 401 on every GenAI call until restarted.

Production symptom observed in almariel: chats silently fall through to reason=error after ~15 minutes of pod uptime; httpx logs show HTTP/1.1 401 Unauthorized on every chat.completions call. Pod restart was the only known workaround.

Fix

Wires the signer's own refresh_security_token method into the wrapper via a new _refresh_callable_for(signer) helper:

  • InstancePrincipalsSecurityTokenSigner — has refresh_security_token
  • get_resource_principals_signer() — returns a signer with the same contract
  • DelegationTokenSigner (and other OCI-SDK-convention variants) — same contract, picked up via hasattr rather than isinstance checks
  • User-principal API-key Signer — no refresh_security_token, so the helper returns None and the refresh path stays dormant

refresh_interval is also tightened from the upstream 3600.0 default to 600.0 — short enough that proactive refresh beats the typical 15–30 minute federation-token TTL even if a 401 doesn't fire first.

No public-API change; the wiring is internal.

Test plan

  • TestRefreshCallableFor — 3 cases (refresh attr present / missing / non-callable defensive path)
  • TestClientWiresRefreshSigner — 2 cases (token signer gets the callback, static signer gets None)
  • All 49 OCI-area unit tests pass (tests/unit/test_oci_*.py, test_rag_embeddings_oci.py)
  • ruff check clean
  • Version bumped to 0.2.0b12 + changelog entry
  • Live verify in almariel after pip install locus-sdk==0.2.0b12 rolls into the agent image

🤖 Generated with Claude Code

… silent 401 after ~15min)

OCIOpenAIModel constructed its OCIRequestSigner without a refresh_signer
callback. OCIRequestSigner.auth_flow has both a refresh-on-401 branch
and a periodic-refresh branch — both early-return when refresh_signer
is None, which meant the federation token captured at process start
was used forever. On OKE, instance-principal tokens expire on the
order of 15–30 minutes, so any agent pod older than that would 401 on
every GenAI call until restarted.

Production symptom seen in almariel: chats silently fall through to
reason=error after ~15 minutes of pod uptime, with the underlying
httpx logs showing "HTTP/1.1 401 Unauthorized" on every chat.completions
call. Pod restart was the only known workaround.

The fix wires the signer's own refresh_security_token method (present
on InstancePrincipalsSecurityTokenSigner, get_resource_principals_signer()
returns a signer with the same contract, and any DelegationTokenSigner
variant that follows the OCI SDK convention) into the wrapper via a
new _refresh_callable_for(signer) helper. Static signers (user-principal
API key) have no refresh_security_token attribute and the helper
returns None, so the refresh path stays dormant for them.

refresh_interval is also tightened from the upstream 3600s default to
600s — short enough that proactive refresh beats the typical 15–30
minute federation-token TTL even if a 401 doesn't fire first.

No public-API change; the wiring is internal. Bumps version to
0.2.0b12 + changelog entry.

Test coverage:
  * TestRefreshCallableFor — 3 cases (refresh attr exists, missing,
    non-callable defensive path)
  * TestClientWiresRefreshSigner — 2 cases (token signer gets the
    callback, static signer gets None)

All 49 OCI-area unit tests pass; ruff clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@oracle-contributor-agreement oracle-contributor-agreement Bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label May 17, 2026
@fede-kamel
Copy link
Copy Markdown
Contributor Author

Superseded by the v2 branch with DCO sign-off + ruff format applied + clean commit metadata. New PR: #206

@fede-kamel fede-kamel closed this May 17, 2026
@fede-kamel fede-kamel deleted the fix/oci-instance-principal-token-refresh branch May 17, 2026 03:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

OCA Verified All contributors have signed the Oracle Contributor Agreement.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant