fix(oci/openai-compat): auto-refresh instance-principal token (closes silent 401 after ~15min)#206
Merged
Conversation
Closed
6 tasks
|
Thank you for your pull request and welcome to our community! To contribute, please sign the Oracle Contributor Agreement (OCA). To sign the OCA, please create an Oracle account and sign the OCA in Oracle's Contributor Agreement Application. When signing the OCA, please provide your GitHub username. After signing the OCA and getting an OCA approval from Oracle, this PR will be automatically updated. If you are an Oracle employee, please make sure that you are a member of the main Oracle GitHub organization, and your membership in this organization is public. |
c6a13a9 to
51009dc
Compare
… silent 401 after ~15min)
OCIOpenAIModel constructed its OCIRequestSigner without a refresh_signer
callback. OCIRequestSigner.auth_flow has both a refresh-on-401 branch
and a periodic-refresh branch — both early-return when refresh_signer
is None, which meant the federation token captured at process start
was used forever. On OKE, instance-principal tokens expire on the
order of 15–30 minutes, so any agent pod older than that would 401 on
every GenAI call until restarted.
Production symptom: chats silently fall through to reason=error after
~15 minutes of pod uptime, with httpx logs showing
"HTTP/1.1 401 Unauthorized" on every chat.completions call. Pod restart
was the only known workaround.
The fix wires the signer's own refresh_security_token method (present
on InstancePrincipalsSecurityTokenSigner, get_resource_principals_signer()
returns a signer with the same contract, and any DelegationTokenSigner
variant that follows the OCI SDK convention) into the wrapper via a
new _refresh_callable_for(signer) helper. Static signers (user-principal
API key) have no refresh_security_token attribute and the helper returns
None, so the refresh path stays dormant for them.
refresh_interval is also tightened from the upstream 3600s default to
600s — short enough that proactive refresh beats the typical 15–30
minute federation-token TTL even if a 401 doesn't fire first.
No public-API change; the wiring is internal. Bumps version to
0.2.0b12 with a changelog entry.
Test coverage:
* TestRefreshCallableFor — 3 cases (refresh attr present, missing,
non-callable defensive path)
* TestClientWiresRefreshSigner — 2 cases (token signer gets the
callback, static signer gets None)
All 49 OCI-area unit tests pass; hatch run check (format-check + ruff +
mypy) green.
Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>
OCIResponsesModel uses the same OCIRequestSigner pattern as OCIOpenAIModel for its /v1/responses HTTP client, and was hitting the identical bug — signer constructed at startup without a refresh_signer callback, federation token frozen at process-start, 401 wall after ~15-30 min on OKE with instance-principal auth. Same fix: import _refresh_callable_for from openai_compat and pass it plus refresh_interval=600.0 to OCIRequestSigner. Two new test cases in TestHttpClientWiresRefreshSigner mirror the openai_compat coverage: token signer gets the callback, static signer gets None. CHANGELOG entry expanded into a transport-by-transport table so the scope of the fix is unambiguous (openai_compat + responses are both fixed; OCIModel/native SDK and OracleGenAIEmbeddings already refresh internally via the OCI SDK and are unaffected). Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>
51009dc to
06999e7
Compare
OCI instance-principal token now auto-refreshes on both openai-style HTTP transports (OCIOpenAIModel + OCIResponsesModel). Closes the silent 401 after ~15min federation-token expiry on OKE pods with instance-principal auth. Signed-off-by: Federico Kamelhar <federico.kamelhar@oracle.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
OCIOpenAIModelconstructed itsOCIRequestSignerwithout arefresh_signercallback.OCIRequestSigner.auth_flowhas both a refresh-on-401 branch and a periodic-refresh branch — both early-return whenrefresh_signer is None, which meant the federation token captured at process start was used forever. On OKE, instance-principal tokens expire on the order of 15–30 minutes, so any agent pod older than that would 401 on every GenAI call until restarted.Production symptom: chats silently fall through to
reason=errorafter ~15 minutes of pod uptime; httpx logs showHTTP/1.1 401 Unauthorizedon everychat.completionscall. Pod restart was the only known workaround.Fix
Wires the signer's own
refresh_security_tokenmethod into the wrapper via a new_refresh_callable_for(signer)helper:InstancePrincipalsSecurityTokenSigner— hasrefresh_security_tokenget_resource_principals_signer()— returns a signer with the same contractDelegationTokenSigner(and other OCI-SDK-convention variants) — same contract, picked up viahasattrrather than isinstance checksSigner— norefresh_security_token, so the helper returnsNoneand the refresh path stays dormantrefresh_intervalis also tightened from the upstream3600.0default to600.0— short enough that proactive refresh beats the typical 15–30 minute federation-token TTL even if a 401 doesn't fire first.No public-API change; the wiring is internal.
Test plan
TestRefreshCallableFor— 3 cases (refresh attr present / missing / non-callable defensive path)TestClientWiresRefreshSigner— 2 cases (token signer gets the callback, static signer gets None)tests/unit/test_oci_*.py,test_rag_embeddings_oci.py)hatch run checkclean (format-check + ruff + mypy on Python 3.12)0.2.0b12+ changelog entry0.2.0b12PyPI release rolls into the consumer image