feat: OAuth device flow and per-project server linking#83
Conversation
Implement RFC 8628 device authorization grant for CLI authentication. Users run `osa login` to authenticate via browser, and credentials are stored in ~/.config/osa/credentials.json keyed by server URL. Add `osa link --server <url>` to save the server URL in .osa/config.json so subsequent commands resolve it automatically without --server flag. Server-side: device authorization endpoints, user code verification page, device token polling, and database migration for device_authorizations table. SDK-side: login/logout commands, credential storage with token refresh, resolve_server() chain (--server → OSA_SERVER → .osa/config.json), and deploy command integration with stored credentials. Closes #81
|
Greptile SummaryThis PR implements RFC 8628 OAuth device authorization grant for CLI authentication ( Key findings:
Confidence Score: 3/5
|
| Filename | Overview |
|---|---|
| sdk/py/osa/cli/credentials.py | New credential storage module; prior TOCTOU and dead-code issues addressed with atomic os.open write and resolve_token now calling refresh_access_token. |
| sdk/py/osa/cli/login.py | Device flow polling loop; slow_down backoff sync and previous threading concerns addressed, but unguarded resp.json() + bare key access on the device initiation response (lines 109-114) can raise uncaught ValueError/KeyError. |
| server/osa/application/api/v1/routes/auth.py | Device flow routes added; templates loaded at import time (flagged in previous review), early OAuth error redirects for missing state/error param still send device-flow users to frontend error page rather than a device-specific page (flagged in previous review). |
| server/osa/infrastructure/persistence/repository/auth.py | Correct atomic AUTHORIZED→CONSUMED transition via UPDATE…RETURNING; however delete_expired_before omits CONSUMED records from cleanup, causing the table to grow unboundedly over time. |
| server/osa/domain/auth/model/device_authorization.py | Well-structured entity with correct state-machine transitions (pending→authorized→consumed, pending→expired) and replay-prevention via unique device_code. |
| server/osa/domain/auth/service/auth.py | Device flow methods well-implemented; exchange_device_code correctly uses atomic consume-then-mint pattern to prevent duplicate token issuance. _generate_user_code depends on the large SAFE_CHARS alphabet, making collisions negligible in practice. |
| server/migrations/versions/add_device_authorizations.py | Migration creates device_authorizations table with unique constraints on both codes and an index on (status, expires_at); uses a human-readable string revision ID rather than the standard hex format, which is unconventional but functional. |
Sequence Diagram
sequenceDiagram
participant CLI as osa CLI
participant Server as OSA Server
participant DB as device_authorizations
participant ORCID as ORCID
participant Browser as User Browser
CLI->>Server: POST /api/v1/auth/device
Server->>DB: INSERT (status=pending, user_code, device_code)
Server-->>CLI: device_code, user_code, verification_uri, expires_in, interval
CLI->>Browser: open verification_uri?code=XXXX-XXXX
Browser->>Server: GET /api/v1/auth/device/verify?code=XXXX-XXXX
loop CLI polls every interval seconds
CLI->>Server: POST /api/v1/auth/device/token {device_code}
Server-->>CLI: 400 {error: authorization_pending}
end
Browser->>Server: POST /api/v1/auth/device/verify {user_code}
Server->>DB: get_by_user_code → verify pending
Server-->>Browser: 302 → ORCID authorization URL (state contains device_code)
Browser->>ORCID: Authorization request
ORCID-->>Browser: 302 → /api/v1/auth/callback?code=...&state=...
Browser->>Server: GET /api/v1/auth/callback?code=...&state=...
Server->>ORCID: exchange code for identity
Server->>DB: UPDATE status=authorized, user_id=... (via authorize_device)
Server-->>Browser: 302 → /api/v1/auth/device/complete (success page)
CLI->>Server: POST /api/v1/auth/device/token {device_code}
Server->>DB: UPDATE status=consumed WHERE status=authorized RETURNING * (atomic)
Server->>Server: mint access_token + refresh_token
Server-->>CLI: 200 {access_token, refresh_token}
CLI->>CLI: write credentials to ~/.config/osa/credentials.json (0600)
Last reviewed commit: 89e71b8
… refresh code - Add atomic consume_if_authorized to DeviceAuthorizationRepository that uses UPDATE WHERE status='authorized' so only one concurrent poller wins - Replace write_text+chmod with os.open(O_CREAT, 0o600) in credentials to eliminate window where tokens are briefly world-readable - Wire refresh_access_token into resolve_token so expired tokens are refreshed automatically instead of causing silent 401s
|
@greptile |
| def test_stored_credentials_used_when_no_env(self, cred_file: Path, monkeypatch): | ||
| monkeypatch.delenv("OSA_TOKEN", raising=False) | ||
| write_credentials( | ||
| "https://example.com", | ||
| access_token="stored-at", | ||
| refresh_token="rt", | ||
| path=cred_file, | ||
| ) | ||
|
|
||
| token = resolve_token("https://example.com", path=cred_file) | ||
| assert token == "stored-at" |
There was a problem hiding this comment.
Unintentional real HTTP call inside unit test
resolve_token now always calls refresh_access_token when stored credentials are found. This test doesn't mock that function, so it makes a live httpx.POST to https://example.com/api/v1/auth/refresh on every test run. The assertion passes only because the connection fails (caught as httpx.HTTPError) and the code falls back to the stored token — the test is not actually exercising the intended "stored credentials are returned" path and its outcome depends on network behavior.
The identical pattern exists at sdk/py/tests/test_deploy_auth.py:30-41 (test_stored_credentials_used_without_env).
Both tests should patch refresh_access_token (either returning a refreshed token to verify the happy path, or returning None to verify the fallback). The sibling tests test_attempts_refresh_when_stored_creds_exist and test_falls_back_to_stored_token_when_refresh_fails already demonstrate the correct mocking approach.
- Replace ValueError/RuntimeError with domain errors (InvalidStateError, InfrastructureError) in DeviceAuthorization model and AuthService - Move device flow callback and verify logic from router to command handlers (CompleteDeviceOAuth, VerifyDeviceCode) - Replace tuple return type with DeviceTokenResult dataclass - Filter delete_expired_before to only remove PENDING/EXPIRED rows - Fix slow_down backoff sync bug in SDK polling - Patch refresh_access_token in SDK tests to prevent live HTTP calls - Consolidate deferred imports and clean up debug logging in AuthService
|
@greptile |
| if resp.status_code == 200: | ||
| return resp.json() | ||
|
|
||
| if resp.status_code >= 500: | ||
| # Server error — backoff and retry | ||
| time.sleep(min(backoff, 30)) | ||
| backoff = min(backoff * 2, 30) | ||
| continue | ||
|
|
||
| # 400-level: check error code | ||
| data = resp.json() | ||
| error = data.get("error", "") |
There was a problem hiding this comment.
resp.json() unguarded — crashes login on non-JSON responses
Both the 200 path (line 56) and the 400 path (line 65) call resp.json() without exception handling. If a WAF, reverse-proxy, or the server itself returns a non-JSON body (e.g., an HTML error page) for either response code, Python raises json.JSONDecodeError (a ValueError), which is not caught anywhere in the call stack. The polling loop exits with an unhandled traceback instead of a graceful failure message.
| if resp.status_code == 200: | |
| return resp.json() | |
| if resp.status_code >= 500: | |
| # Server error — backoff and retry | |
| time.sleep(min(backoff, 30)) | |
| backoff = min(backoff * 2, 30) | |
| continue | |
| # 400-level: check error code | |
| data = resp.json() | |
| error = data.get("error", "") | |
| if resp.status_code == 200: | |
| try: | |
| return resp.json() | |
| except ValueError: | |
| logger.error("Device token endpoint returned non-JSON 200 response") | |
| return None | |
| if resp.status_code >= 500: | |
| # Server error — backoff and retry | |
| time.sleep(min(backoff, 30)) | |
| backoff = min(backoff * 2, 30) | |
| continue | |
| # 400-level: check error code | |
| try: | |
| data = resp.json() | |
| except ValueError: | |
| logger.error("Device token endpoint returned non-JSON %d response", resp.status_code) | |
| return None |
| if error: | ||
| logger.warning("OAuth error: %s - %s", error, error_description) | ||
| error_params = urlencode( | ||
| { | ||
| "error": error, | ||
| "error_description": error_description or "Authentication failed", | ||
| } | ||
| return RedirectResponse( | ||
| url=_error_redirect(error, error_description or "Authentication failed") | ||
| ) | ||
| return RedirectResponse(url=f"{frontend_url}/auth/error?{error_params}") | ||
|
|
||
| # Validate signed state token | ||
| if not state: | ||
| logger.warning("OAuth state missing") | ||
| error_params = urlencode( | ||
| { | ||
| "error": "oauth_state_missing", | ||
| "error_description": "Missing state parameter", | ||
| } | ||
| return RedirectResponse( | ||
| url=_error_redirect("oauth_state_missing", "Missing state parameter") | ||
| ) | ||
| return RedirectResponse(url=f"{frontend_url}/auth/error?{error_params}") | ||
|
|
||
| state_data = token_service.verify_oauth_state(state) | ||
| if state_data is None: | ||
| logger.warning("OAuth state invalid or expired") | ||
| error_params = urlencode( | ||
| { | ||
| "error": "oauth_state_invalid", | ||
| "error_description": "Invalid or expired state parameter", | ||
| } | ||
| return RedirectResponse( | ||
| url=_error_redirect("oauth_state_invalid", "Invalid or expired state parameter") | ||
| ) |
There was a problem hiding this comment.
Device flow users sent to frontend on early OAuth errors
At lines 191–202, the code returns early on two conditions — the OAuth provider returning an error query param (e.g., the user clicked "Deny" on ORCID), and a missing state param — before is_device_flow is ever computed (line 213). Both branches call _error_redirect, which redirects the user's browser to {frontend_url}/auth/error. For a device flow session, this drops the user at the web application's error page, which typically prompts them to "return to the app" rather than "return to your terminal and run osa login again."
Although the CLI will eventually time out, the UX is misleading for what is likely the most common device flow failure path (user denies access on ORCID).
The state parameter is present in the OAuth error redirect from ORCID (RFC 6749 §4.1.2.1 requires it), so the state can be decoded early enough to detect device flow before issuing the redirect:
# Parse state early so device_flow errors route correctly
_early_state_data = token_service.verify_oauth_state(state) if state else None
_is_early_device_flow = _early_state_data is not None and _early_state_data.device_code is not None
if error:
logger.warning("OAuth error: %s - %s", error, error_description)
if _is_early_device_flow:
return RedirectResponse(url=_device_error_redirect(error_description or "Authentication failed"))
return RedirectResponse(
url=_error_redirect(error, error_description or "Authentication failed")
)
if not state:
logger.warning("OAuth state missing")
return RedirectResponse(
url=_error_redirect("oauth_state_missing", "Missing state parameter")
)|
@greptile |
| async def delete_expired_before(self, cutoff: datetime) -> int: | ||
| stmt = delete(device_authorizations_table).where( | ||
| device_authorizations_table.c.expires_at < cutoff, | ||
| device_authorizations_table.c.status.in_( | ||
| [ | ||
| DeviceAuthorizationStatus.PENDING.value, | ||
| DeviceAuthorizationStatus.EXPIRED.value, | ||
| ] | ||
| ), | ||
| ) | ||
| result = await self.session.execute(stmt) | ||
| await self.session.flush() | ||
| return result.rowcount |
There was a problem hiding this comment.
CONSUMED records never cleaned up — table grows unboundedly
delete_expired_before only deletes rows with status PENDING or EXPIRED. CONSUMED records (one per successful CLI login) are permanently excluded from cleanup even though their expires_at is also in the past (15 minutes after creation).
This has two concrete consequences:
- Unbounded table growth — every successful
osa loginpermanently adds a row todevice_authorizations. On a popular server this will degrade query performance over time. - Unique constraint exhaustion — the
uq_device_auth_user_codeconstraint covers all rows, including consumed ones. With 27^8 ≈ 282 billion possible codes this won't be hit in practice, but it means those codes can never be reused by new device authorizations.
The fix is straightforward — include CONSUMED in the cleanup filter, since consumed records are no longer needed once their expires_at has passed:
async def delete_expired_before(self, cutoff: datetime) -> int:
stmt = delete(device_authorizations_table).where(
device_authorizations_table.c.expires_at < cutoff,
device_authorizations_table.c.status.in_(
[
DeviceAuthorizationStatus.PENDING.value,
DeviceAuthorizationStatus.EXPIRED.value,
DeviceAuthorizationStatus.CONSUMED.value, # also safe to clean up
]
),
)
result = await self.session.execute(stmt)
await self.session.flush()
return result.rowcountIf audit retention of consumed records is intentional, the decision and its trade-offs should be documented, and a separate TTL-based retention policy (e.g., keep for 90 days) should be added.
|
|
||
| data = resp.json() | ||
| device_code = data["device_code"] | ||
| user_code = data["user_code"] | ||
| verification_uri = data["verification_uri"] | ||
| expires_in = data["expires_in"] |
There was a problem hiding this comment.
Unguarded resp.json() and bare key access on device initiation response
resp.json() can raise json.JSONDecodeError (a ValueError) if the 200 response body is not valid JSON (e.g., an HTML error page from a WAF or reverse proxy). Additionally, the subsequent direct key accesses (data["device_code"], data["user_code"], etc.) will raise KeyError if any field is absent.
The same concern applies to the write step at result["access_token"] / result["refresh_token"] ~line 142, where result comes from the poll endpoint.
Both _poll_for_token (noted in a prior review comment at line 55) and login() share this pattern. Consider wrapping in a try/except and printing a helpful diagnostic message on failure:
try:
data = resp.json()
device_code = data["device_code"]
user_code = data["user_code"]
verification_uri = data["verification_uri"]
expires_in = data["expires_in"]
interval = data["interval"]
except (ValueError, KeyError) as e:
print(f"Error: unexpected response from server: {e}", file=sys.stderr)
return FalseCONSUMED device authorizations are safe to delete once expired — the CLI has already exchanged them for tokens. Only AUTHORIZED rows need protection (user completed login but CLI hasn't polled yet).
Summary
osa login/osa logout)osa link --server <url>for per-project server linking via.osa/config.jsonresolve_server()chain (--server→OSA_SERVER→.osa/config.json)Test plan
osa link --server <url>writes.osa/config.jsonosa loginauthenticates via browser without--serverwhen linkedosa deployresolves server from.osa/config.jsonCloses #81