Skip to content

feat(credential): Support for AWS credential injection#24

Merged
dpup merged 18 commits into
mainfrom
aws-creds
Jan 21, 2026
Merged

feat(credential): Support for AWS credential injection#24
dpup merged 18 commits into
mainfrom
aws-creds

Conversation

@dpup
Copy link
Copy Markdown
Collaborator

@dpup dpup commented Jan 17, 2026

Summary

  • Add AWS credential support via IAM role assumption with automatic refresh
  • Refactor README for clearer philosophy and simpler quick start
  • Add CONTRIBUTING.md with development guidelines

AWS Credential Support

Adds agent grant aws command that stores IAM role configuration (not secrets). At runtime, containers receive credentials via credential_process - a small embedded binary fetches fresh credentials from the proxy on demand.

Usage:

  # One-time setup: configure role to assume
  agent grant aws --role=arn:aws:iam::123456789012:role/AgentRole

  # Run with AWS access
  agent run --grant aws -- aws s3 ls

Key design decisions:

  • Uses credential_process instead of static environment variables for automatic refresh (agent runs can last days, credentials expire in 15m-12h)
  • Embedded Go binary for credential fetching works in any container image (no curl/bash dependency)
  • Host AWS credentials never enter the container; only short-lived assumed role credentials are available
  • Auth token required when proxy binds to all interfaces (Apple containers)

Documentation Updates

  • README.md: Rewritten with clearer philosophy ("Don't manage containers. Manage runs.") and streamlined quick start
  • CONTRIBUTING.md: New file with development commands, code style, and contribution guidelines
  • examples/grant-aws/: Working example with IAM setup instructions
  • docs/plans/: Consolidated AWS implementation plan documenting the architecture

Test plan

  • agent grant aws --help shows AWS-specific flags
  • agent grant aws --role=invalid returns ARN validation error
  • agent grant aws --role=arn:aws:iam::ACCOUNT:role/ROLE succeeds with valid host credentials
  • agent run --grant aws -- aws sts get-caller-identity returns assumed role identity
  • Credentials refresh automatically for long-running agents

@dpup dpup requested a review from andybons January 17, 2026 01:30
@majorcontext majorcontext deleted a comment from claude Bot Jan 17, 2026
@dpup
Copy link
Copy Markdown
Collaborator Author

dpup commented Jan 17, 2026

@andybons main decision to discuss here is passing all traffic through the proxy. There's pros/cons. AWS doesn't strictly need it, because credentials aren't injected at the request level, but then there'd be no auditing of network traffic.

@claude
Copy link
Copy Markdown

claude Bot commented Jan 17, 2026

📦 Previous Review (superseded by newer review)

Pull Request Review: AWS Credential Support

This PR adds comprehensive AWS credential support via IAM role assumption. The implementation is well-designed and follows good security practices. Below is my detailed review:


Strengths

Architecture & Design

  • Excellent security model: Uses IAM role assumption instead of credential passthrough, ensuring host credentials never enter containers
  • Smart credential refresh: credential_process approach provides automatic refresh for long-running agents
  • Well-documented: Comprehensive design docs and examples make the feature approachable
  • Clean separation of concerns: Credential provider, proxy handler, and run manager integration are properly separated

Code Quality

  • Good test coverage: Tests for ARN parsing, session duration validation, caching behavior, and expiration refresh
  • Proper concurrency: Double-checked locking pattern in GetCredentials() prevents race conditions
  • Error handling: Clear, actionable error messages guide users through setup issues
  • Interface design: STSAssumeRoler interface enables testing without AWS credentials

Documentation

  • README updates are clear and include working examples
  • Design documents explain architectural decisions
  • Example IAM setup instructions (CloudFormation + CLI) are helpful

🔍 Issues & Concerns

1. ARN Parsing Vulnerability (Security - Medium)

Location: internal/credential/aws.go:15-31

The ARN parser only validates structure but doesn't handle all edge cases:

parts := strings.Split(arn, ":")
if len(parts) < 6 {
    return nil, fmt.Errorf("invalid ARN format: %s", arn)
}

Issues:

  • ARN format allows colons in the resource portion, which would create len(parts) > 6
  • Example: arn:aws:iam::123456789012:role/path:with:colons/RoleName would be incorrectly parsed
  • parts[5] might not contain the full resource string

Recommendation: Join all parts from index 5 onward:

if len(parts) < 6 {
    return nil, fmt.Errorf("invalid ARN format: %s", arn)
}
resource := strings.Join(parts[5:], ":")
if !strings.HasPrefix(resource, "role/") {
    return nil, fmt.Errorf("invalid ARN: must be a role ARN (got %s)", resource)
}

2. Credential Storage Abuse of Scopes Field (Design - Medium)

Location: cmd/agent/cli/grant.go:211-216

AWS config is packed into the Credential struct by abusing the Scopes slice:

cred := credential.Credential{
    Provider:  credential.ProviderAWS,
    Token:     awsCfg.RoleARN, // Store role ARN in Token field
    Scopes:    []string{awsCfg.Region, awsCfg.SessionDurationStr, awsCfg.ExternalID},
    CreatedAt: time.Now(),
}

Issues:

  • Semantic mismatch: Scopes implies permission scopes, not config fields
  • Fragile: Relies on array indices matching specific meanings
  • Hard to maintain: Adding new AWS config fields requires careful index management
  • Inconsistent: Other providers use fields appropriately

Recommendation: Either:

  1. Add a Config field to Credential as map[string]string or interface{}
  2. Serialize AWSConfig to JSON and store in Token or a new field
  3. Create provider-specific credential storage

3. Missing Context Cancellation Check (Reliability - Low)

Location: internal/proxy/aws.go:158

GetCredentials() doesn't check if context is cancelled before expensive STS call:

result, err := p.stsClient.AssumeRole(ctx, input)

Recommendation: Add early check:

if err := ctx.Err(); err != nil {
    return nil, err
}
result, err := p.stsClient.AssumeRole(ctx, input)

4. Credential Helper Dependency on curl (Portability - Low)

Location: internal/run/credential_helper.go:8-18

The shell script requires curl, which may not be available in all images:

exec curl -sf -m 10 -H "Authorization: Bearer $AGENTOPS_CREDENTIAL_TOKEN" "$AGENTOPS_CREDENTIAL_URL"

Issues:

  • Comment says "aws dependency includes curl" but this isn't guaranteed
  • Minimal/distroless images may lack curl
  • Error from missing curl is cryptic

Recommendation:

  • Add explicit validation that curl exists when AWS grant is used
  • Consider embedding a minimal Go binary for credential fetching instead of shell script
  • Document the curl requirement prominently

5. Potential Goroutine Leak in Handler (Performance - Low)

Location: internal/proxy/aws.go:42

ServeHTTP uses r.Context() which is tied to the HTTP request. If the handler takes longer than the request timeout, it continues fetching credentials unnecessarily.

Recommendation: Consider adding timeout to credential fetching or documenting expected behavior.

6. Error Information Disclosure (Security - Low)

Location: internal/proxy/aws.go:44

http.Error(w, fmt.Sprintf("failed to get credentials: %v", err), http.StatusInternalServerError)

Issue: Internal errors from AWS SDK might leak sensitive information to the container

Recommendation: Sanitize or wrap error:

// Log full error server-side
fmt.Fprintf(os.Stderr, "AWS credential fetch failed: %v\n", err)
// Return generic message to client
http.Error(w, "failed to get credentials", http.StatusInternalServerError)

7. Unchecked Array Access (Reliability - Low)

Location: internal/run/manager.go:1767-1777

if len(cred.Scopes) > 0 && cred.Scopes[0] != "" {
    region = cred.Scopes[0]
}
if len(cred.Scopes) > 1 {
    sessionDurationStr = cred.Scopes[1]
}
if len(cred.Scopes) > 2 {
    externalID = cred.Scopes[2]
}

Issue: Brittle if credential format changes. Related to Issue #2 above.


📝 Minor Improvements

Code Style

  1. Inconsistent error wrapping (internal/proxy/aws.go:92):

    • Uses fmt.Errorf("loading AWS config: %w", err)
    • Good! But some other errors don't wrap (line 160, 164)
  2. Magic number (internal/proxy/aws.go:132):

    • 5*time.Minute appears multiple times
    • Consider: const credentialRefreshBuffer = 5 * time.Minute
  3. Comment accuracy (internal/proxy/aws.go:25):

    • Says "ECS container format" but actually uses credential_process format
    • Update comment to: "serves AWS credentials via HTTP in credential_process format"

Testing

  1. Missing edge case tests:

  2. Test timeout: Some tests create real AWS config objects which might be slow or flaky

Documentation

  1. Session duration mismatch: README says "15 minutes by default" but also "Credentials auto-refresh via AWS SDK's built-in mechanism". The SDK doesn't auto-refresh static credentials—only the credential_process does. Clarify this distinction.

  2. Proxy auth documentation: The security model section in proxy.go was updated but doesn't explicitly mention AWS credential endpoint auth.


🎯 Performance Considerations

  1. Credential caching is good: 5-minute refresh buffer is reasonable
  2. Consider: Add metrics/logging for cache hit rate
  3. Double-checked locking: Correctly implemented ✅

🔒 Security Assessment

Positive:

  • ✅ No long-lived credentials in containers
  • ✅ Auth token for credential endpoint on 0.0.0.0 binding
  • ✅ Short-lived sessions (configurable)
  • ✅ Proper isolation model

Concerns:


📊 Test Coverage

Good coverage on:

  • ARN parsing validation
  • Session duration validation
  • Credential caching
  • Refresh on expiration

Missing coverage:

  • Integration test with actual container and credential_process
  • Error scenarios (network failures, STS throttling)
  • Concurrent credential requests

🎬 Recommendations Summary

Must Fix (Before Merge):

  1. Fix ARN parsing to handle colons in resource name (Issue feat(routing): add hostname-based service routing for agents #1)
  2. Fix credential storage abuse of Scopes field (Issue Add Claude Code GitHub Workflow #2)

Should Fix (Before Merge):

  1. Document curl dependency clearly or provide alternative
  2. Add early context cancellation check

Nice to Have (Future):

  1. Sanitize error messages in HTTP responses
  2. Add audit logging for credential fetches
  3. Add integration tests

Overall Assessment

This is a high-quality PR with thoughtful architecture and good security practices. The design documentation is excellent, and the code is generally clean and well-tested.

The main concerns are around the ARN parsing edge case and the abuse of the Scopes field for config storage. These should be addressed before merge.

With the recommended fixes, this feature would be production-ready and a valuable addition to AgentOps.

Recommendation: Request changes for Issues #1 and #2, then approve.


Great work on this feature! The credential_process approach is clever and solves the refresh problem elegantly. 🚀

dpup added 16 commits January 21, 2026 13:42
Design for AWS credential injection via ECS container credential protocol.
Key decisions:
- Role-only (no credential passthrough)
- 15-minute default session duration with auto-refresh
- Proxy serves /_aws/credentials endpoint
- Uses host AWS credentials to assume role at runtime

🤖 Generated with [Claude Code](https://claude.com/claude-code)
- Add AWSConfig struct with role ARN, region, session duration, external ID
- Add ParseRoleARN function to validate IAM role ARNs
- Add SessionDuration method with validation (15m-12h range)
- Include comprehensive tests
- Add `agent grant aws --role=ARN` command
- Support --region, --session-duration, --external-id options
- Validate role ARN format and test AssumeRole at grant time
- Store role config (not credentials) for runtime assumption
- Add AWS SDK v2 dependencies
- Add AWSCredentialHandler for serving credentials via HTTP
- Add AWSCredentialProvider with STS AssumeRole and 5-minute cache buffer
- Returns credentials in ECS container credential format
- Include comprehensive tests with mock STS client
- Add AWSCredentialProvider field to Run struct
- Add SetAWSHandler method and routing in proxy.go
- Set up AWS credential endpoint in manager.go
- Configure AWS_CONTAINER_CREDENTIALS_FULL_URI and AWS_REGION env vars
- Add AWS hosts to network policy allow list
- Add AWS credentials section explaining IAM role assumption
- Document --role, --region, --session-duration options
- Explain how credentials auto-refresh
- Note required IAM setup steps
- Add agent.yaml using aws sts get-caller-identity
- Include CloudFormation and CLI snippets for IAM role setup
- Document the credential flow and verification steps
- Add platform detection in BuildImage (linux/arm64 on Apple Silicon)
- Update AWS CLI install to detect architecture via uname -m
- Fixes Rosetta errors when building on Apple Silicon
AWS SDK only trusts container credential endpoints from loopback addresses
unless AWS_CONTAINER_AUTHORIZATION_TOKEN is set. Generate a secure token
for the credential endpoint and verify it in the handler.

- Add authToken to AWSCredentialProvider and handler
- Generate and set AWS_CONTAINER_AUTHORIZATION_TOKEN
- Verify Bearer token in credential handler
- Fixes 'Unsupported host' error on macOS Docker
AWS SDKs only trust container credential endpoints from loopback addresses.
host.docker.internal is not in the allowed list and there's no workaround.
See: boto/botocore#2515

- On Linux (host network): use credential endpoint at localhost
- On macOS/Windows (bridge mode): inject credentials as env vars directly

Credentials are already short-lived from AssumeRole, so security is maintained.
Auto-refresh is not available on macOS/Windows; set longer --session-duration if needed.
Instead of injecting static AWS credentials as env vars (which expire),
use AWS SDK's credential_process feature for automatic refresh.

Implementation:
- Add aws-credential-helper binary that fetches creds from proxy
- Embed cross-compiled linux/amd64 and linux/arm64 binaries
- Mount helper + AWS config into container at /agentops/aws/
- Proxy endpoint returns credential_process JSON format

Also fixes:
- MITM all HTTPS when CA is present (not just for credential injection)
- Include CA cert in TLS chain for Python SSL compatibility
- Add AWS_PAGER="" to disable pager in containers without less

Agents can now run for days without credential expiration issues.
Updates the AWS credentials implementation plan to accurately document
what was built:

- Changed status to IMPLEMENTED
- Documented use of credential_process instead of ECS-style credentials
- Added Task 8 for embedded credential helper binary
- Added Architecture Notes explaining design decisions
- Fixed file paths (run.go vs types.go)
- Added credential flow diagram
- Listed all files changed

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Merge content from the credential_process design doc into the main
implementation plan:
- Added problem statement about credential expiration
- Added endpoint reachability table by platform
- Added error handling notes

Remove the now-redundant design document.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Simplify the AWS credential helper by using a shell script instead of
a cross-compiled Go binary. This works because:

- Containers using --grant aws must have the aws dependency
- The aws dependency installation includes curl
- A shell script with curl is simpler and has no build step

Removes:
- cmd/aws-credential-helper/main.go
- internal/run/generate.go
- internal/run/helpers/ (embedded binaries)

The credential helper is now a simple shell script embedded as a
string constant in internal/run/credential_helper.go.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Fixes from code review:

1. Resource leak: Track and clean up AWS temp directory in Wait() and
   Destroy() - prevents temp dir accumulation on host

2. Nil pointer safety: Check result.Credentials before dereferencing
   and use aws.ToString/aws.ToTime helpers

3. JSON encoding: Handle json.Encode error (log warning since response
   already started)

4. Context propagation: NewAWSCredentialProvider now accepts context
   for proper timeout/cancellation support

5. Session uniqueness: Grant test uses timestamp-based session name
   for clearer CloudTrail logs

6. MITM documentation: Added security note explaining that ALL HTTPS
   traffic is intercepted when CA is configured (intentional for
   observability)

🤖 Generated with [Claude Code](https://claude.com/claude-code)
@claude
Copy link
Copy Markdown

claude Bot commented Jan 21, 2026

📦 Previous Review (superseded by newer review)

Code Review: AWS Credential Support (PR #24)

Summary

This PR adds comprehensive AWS credential support via IAM role assumption with automatic refresh. The implementation is well-architected and follows security best practices. I've identified several issues ranging from critical security concerns to minor improvements.

Critical Issues

1. Timing Attack Vulnerability in Auth Token Comparison

Location: internal/proxy/aws.go:36

if auth == "" || auth \!= "Bearer "+h.authToken {
    http.Error(w, "unauthorized", http.StatusUnauthorized)
    return
}

Issue: Using direct string comparison (\!=) for auth tokens is vulnerable to timing attacks. An attacker could determine the correct token character by character by measuring response times.

Fix: Use crypto/subtle.ConstantTimeCompare for token comparison:

import "crypto/subtle"

expectedAuth := "Bearer " + h.authToken
if auth == "" || subtle.ConstantTimeCompare([]byte(auth), []byte(expectedAuth)) \!= 1 {
    http.Error(w, "unauthorized", http.StatusUnauthorized)
    return
}

2. Missing Error Context Leakage Prevention

Location: internal/proxy/aws.go:44

http.Error(w, fmt.Sprintf("failed to get credentials: %v", err), http.StatusInternalServerError)

Issue: Exposing raw error messages could leak sensitive information about the AWS account, role configuration, or internal state to unauthorized clients.

Fix: Log the detailed error server-side but return generic message to client:

fmt.Fprintf(os.Stderr, "AWS credential fetch error: %v\n", err)
http.Error(w, "internal server error", http.StatusInternalServerError)

3. Race Condition in Credential Caching

Location: internal/proxy/aws.go:129-146

Issue: While the double-checked locking pattern is implemented, there's a subtle issue. After releasing the read lock at line 137, another goroutine could acquire the write lock, refresh credentials, and release it before this goroutine acquires the write lock at line 140. However, this is actually handled correctly by the double-check at line 144-146. This is not a bug, but the pattern is error-prone.

Suggestion: Consider using sync.Once or a channel-based approach for clearer intent, though current implementation is correct.

High Priority Issues

4. ARN Validation Incomplete

Location: internal/credential/aws.go:15-31

Issue: ARN validation doesn't check for:

  • AWS partition (aws, aws-cn, aws-us-gov)
  • Empty account ID
  • Empty role name
  • Special characters in role name that could cause issues

Example of what should be rejected:

  • arn:aws:iam:::role/test (empty account)
  • arn:aws:iam::123:role/ (empty role name)
  • arn:aws-cn:iam::123:role/test (China partition - may need different handling)

Fix:

func ParseRoleARN(arn string) (*AWSConfig, error) {
    if arn == "" {
        return nil, fmt.Errorf("role ARN is required")
    }

    parts := strings.Split(arn, ":")
    if len(parts) \!= 6 {
        return nil, fmt.Errorf("invalid ARN format: expected 6 parts, got %d", len(parts))
    }

    partition, service, region, account, resource := parts[1], parts[2], parts[3], parts[4], parts[5]
    
    // Validate partition
    validPartitions := map[string]bool{"aws": true, "aws-cn": true, "aws-us-gov": true}
    if \!validPartitions[partition] {
        return nil, fmt.Errorf("invalid ARN partition: %s (expected aws, aws-cn, or aws-us-gov)", partition)
    }

    if service \!= "iam" {
        return nil, fmt.Errorf("invalid ARN: must be an IAM ARN (got %s)", service)
    }

    if account == "" {
        return nil, fmt.Errorf("invalid ARN: account ID is required")
    }

    if \!strings.HasPrefix(resource, "role/") {
        return nil, fmt.Errorf("invalid ARN: must be a role ARN (got %s)", resource)
    }

    roleName := strings.TrimPrefix(resource, "role/")
    if roleName == "" {
        return nil, fmt.Errorf("invalid ARN: role name is required")
    }

    return &AWSConfig{
        RoleARN: arn,
        Region:  "us-east-1",
    }, nil
}

5. Insufficient Input Validation on Session Duration

Location: internal/credential/types.go:1253-1269

Issue: Maximum duration depends on the role configuration. Service-linked roles and roles with MFA may have different limits. The hardcoded 12h limit might not work for all scenarios.

Suggestion: Document this limitation and consider validating against the actual role's max session duration during the grant test.

6. Proxy MITM Interception Change Has Broad Impact

Location: internal/proxy/proxy.go:1732-1743

-if _, hasCredential := p.getCredential(host); p.ca \!= nil && hasCredential {
+if p.ca \!= nil {

Issue: This changes the proxy to intercept ALL HTTPS traffic, not just credential-injected hosts. While the comment explains this is intentional for observability, this is a significant behavior change that affects all grants, not just AWS.

Concerns:

  • Breaking change for existing users
  • Applications with certificate pinning will fail
  • Increased attack surface
  • Should be documented in release notes

Suggestion: Consider adding a configuration flag to opt-in to full interception, or at least ensure this is prominently documented in upgrade notes.

Medium Priority Issues

7. Temp Directory Not Cleaned on Early Failures

Location: internal/run/manager.go:1842-1856

Issue: If an error occurs after creating awsDir but before it's assigned to r.awsTempDir, the directory won't be cleaned up.

awsDir, err := os.MkdirTemp("", "agentops-aws-*")
if err \!= nil {
    return nil, fmt.Errorf("creating AWS credential helper directory: %w", err)
}
r.awsTempDir = awsDir // Track for cleanup

// If WriteFile fails here, awsDir leaks
if err := os.WriteFile(helperPath, GetAWSCredentialHelper(), 0755); err \!= nil {
    return nil, fmt.Errorf("writing AWS credential helper: %w", err)
}

Fix: Use defer cleanup or assign r.awsTempDir immediately:

awsDir, err := os.MkdirTemp("", "agentops-aws-*")
if err \!= nil {
    return nil, fmt.Errorf("creating AWS credential helper directory: %w", err)
}
r.awsTempDir = awsDir // Assign immediately for cleanup

if err := os.WriteFile(helperPath, GetAWSCredentialHelper(), 0755); err \!= nil {
    // Now cleanup will happen via existing defer or error path
    return nil, fmt.Errorf("writing AWS credential helper: %w", err)
}

8. Missing Context Cancellation Check

Location: internal/proxy/aws.go:158

The AssumeRole call doesn't check if the context is already canceled before making the expensive STS call.

Suggestion:

if ctx.Err() \!= nil {
    return nil, ctx.Err()
}

result, err := p.stsClient.AssumeRole(ctx, input)

9. Credential Helper Dependency on curl

Location: internal/run/credential_helper.go

Issue: The helper script assumes curl is available. While the comment states AWS CLI includes curl, this may not be true for all installation methods or container images.

Suggestions:

  • Document the curl dependency explicitly
  • Consider fallback to wget or golang-based helper
  • Add runtime check and better error message if curl is missing

10. Missing Audit Logging for AWS Credential Access

Issue: The design doc mentions audit logging (docs/plans/2025-01-16-aws-credentials-design.md:470-478), but I don't see implementation in internal/proxy/aws.go.

Expected: Each credential fetch should be logged to the audit system for compliance and debugging.

Low Priority / Style Issues

11. Inconsistent Error Messages

Some error messages use title case, others don't. Example:

  • "failed to get credentials" (lowercase)
  • "Cannot assume role" (title case in docs)

Suggestion: Standardize to lowercase for error messages (Go convention).

12. Magic Number - 5 Minute Refresh Buffer

Location: internal/proxy/aws.go:132

The 5-minute buffer is hardcoded. Consider making this configurable or at least a named constant:

const credentialRefreshBuffer = 5 * time.Minute

// In GetCredentials:
if p.cached \!= nil && time.Now().Add(credentialRefreshBuffer).Before(p.expiration) {

13. Unused Return Value

Location: cmd/moat/cli/grant.go:171

if _, sdErr := awsCfg.SessionDuration(); sdErr \!= nil {

The duration is validated but discarded. Consider returning it for consistency:

if d, sdErr := awsCfg.SessionDuration(); sdErr \!= nil {
    return sdErr
} else {
    // Could store d in awsCfg if needed
}

14. File Permissions on Credential Helper

Location: internal/run/manager.go:1853

Helper is written with 0755 (world-readable). Consider 0750 or 0700 since it contains the credential endpoint URL in the script.

15. Import Alias Missing for goruntime

Location: internal/container/docker.go:1065

goruntime "runtime"

The alias is used but not shown in the diff. Ensure it's properly imported.

Positive Aspects

  1. Excellent Test Coverage: Unit tests for caching, refresh logic, and authorization are comprehensive
  2. Security-First Design: Using IAM roles instead of credential passthrough is the right approach
  3. Good Documentation: The design docs and README are thorough and well-written
  4. Proper Concurrency: Double-checked locking is implemented correctly despite being complex
  5. Testable Design: Interfaces like STSAssumeRoler enable proper unit testing
  6. Credential Refresh: Automatic refresh with buffer prevents expiration mid-operation

Test Coverage Recommendations

  1. Add integration test for full credential flow (grant → run → AWS API call)
  2. Test failure scenarios:
    • curl not available in container
    • Network timeout to proxy
    • STS API throttling
    • Expired/revoked credentials
  3. Test concurrent credential requests
  4. Test with AWS partition variants (aws-cn, aws-us-gov)

Documentation Suggestions

  1. Add security considerations section to README
  2. Document the MITM proxy behavior change
  3. Add troubleshooting guide for common errors:
    • "Cannot assume role" → check trust policy
    • curl not found → AWS CLI not installed
    • Timeout → firewall blocking proxy

Commit Message Compliance

✅ Follows Conventional Commits format: feat(credential): Support for AWS credential injection
✅ Does not include Co-Authored-By as per CLAUDE.md guidelines

Overall Assessment

This is a solid implementation of AWS credential support with good architecture and security practices. The critical issues around timing attacks and error message leakage should be addressed before merge. The ARN validation improvements and temp directory cleanup are important for robustness.

The change to intercept all HTTPS traffic (not just credential hosts) is significant and should be carefully considered - it may warrant a separate PR or at minimum prominent documentation.

Recommendation: Request changes for critical security issues, then approve after fixes.

@claude
Copy link
Copy Markdown

claude Bot commented Jan 21, 2026

Pull Request Review: AWS Credential Support

This PR adds AWS credential injection via IAM role assumption with automatic refresh. Overall, this is a well-designed and thoughtfully implemented feature that aligns well with the project's security model. The code quality is high, tests are comprehensive, and the documentation is thorough.

✅ Strengths

Security Design

  • Strong security model using IAM role assumption instead of credential passthrough
  • Proper use of constant-time comparison for auth token validation (prevents timing attacks)
  • Short-lived credentials with automatic refresh (15-minute default)
  • Auth token properly scoped to Apple containers only (where proxy binds to 0.0.0.0)
  • Credentials never stored in containers, only served via authenticated endpoint

Code Quality

  • Well-structured with clear separation of concerns
  • Comprehensive test coverage (unit tests for ARN parsing, session duration validation, credential caching, and refresh logic)
  • Good error messages with actionable guidance
  • Proper use of interfaces (STSAssumeRoler) for testability

Documentation

  • Excellent design documentation explaining architecture and trade-offs
  • Clear examples with real-world IAM setup (CloudFormation + AWS CLI)
  • README updates match actual behavior

🔍 Issues & Suggestions

1. Security: Credential Helper Script Depends on curl

Location: internal/run/credential_helper.go:15-17

The credential helper uses a shell script that requires curl:

exec curl -sf -m 10 -H "Authorization: Bearer $AGENTOPS_CREDENTIAL_TOKEN" "$AGENTOPS_CREDENTIAL_URL"

Issue: This creates a dependency on curl being installed in the container. While the documentation mentions "the aws dependency includes curl", this is not guaranteed across all container images.

Recommendation:

  • Add validation at grant-time or run-time to check if curl is available in the container image
  • Consider a Go-based credential helper binary that can be embedded and mounted (similar to how other tools work)
  • Update documentation to explicitly require curl or provide a fallback

Evidence from docs/plans/2025-01-16-aws-credentials-implementation.md:166-167:

This works because containers using --grant aws must have the AWS CLI installed,
and the aws dependency includes curl as part of its installation.

This assumption may not hold for minimal container images or custom AWS SDK installations.

2. Error Handling: Comment Typo/Formatting

Location: internal/run/manager.go:272

There's a formatting issue in a comment:

// AWS credentials are handled via credential endpoint, not header injection
/ Parse stored config: Token=roleARN, Scopes=[region, sessionDuration, externalID]

The single forward slash (/) instead of // is inconsistent with Go comment style.

Recommendation: Fix the comment formatting for consistency.

3. Session Duration Validation Edge Case

Location: internal/credential/types.go:48-62

The session duration validation is good, but there's no check for negative durations:

if d < 15*time.Minute {
    return 0, fmt.Errorf("session duration %v is less than minimum 15m", d)
}
if d > 12*time.Hour {
    return 0, fmt.Errorf("session duration %v exceeds maximum 12h", d)
}

Potential Issue: If someone provides a negative duration (e.g., -1h), it would pass through the minimum check in an unexpected way.

Recommendation: Add explicit check for negative/zero durations:

if d <= 0 {
    return 0, fmt.Errorf("session duration %v must be positive", d)
}

4. Code Duplication: Session Duration Parsing

Location: internal/run/manager.go:287-292 and internal/credential/types.go:48-62

Session duration parsing logic is duplicated. The manager.go version silently ignores parse errors, while types.go properly validates.

Recommendation: Always use AWSConfig.SessionDuration() instead of inline parsing to ensure consistent validation.

5. Test Coverage: Missing Auth Token Tests

Location: internal/proxy/aws_test.go

The tests don't verify auth token validation for the AWS credential handler.

Recommendation: Add tests for unauthorized access (missing token), unauthorized access (invalid token), and authorized access (valid token).

6. Documentation: Cleanup Behavior

Location: internal/run/run.go:66-67

The awsTempDir field is documented for cleanup, but the cleanup code isn't visible in the PR diff. Please verify the AWS temp directory is properly cleaned up in the Destroy() method or equivalent.

📊 Performance Considerations

Credential Caching: ✅ The 5-minute refresh buffer is reasonable and prevents credential expiration during requests. Good design choice.

Concurrent Requests: ✅ The mutex-based caching with double-check locking ensures only one STS call happens at a time during refresh, which is efficient.

🔐 Additional Security Observations

  1. Proxy Authentication: ✅ The security model correctly restricts auth token requirement to Apple containers where the proxy binds to 0.0.0.0
  2. Error Message Sanitization: ✅ Good practice of logging detailed errors server-side while returning generic messages to clients
  3. Constant-Time Comparison: ✅ Proper use of subtle.ConstantTimeCompare prevents timing attacks on auth tokens
  4. Credential Isolation: ✅ Host credentials never enter the container; only short-lived assumed role credentials are available

📝 Conventions & Style

Following CLAUDE.md conventions:

Error Messages: Good actionable error messages (e.g., grant.go:144-153 tells users exactly what to do)
Conventional Commits: PR title follows format feat(credential): ...
Code Style: Follows standard Go conventions

🎯 Overall Assessment

This is a high-quality implementation that follows the project's conventions and security principles. The architecture is sound, the code is well-tested, and the documentation is thorough.

Priority Issues:

  1. HIGH: Document/verify curl dependency or provide Go-based helper
  2. HIGH: Verify AWS temp directory cleanup
  3. MEDIUM: Add negative duration check in session validation
  4. MEDIUM: Refactor to use AWSConfig.SessionDuration() consistently
  5. MEDIUM: Add auth token tests
  6. LOW: Fix comment formatting at manager.go:272

Approval Status:Approve with minor fixes

The core implementation is solid and demonstrates strong software engineering practices. The identified issues are relatively minor and can be addressed in follow-up commits or accepted as-is with documentation updates for the curl dependency.

Great work on maintaining security best practices throughout the implementation!

- Use constant-time comparison for auth token to prevent timing attacks
- Log detailed errors server-side, return generic message to clients
- Improve ARN validation: check partition, account ID, role name
- Add context cancellation check before STS AssumeRole call
- Use named constant for credential refresh buffer (5 min)
- Restrict credential helper permissions to 0700
@majorcontext majorcontext deleted a comment from claude Bot Jan 21, 2026
@claude
Copy link
Copy Markdown

claude Bot commented Jan 21, 2026

Pull Request Review: AWS Credential Support

This is a well-designed and thorough implementation of AWS credential support. The architecture is sound, with strong security considerations and comprehensive documentation. Below are my findings across several areas:

Code Quality & Architecture

Strengths:

  • Clean separation of concerns: Credential parsing (internal/credential/aws.go), provider logic (internal/proxy/aws.go), and integration (internal/run/manager.go) are well-separated
  • Excellent use of interfaces: STSAssumeRoler interface enables clean testing with mock clients
  • Thread-safe caching: Proper use of sync.RWMutex with double-checked locking pattern in AWSCredentialProvider.GetCredentials (lines 143-159)
  • Proper resource cleanup: AWS temp directory cleanup in both Wait() and Destroy() methods prevents resource leaks
  • Well-structured tests: Comprehensive test coverage for ARN parsing, session duration validation, credential caching, and expiration handling

Minor Issues:

  1. Account ID validation missing (internal/credential/aws.go:39-40): The code checks if account ID is not empty but does not validate it is a 12-digit number. Consider adding validation.

  2. Role name validation incomplete (internal/credential/aws.go:47-50): Checks for non-empty but does not validate AWS role naming constraints (alphanumeric + +=,.@-_ characters). This is low priority since AWS will reject invalid names, but early validation provides better UX.

Security

Excellent security practices:

  • Constant-time comparison for auth tokens (internal/proxy/aws.go:42)
  • Generic error messages to clients, detailed logs server-side (lines 50-53)
  • Context cancellation checks before expensive operations (line 139)
  • Credential helper script permissions set to 0700 (internal/run/manager.go:443)
  • Auth token generation uses crypto/rand with 32 bytes (internal/run/manager.go:322-326)
  • Credentials never stored—only role ARN configuration
  • Automatic refresh via credential_process prevents long-lived credential exposure
  • 5-minute cache buffer ensures credentials refresh before expiration

Considerations:

  1. MITM interception disclosure: The proxy now intercepts ALL HTTPS traffic when CA is configured. This is intentional for observability and is documented in proxy.go security model section.

  2. Auth token in environment variable (internal/run/manager.go:479-481): The AGENTOPS_CREDENTIAL_TOKEN is passed to the container and visible in process listings. This is acceptable because token is per-run and short-lived, but consider documenting this in security model section.

  3. Network policy AWS hosts (internal/proxy/hosts.go:102-106): Uses wildcard *.amazonaws.com which is very broad. This is reasonable for AWS given the huge number of services, but worth noting in docs that agents get access to all AWS services reachable via their role permissions.

Performance

Well-optimized:

  • Credential caching with 5-minute buffer reduces STS API calls
  • Double-checked locking pattern prevents cache stampede
  • credential_process approach enables on-demand credential fetching (no upfront cost)
  • Credential helper is lightweight shell script using curl

Potential Bugs

None critical, but verify:

  1. Error handling in credential helper (internal/run/credential_helper.go:10-21): The script uses set -e and curl -sf. If the credential URL is unreachable, curl will exit non-zero and AWS SDK will retry. The 10-second timeout (-m 10) is good.

  2. Context propagation (internal/proxy/aws.go:171): AssumeRole is called with the request context. If a credential fetch request times out on the client side, the STS call may continue executing. This is acceptable—STS calls are usually fast.

  3. Nil credential check (internal/proxy/aws.go:176-178): Good defensive programming. AWS SDK docs do not explicitly guarantee non-nil Credentials in successful responses.

Test Coverage

Strong coverage:

  • ARN validation (valid, invalid partition, wrong service, missing parts)
  • Session duration edge cases (too short, too long, invalid format)
  • Credential caching behavior
  • Credential refresh on near-expiry
  • Auth token validation (missing, invalid, valid)
  • Error handling for STS failures

Suggested additions:

  1. Integration test that verifies credential_process works end-to-end with AWS SDK
  2. Test for concurrent GetCredentials calls (verify no race conditions)
  3. Test for external ID handling in AssumeRole call

Documentation

Excellent documentation:

  • README.md clearly explains AWS credential flow
  • examples/grant-aws/ provides CloudFormation and CLI setup
  • docs/plans/ documents architecture decisions
  • Inline comments explain security model and design choices
  • Error messages provide actionable guidance

Minor improvements:

  1. Document that *.amazonaws.com in network policy allows all AWS services
  2. Consider adding a troubleshooting section for common issues

Conventional Commits

All commits follow the conventional commits format correctly with proper scopes and no Co-Authored-By lines (per CLAUDE.md).

Summary

This is production-ready code with excellent security practices, clean architecture, and comprehensive testing. The implementation elegantly solves the credential refresh problem using credential_process.

Recommendation: Approve with optional minor enhancements for ARN validation and documentation.

The main architectural decision—using credential_process over static environment variables—is the right choice for long-running agents and significantly improves security posture.

@majorcontext majorcontext deleted a comment from claude Bot Jan 21, 2026
@dpup dpup merged commit eb0d9e6 into main Jan 21, 2026
4 checks passed
@dpup dpup deleted the aws-creds branch January 21, 2026 22:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant