Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
7a2802e
Add K8s GPU metrics collection design spec
maksimov Apr 19, 2026
d271797
Add K8s GPU metrics collection implementation plan
maksimov Apr 19, 2026
1c2d3d8
Add EnrichK8sGPUMetrics for CloudWatch Container Insights GPU metrics
maksimov Apr 19, 2026
96155c1
Add ProxyGet to K8sClient interface for pod API proxy
maksimov Apr 19, 2026
84a4a9b
Add DCGM exporter scraping for K8s GPU metrics
maksimov Apr 19, 2026
0f460c4
Add Prometheus query enrichment for K8s GPU metrics
maksimov Apr 19, 2026
d605cb4
Add ruleK8sLowGPUUtil for utilization-based K8s GPU waste detection
maksimov Apr 19, 2026
54dc0ce
Wire K8s GPU metrics fallback chain into CLI scan flow
maksimov Apr 19, 2026
c93e0f7
Fix DCGM node matching and CW error spam
maksimov Apr 19, 2026
2640b88
Fix DCGM scrape spam and Prometheus node name mismatch
maksimov Apr 19, 2026
b08c025
Include time window in low GPU utilization recommendation text
maksimov Apr 19, 2026
e846cc8
Skip CW enrichment when AWS creds unavailable, reduce DCGM noise
maksimov Apr 19, 2026
ff5b7f5
Add diff package with Compare function and tests
maksimov Apr 15, 2026
1eed3b8
Add diff table and JSON output formatters
maksimov Apr 15, 2026
5672229
Add diff subcommand to compare two scan results
maksimov Apr 15, 2026
2e95784
Fix box alignment in diff table output
maksimov Apr 15, 2026
01b4512
Fix misleading idle duration in K8s GPU node recommendations
maksimov Apr 15, 2026
c9ee92d
Update README with K8s scanning, diff command, and current output format
maksimov Apr 15, 2026
cd27862
Add multi-target scanning design spec
maksimov Apr 18, 2026
b8c6a35
Add multi-target scanning implementation plan
maksimov Apr 18, 2026
408399e
Add TargetSummary and TargetErrorInfo model types for multi-target sc…
maksimov Apr 18, 2026
e856adb
Extract BuildSummary to summary.go and add BuildTargetSummaries
maksimov Apr 18, 2026
d1c79f3
Implement ResolveTargets with STS AssumeRole for multi-account scanning
maksimov Apr 18, 2026
cd49f91
Refactor Scan() for parallel multi-target scanning
maksimov Apr 18, 2026
698b0f6
Add --targets, --role, --org, --external-id, --skip-self flags to sca…
maksimov Apr 18, 2026
19fbbb7
Add per-target summary table and target column to table formatter
maksimov Apr 18, 2026
8737a44
Add per-target summaries to markdown and Slack formatters
maksimov Apr 18, 2026
68fbeaa
Add cross-account and Organizations permissions to iam-policy output
maksimov Apr 18, 2026
1f58d92
Add multi-account scanning docs to README
maksimov Apr 18, 2026
6001598
Fix callerAccount bug, deduplicate severity logic, clean up dead code
maksimov Apr 18, 2026
41b0867
Add SpotHourlyCost field to GPUInstance model
maksimov Apr 19, 2026
b081710
Implement EnrichSpotPrices with DescribeSpotPriceHistory
maksimov Apr 19, 2026
82ae997
Wire EnrichSpotPrices into scanRegion after EC2 discovery
maksimov Apr 19, 2026
8f4973e
Correct spot instance cost using live spot prices
maksimov Apr 19, 2026
6b7bfab
Add ruleSpotEligible analysis rule for spot recommendations
maksimov Apr 19, 2026
4b8c1c6
Add ec2:DescribeSpotPriceHistory to IAM policy output
maksimov Apr 19, 2026
1abda17
Address review: update signal type comment, add pagination note, guar…
maksimov Apr 19, 2026
2f05b9c
Add Prometheus GPU metrics support for EC2 instances
maksimov Apr 20, 2026
0c33990
Sanitize example output in README
maksimov Apr 21, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
220 changes: 186 additions & 34 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,39 @@
# gpuaudit

Scan your AWS account for GPU waste and get actionable recommendations to cut your cloud spend.
Scan your cloud for GPU waste and get actionable recommendations to cut your spend.

```
$ gpuaudit scan --profile ml-prod
$ gpuaudit scan --skip-eks

GPU Fleet Summary
Total GPU instances: 14
Total monthly GPU spend: $47,832
Estimated monthly waste: $18,240 (38%)
Found 38 GPU nodes across 47 nodes in gpu-cluster

CRITICAL (3 instances, $8,940/mo potential savings)
gpuaudit — GPU Cost Audit for AWS
Account: 123456789012 | Regions: us-east-1 | Duration: 4.2s

i-0a1b2c3d4e g5.12xlarge (4x A10G) $4,380/mo Idle — no activity for 18 days → terminate
i-9f8e7d6c5b p4d.24xlarge (8x A100) $23,652/mo Idle — <1% CPU for 6 days → terminate
sagemaker:asr ml.g6.48xlarge (8x L40S) $9,490/mo GPU util avg 8% → downsize to ml.g5.xlarge
┌──────────────────────────────────────────────────────────┐
│ GPU Fleet Summary │
├──────────────────────────────────────────────────────────┤
│ Total GPU instances: 38 │
│ Total monthly GPU spend: $127450 │
│ Estimated monthly waste: $18200 ( 14%) │
└──────────────────────────────────────────────────────────┘

CRITICAL — 3 instance(s), $15400/mo potential savings

Instance Type Monthly Signal Recommendation
──────────────────────────────────── ────────────────────────── ──────── ──────────────── ──────────────────────────────────────────────
gpu-cluster/ip-10-15-255-248 g6e.16xlarge (1× L40S) $ 6752 idle Node up 13 days with 0 GPU pods scheduled.
gpu-cluster/ip-10-22-250-15 g6e.16xlarge (1× L40S) $ 6752 idle Node up 1 days with 0 GPU pods scheduled.
...
```

## What it scans

- **EC2** — GPU instances (g4dn, g5, g6, g6e, p4d, p4de, p5, inf2, trn1) with CloudWatch metrics
- **SageMaker** — Endpoints with GPU utilization and invocation metrics
- **EKS** — Managed GPU node groups via the AWS EKS API
- **Kubernetes** — GPU nodes and pod allocation via the Kubernetes API (Karpenter, self-managed, any CNI)

## What it detects

- **Idle GPU instances** — running but doing nothing (low CPU + near-zero network for 24+ hours)
Expand All @@ -25,6 +42,7 @@ $ gpuaudit scan --profile ml-prod
- **Stale instances** — non-production instances running 90+ days
- **SageMaker low utilization** — endpoints with <10% GPU utilization
- **SageMaker oversized** — endpoints using <30% GPU memory on multi-GPU instances
- **K8s unallocated GPUs** — nodes with GPU capacity but no pods requesting GPUs

## Install

Expand All @@ -36,7 +54,7 @@ Or build from source:

```bash
git clone https://github.com/gpuaudit/cli.git
cd gpuaudit
cd cli
go build -o gpuaudit ./cmd/gpuaudit
```

Expand All @@ -49,20 +67,155 @@ gpuaudit scan
# Specific profile and region
gpuaudit scan --profile production --region us-east-1

# Kubernetes cluster scan (uses KUBECONFIG or ~/.kube/config)
gpuaudit scan --skip-eks

# Specific kubeconfig and context
gpuaudit scan --kubeconfig ~/.kube/config --kube-context gpu-cluster

# JSON output for automation
gpuaudit scan --format json --output report.json
gpuaudit scan --format json -o report.json

# Markdown for docs/PRs
gpuaudit scan --format markdown
# Compare two scans to see what changed
gpuaudit diff old-report.json new-report.json

# Slack Block Kit payload (pipe to webhook)
gpuaudit scan --format slack --output - | curl -X POST -H 'Content-Type: application/json' -d @- $SLACK_WEBHOOK
gpuaudit scan --format slack -o - | \
curl -X POST -H 'Content-Type: application/json' -d @- $SLACK_WEBHOOK

# Skip CloudWatch metrics (faster, less accurate)
gpuaudit scan --skip-metrics

# Skip SageMaker scanning
# Skip specific scanners
gpuaudit scan --skip-metrics # faster, less accurate
gpuaudit scan --skip-sagemaker
gpuaudit scan --skip-eks # skip AWS EKS API (use --skip-k8s for Kubernetes API)
gpuaudit scan --skip-k8s
```

## Comparing scans

Save scan results as JSON, then diff them later:

```bash
gpuaudit scan --format json -o scan-apr-08.json
# ... time passes, changes happen ...
gpuaudit scan --format json -o scan-apr-15.json
gpuaudit diff scan-apr-08.json scan-apr-15.json
```

```
gpuaudit diff — 2026-04-08 12:00 UTC → 2026-04-15 12:00 UTC

┌──────────────────────────────────────────────────────────┐
│ Cost Delta │
├──────────────────────────────────────────────────────────┤
│ Monthly spend: $142000 → $127450 (-$14550) │
│ Estimated waste: $31000 → $18200 (-$12800) │
│ Instances: 45 → 38 (-9 removed, +2 added) │
└──────────────────────────────────────────────────────────┘

REMOVED — 9 instance(s), -$16200/mo
...
```

Matches instances by ID. Reports added, removed, and changed instances with per-field diffs (instance type, pricing model, cost, state, GPU allocation, waste severity).

## Multi-Account Scanning

Scan multiple AWS accounts in a single invocation using STS AssumeRole.

### Prerequisites

Deploy a read-only IAM role (`gpuaudit-reader`) to each target account. See [Cross-Account Role Setup](#cross-account-role-setup) below.

### Usage

```bash
# Scan specific accounts
gpuaudit scan --targets 111111111111,222222222222 --role gpuaudit-reader

# Scan entire AWS Organization
gpuaudit scan --org --role gpuaudit-reader

# Exclude management account
gpuaudit scan --org --role gpuaudit-reader --skip-self

# With external ID
gpuaudit scan --targets 111111111111 --role gpuaudit-reader --external-id my-secret
```

### Cross-Account Role Setup

#### Terraform

```hcl
variable "management_account_id" {
description = "AWS account ID where gpuaudit runs"
type = string
}

resource "aws_iam_role" "gpuaudit_reader" {
name = "gpuaudit-reader"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = { AWS = "arn:aws:iam::${var.management_account_id}:root" }
Action = "sts:AssumeRole"
}]
})
}

resource "aws_iam_role_policy" "gpuaudit_reader" {
name = "gpuaudit-policy"
role = aws_iam_role.gpuaudit_reader.id
policy = file("gpuaudit-policy.json") # from: gpuaudit iam-policy > gpuaudit-policy.json
}
```

Deploy to all accounts using Terraform workspaces or CloudFormation StackSets.

#### CloudFormation StackSet

```yaml
AWSTemplateFormatVersion: "2010-09-09"
Parameters:
ManagementAccountId:
Type: String
Resources:
GpuAuditRole:
Type: AWS::IAM::Role
Properties:
RoleName: gpuaudit-reader
AssumeRolePolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Principal:
AWS: !Sub "arn:aws:iam::${ManagementAccountId}:root"
Action: sts:AssumeRole
Policies:
- PolicyName: gpuaudit-policy
PolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Action:
- ec2:DescribeInstances
- ec2:DescribeInstanceTypes
- ec2:DescribeRegions
- sagemaker:ListEndpoints
- sagemaker:DescribeEndpoint
- sagemaker:DescribeEndpointConfig
- eks:ListClusters
- eks:ListNodegroups
- eks:DescribeNodegroup
- cloudwatch:GetMetricData
- cloudwatch:GetMetricStatistics
- cloudwatch:ListMetrics
- ce:GetCostAndUsage
- ce:GetReservationUtilization
- ce:GetSavingsPlansUtilization
- pricing:GetProducts
Resource: "*"
```

## IAM permissions
Expand All @@ -73,7 +226,7 @@ gpuaudit is read-only. It never modifies your infrastructure. Generate the minim
gpuaudit iam-policy
```

This outputs a JSON policy requiring only `Describe*`, `List*`, `Get*` permissions for EC2, SageMaker, CloudWatch, Cost Explorer, and Pricing APIs.
For Kubernetes scanning, gpuaudit needs `get`/`list` on `nodes` and `pods` cluster-wide.

## GPU pricing reference

Expand All @@ -83,27 +236,25 @@ gpuaudit pricing

# Filter by GPU model
gpuaudit pricing --gpu H100
gpuaudit pricing --gpu A10G
gpuaudit pricing --gpu T4
gpuaudit pricing --gpu L4
```

## Output formats

| Format | Flag | Use case |
|---|---|---|
| Table | `--format table` (default) | Terminal viewing |
| JSON | `--format json` | Automation, CI/CD pipelines |
| JSON | `--format json` | Automation, CI/CD, `gpuaudit diff` |
| Markdown | `--format markdown` | PRs, wikis, docs |
| Slack | `--format slack` | Slack webhook integration |

## How it works

1. **Discovery** — Scans EC2 and SageMaker across multiple regions for GPU instance families (g4dn, g5, g6, g6e, p4d, p4de, p5, inf2, trn1)
1. **Discovery** — Scans EC2, SageMaker, EKS node groups, and Kubernetes API across multiple regions for GPU resources
2. **Metrics** — Collects 7-day CloudWatch metrics: CPU, network I/O for EC2; GPU utilization, GPU memory, invocations for SageMaker
3. **Analysis** — Applies 6 waste detection rules with severity levels (critical/warning)
4. **Recommendations** — Generates specific actions (terminate, downsize, switch pricing) with estimated monthly savings

Regions scanned by default: us-east-1, us-east-2, us-west-2, eu-west-1, eu-west-2, eu-central-1, ap-southeast-1, ap-northeast-1, ap-south-1.
3. **K8s allocation** — Lists pods requesting `nvidia.com/gpu` resources and maps them to nodes
4. **Analysis** — Applies 7 waste detection rules with severity levels (critical/warning/info)
5. **Recommendations** — Generates specific actions (terminate, downsize, switch pricing) with estimated monthly savings

## Project structure

Expand All @@ -113,21 +264,22 @@ gpuaudit/
├── internal/
│ ├── models/ Core data types (GPUInstance, WasteSignal, Recommendation)
│ ├── pricing/ Bundled GPU pricing database (40+ instance types)
│ ├── analysis/ Waste detection rules engine
│ ├── output/ Formatters (table, JSON, markdown, Slack)
│ └── providers/aws/ EC2, SageMaker, CloudWatch, scanner orchestrator
│ ├── analysis/ Waste detection rules engine (7 rules)
│ ├── diff/ Scan comparison logic
│ ├── output/ Formatters (table, JSON, markdown, Slack, diff)
│ └── providers/
│ ├── aws/ EC2, SageMaker, EKS, CloudWatch, Cost Explorer
│ └── k8s/ Kubernetes API GPU node/pod discovery
└── LICENSE Apache 2.0
```

## Roadmap

- [ ] AWS Cost Explorer integration (actual vs projected spend)
- [ ] EKS GPU pod discovery
- [ ] DCGM GPU metrics via Kubernetes (actual GPU utilization, not just allocation)
- [ ] SageMaker training job analysis
- [ ] Multi-account (AWS Organizations) scanning
- [ ] GCP + Azure support
- [ ] GitHub Action for scheduled scans
- [ ] Historical scan comparison (`gpuaudit diff`)

## License

Expand Down
Loading