Commit b9b800d
authored
Parallelize team-provisioner Google Admin SDK calls (#130)
## Summary
The registry's **Provision Groups** workflow has been failing with a
`Read timeout on endpoint URL ...
/javabin-team-provisioner/invocations`.
The Lambda itself isn't timing out — its config has `Timeout=300s` and
real executions complete in ~90s. What's tripping is the AWS CLI's
default **60s read-timeout**, which then triggers client-side retries
and three concurrent Lambda invocations per CI run before exiting with
a false-positive failure.
CloudWatch confirms: invocation `96299b07-…` on 2026-05-27 ran cleanly
end-to-end with `Duration: 90,978 ms` — the data was synced but CI was
already red.
Root cause: per-hero and per-group Google Admin Directory API calls
were sequential. With 55 heroes and 35 groups, the ~230 round-trips at
~400 ms each sum to ~90 s.
## Changes
- Extract `_sync_hero_account`, `_sync_hero_alias`, `_sync_one_group`,
and `_sync_one_access_group` from `handle_sync_groups_and_heros`.
- Dispatch each step's per-item work through `ThreadPoolExecutor` with
`GROUP_SYNC_WORKERS` workers (default `5`, env-tunable on the Lambda
without redeploying code).
- Pre-fetch the Google OAuth token once before parallel work starts so
workers don't race to refresh it.
- Add `--cli-read-timeout 0` to `scripts/provision-groups.py` and
`scripts/provision-teams.sh` as a safety net against future regressions.
## Sizing rationale
Google Directory API per-user quota: **1,500 queries / 100 s** (≈15
QPS).
We impersonate a single admin via domain-wide delegation, so per-user
applies. 5 workers × ~3 QPS each ≈ 15 QPS — at the budget ceiling, well
below the 40 QPS project quota. If we hit 429s in practice, drop
`GROUP_SYNC_WORKERS` to 3 via env-var (no redeploy needed).
## Expected effect
| Phase | Sequential | Parallel (5w) |
|---|---|---|
| Step 1 — hero account checks (55) | 26 s | ~6 s |
| Step 3 — groups + IC sync (35) | 60 s | ~12-15 s |
| Step 6 — access groups (5) | 4 s | ~1 s |
| **Total** | **~91 s** | **~20-25 s** |
## Test plan
- [ ] Platform CI plan + apply (deploys new Lambda zip).
- [ ] Next merge to `javaBin/registry`'s `groups/**` triggers the
Provision Groups workflow; verify it completes green and the
Lambda `Duration` metric is well under 60 s.
- [ ] Spot-check CloudWatch logs for any 429 from Google Directory
API — if observed, drop `GROUP_SYNC_WORKERS` env var to 3.
Related: false-positive timeout in [Provision Groups
#26418541309](https://github.com/javaBin/registry/actions/runs/26418541309).1 parent 1c1016f commit b9b800d
3 files changed
Lines changed: 324 additions & 273 deletions
File tree
- scripts
- terraform/lambda-src/team_provisioner
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
80 | 80 | | |
81 | 81 | | |
82 | 82 | | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
83 | 86 | | |
84 | 87 | | |
85 | 88 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
45 | 45 | | |
46 | 46 | | |
47 | 47 | | |
| 48 | + | |
48 | 49 | | |
49 | 50 | | |
50 | 51 | | |
0 commit comments