Skip to content

Commit b9b800d

Browse files
Parallelize team-provisioner Google Admin SDK calls (#130)
## Summary The registry's **Provision Groups** workflow has been failing with a `Read timeout on endpoint URL ... /javabin-team-provisioner/invocations`. The Lambda itself isn't timing out — its config has `Timeout=300s` and real executions complete in ~90s. What's tripping is the AWS CLI's default **60s read-timeout**, which then triggers client-side retries and three concurrent Lambda invocations per CI run before exiting with a false-positive failure. CloudWatch confirms: invocation `96299b07-…` on 2026-05-27 ran cleanly end-to-end with `Duration: 90,978 ms` — the data was synced but CI was already red. Root cause: per-hero and per-group Google Admin Directory API calls were sequential. With 55 heroes and 35 groups, the ~230 round-trips at ~400 ms each sum to ~90 s. ## Changes - Extract `_sync_hero_account`, `_sync_hero_alias`, `_sync_one_group`, and `_sync_one_access_group` from `handle_sync_groups_and_heros`. - Dispatch each step's per-item work through `ThreadPoolExecutor` with `GROUP_SYNC_WORKERS` workers (default `5`, env-tunable on the Lambda without redeploying code). - Pre-fetch the Google OAuth token once before parallel work starts so workers don't race to refresh it. - Add `--cli-read-timeout 0` to `scripts/provision-groups.py` and `scripts/provision-teams.sh` as a safety net against future regressions. ## Sizing rationale Google Directory API per-user quota: **1,500 queries / 100 s** (≈15 QPS). We impersonate a single admin via domain-wide delegation, so per-user applies. 5 workers × ~3 QPS each ≈ 15 QPS — at the budget ceiling, well below the 40 QPS project quota. If we hit 429s in practice, drop `GROUP_SYNC_WORKERS` to 3 via env-var (no redeploy needed). ## Expected effect | Phase | Sequential | Parallel (5w) | |---|---|---| | Step 1 — hero account checks (55) | 26 s | ~6 s | | Step 3 — groups + IC sync (35) | 60 s | ~12-15 s | | Step 6 — access groups (5) | 4 s | ~1 s | | **Total** | **~91 s** | **~20-25 s** | ## Test plan - [ ] Platform CI plan + apply (deploys new Lambda zip). - [ ] Next merge to `javaBin/registry`'s `groups/**` triggers the Provision Groups workflow; verify it completes green and the Lambda `Duration` metric is well under 60 s. - [ ] Spot-check CloudWatch logs for any 429 from Google Directory API — if observed, drop `GROUP_SYNC_WORKERS` env var to 3. Related: false-positive timeout in [Provision Groups #26418541309](https://github.com/javaBin/registry/actions/runs/26418541309).
1 parent 1c1016f commit b9b800d

3 files changed

Lines changed: 324 additions & 273 deletions

File tree

scripts/provision-groups.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,9 @@ def invoke_lambda(payload):
8080
"--function-name", "javabin-team-provisioner",
8181
"--payload", f"fileb://{payload_file}",
8282
"--cli-binary-format", "raw-in-base64-out",
83+
# Lambda timeout is 300s; default CLI read timeout (60s) is too low.
84+
# Disable client-side read timeout so the CLI waits for the Lambda response.
85+
"--cli-read-timeout", "0",
8386
"/tmp/lambda-response.json",
8487
],
8588
capture_output=True,

scripts/provision-teams.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,7 @@ aws lambda invoke \
4545
--function-name javabin-team-provisioner \
4646
--payload "$PAYLOAD" \
4747
--cli-binary-format raw-in-base64-out \
48+
--cli-read-timeout 0 \
4849
/tmp/lambda-response.json
4950

5051
cat /tmp/lambda-response.json

0 commit comments

Comments
 (0)