Parallelize team-provisioner Google Admin SDK calls (#130)

Alexanderamiri · web-flow · commit b9b800d1e9ae · 2026-05-27T23:24:48.000+02:00
## Summary The registry's **Provision Groups** workflow has been failing with a `Read timeout on endpoint URL ... /javabin-team-provisioner/invocations`. The Lambda itself isn't timing out — its config has `Timeout=300s` and real executions complete in ~90s. What's tripping is the AWS CLI's default **60s read-timeout**, which then triggers client-side retries and three concurrent Lambda invocations per CI run before exiting with a false-positive failure. CloudWatch confirms: invocation `96299b07-…` on 2026-05-27 ran cleanly end-to-end with `Duration: 90,978 ms` — the data was synced but CI was already red. Root cause: per-hero and per-group Google Admin Directory API calls were sequential. With 55 heroes and 35 groups, the ~230 round-trips at ~400 ms each sum to ~90 s. ## Changes - Extract `_sync_hero_account`, `_sync_hero_alias`, `_sync_one_group`, and `_sync_one_access_group` from `handle_sync_groups_and_heros`. - Dispatch each step's per-item work through `ThreadPoolExecutor` with `GROUP_SYNC_WORKERS` workers (default `5`, env-tunable on the Lambda without redeploying code). - Pre-fetch the Google OAuth token once before parallel work starts so workers don't race to refresh it. - Add `--cli-read-timeout 0` to `scripts/provision-groups.py` and `scripts/provision-teams.sh` as a safety net against future regressions. ## Sizing rationale Google Directory API per-user quota: **1,500 queries / 100 s** (≈15 QPS). We impersonate a single admin via domain-wide delegation, so per-user applies. 5 workers × ~3 QPS each ≈ 15 QPS — at the budget ceiling, well below the 40 QPS project quota. If we hit 429s in practice, drop `GROUP_SYNC_WORKERS` to 3 via env-var (no redeploy needed). ## Expected effect | Phase | Sequential | Parallel (5w) | |---|---|---| | Step 1 — hero account checks (55) | 26 s | ~6 s | | Step 3 — groups + IC sync (35) | 60 s | ~12-15 s | | Step 6 — access groups (5) | 4 s | ~1 s | | **Total** | **~91 s** | **~20-25 s** | ## Test plan - [ ] Platform CI plan + apply (deploys new Lambda zip). - [ ] Next merge to `javaBin/registry`'s `groups/**` triggers the Provision Groups workflow; verify it completes green and the Lambda `Duration` metric is well under 60 s. - [ ] Spot-check CloudWatch logs for any 429 from Google Directory API — if observed, drop `GROUP_SYNC_WORKERS` env var to 3. Related: false-positive timeout in [Provision Groups #26418541309](https://github.com/javaBin/registry/actions/runs/26418541309).
diff --git a/scripts/provision-groups.py b/scripts/provision-groups.py
@@ -80,6 +80,9 @@ def invoke_lambda(payload):
             "--function-name", "javabin-team-provisioner",
             "--payload", f"fileb://{payload_file}",
             "--cli-binary-format", "raw-in-base64-out",
+            # Lambda timeout is 300s; default CLI read timeout (60s) is too low.
+            # Disable client-side read timeout so the CLI waits for the Lambda response.
+            "--cli-read-timeout", "0",
             "/tmp/lambda-response.json",
         ],
         capture_output=True,
diff --git a/scripts/provision-teams.sh b/scripts/provision-teams.sh
@@ -45,6 +45,7 @@ aws lambda invoke \
   --function-name javabin-team-provisioner \
   --payload "$PAYLOAD" \
   --cli-binary-format raw-in-base64-out \
+  --cli-read-timeout 0 \
   /tmp/lambda-response.json
 
 cat /tmp/lambda-response.json
diff --git a/terraform/lambda-src/team_provisioner/handler.py b/terraform/lambda-src/team_provisioner/handler.py