Fix non-US relay cluster connectivity and improve monitoring#1130
Fix non-US relay cluster connectivity and improve monitoring#1130
Conversation
The cluster JWT was missing the `cluster: true` claim, causing euc/sea relays to be rejected by the root node. Additionally, the cluster task ran in a spawned tokio task whose panic didn't bring down the process, leaving a half-dead relay that passed health checks. Changes: - Use tokio::select! in main so cluster/web/server failures propagate - Replace GCP uptime checks with a combined monitor service (memory + health) that posts directly to Discord - Rename moq-token -> moq-token-cli in all scripts and docs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
WalkthroughThis PR updates tooling and monitoring: it replaces 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
✨ Simplify code
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cdn/common/monitor.service.tftpl`:
- Around line 2-9: Add systemd network-online ordering to the unit so it waits
for DNS/HTTPS before starting: insert a [Unit] section (above the existing
Description= line or immediately after it) that includes
After=network-online.target and Wants=network-online.target; keep existing
fields like Description=MoQ CDN monitor, Environment=MONITOR_WEBHOOK,
Environment=MONITOR_HEALTH_DOMAIN and ExecStart=/etc/systemd/system/monitor.sh
unchanged.
In `@cdn/common/monitor.sh`:
- Around line 36-38: The curl call that posts the payload to $WEBHOOK can block
the monitor loop; modify the curl invocation (the command that uses payload and
$WEBHOOK) to include explicit timeouts (e.g., --connect-timeout and --max-time)
so requests fail quickly on DNS/connect or slow responses; keep existing flags
(-s -f -X POST -H "Content-Type: application/json" -d "$payload") and surface
failures as before (the || echo warning) but ensure the curl command uses
timeouts like --connect-timeout 5 and --max-time 10 to avoid stalling the
monitor loop.
- Around line 55-61: The current memory alert builds `top` from `ps aux` and
sends it via `send_webhook`, which leaks full COMMAND argv; update the `top`
assignment in monitor.sh to use a sanitized ps format (e.g., `ps -eo
pid,user,%mem,rss,comm --sort=-%mem | head -6`) or pipe `ps aux` through awk to
drop the COMMAND column so only non-sensitive fields are included, keep the rest
of the message assembly (variables HOSTNAME, available, MEMORY_THRESHOLD) and
call `send_webhook "$msg"` with the sanitized `top` variable.
In `@cdn/health.sh`:
- Line 18: The fallback JWT generation command printed by the echo (the
moq-token-cli sign invocation) uses the non-existent flag --get; update that
command string to use --subscribe instead so the printed instruction and
fallback behavior work (change the echo line that prints: "cargo run --bin
moq-token-cli -- --key secrets/root.jwk sign --root \"demo\" --get \"\" >
secrets/demo-sub.jwt" to use --subscribe). Also apply the same replacement in
the identical occurrence in cdn/README.md so both references use --subscribe.
In `@cdn/main.tf`:
- Around line 50-53: The template is using the shared var.domain for
MONITOR_HEALTH_DOMAIN causing cross-node health checks; update cdn/main.tf to
pass a node-local identifier into templatefile instead of var.domain (for
example the instance's private IP, private DNS, or a per-instance var like
node_hostname), and adjust cdn/common/monitor.service.tftpl to set
MONITOR_HEALTH_DOMAIN from that node-local parameter; ensure the calling code
that creates instances supplies the per-node value (e.g., instance.private_ip or
per-instance variable) so each node probes itself rather than the global relay
domain.
In `@cdn/pub/justfile`:
- Around line 56-60: Summary: Suspend monitor.service while bouncing
demo-bbb.service to avoid health-probe false failures. Change the command
sequence in the justfile so you stop or restart monitor.service before touching
demo-bbb and only start/restart monitor.service after demo-bbb is back up: call
systemctl enable monitor.service (if needed), then systemctl stop
monitor.service (or systemctl mask it temporarily), then run systemctl start
demo-bbb-prepare.service and systemctl start vacuum.timer and systemctl restart
demo-bbb.service, and finally systemctl start (or unmask+start) monitor.service;
reference monitor.service, demo-bbb.service, demo-bbb-prepare.service and the
health probe in cdn/common/monitor.sh (fetching demo/bbb) when applying the
change.
In `@cdn/README.md`:
- Line 16: Remove the unsafe shell redirection from the example command on line
16: change the shown command "cargo run --bin moq-token-cli -- --key
secrets/root.jwk generate > secrets/root.jwk" to simply use "cargo run --bin
moq-token-cli -- --key secrets/root.jwk generate" so the CLI writes directly to
the --key path and avoid truncating the file via the shell redirect.
In `@cdn/relay/justfile`:
- Around line 66-72: Stop or disable the monitoring service around the relay
restart to avoid flapping: before restarting or bouncing moq-relay.service in
the deploy-all workflow, issue a stop/suspend of monitor.service (monitor.sh
probes immediately on startup), then restart moq-relay.service, and finally
start/re-enable monitor.service once the relay is back up; update the justfile
sequence that currently runs systemctl restart monitor.service && systemctl
restart moq-relay.service to stop monitor.service before restarting
moq-relay.service and start it again afterwards.
In `@rs/moq-relay/src/main.rs`:
- Line 79: The current pattern only matches Err from web.run() and ignores Ok,
so change the handling to capture the full result of Web::run() (callable as
web.run()) and match on it; on Ok(()) treat a stopped web server as terminal
(return an Err with a context like "web server stopped") and on Err(e) return
that error with context (e.g., "web server failed"), mirroring serve() behavior
so the relay exits when the web server stops running.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 549cd501-b554-4e89-a19f-0e434575567a
📒 Files selected for processing (11)
cdn/README.mdcdn/common/memory-alert.shcdn/common/monitor.service.tftplcdn/common/monitor.shcdn/health.shcdn/health.tfcdn/main.tfcdn/pub/justfilecdn/relay/justfilejustfilers/moq-relay/src/main.rs
💤 Files with no reviewable changes (1)
- cdn/common/memory-alert.sh
cdn/health.sh
Outdated
| if [ ! -f "$JWT_FILE" ]; then | ||
| echo "Error: $JWT_FILE not found." | ||
| echo "Generate it with: cargo run --bin moq-token -- --key secrets/root.jwk sign --root \"demo\" --get \"\" > secrets/demo-sub.jwt" | ||
| echo "Generate it with: cargo run --bin moq-token-cli -- --key secrets/root.jwk sign --root \"demo\" --get \"\" > secrets/demo-sub.jwt" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== moq-token-cli sign flags =="
rg -n -C2 'Sign \{|subscribe: Vec<String>' rs/moq-token-cli/src/main.rs
echo
echo "== current call sites =="
rg -n -- '--get ""|--subscribe ""' cdn/health.sh cdn/README.md justfileRepository: moq-dev/moq
Length of output: 911
Replace --get with --subscribe in the fallback JWT command.
The moq-token-cli sign subcommand exposes --subscribe, not --get. This command on line 18 will fail when the token is missing. The identical error appears in cdn/README.md line 25.
Suggested fix
- echo "Generate it with: cargo run --bin moq-token-cli -- --key secrets/root.jwk sign --root \"demo\" --get \"\" > secrets/demo-sub.jwt"
+ echo "Generate it with: cargo run --bin moq-token-cli -- --key secrets/root.jwk sign --root \"demo\" --subscribe \"\" > secrets/demo-sub.jwt"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| echo "Generate it with: cargo run --bin moq-token-cli -- --key secrets/root.jwk sign --root \"demo\" --get \"\" > secrets/demo-sub.jwt" | |
| echo "Generate it with: cargo run --bin moq-token-cli -- --key secrets/root.jwk sign --root \"demo\" --subscribe \"\" > secrets/demo-sub.jwt" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cdn/health.sh` at line 18, The fallback JWT generation command printed by the
echo (the moq-token-cli sign invocation) uses the non-existent flag --get;
update that command string to use --subscribe instead so the printed instruction
and fallback behavior work (change the echo line that prints: "cargo run --bin
moq-token-cli -- --key secrets/root.jwk sign --root \"demo\" --get \"\" >
secrets/demo-sub.jwt" to use --subscribe). Also apply the same replacement in
the identical occurrence in cdn/README.md so both references use --subscribe.
| content = templatefile("${path.module}/common/monitor.service.tftpl", { | ||
| webhook = var.webhook | ||
| domain = var.domain | ||
| }) |
There was a problem hiding this comment.
var.domain is too coarse for per-node health checks.
cdn/common/monitor.service.tftpl:4-6 exports this as MONITOR_HEALTH_DOMAIN, and cdn/input.tf:12-15 defines var.domain as the relay domain name. Because this service is shared across instances, every node ends up probing the same public hostname. In a multi-node relay/pub deployment, one healthy node can satisfy the check for another unhealthy node, so local failures go undetected. This template needs a node-local target instead.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cdn/main.tf` around lines 50 - 53, The template is using the shared
var.domain for MONITOR_HEALTH_DOMAIN causing cross-node health checks; update
cdn/main.tf to pass a node-local identifier into templatefile instead of
var.domain (for example the instance's private IP, private DNS, or a
per-instance var like node_hostname), and adjust
cdn/common/monitor.service.tftpl to set MONITOR_HEALTH_DOMAIN from that
node-local parameter; ensure the calling code that creates instances supplies
the per-node value (e.g., instance.private_ip or per-instance variable) so each
node probes itself rather than the global relay domain.
| systemctl enable monitor.service && \ | ||
| systemctl start demo-bbb-prepare.service && \ | ||
| systemctl start vacuum.timer && \ | ||
| systemctl restart memory-alert.service && \ | ||
| systemctl restart monitor.service && \ | ||
| systemctl restart demo-bbb.service" |
There was a problem hiding this comment.
Suspend the monitor while demo-bbb.service is being bounced.
The health probe fetches demo/bbb on every run (cdn/common/monitor.sh, Line 91). If the monitor stays live during a publisher restart, deploys can emit false failures for an intentional outage.
Suggested fix
- systemctl enable monitor.service && \
+ systemctl enable monitor.service && \
systemctl start demo-bbb-prepare.service && \
systemctl start vacuum.timer && \
- systemctl restart monitor.service && \
- systemctl restart demo-bbb.service"
+ systemctl stop monitor.service && \
+ systemctl restart demo-bbb.service && \
+ systemctl restart monitor.service"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| systemctl enable monitor.service && \ | |
| systemctl start demo-bbb-prepare.service && \ | |
| systemctl start vacuum.timer && \ | |
| systemctl restart memory-alert.service && \ | |
| systemctl restart monitor.service && \ | |
| systemctl restart demo-bbb.service" | |
| systemctl enable monitor.service && \ | |
| systemctl start demo-bbb-prepare.service && \ | |
| systemctl start vacuum.timer && \ | |
| systemctl stop monitor.service && \ | |
| systemctl restart demo-bbb.service && \ | |
| systemctl restart monitor.service" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cdn/pub/justfile` around lines 56 - 60, Summary: Suspend monitor.service
while bouncing demo-bbb.service to avoid health-probe false failures. Change the
command sequence in the justfile so you stop or restart monitor.service before
touching demo-bbb and only start/restart monitor.service after demo-bbb is back
up: call systemctl enable monitor.service (if needed), then systemctl stop
monitor.service (or systemctl mask it temporarily), then run systemctl start
demo-bbb-prepare.service and systemctl start vacuum.timer and systemctl restart
demo-bbb.service, and finally systemctl start (or unmask+start) monitor.service;
reference monitor.service, demo-bbb.service, demo-bbb-prepare.service and the
health probe in cdn/common/monitor.sh (fetching demo/bbb) when applying the
change.
|
|
||
| # generate the root key private key | ||
| cargo run --bin moq-token -- --key secrets/root.jwk generate > secrets/root.jwk | ||
| cargo run --bin moq-token-cli -- --key secrets/root.jwk generate > secrets/root.jwk |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Generate writes to --key =="
rg -n -C2 'Commands::Generate|key\.to_file\(&cli\.key\)' rs/moq-token-cli/src/main.rs
echo
echo "== shell redirection truncates before the command runs =="
tmp=$(mktemp)
printf 'sentinel' > "$tmp"
set +e
bash -lc "false > \"$tmp\""
status=$?
set -e
printf 'exit=%s size=%s\n' "$status" "$(wc -c < "$tmp")"
rm -f "$tmp"Repository: moq-dev/moq
Length of output: 475
Remove the > secrets/root.jwk redirect from line 16.
The moq-token-cli generate command writes directly to the --key path, making the redirect redundant and unsafe. The shell truncates the output file before cargo run executes, so a failed rerun leaves secrets/root.jwk empty and the previous key lost.
Suggested fix
- cargo run --bin moq-token-cli -- --key secrets/root.jwk generate > secrets/root.jwk
+ cargo run --bin moq-token-cli -- --key secrets/root.jwk generate📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cargo run --bin moq-token-cli -- --key secrets/root.jwk generate > secrets/root.jwk | |
| cargo run --bin moq-token-cli -- --key secrets/root.jwk generate |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cdn/README.md` at line 16, Remove the unsafe shell redirection from the
example command on line 16: change the shown command "cargo run --bin
moq-token-cli -- --key secrets/root.jwk generate > secrets/root.jwk" to simply
use "cargo run --bin moq-token-cli -- --key secrets/root.jwk generate" so the
CLI writes directly to the --key path and avoid truncating the file via the
shell redirect.
| systemctl enable monitor.service && \ | ||
| systemctl enable moq-relay.service && \ | ||
| systemctl start certbot-renew.timer && \ | ||
| systemctl start vacuum.timer && \ | ||
| systemctl start moq-cert.service && \ | ||
| systemctl restart memory-alert.service && \ | ||
| systemctl restart monitor.service && \ | ||
| systemctl restart moq-relay.service" |
There was a problem hiding this comment.
Suspend the monitor while moq-relay.service is being bounced.
monitor.sh starts probing immediately on startup (cdn/common/monitor.sh, Lines 118-120), and an older monitor instance may already be running during redeploys. As written, deploy-all can flap the health channel on intentional restarts.
Suggested fix
- systemctl enable monitor.service && \
+ systemctl enable monitor.service && \
systemctl enable moq-relay.service && \
systemctl start certbot-renew.timer && \
systemctl start vacuum.timer && \
systemctl start moq-cert.service && \
- systemctl restart monitor.service && \
- systemctl restart moq-relay.service"
+ systemctl stop monitor.service && \
+ systemctl restart moq-relay.service && \
+ systemctl restart monitor.service"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cdn/relay/justfile` around lines 66 - 72, Stop or disable the monitoring
service around the relay restart to avoid flapping: before restarting or
bouncing moq-relay.service in the deploy-all workflow, issue a stop/suspend of
monitor.service (monitor.sh probes immediately on startup), then restart
moq-relay.service, and finally start/re-enable monitor.service once the relay is
back up; update the justfile sequence that currently runs systemctl restart
monitor.service && systemctl restart moq-relay.service to stop monitor.service
before restarting moq-relay.service and start it again afterwards.
|
|
||
| tokio::select! { | ||
| Err(err) = cluster.clone().run() => return Err(err).context("cluster failed"), | ||
| Err(err) = web.run() => return Err(err).context("web server failed"), |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
file="$(fd 'web\.rs$' rs/moq-relay/src | head -n1)"
test -n "$file"
echo "Inspecting $file"
rg -n -C4 'struct\s+Web|impl\s+Web|pub\s+async\s+fn\s+run|async\s+fn\s+run|\bOk\(\(\)\)|anyhow::bail!' "$file"Repository: moq-dev/moq
Length of output: 1280
Fix the web.run() branch to handle both success and error cases.
The Web::run() method (web.rs:80) returns anyhow::Result<()> with a successful Ok(()) path (line 126). The current pattern Err(err) = web.run() only matches errors, silently ignoring clean exits. This means the relay continues running without its health/metrics endpoint if the web server stops normally.
Capture the full result and treat a stopped web server as terminal, matching the behavior of serve():
Suggested fix
- Err(err) = web.run() => return Err(err).context("web server failed"),
+ res = web.run() => {
+ res.context("web server failed")?;
+ anyhow::bail!("web server stopped");
+ },📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Err(err) = web.run() => return Err(err).context("web server failed"), | |
| res = web.run() => { | |
| res.context("web server failed")?; | |
| anyhow::bail!("web server stopped"); | |
| }, |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@rs/moq-relay/src/main.rs` at line 79, The current pattern only matches Err
from web.run() and ignores Ok, so change the handling to capture the full result
of Web::run() (callable as web.run()) and match on it; on Ok(()) treat a stopped
web server as terminal (return an Err with a context like "web server stopped")
and on Err(e) return that error with context (e.g., "web server failed"),
mirroring serve() behavior so the relay exits when the web server stops running.
- Add network-online ordering to monitor.service - Add curl timeouts to webhook posts to avoid stalling the loop - Sanitize ps output to avoid leaking process argv (e.g. JWTs) - Fix --get -> --subscribe in health.sh and README.md docs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cdn/common/monitor.service.tftpl`:
- Line 8: Replace the direct Environment=MONITOR_WEBHOOK=${webhook} setting in
the systemd unit template with an EnvironmentFile directive that points to a
root-owned credential file (e.g., /etc/default/moq-monitor) and ensure the
webhook is written to that file as MONITOR_WEBHOOK=<webhook_url> with mode 0600;
update any packaging/provisioning that renders cdn/common/monitor.service.tftpl
to create /etc/default/moq-monitor with the secret and correct
ownership/permissions instead of embedding ${webhook} in the unit file.
In `@cdn/common/monitor.sh`:
- Line 67: The MEMORY_BACKOFF doubling can grow unbounded and lengthen
health-check sleeps; modify monitor.sh to cap the backoff and reset it on a
healthy check: introduce constants like INITIAL_MEMORY_BACKOFF and
MAX_MEMORY_BACKOFF, replace the line that doubles MEMORY_BACKOFF
(MEMORY_BACKOFF=$(( MEMORY_BACKOFF * 2 ))) with a capped increment (e.g.,
MEMORY_BACKOFF=$(( MEMORY_BACKOFF * 2 )); if [ "$MEMORY_BACKOFF" -gt
"$MAX_MEMORY_BACKOFF" ]; then MEMORY_BACKOFF="$MAX_MEMORY_BACKOFF"; fi), ensure
the sleep uses the current MEMORY_BACKOFF, and reset
MEMORY_BACKOFF=INITIAL_MEMORY_BACKOFF when the memory condition clears so
health-check frequency returns to normal.
- Line 18: The unquoted array assignment to HEALTH_NODES using
MONITOR_HEALTH_NODES allows glob expansion; change the parsing to safely split
the quoted MONITOR_HEALTH_NODES into an array without word/glob expansion (for
example, set IFS to space and use read -r -a to populate HEALTH_NODES from the
quoted parameter expansion of ${MONITOR_HEALTH_NODES:-usc euc sea}), ensuring
the variable is quoted so wildcard characters are not expanded against the
filesystem.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a6b472f5-e770-496b-b868-58a05236ead5
📒 Files selected for processing (4)
cdn/README.mdcdn/common/monitor.service.tftplcdn/common/monitor.shcdn/health.sh
🚧 Files skipped from review as they are similar to previous changes (1)
- cdn/README.md
|
|
||
| [Service] | ||
| Type=simple | ||
| Environment=MONITOR_WEBHOOK=${webhook} |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== monitor unit secret exposure check =="
rg -n 'MONITOR_WEBHOOK|EnvironmentFile' cdn/common/monitor.service.tftpl cdn/common/gen/monitor.service 2>/dev/null || true
echo
echo "== generated unit permissions (if present) =="
fd 'monitor.service' cdn/common/gen --exec stat -c '%a %n' {} 2>/dev/null || trueRepository: moq-dev/moq
Length of output: 216
🏁 Script executed:
# Check repository structure and file handling
cd cdn/common && ls -la monitor* 2>/dev/null || echo "Files not found in cdn/common"Repository: moq-dev/moq
Length of output: 185
🏁 Script executed:
# Look for generation scripts or deployment logic
rg -l "monitor.service" --type sh --type yaml --type json --type pyRepository: moq-dev/moq
Length of output: 37
🏁 Script executed:
# Check if there's any credential/secret handling elsewhere
rg "webhook|secret|credential" cdn/common/ -A 2 -B 2Repository: moq-dev/moq
Length of output: 2420
🏁 Script executed:
# Check how the template variables are substituted
rg "MONITOR_WEBHOOK|webhook" --type tftpl -B 3 -A 3Repository: moq-dev/moq
Length of output: 84
🏁 Script executed:
# Verify if EnvironmentFile is used elsewhere in systemd units
rg "EnvironmentFile" --type tftpl --type serviceRepository: moq-dev/moq
Length of output: 84
🏁 Script executed:
# Search for how tftpl files are processed (likely Terraform or similar)
rg -i "terraform|tftpl" --type go --type py --type yaml --type json --type sh -B 2 -A 2Repository: moq-dev/moq
Length of output: 37
🏁 Script executed:
# Check for provisioning/deployment scripts
find . -type f \( -name "*.tf" -o -name "Makefile" -o -name "*.mk" -o -name "*.yml" -o -name "*.yaml" \) | head -20Repository: moq-dev/moq
Length of output: 549
🏁 Script executed:
# Look for infrastructure-as-code patterns
rg -l "provisioner|template_file|templatefile" --type tfRepository: moq-dev/moq
Length of output: 100
🏁 Script executed:
# Check the full monitor.service.tftpl content
cat cdn/common/monitor.service.tftplRepository: moq-dev/moq
Length of output: 486
🏁 Script executed:
# Check how the monitor service unit is deployed in Terraform
rg -A 10 -B 5 "monitor.service.tftpl|monitor.sh" cdn/*.tfRepository: moq-dev/moq
Length of output: 923
🏁 Script executed:
# Look for webhook variable definition and handling
rg "webhook" cdn/*.tf -B 3 -A 3Repository: moq-dev/moq
Length of output: 676
🏁 Script executed:
# Check for file permissions settings in Terraform
rg "file_permissions|file_mode|chmod" cdn/*.tf -B 2 -A 2Repository: moq-dev/moq
Length of output: 37
🏁 Script executed:
# Check how the generated monitor.service file is used/deployed
rg -A 10 "gen/monitor.service|monitor_service" cdn/*.tfRepository: moq-dev/moq
Length of output: 769
🏁 Script executed:
# Look for any deployment or copy logic that might set permissions
rg "remote-exec|local-exec|provisioner|rsync|scp" cdn/*.tf -B 3 -A 5Repository: moq-dev/moq
Length of output: 37
🏁 Script executed:
# Check bootstrap.tf for provisioning details
cat cdn/bootstrap.tf | head -100Repository: moq-dev/moq
Length of output: 653
🏁 Script executed:
# Verify Terraform local_file default behavior and check if file_permission is set anywhere
rg "local_file" cdn/ -A 15 | grep -E "file_permission|file_mode|permissions|mode"Repository: moq-dev/moq
Length of output: 37
🏁 Script executed:
# Check if there's any post-deployment permission hardening in bootstrap or elsewhere
rg "chmod|0600|0644|permissions" cdn/ -B 2 -A 2Repository: moq-dev/moq
Length of output: 1277
🏁 Script executed:
# Verify the exact Terraform resource definition
sed -n '/resource "local_file" "monitor_service"/,/^}/p' cdn/main.tfRepository: moq-dev/moq
Length of output: 285
Use EnvironmentFile for webhook secret instead of embedding in unit file.
Line 8 embeds the webhook URL directly in the systemd unit file with default Terraform permissions (0644), making it world-readable to local users. This is inconsistent with the project's secret handling elsewhere (e.g., /var/lib/moq/*.jwk, gcp.json stored with 0600).
Move MONITOR_WEBHOOK to a root-owned credential file with restricted permissions:
🔒 Suggested fix
[Service]
Type=simple
-Environment=MONITOR_WEBHOOK=${webhook}
+EnvironmentFile=-/etc/default/moq-monitor
Environment=MONITOR_HEALTH_DOMAIN=${domain}
Environment=PATH=/var/lib/moq/pkg/bin:/usr/bin:/bin
ExecStart=/etc/systemd/system/monitor.shProvision /etc/default/moq-monitor with mode 0600 containing:
MONITOR_WEBHOOK=<webhook_url>
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cdn/common/monitor.service.tftpl` at line 8, Replace the direct
Environment=MONITOR_WEBHOOK=${webhook} setting in the systemd unit template with
an EnvironmentFile directive that points to a root-owned credential file (e.g.,
/etc/default/moq-monitor) and ensure the webhook is written to that file as
MONITOR_WEBHOOK=<webhook_url> with mode 0600; update any packaging/provisioning
that renders cdn/common/monitor.service.tftpl to create /etc/default/moq-monitor
with the secret and correct ownership/permissions instead of embedding
${webhook} in the unit file.
- Cap MEMORY_BACKOFF at 600s so health checks aren't delayed 30+ min - Reset backoff when memory recovers - Use read -r -a instead of unquoted expansion for HEALTH_NODES Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cdn/common/monitor.sh`:
- Around line 96-99: The /fetch endpoint currently only accepts JWT via query
param; add Authorization header support in the server by updating the /fetch
route handler (the fetch handler in web.rs) to accept and validate a Bearer
token from the Authorization header when present (fallback to existing ?jwt
behavior for compatibility), then change the scripts (the curl call in
monitor.sh and the similar call in cdn/health.sh) to stop embedding ?jwt in the
URL and instead send the token with -H "Authorization: Bearer ${jwt}" (remove or
omit the ?jwt=... portion of the URL); ensure token parsing/validation in the
fetch handler handles both header and query formats and that unit/integration
tests cover both cases.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 769c89b2-5b07-4144-a8c1-e4760fdfd4da
📒 Files selected for processing (1)
cdn/common/monitor.sh
| local url="https://${node}.${HEALTH_DOMAIN}/fetch/demo/bbb/catalog.json?jwt=${jwt}" | ||
| local status ok=false | ||
|
|
||
| status=$(curl -sf -o /dev/null -w "%{http_code}" --max-time 10 "$url" 2>/dev/null) && ok=true |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Verify how health/fetch auth is implemented server-side before changing monitor.sh.
rg -nP --type=rs -C3 'catalog\.json|/fetch/|jwt\s*[:=]|query.*jwt|Authorization|Bearer|auth'Repository: moq-dev/moq
Length of output: 81
🏁 Script executed:
#!/bin/bash
# First, understand the repository structure
git ls-files | head -30Repository: moq-dev/moq
Length of output: 622
🏁 Script executed:
#!/bin/bash
# Check for health/fetch endpoint implementations
fd --type f -e go -e rs -e ts -e js | head -20Repository: moq-dev/moq
Length of output: 574
🏁 Script executed:
#!/bin/bash
# Look for any fetch or health endpoint definition
rg -l 'catalog\.json|/fetch/' | head -20Repository: moq-dev/moq
Length of output: 423
🏁 Script executed:
#!/bin/bash
# Search for /fetch endpoint implementation
rg -n 'fetch|/fetch|catalog\.json' rs/moq-relay/src/web.rs rs/hang/src/catalog/ rs/moq-mux/src/catalog.rsRepository: moq-dev/moq
Length of output: 466
🏁 Script executed:
#!/bin/bash
# Search for JWT and Authorization header handling
rg -n 'jwt|JWT|Authorization|Bearer|auth' rs/moq-relay/src/web.rs rs/hang/src/catalog/root.rs rs/moq-mux/src/catalog.rsRepository: moq-dev/moq
Length of output: 645
🏁 Script executed:
#!/bin/bash
# Check the health endpoint implementation
cat cdn/health.shRepository: moq-dev/moq
Length of output: 1738
🏁 Script executed:
#!/bin/bash
# Get the full context of serve_fetch and auth handling
sed -n '160,280p' rs/moq-relay/src/web.rsRepository: moq-dev/moq
Length of output: 2996
🏁 Script executed:
#!/bin/bash
# Check for any Authorization header or bearer token handling
rg -n 'header|Header|Bearer|Authorization|extract' rs/moq-relay/src/web.rs | head -20Repository: moq-dev/moq
Length of output: 89
🏁 Script executed:
#!/bin/bash
# Look at AuthQuery struct definition
rg -B5 -A10 'struct AuthQuery' rs/moq-relay/src/web.rsRepository: moq-dev/moq
Length of output: 357
Add header-based authentication support to the server before switching this script.
The /fetch endpoint currently only accepts JWT via query parameter (not via Authorization header). While placing JWTs in URLs does pose security risks—credentials can leak via process args, proxy logs, and observability pipelines—this script must use query parameters because the server doesn't yet support header-based alternatives.
Recommend a two-step fix:
- Implement Authorization header support in
rs/moq-relay/src/web.rs(the/fetchhandler at line 253) - Update this script and
cdn/health.shto send JWT in theAuthorization: Bearer <jwt>header instead
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cdn/common/monitor.sh` around lines 96 - 99, The /fetch endpoint currently
only accepts JWT via query param; add Authorization header support in the server
by updating the /fetch route handler (the fetch handler in web.rs) to accept and
validate a Bearer token from the Authorization header when present (fallback to
existing ?jwt behavior for compatibility), then change the scripts (the curl
call in monitor.sh and the similar call in cdn/health.sh) to stop embedding ?jwt
in the URL and instead send the token with -H "Authorization: Bearer ${jwt}"
(remove or omit the ?jwt=... portion of the URL); ensure token
parsing/validation in the fetch handler handles both header and query formats
and that unit/integration tests cover both cases.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
rs/moq-relay/src/main.rs (1)
77-82:⚠️ Potential issue | 🟠 MajorHandle
Ok(())returns from subsystems to avoid running in a degraded state.The pattern
Err(err) = futureonly matches errors. Ifweb.run()returnsOk(())(which it can perweb.rs:126), the branch is silently disabled while the relay continues without its web server. Theelsebranch only executes when all futures complete without matching, so one subsystem can stop while others continue.Apply the fix from the previous review to all branches that can return
Ok(()):Suggested fix for web.run() branch
- Err(err) = web.run() => return Err(err).context("web server failed"), + res = web.run() => { + res.context("web server failed")?; + anyhow::bail!("web server stopped"); + },Note:
cluster.run()andserve()appear to always return errors (viabail!), so they're less affected, but applying the same pattern consistently would be defensive.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rs/moq-relay/src/main.rs` around lines 77 - 82, The select! currently only matches Err branches so a future that returns Ok(()) (e.g., web.run()) will silently disable that subsystem; update each branch (web.run(), cluster.clone().run(), and serve(server, cluster, auth)) to match the full Result and treat Ok(()) as an unexpected shutdown by returning an Err with context (e.g., map Ok(()) to an anyhow error like "web server exited unexpectedly") so the relay fails rather than continuing in a degraded state.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@rs/moq-relay/src/main.rs`:
- Around line 77-82: The select! currently only matches Err branches so a future
that returns Ok(()) (e.g., web.run()) will silently disable that subsystem;
update each branch (web.run(), cluster.clone().run(), and serve(server, cluster,
auth)) to match the full Result and treat Ok(()) as an unexpected shutdown by
returning an Err with context (e.g., map Ok(()) to an anyhow error like "web
server exited unexpectedly") so the relay fails rather than continuing in a
degraded state.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1d899a21-434c-4ee1-b919-9e85aa1a74a7
📒 Files selected for processing (1)
rs/moq-relay/src/main.rs
Summary
cluster: true, causing euc (Frankfurt) and sea (Singapore) relays to be rejected by the root node withExpectedClustererror. They would crash-loop and never serve content.tokio::spawnwith.expect(), so a panic only killed the task — the web server stayed alive, passing health checks despite a dead cluster. Now usestokio::select!so any subsystem failure exits the process.monitor.shservice that checks both memory and relay health, posting directly to Discord.moq-token→moq-token-cliin all scripts and docs.Test plan
cargo check -p moq-relaypassestofu fmt -check -recursivepasses🤖 Generated with Claude Code