Skip to content

Fix non-US relay cluster connectivity and improve monitoring#1130

Merged
kixelated merged 5 commits intomainfrom
fix-europe
Mar 19, 2026
Merged

Fix non-US relay cluster connectivity and improve monitoring#1130
kixelated merged 5 commits intomainfrom
fix-europe

Conversation

@kixelated
Copy link
Collaborator

@kixelated kixelated commented Mar 18, 2026

Summary

  • Cluster fix: The cluster JWT was missing cluster: true, causing euc (Frankfurt) and sea (Singapore) relays to be rejected by the root node with ExpectedCluster error. They would crash-loop and never serve content.
  • Process exit fix: The cluster task ran in a tokio::spawn with .expect(), so a panic only killed the task — the web server stayed alive, passing health checks despite a dead cluster. Now uses tokio::select! so any subsystem failure exits the process.
  • Monitoring: Replaced GCP uptime checks + webhook (which silently failed with Discord) with a combined monitor.sh service that checks both memory and relay health, posting directly to Discord.
  • Docs: Renamed moq-tokenmoq-token-cli in all scripts and docs.

Test plan

  • Verified all three relays are healthy after deploying new cluster JWT
  • cargo check -p moq-relay passes
  • tofu fmt -check -recursive passes
  • Deploy monitor service and verify Discord alerts work

🤖 Generated with Claude Code

@kixelated kixelated enabled auto-merge (squash) March 18, 2026 20:53
@kixelated kixelated disabled auto-merge March 18, 2026 21:07
The cluster JWT was missing the `cluster: true` claim, causing euc/sea
relays to be rejected by the root node. Additionally, the cluster task
ran in a spawned tokio task whose panic didn't bring down the process,
leaving a half-dead relay that passed health checks.

Changes:
- Use tokio::select! in main so cluster/web/server failures propagate
- Replace GCP uptime checks with a combined monitor service (memory +
  health) that posts directly to Discord
- Rename moq-token -> moq-token-cli in all scripts and docs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@kixelated kixelated changed the title Fix some old docs Fix non-US relay cluster connectivity and improve monitoring Mar 18, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 18, 2026

Walkthrough

This PR updates tooling and monitoring: it replaces moq-token with moq-token-cli in docs and local scripts; removes cdn/common/memory-alert.sh and adds cdn/common/monitor.sh plus cdn/common/monitor.service.tftpl and related deployment changes; removes Google Cloud Monitoring resources in cdn/health.tf; updates Terraform to generate common/gen/monitor.service; updates deployment justfiles to use monitor.sh/monitor.service; adjusts cdn/health.sh webhook payload and JWT flag; and refactors the Rust relay main to run cluster, web, and a new serve task concurrently via tokio::select!.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the main changes: fixing non-US relay cluster connectivity (the critical cluster JWT fix) and improving monitoring infrastructure.
Description check ✅ Passed The description is well-related to the changeset, providing clear context on the cluster JWT fix, process exit handling improvements, monitoring system replacement, and documentation updates that align with the file changes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-europe
✨ Simplify code
  • Create PR with simplified code
  • Commit simplified code in branch fix-europe
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cdn/common/monitor.service.tftpl`:
- Around line 2-9: Add systemd network-online ordering to the unit so it waits
for DNS/HTTPS before starting: insert a [Unit] section (above the existing
Description= line or immediately after it) that includes
After=network-online.target and Wants=network-online.target; keep existing
fields like Description=MoQ CDN monitor, Environment=MONITOR_WEBHOOK,
Environment=MONITOR_HEALTH_DOMAIN and ExecStart=/etc/systemd/system/monitor.sh
unchanged.

In `@cdn/common/monitor.sh`:
- Around line 36-38: The curl call that posts the payload to $WEBHOOK can block
the monitor loop; modify the curl invocation (the command that uses payload and
$WEBHOOK) to include explicit timeouts (e.g., --connect-timeout and --max-time)
so requests fail quickly on DNS/connect or slow responses; keep existing flags
(-s -f -X POST -H "Content-Type: application/json" -d "$payload") and surface
failures as before (the || echo warning) but ensure the curl command uses
timeouts like --connect-timeout 5 and --max-time 10 to avoid stalling the
monitor loop.
- Around line 55-61: The current memory alert builds `top` from `ps aux` and
sends it via `send_webhook`, which leaks full COMMAND argv; update the `top`
assignment in monitor.sh to use a sanitized ps format (e.g., `ps -eo
pid,user,%mem,rss,comm --sort=-%mem | head -6`) or pipe `ps aux` through awk to
drop the COMMAND column so only non-sensitive fields are included, keep the rest
of the message assembly (variables HOSTNAME, available, MEMORY_THRESHOLD) and
call `send_webhook "$msg"` with the sanitized `top` variable.

In `@cdn/health.sh`:
- Line 18: The fallback JWT generation command printed by the echo (the
moq-token-cli sign invocation) uses the non-existent flag --get; update that
command string to use --subscribe instead so the printed instruction and
fallback behavior work (change the echo line that prints: "cargo run --bin
moq-token-cli -- --key secrets/root.jwk sign --root \"demo\" --get \"\" >
secrets/demo-sub.jwt" to use --subscribe). Also apply the same replacement in
the identical occurrence in cdn/README.md so both references use --subscribe.

In `@cdn/main.tf`:
- Around line 50-53: The template is using the shared var.domain for
MONITOR_HEALTH_DOMAIN causing cross-node health checks; update cdn/main.tf to
pass a node-local identifier into templatefile instead of var.domain (for
example the instance's private IP, private DNS, or a per-instance var like
node_hostname), and adjust cdn/common/monitor.service.tftpl to set
MONITOR_HEALTH_DOMAIN from that node-local parameter; ensure the calling code
that creates instances supplies the per-node value (e.g., instance.private_ip or
per-instance variable) so each node probes itself rather than the global relay
domain.

In `@cdn/pub/justfile`:
- Around line 56-60: Summary: Suspend monitor.service while bouncing
demo-bbb.service to avoid health-probe false failures. Change the command
sequence in the justfile so you stop or restart monitor.service before touching
demo-bbb and only start/restart monitor.service after demo-bbb is back up: call
systemctl enable monitor.service (if needed), then systemctl stop
monitor.service (or systemctl mask it temporarily), then run systemctl start
demo-bbb-prepare.service and systemctl start vacuum.timer and systemctl restart
demo-bbb.service, and finally systemctl start (or unmask+start) monitor.service;
reference monitor.service, demo-bbb.service, demo-bbb-prepare.service and the
health probe in cdn/common/monitor.sh (fetching demo/bbb) when applying the
change.

In `@cdn/README.md`:
- Line 16: Remove the unsafe shell redirection from the example command on line
16: change the shown command "cargo run --bin moq-token-cli -- --key
secrets/root.jwk generate > secrets/root.jwk" to simply use "cargo run --bin
moq-token-cli -- --key secrets/root.jwk generate" so the CLI writes directly to
the --key path and avoid truncating the file via the shell redirect.

In `@cdn/relay/justfile`:
- Around line 66-72: Stop or disable the monitoring service around the relay
restart to avoid flapping: before restarting or bouncing moq-relay.service in
the deploy-all workflow, issue a stop/suspend of monitor.service (monitor.sh
probes immediately on startup), then restart moq-relay.service, and finally
start/re-enable monitor.service once the relay is back up; update the justfile
sequence that currently runs systemctl restart monitor.service && systemctl
restart moq-relay.service to stop monitor.service before restarting
moq-relay.service and start it again afterwards.

In `@rs/moq-relay/src/main.rs`:
- Line 79: The current pattern only matches Err from web.run() and ignores Ok,
so change the handling to capture the full result of Web::run() (callable as
web.run()) and match on it; on Ok(()) treat a stopped web server as terminal
(return an Err with a context like "web server stopped") and on Err(e) return
that error with context (e.g., "web server failed"), mirroring serve() behavior
so the relay exits when the web server stops running.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 549cd501-b554-4e89-a19f-0e434575567a

📥 Commits

Reviewing files that changed from the base of the PR and between 65a8981 and 62a16e6.

📒 Files selected for processing (11)
  • cdn/README.md
  • cdn/common/memory-alert.sh
  • cdn/common/monitor.service.tftpl
  • cdn/common/monitor.sh
  • cdn/health.sh
  • cdn/health.tf
  • cdn/main.tf
  • cdn/pub/justfile
  • cdn/relay/justfile
  • justfile
  • rs/moq-relay/src/main.rs
💤 Files with no reviewable changes (1)
  • cdn/common/memory-alert.sh

cdn/health.sh Outdated
if [ ! -f "$JWT_FILE" ]; then
echo "Error: $JWT_FILE not found."
echo "Generate it with: cargo run --bin moq-token -- --key secrets/root.jwk sign --root \"demo\" --get \"\" > secrets/demo-sub.jwt"
echo "Generate it with: cargo run --bin moq-token-cli -- --key secrets/root.jwk sign --root \"demo\" --get \"\" > secrets/demo-sub.jwt"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== moq-token-cli sign flags =="
rg -n -C2 'Sign \{|subscribe: Vec<String>' rs/moq-token-cli/src/main.rs

echo
echo "== current call sites =="
rg -n -- '--get ""|--subscribe ""' cdn/health.sh cdn/README.md justfile

Repository: moq-dev/moq

Length of output: 911


Replace --get with --subscribe in the fallback JWT command.

The moq-token-cli sign subcommand exposes --subscribe, not --get. This command on line 18 will fail when the token is missing. The identical error appears in cdn/README.md line 25.

Suggested fix
-	echo "Generate it with: cargo run --bin moq-token-cli -- --key secrets/root.jwk sign --root \"demo\" --get \"\" > secrets/demo-sub.jwt"
+	echo "Generate it with: cargo run --bin moq-token-cli -- --key secrets/root.jwk sign --root \"demo\" --subscribe \"\" > secrets/demo-sub.jwt"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
echo "Generate it with: cargo run --bin moq-token-cli -- --key secrets/root.jwk sign --root \"demo\" --get \"\" > secrets/demo-sub.jwt"
echo "Generate it with: cargo run --bin moq-token-cli -- --key secrets/root.jwk sign --root \"demo\" --subscribe \"\" > secrets/demo-sub.jwt"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cdn/health.sh` at line 18, The fallback JWT generation command printed by the
echo (the moq-token-cli sign invocation) uses the non-existent flag --get;
update that command string to use --subscribe instead so the printed instruction
and fallback behavior work (change the echo line that prints: "cargo run --bin
moq-token-cli -- --key secrets/root.jwk sign --root \"demo\" --get \"\" >
secrets/demo-sub.jwt" to use --subscribe). Also apply the same replacement in
the identical occurrence in cdn/README.md so both references use --subscribe.

Comment on lines +50 to 53
content = templatefile("${path.module}/common/monitor.service.tftpl", {
webhook = var.webhook
domain = var.domain
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

var.domain is too coarse for per-node health checks.

cdn/common/monitor.service.tftpl:4-6 exports this as MONITOR_HEALTH_DOMAIN, and cdn/input.tf:12-15 defines var.domain as the relay domain name. Because this service is shared across instances, every node ends up probing the same public hostname. In a multi-node relay/pub deployment, one healthy node can satisfy the check for another unhealthy node, so local failures go undetected. This template needs a node-local target instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cdn/main.tf` around lines 50 - 53, The template is using the shared
var.domain for MONITOR_HEALTH_DOMAIN causing cross-node health checks; update
cdn/main.tf to pass a node-local identifier into templatefile instead of
var.domain (for example the instance's private IP, private DNS, or a
per-instance var like node_hostname), and adjust
cdn/common/monitor.service.tftpl to set MONITOR_HEALTH_DOMAIN from that
node-local parameter; ensure the calling code that creates instances supplies
the per-node value (e.g., instance.private_ip or per-instance variable) so each
node probes itself rather than the global relay domain.

Comment on lines +56 to 60
systemctl enable monitor.service && \
systemctl start demo-bbb-prepare.service && \
systemctl start vacuum.timer && \
systemctl restart memory-alert.service && \
systemctl restart monitor.service && \
systemctl restart demo-bbb.service"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Suspend the monitor while demo-bbb.service is being bounced.

The health probe fetches demo/bbb on every run (cdn/common/monitor.sh, Line 91). If the monitor stays live during a publisher restart, deploys can emit false failures for an intentional outage.

Suggested fix
-		systemctl enable monitor.service && \
+		systemctl enable monitor.service && \
 		systemctl start demo-bbb-prepare.service && \
 		systemctl start vacuum.timer && \
-		systemctl restart monitor.service && \
-		systemctl restart demo-bbb.service"
+		systemctl stop monitor.service && \
+		systemctl restart demo-bbb.service && \
+		systemctl restart monitor.service"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
systemctl enable monitor.service && \
systemctl start demo-bbb-prepare.service && \
systemctl start vacuum.timer && \
systemctl restart memory-alert.service && \
systemctl restart monitor.service && \
systemctl restart demo-bbb.service"
systemctl enable monitor.service && \
systemctl start demo-bbb-prepare.service && \
systemctl start vacuum.timer && \
systemctl stop monitor.service && \
systemctl restart demo-bbb.service && \
systemctl restart monitor.service"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cdn/pub/justfile` around lines 56 - 60, Summary: Suspend monitor.service
while bouncing demo-bbb.service to avoid health-probe false failures. Change the
command sequence in the justfile so you stop or restart monitor.service before
touching demo-bbb and only start/restart monitor.service after demo-bbb is back
up: call systemctl enable monitor.service (if needed), then systemctl stop
monitor.service (or systemctl mask it temporarily), then run systemctl start
demo-bbb-prepare.service and systemctl start vacuum.timer and systemctl restart
demo-bbb.service, and finally systemctl start (or unmask+start) monitor.service;
reference monitor.service, demo-bbb.service, demo-bbb-prepare.service and the
health probe in cdn/common/monitor.sh (fetching demo/bbb) when applying the
change.


# generate the root key private key
cargo run --bin moq-token -- --key secrets/root.jwk generate > secrets/root.jwk
cargo run --bin moq-token-cli -- --key secrets/root.jwk generate > secrets/root.jwk
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Generate writes to --key =="
rg -n -C2 'Commands::Generate|key\.to_file\(&cli\.key\)' rs/moq-token-cli/src/main.rs

echo
echo "== shell redirection truncates before the command runs =="
tmp=$(mktemp)
printf 'sentinel' > "$tmp"
set +e
bash -lc "false > \"$tmp\""
status=$?
set -e
printf 'exit=%s size=%s\n' "$status" "$(wc -c < "$tmp")"
rm -f "$tmp"

Repository: moq-dev/moq

Length of output: 475


Remove the > secrets/root.jwk redirect from line 16.

The moq-token-cli generate command writes directly to the --key path, making the redirect redundant and unsafe. The shell truncates the output file before cargo run executes, so a failed rerun leaves secrets/root.jwk empty and the previous key lost.

Suggested fix
-   cargo run --bin moq-token-cli -- --key secrets/root.jwk generate > secrets/root.jwk
+   cargo run --bin moq-token-cli -- --key secrets/root.jwk generate
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cargo run --bin moq-token-cli -- --key secrets/root.jwk generate > secrets/root.jwk
cargo run --bin moq-token-cli -- --key secrets/root.jwk generate
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cdn/README.md` at line 16, Remove the unsafe shell redirection from the
example command on line 16: change the shown command "cargo run --bin
moq-token-cli -- --key secrets/root.jwk generate > secrets/root.jwk" to simply
use "cargo run --bin moq-token-cli -- --key secrets/root.jwk generate" so the
CLI writes directly to the --key path and avoid truncating the file via the
shell redirect.

Comment on lines +66 to 72
systemctl enable monitor.service && \
systemctl enable moq-relay.service && \
systemctl start certbot-renew.timer && \
systemctl start vacuum.timer && \
systemctl start moq-cert.service && \
systemctl restart memory-alert.service && \
systemctl restart monitor.service && \
systemctl restart moq-relay.service"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Suspend the monitor while moq-relay.service is being bounced.

monitor.sh starts probing immediately on startup (cdn/common/monitor.sh, Lines 118-120), and an older monitor instance may already be running during redeploys. As written, deploy-all can flap the health channel on intentional restarts.

Suggested fix
-		systemctl enable monitor.service && \
+		systemctl enable monitor.service && \
 		systemctl enable moq-relay.service && \
 		systemctl start certbot-renew.timer && \
 		systemctl start vacuum.timer && \
 		systemctl start moq-cert.service && \
-		systemctl restart monitor.service && \
-		systemctl restart moq-relay.service"
+		systemctl stop monitor.service && \
+		systemctl restart moq-relay.service && \
+		systemctl restart monitor.service"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cdn/relay/justfile` around lines 66 - 72, Stop or disable the monitoring
service around the relay restart to avoid flapping: before restarting or
bouncing moq-relay.service in the deploy-all workflow, issue a stop/suspend of
monitor.service (monitor.sh probes immediately on startup), then restart
moq-relay.service, and finally start/re-enable monitor.service once the relay is
back up; update the justfile sequence that currently runs systemctl restart
monitor.service && systemctl restart moq-relay.service to stop monitor.service
before restarting moq-relay.service and start it again afterwards.


tokio::select! {
Err(err) = cluster.clone().run() => return Err(err).context("cluster failed"),
Err(err) = web.run() => return Err(err).context("web server failed"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

file="$(fd 'web\.rs$' rs/moq-relay/src | head -n1)"
test -n "$file"

echo "Inspecting $file"
rg -n -C4 'struct\s+Web|impl\s+Web|pub\s+async\s+fn\s+run|async\s+fn\s+run|\bOk\(\(\)\)|anyhow::bail!' "$file"

Repository: moq-dev/moq

Length of output: 1280


Fix the web.run() branch to handle both success and error cases.

The Web::run() method (web.rs:80) returns anyhow::Result<()> with a successful Ok(()) path (line 126). The current pattern Err(err) = web.run() only matches errors, silently ignoring clean exits. This means the relay continues running without its health/metrics endpoint if the web server stops normally.

Capture the full result and treat a stopped web server as terminal, matching the behavior of serve():

Suggested fix
-		Err(err) = web.run() => return Err(err).context("web server failed"),
+		res = web.run() => {
+			res.context("web server failed")?;
+			anyhow::bail!("web server stopped");
+		},
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Err(err) = web.run() => return Err(err).context("web server failed"),
res = web.run() => {
res.context("web server failed")?;
anyhow::bail!("web server stopped");
},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rs/moq-relay/src/main.rs` at line 79, The current pattern only matches Err
from web.run() and ignores Ok, so change the handling to capture the full result
of Web::run() (callable as web.run()) and match on it; on Ok(()) treat a stopped
web server as terminal (return an Err with a context like "web server stopped")
and on Err(e) return that error with context (e.g., "web server failed"),
mirroring serve() behavior so the relay exits when the web server stops running.

- Add network-online ordering to monitor.service
- Add curl timeouts to webhook posts to avoid stalling the loop
- Sanitize ps output to avoid leaking process argv (e.g. JWTs)
- Fix --get -> --subscribe in health.sh and README.md docs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cdn/common/monitor.service.tftpl`:
- Line 8: Replace the direct Environment=MONITOR_WEBHOOK=${webhook} setting in
the systemd unit template with an EnvironmentFile directive that points to a
root-owned credential file (e.g., /etc/default/moq-monitor) and ensure the
webhook is written to that file as MONITOR_WEBHOOK=<webhook_url> with mode 0600;
update any packaging/provisioning that renders cdn/common/monitor.service.tftpl
to create /etc/default/moq-monitor with the secret and correct
ownership/permissions instead of embedding ${webhook} in the unit file.

In `@cdn/common/monitor.sh`:
- Line 67: The MEMORY_BACKOFF doubling can grow unbounded and lengthen
health-check sleeps; modify monitor.sh to cap the backoff and reset it on a
healthy check: introduce constants like INITIAL_MEMORY_BACKOFF and
MAX_MEMORY_BACKOFF, replace the line that doubles MEMORY_BACKOFF
(MEMORY_BACKOFF=$(( MEMORY_BACKOFF * 2 ))) with a capped increment (e.g.,
MEMORY_BACKOFF=$(( MEMORY_BACKOFF * 2 )); if [ "$MEMORY_BACKOFF" -gt
"$MAX_MEMORY_BACKOFF" ]; then MEMORY_BACKOFF="$MAX_MEMORY_BACKOFF"; fi), ensure
the sleep uses the current MEMORY_BACKOFF, and reset
MEMORY_BACKOFF=INITIAL_MEMORY_BACKOFF when the memory condition clears so
health-check frequency returns to normal.
- Line 18: The unquoted array assignment to HEALTH_NODES using
MONITOR_HEALTH_NODES allows glob expansion; change the parsing to safely split
the quoted MONITOR_HEALTH_NODES into an array without word/glob expansion (for
example, set IFS to space and use read -r -a to populate HEALTH_NODES from the
quoted parameter expansion of ${MONITOR_HEALTH_NODES:-usc euc sea}), ensuring
the variable is quoted so wildcard characters are not expanded against the
filesystem.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a6b472f5-e770-496b-b868-58a05236ead5

📥 Commits

Reviewing files that changed from the base of the PR and between 62a16e6 and 31a0a52.

📒 Files selected for processing (4)
  • cdn/README.md
  • cdn/common/monitor.service.tftpl
  • cdn/common/monitor.sh
  • cdn/health.sh
🚧 Files skipped from review as they are similar to previous changes (1)
  • cdn/README.md


[Service]
Type=simple
Environment=MONITOR_WEBHOOK=${webhook}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== monitor unit secret exposure check =="
rg -n 'MONITOR_WEBHOOK|EnvironmentFile' cdn/common/monitor.service.tftpl cdn/common/gen/monitor.service 2>/dev/null || true

echo
echo "== generated unit permissions (if present) =="
fd 'monitor.service' cdn/common/gen --exec stat -c '%a %n' {} 2>/dev/null || true

Repository: moq-dev/moq

Length of output: 216


🏁 Script executed:

# Check repository structure and file handling
cd cdn/common && ls -la monitor* 2>/dev/null || echo "Files not found in cdn/common"

Repository: moq-dev/moq

Length of output: 185


🏁 Script executed:

# Look for generation scripts or deployment logic
rg -l "monitor.service" --type sh --type yaml --type json --type py

Repository: moq-dev/moq

Length of output: 37


🏁 Script executed:

# Check if there's any credential/secret handling elsewhere
rg "webhook|secret|credential" cdn/common/ -A 2 -B 2

Repository: moq-dev/moq

Length of output: 2420


🏁 Script executed:

# Check how the template variables are substituted
rg "MONITOR_WEBHOOK|webhook" --type tftpl -B 3 -A 3

Repository: moq-dev/moq

Length of output: 84


🏁 Script executed:

# Verify if EnvironmentFile is used elsewhere in systemd units
rg "EnvironmentFile" --type tftpl --type service

Repository: moq-dev/moq

Length of output: 84


🏁 Script executed:

# Search for how tftpl files are processed (likely Terraform or similar)
rg -i "terraform|tftpl" --type go --type py --type yaml --type json --type sh -B 2 -A 2

Repository: moq-dev/moq

Length of output: 37


🏁 Script executed:

# Check for provisioning/deployment scripts
find . -type f \( -name "*.tf" -o -name "Makefile" -o -name "*.mk" -o -name "*.yml" -o -name "*.yaml" \) | head -20

Repository: moq-dev/moq

Length of output: 549


🏁 Script executed:

# Look for infrastructure-as-code patterns
rg -l "provisioner|template_file|templatefile" --type tf

Repository: moq-dev/moq

Length of output: 100


🏁 Script executed:

# Check the full monitor.service.tftpl content
cat cdn/common/monitor.service.tftpl

Repository: moq-dev/moq

Length of output: 486


🏁 Script executed:

# Check how the monitor service unit is deployed in Terraform
rg -A 10 -B 5 "monitor.service.tftpl|monitor.sh" cdn/*.tf

Repository: moq-dev/moq

Length of output: 923


🏁 Script executed:

# Look for webhook variable definition and handling
rg "webhook" cdn/*.tf -B 3 -A 3

Repository: moq-dev/moq

Length of output: 676


🏁 Script executed:

# Check for file permissions settings in Terraform
rg "file_permissions|file_mode|chmod" cdn/*.tf -B 2 -A 2

Repository: moq-dev/moq

Length of output: 37


🏁 Script executed:

# Check how the generated monitor.service file is used/deployed
rg -A 10 "gen/monitor.service|monitor_service" cdn/*.tf

Repository: moq-dev/moq

Length of output: 769


🏁 Script executed:

# Look for any deployment or copy logic that might set permissions
rg "remote-exec|local-exec|provisioner|rsync|scp" cdn/*.tf -B 3 -A 5

Repository: moq-dev/moq

Length of output: 37


🏁 Script executed:

# Check bootstrap.tf for provisioning details
cat cdn/bootstrap.tf | head -100

Repository: moq-dev/moq

Length of output: 653


🏁 Script executed:

# Verify Terraform local_file default behavior and check if file_permission is set anywhere
rg "local_file" cdn/ -A 15 | grep -E "file_permission|file_mode|permissions|mode"

Repository: moq-dev/moq

Length of output: 37


🏁 Script executed:

# Check if there's any post-deployment permission hardening in bootstrap or elsewhere
rg "chmod|0600|0644|permissions" cdn/ -B 2 -A 2

Repository: moq-dev/moq

Length of output: 1277


🏁 Script executed:

# Verify the exact Terraform resource definition
sed -n '/resource "local_file" "monitor_service"/,/^}/p' cdn/main.tf

Repository: moq-dev/moq

Length of output: 285


Use EnvironmentFile for webhook secret instead of embedding in unit file.

Line 8 embeds the webhook URL directly in the systemd unit file with default Terraform permissions (0644), making it world-readable to local users. This is inconsistent with the project's secret handling elsewhere (e.g., /var/lib/moq/*.jwk, gcp.json stored with 0600).

Move MONITOR_WEBHOOK to a root-owned credential file with restricted permissions:

🔒 Suggested fix
 [Service]
 Type=simple
-Environment=MONITOR_WEBHOOK=${webhook}
+EnvironmentFile=-/etc/default/moq-monitor
 Environment=MONITOR_HEALTH_DOMAIN=${domain}
 Environment=PATH=/var/lib/moq/pkg/bin:/usr/bin:/bin
 ExecStart=/etc/systemd/system/monitor.sh

Provision /etc/default/moq-monitor with mode 0600 containing:
MONITOR_WEBHOOK=<webhook_url>

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cdn/common/monitor.service.tftpl` at line 8, Replace the direct
Environment=MONITOR_WEBHOOK=${webhook} setting in the systemd unit template with
an EnvironmentFile directive that points to a root-owned credential file (e.g.,
/etc/default/moq-monitor) and ensure the webhook is written to that file as
MONITOR_WEBHOOK=<webhook_url> with mode 0600; update any packaging/provisioning
that renders cdn/common/monitor.service.tftpl to create /etc/default/moq-monitor
with the secret and correct ownership/permissions instead of embedding
${webhook} in the unit file.

- Cap MEMORY_BACKOFF at 600s so health checks aren't delayed 30+ min
- Reset backoff when memory recovers
- Use read -r -a instead of unquoted expansion for HEALTH_NODES

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cdn/common/monitor.sh`:
- Around line 96-99: The /fetch endpoint currently only accepts JWT via query
param; add Authorization header support in the server by updating the /fetch
route handler (the fetch handler in web.rs) to accept and validate a Bearer
token from the Authorization header when present (fallback to existing ?jwt
behavior for compatibility), then change the scripts (the curl call in
monitor.sh and the similar call in cdn/health.sh) to stop embedding ?jwt in the
URL and instead send the token with -H "Authorization: Bearer ${jwt}" (remove or
omit the ?jwt=... portion of the URL); ensure token parsing/validation in the
fetch handler handles both header and query formats and that unit/integration
tests cover both cases.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 769c89b2-5b07-4144-a8c1-e4760fdfd4da

📥 Commits

Reviewing files that changed from the base of the PR and between 31a0a52 and 4bf44e8.

📒 Files selected for processing (1)
  • cdn/common/monitor.sh

Comment on lines +96 to +99
local url="https://${node}.${HEALTH_DOMAIN}/fetch/demo/bbb/catalog.json?jwt=${jwt}"
local status ok=false

status=$(curl -sf -o /dev/null -w "%{http_code}" --max-time 10 "$url" 2>/dev/null) && ok=true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail
# Verify how health/fetch auth is implemented server-side before changing monitor.sh.
rg -nP --type=rs -C3 'catalog\.json|/fetch/|jwt\s*[:=]|query.*jwt|Authorization|Bearer|auth'

Repository: moq-dev/moq

Length of output: 81


🏁 Script executed:

#!/bin/bash
# First, understand the repository structure
git ls-files | head -30

Repository: moq-dev/moq

Length of output: 622


🏁 Script executed:

#!/bin/bash
# Check for health/fetch endpoint implementations
fd --type f -e go -e rs -e ts -e js | head -20

Repository: moq-dev/moq

Length of output: 574


🏁 Script executed:

#!/bin/bash
# Look for any fetch or health endpoint definition
rg -l 'catalog\.json|/fetch/' | head -20

Repository: moq-dev/moq

Length of output: 423


🏁 Script executed:

#!/bin/bash
# Search for /fetch endpoint implementation
rg -n 'fetch|/fetch|catalog\.json' rs/moq-relay/src/web.rs rs/hang/src/catalog/ rs/moq-mux/src/catalog.rs

Repository: moq-dev/moq

Length of output: 466


🏁 Script executed:

#!/bin/bash
# Search for JWT and Authorization header handling
rg -n 'jwt|JWT|Authorization|Bearer|auth' rs/moq-relay/src/web.rs rs/hang/src/catalog/root.rs rs/moq-mux/src/catalog.rs

Repository: moq-dev/moq

Length of output: 645


🏁 Script executed:

#!/bin/bash
# Check the health endpoint implementation
cat cdn/health.sh

Repository: moq-dev/moq

Length of output: 1738


🏁 Script executed:

#!/bin/bash
# Get the full context of serve_fetch and auth handling
sed -n '160,280p' rs/moq-relay/src/web.rs

Repository: moq-dev/moq

Length of output: 2996


🏁 Script executed:

#!/bin/bash
# Check for any Authorization header or bearer token handling
rg -n 'header|Header|Bearer|Authorization|extract' rs/moq-relay/src/web.rs | head -20

Repository: moq-dev/moq

Length of output: 89


🏁 Script executed:

#!/bin/bash
# Look at AuthQuery struct definition
rg -B5 -A10 'struct AuthQuery' rs/moq-relay/src/web.rs

Repository: moq-dev/moq

Length of output: 357


Add header-based authentication support to the server before switching this script.

The /fetch endpoint currently only accepts JWT via query parameter (not via Authorization header). While placing JWTs in URLs does pose security risks—credentials can leak via process args, proxy logs, and observability pipelines—this script must use query parameters because the server doesn't yet support header-based alternatives.

Recommend a two-step fix:

  1. Implement Authorization header support in rs/moq-relay/src/web.rs (the /fetch handler at line 253)
  2. Update this script and cdn/health.sh to send JWT in the Authorization: Bearer <jwt> header instead
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cdn/common/monitor.sh` around lines 96 - 99, The /fetch endpoint currently
only accepts JWT via query param; add Authorization header support in the server
by updating the /fetch route handler (the fetch handler in web.rs) to accept and
validate a Bearer token from the Authorization header when present (fallback to
existing ?jwt behavior for compatibility), then change the scripts (the curl
call in monitor.sh and the similar call in cdn/health.sh) to stop embedding ?jwt
in the URL and instead send the token with -H "Authorization: Bearer ${jwt}"
(remove or omit the ?jwt=... portion of the URL); ensure token
parsing/validation in the fetch handler handles both header and query formats
and that unit/integration tests cover both cases.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
rs/moq-relay/src/main.rs (1)

77-82: ⚠️ Potential issue | 🟠 Major

Handle Ok(()) returns from subsystems to avoid running in a degraded state.

The pattern Err(err) = future only matches errors. If web.run() returns Ok(()) (which it can per web.rs:126), the branch is silently disabled while the relay continues without its web server. The else branch only executes when all futures complete without matching, so one subsystem can stop while others continue.

Apply the fix from the previous review to all branches that can return Ok(()):

Suggested fix for web.run() branch
-		Err(err) = web.run() => return Err(err).context("web server failed"),
+		res = web.run() => {
+			res.context("web server failed")?;
+			anyhow::bail!("web server stopped");
+		},

Note: cluster.run() and serve() appear to always return errors (via bail!), so they're less affected, but applying the same pattern consistently would be defensive.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rs/moq-relay/src/main.rs` around lines 77 - 82, The select! currently only
matches Err branches so a future that returns Ok(()) (e.g., web.run()) will
silently disable that subsystem; update each branch (web.run(),
cluster.clone().run(), and serve(server, cluster, auth)) to match the full
Result and treat Ok(()) as an unexpected shutdown by returning an Err with
context (e.g., map Ok(()) to an anyhow error like "web server exited
unexpectedly") so the relay fails rather than continuing in a degraded state.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@rs/moq-relay/src/main.rs`:
- Around line 77-82: The select! currently only matches Err branches so a future
that returns Ok(()) (e.g., web.run()) will silently disable that subsystem;
update each branch (web.run(), cluster.clone().run(), and serve(server, cluster,
auth)) to match the full Result and treat Ok(()) as an unexpected shutdown by
returning an Err with context (e.g., map Ok(()) to an anyhow error like "web
server exited unexpectedly") so the relay fails rather than continuing in a
degraded state.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1d899a21-434c-4ee1-b919-9e85aa1a74a7

📥 Commits

Reviewing files that changed from the base of the PR and between 4bf44e8 and 27fab9b.

📒 Files selected for processing (1)
  • rs/moq-relay/src/main.rs

@kixelated kixelated merged commit 01d3d69 into main Mar 19, 2026
2 checks passed
@kixelated kixelated deleted the fix-europe branch March 19, 2026 16:46
@moq-bot moq-bot bot mentioned this pull request Mar 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant