Harden idle-shutdown bootstrap: blocklist activity detection, install/uninstall scripts, systemd hardening#20
Conversation
… scripts, systemd hardening
- idle-check.py: replace sender_id regex with blocklist (automated prefixes). Every role=user message counts unless it starts with a known noise prefix. Fixes silent exclusion of TUI/web/Slack/Discord activity. IDLE_EXTRA_AUTOMATED_PREFIXES env hook for customization. Also peels leading 'Conversation info' metadata wrappers before prefix-matching.
- idle-check.sh: threshold 0.5h -> 1.0h (30 min was too aggressive for agents mid-response). Log + lock move to $HOME/.openclaw/logs/ so systemd PrivateTmp=yes doesn't hide them.
- systemd/idle-check.service: User=${INSTALL_USER} placeholder (was hardcoded ec2-user). Adds NoNewPrivileges=yes, ProtectSystem=strict, PrivateTmp=yes, narrow ReadWritePaths. install-idle.sh substitutes the placeholder.
- README.md: new mermaid flowchart showing BOTH token paths (idle-check and notify-lambda). New 'Supported surfaces' section. New 'Where's the log' callout. Install/uninstall docs point at the new scripts.
- install-idle.sh: idempotent installer. Handles SSM params, both IAM roles with inline policies, both Lambdas (sleep 10 after role create for IAM propagation), API GW with throttling, correct-ordered lambda:InvokeFunction grants for apigateway + events, EventBridge rule + target + permission, systemd install with ${INSTALL_USER} substitution, daemon-reload. Does NOT enable the timer. --reuse-openclaw-bot-token reads ~/.openclaw/openclaw.json .channels.telegram.botToken (chat-id still required explicitly).
- uninstall-idle.sh: mirrors install-idle.sh. Safe to run multiple times. --keep-ssm option.
- tests/test-idle-check.sh: 8 synthetic JSONL fixtures covering Telegram, TUI, heartbeat, System:, Pre-compaction, mixed-wins, env-configured prefix, and Conversation-info wrapper. Discovered automatically by .github/workflows/pack-tests.yml.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7b39261e44
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| put_ssm_string "$SSM_CHAT_ID_PARAM" "$CHAT_ID" | ||
| ok "Chat id → $SSM_CHAT_ID_PARAM" | ||
|
|
||
| # Wake URL written after API GW created (step 5) |
There was a problem hiding this comment.
Persist instance ID in SSM during install
This installer path writes bot token and chat ID, but never writes /openclaw/wake-config/instance-id. wake-lambda/index.mjs requires that parameter before calling StartInstances, so a fresh install from this script will return the wake Lambda "Config Error" path and fail to start the instance until someone manually creates the missing SSM key.
Useful? React with 👍 / 👎.
| {"Effect":"Allow","Action":["ssm:GetParameter"],"Resource":[ | ||
| "arn:aws:ssm:${REGION}:${ACCT_ID}:parameter/openclaw/wake-config/*", | ||
| "arn:aws:ssm:${REGION}:${ACCT_ID}:parameter/openclaw/wake-token" | ||
| ]} |
There was a problem hiding this comment.
Grant notify Lambda permission to rotate wake token
The notify role policy only allows ssm:GetParameter, but notify-lambda/handler.py writes both /openclaw/wake-token and the dedup marker via ssm.put_parameter(...) on stopped events. With this policy, those writes raise AccessDenied, so stop-event token rotation/notification flow breaks in production.
Useful? React with 👍 / 👎.
| fi | ||
| } | ||
|
|
||
| NOTIFY_ENV='{"Variables":{"INSTANCE_ID":"'"$INSTANCE_ID"'","TELEGRAM_CHAT_ID":"'"$CHAT_ID"'"}}' |
There was a problem hiding this comment.
Pass wake URL to notify Lambda environment
The notify Lambda environment omits WAKE_URL, but notify-lambda/handler.py builds stop-event links from os.environ.get('WAKE_URL', ''). Without that variable, the Telegram message link becomes just ?token=..., which is not tappable as a valid wake URL after stops that rely on notify-lambda.
Useful? React with 👍 / 👎.
…KE_URL env Three bugs caught by the Codex review on PR #20, all confirmed against the Lambda source: 1. install-idle.sh wrote every SSM param except /openclaw/wake-config/instance-id, but wake-lambda/index.mjs requires it at startup. A fresh install would return the 'Config Error' path and never start the EC2 instance. Now written alongside the other wake-config params. 2. notify-lambda/handler.py rotates /openclaw/wake-token and writes /openclaw/wake-config/last-stop-event-id as a dedup marker on stopped events, but the role only had ssm:GetParameter. Added ssm:PutParameter scoped to just those two keys. 3. notify-lambda/handler.py builds stop-event Telegram links from os.environ.get('WAKE_URL'), but the installer only set INSTANCE_ID + TELEGRAM_CHAT_ID in the function env. Links after stop events were '?token=...' with no host. Env is now updated after API GW creation with the real wake URL. uninstall-idle.sh also cleans up the two newly-tracked SSM keys.
|
Addressed all three findings from the Codex review in 6293bfe:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6293bfe823
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| {"Effect":"Allow","Action":["ec2:StartInstances","ec2:DescribeInstanceStatus"], | ||
| "Resource":"arn:aws:ec2:${REGION}:${ACCT_ID}:instance/${INSTANCE_ID}"}, |
There was a problem hiding this comment.
Allow DescribeInstanceStatus on wildcard resource
wake-lambda calls DescribeInstanceStatus before starting the instance, but this policy scopes that action to a specific instance ARN. EC2 Describe* actions require "Resource": "*", so the call is denied and the wake flow returns the "Check Failed" path instead of starting the instance. Split ec2:DescribeInstanceStatus into its own statement with wildcard resource.
Useful? React with 👍 / 👎.
| WORKSPACE="/home/${INSTALL_USER}/.openclaw/workspace" | ||
| LOG_DIR="/home/${INSTALL_USER}/.openclaw/logs" |
There was a problem hiding this comment.
Resolve home dir dynamically for install user
This installer hardcodes /home/${INSTALL_USER}, but the runtime script uses $HOME/.openclaw and the unit enables ProtectSystem=strict with a matching ReadWritePaths. When the account home is not /home/<user> (for example root on /root), timer runs cannot write log/state paths under $HOME, causing the service to fail. Resolve the user’s actual home directory (e.g., from passwd) and substitute that path into install and unit paths.
Useful? React with 👍 / 👎.
|
@codex do a review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6293bfe823
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # Substitute ${INSTALL_USER} in the unit file | ||
| TMP_UNIT="$(mktemp)" | ||
| sed "s|\${INSTALL_USER}|${INSTALL_USER}|g" "$SCRIPT_DIR/systemd/idle-check.service" > "$TMP_UNIT" | ||
| run "install -o root -g root -m 644 \"$TMP_UNIT\" /etc/systemd/system/idle-check.service" |
There was a problem hiding this comment.
Fail fast when installer lacks root privileges
The installer reaches privileged local operations like writing /etc/systemd/system/idle-check.service only after it has already provisioned SSM/IAM/Lambda/API resources, but there is no upfront root/sudo check. If a user runs the documented ./install-idle.sh ... command as a non-root account, the script can mutate AWS state and then abort in step 6 with permission errors, leaving a partially-installed deployment that must be cleaned up manually.
Useful? React with 👍 / 👎.
| WORKSPACE="/home/${INSTALL_USER}/.openclaw/workspace" | ||
| LOG_DIR="/home/${INSTALL_USER}/.openclaw/logs" |
There was a problem hiding this comment.
Resolve install paths from the user's actual home
WORKSPACE/LOG_DIR are derived from a hardcoded /home/${INSTALL_USER} prefix, which is incorrect for valid users whose home is elsewhere (for example, root uses /root). In that case the generated service paths and ReadWritePaths point to /home/... while idle-check.sh writes under $HOME, causing runtime failures under ProtectSystem=strict. Compute the home directory from account metadata (e.g., passwd entry) and build all related paths from that single source of truth.
Useful? React with 👍 / 👎.
Two findings from Codex on commit 6293bfe: P1 — Fail fast on missing root privileges. install-idle.sh writes to /etc/systemd/system and runs systemctl daemon-reload in step 7. Previously if a non-root user ran the documented command, the script would first provision SSM/IAM/Lambda/API GW and only then abort with 'Permission denied', leaving a half-installed deployment. Added an upfront $EUID check before any AWS mutations, with a clear 'try: sudo ./install-idle.sh' error message. P2 — Resolve install user's real home via getent passwd, not hardcoded /home/$USER. Fixes valid install users whose home lives elsewhere (root in /root, LDAP/AD users with custom homes, system users in /var/lib/*). All downstream paths now use a single source of truth ($INSTALL_HOME): WORKSPACE, LOG_DIR, OPENCLAW_JSON lookup, and the systemd unit's ExecStart + ReadWritePaths. Systemd unit template now carries both ${INSTALL_USER} and ${INSTALL_HOME} placeholders; installer substitutes both via a two-expression sed.
|
Addressed both findings from review 4158055898 in 8ae869d:
Tests: 8/8 still green; |
Hardens the
bootstraps/optional/idle-shutdown/bootstrap based on findings from a production install.What's in this PR, mapped to the ask
1. Activity detection: regex → blocklist
File:
bootstraps/optional/idle-shutdown/idle-check.pyReplaces the
"sender_id":"digits"regex with a blocklist approach. Everyrole=usermessage is counted as real activity unless its text begins with a known automated prefix. This fixes silent exclusion of every non-Telegram surface (TUI, web, Slack, Discord, Nostr, etc.).Read HEARTBEAT.md,System:,Pre-compaction memory flush,[Heartbeat],[System]IDLE_EXTRA_AUTOMATED_PREFIXES="Prefix A,Prefix B"for customizationConversation info (untrusted metadata)/Sender (untrusted metadata)wrapper blocks before matching, so real user text underneath them still counts2. systemd unit: placeholder user + hardening
File:
bootstraps/optional/idle-shutdown/systemd/idle-check.serviceUser=ec2-user→User=${INSTALL_USER}(installer substitutes)ExecStartpath uses${INSTALL_USER}tooNoNewPrivileges=yes,ProtectSystem=strict,PrivateTmp=yes,ReadWritePaths=/home/${INSTALL_USER}/.openclawBecause
PrivateTmp=yesnamespaces/tmpper-invocation, the idle-check log and lockfile move out of/tmp; see #4 below.3. README: dual-path flowchart + Supported surfaces
File:
bootstraps/optional/idle-shutdown/README.mdidle-check.shpre-shutdown, path B:notify-lambdaon any stopped event). Text below the diagram states clearly that any stop → fresh wake token via EventBridge, whether planned or not./tmp/idle-check.log→$HOME/.openclaw/logs/idle-check.logand why (PrivateTmp).4. New installer
File:
bootstraps/optional/idle-shutdown/install-idle.shNamed
install-idle.sh(notinstall.sh) so it can't be confused with the repo's top-levelinstall.sh. Idempotent — re-running updates existing resources rather than duplicating them.Order matters and is enforced: SSM params → IAM roles → inline policies →
sleep 10for IAM propagation → Lambdas (create or update) → API GW + throttling →lambda:InvokeFunctionforapigateway.amazonaws.com→ EventBridge rule + target →lambda:InvokeFunctionforevents.amazonaws.com→ on-instance scripts + systemd unit with${INSTALL_USER}substitution →daemon-reload.The timer is NOT enabled. Final output prints the exact
systemctl enable --now idle-check.timercommand.5. New uninstaller
File:
bootstraps/optional/idle-shutdown/uninstall-idle.shMirror of the installer. Safe to run multiple times (existence checks +
|| trueon deletes).--keep-ssmpreserves/openclaw/wake-config/*and/openclaw/wake-tokenif reinstalling soon.6.
--reuse-openclaw-bot-tokenflagFile:
bootstraps/optional/idle-shutdown/install-idle.shReads
.channels.telegram.botTokenfrom~/.openclaw/openclaw.jsonviajq -rand writes it to SSM as SecureString at/openclaw/wake-config/telegram-bot-token.Per discussion: chat id is never auto-derived — the
--chat-idflag is always required. Copying config across boxes should never silently retarget Telegram messages.7. Defaults bumped
File:
bootstraps/optional/idle-shutdown/idle-check.shIDLE_THRESHOLD_HOURS:0.5→1.0. 30 minutes is too aggressive for an assistant mid-response. No other thresholds changed.8. Tests
File:
tests/test-idle-check.shEight synthetic JSONL fixtures covering:
sender_id→ countsRead HEARTBEAT.mdprefix → ignoredSystem:prefix → ignoredPre-compaction memory flushprefix → ignoredlatest_tsIDLE_EXTRA_AUTOMATED_PREFIXESenv respectedConversation info (untrusted metadata)block still countsAll 8 pass locally.
.github/workflows/pack-tests.ymlauto-discoverstests/test-*.sh, so CI will run this without changes.Runtimes / arch
Unchanged: Python 3.13, Node.js 22, arm64.
One genuine disagreement I flagged before shipping
The spec's
ReadWritePaths=/tmp /home/${INSTALL_USER}/.openclawcombined withPrivateTmp=yeswould have hidden the log file (/tmp/idle-check.log) from ordinary user shells, becausePrivateTmpnamespaces/tmpper service invocation —ReadWritePaths=/tmpdoesn't un-namespace it.Resolution (approved): move the log + flock to
$HOME/.openclaw/logs/. All four hardening flags stay,/tmpis no longer inReadWritePaths. Documented in the README's "Where's the log?" section.