Skip to content

Valkey Sentinel: Fixing Auth/ACL gaps in TLS environments and security hardening (Enhancements for PR #137)#1

Merged
khtee merged 5 commits into
khtee:mainfrom
jose-10000:fix/sentinel-acl-hardening-and-schema
Apr 9, 2026
Merged

Valkey Sentinel: Fixing Auth/ACL gaps in TLS environments and security hardening (Enhancements for PR #137)#1
khtee merged 5 commits into
khtee:mainfrom
jose-10000:fix/sentinel-acl-hardening-and-schema

Conversation

@jose-10000
Copy link
Copy Markdown

Hi @khtee,

Your implementation of Sentinel in PR valkey-io#137 is a great addition to the chart. However, during testing in secured environments, I found that the stack encounters critical issues when the default user is disabled (a common security requirement).

Without these fixes, Sentinels are unable to discover each other or the master in Auth+TLS scenarios (the "empty array" issue). This PR addresses those gaps and introduces several security hardening measures and schema fixes to make the implementation production-ready.


Key Improvements & Bug Fixes

1. Security & ACL Logic (Auth + TLS Interop)
  • Native Inter-Sentinel Auth: Added support for sentinel-user and sentinel-pass. This is the missing link that allows Sentinels to form a quorum and gossip when TLS and ACLs are both active.
  • Role Segregation (Least Privilege): Introduced monitorUser for Sentinel-to-Valkey communication. This allows separating the Replication User (sync only) from the Monitor User (Pub/Sub and health), following the principle of least privilege.
  • Valkey 9 Compatibility: Added conditional logic to omit auth-user and sentinel-user when the user is default, preventing fatal config errors.
  • Defense in Depth: Verified support for disabling the default user (off resetchannels -@all) without breaking cluster discovery.
2. Credential Protection & Logging
  • Leak Prevention: Removed the cat "$SENTINEL_CONF" command from the init script to prevent passwords from being exposed in Kubernetes logs.
  • Log Redirection: Modified the log() function to output to stderr (>&2). This prevents log messages from polluting command substitutions or leaking sensitive data into configuration files.
  • SHA256 Hashing: Inbound Sentinel ACLs are now securely hashed with SHA256 before being written to the config, ensuring no plaintext passwords reside in the generated files.
3. Helm Schema & Configuration Fixes
  • Schema Validation: Fixed a missing comma in values.schema.json and added the sentinelAclUsers definition to pass validation.
  • HAProxy Tag Fix: Converted the HAProxy image tag from a number (2.9) to a string ("2.9") in values.yaml, fixing YAML parsing errors and ensuring helm lint passes.

Testing Status

  • Verified: Full Sentinel discovery and failover in a cluster with ACL enabled and TLS active.
  • Verified: No sensitive credentials leaked in kubectl logs.
  • Work in Progress: Currently reviewing the HAProxy watcher configuration with credentials. I will be performing further validation in the coming days to ensure the sidecar can properly authenticate and update backends in this secured setup.

jose added 4 commits April 2, 2026 16:45
fix(chart): resolve schema validation and template errors
- Updating HAProxy watcher for near-instant IP-based failover.
- Refactoring init scripts to support dynamic topology and universal auth/TLS injection.
- Adding smart L7 health checks in HAProxy to handle ACL-protected nodes.
- Fully parameterizing Service and ConfigMap ports for end-to-end flexibility.
@jose-10000
Copy link
Copy Markdown
Author

Hi @khtee,

Following my initial comments on this PR, I have pushed an update to stabilize the Sentinel High Availability architecture. After testing in environments (with TLS and ACLs enabled), I’ve identified and fixed several edge cases where the cluster would fail to recover during failovers.

Here is a summary of the key improvements included in this update:

1. Sentinel-Aware Topology Management (init.sh)

  • Dynamic Role Assignment: Removed hardcoded replicaof during initial boot. Nodes now start in a "clean" state, allowing Sentinel to manage the topology without conflicting with static configuration.
  • State Persistence: The script now captures and restores existing replication states (SAVED_REPLICA_LINE) from persistent volumes. This ensures that Sentinel's runtime decisions survive pod restarts.
  • Universal Credential Injection: masterauth, masteruser, and tls-replication are now injected into all nodes (including the initial master). This is critical because any master can be demoted to a replica by Sentinel; without these credentials pre-configured, demoted nodes were failing to reconnect to the new primary.

2. Enhanced HAProxy Sentinel Watcher

  • Hostname-Based Tracking: Updated the watcher logic to track the Master by hostname instead of Sentinel index. This ensures accurate failover detection even if the Sentinel quorum is partially unstable.
  • DNS Bypass via Runtime API: The watcher now resolves hostnames to IPs and injects them directly into HAProxy's memory using socat. This prevents HAProxy from entering a "DNS NX maintenance" state, which is a common point of failure in Kubernetes when Pod IPs change.

3. Robust L7 Health Checks

  • Auth-Agnostic Validation: Rewrote the tcp-check using a smart regex: ^(\+PONG|-[Nn][Oo][Aa][Uu][Tt][Hh]|-[Ee][Rr][Rr]). This allows the read pool to stay healthy even when nodes require authentication, ensuring traffic isn't dropped due to -NOAUTH responses during the health check phase.
  • Fail-Fast Timeouts: Added specific backend timeouts to ensure health checks are more aggressive than client traffic, speeding up failure detection.

4. Full Parameterization

  • Dynamic Port Mapping: Fully parameterized both write and read ports across the Service, HAProxy ConfigMap, and Values. This removes the hardcoded 6379 targetPort constraint.

@khtee khtee merged commit d8f854d into khtee:main Apr 9, 2026
@jose-10000
Copy link
Copy Markdown
Author

Hey! I've updated this branch with the correct DCO sign-offs and fixed the author email. Since you already merged the previous version, you need to force-push my new changes into your branch to clear the DCO error in the main PR.

You can do this by running:
git fetch origin
git reset --hard origin/fix/sentinel-acl-hardening-and-schema
git push origin [su-rama] --force

Have a nice day
Jose

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants