Skip to content

进一步完善raft相关,增强监控能力#4

Merged
lushenle merged 1 commit intomasterfrom
feat/distribution
Dec 3, 2025
Merged

进一步完善raft相关,增强监控能力#4
lushenle merged 1 commit intomasterfrom
feat/distribution

Conversation

@lushenle
Copy link
Copy Markdown
Owner

@lushenle lushenle commented Dec 3, 2025

This pull request introduces major enhancements to the distributed cache system, focusing on improved Raft cluster management, dynamic peer membership, observability, and overall robustness. The changes include adding support for dynamic peer addition/removal via HTTP endpoints, implementing a more robust Raft leader election process with randomized timeouts and persistent term/vote tracking, and exposing new Prometheus metrics for better cluster monitoring. Configuration files and code have been updated to support these features, and logging has been improved for easier debugging.

Raft Cluster Management and Dynamic Membership:

  • Added HTTP endpoints (/cluster/join, /cluster/leave, /cluster/peers) to allow dynamic addition and removal of cluster peers at runtime, as well as querying the current peer list. (pkg/cmd/main.go)
  • Implemented AddPeer, RemovePeer, and Peers methods in the Raft node and transport layers to support dynamic peer management, updating metrics accordingly. (pkg/raft/node.go, pkg/raft/http_transport.go) [1] [2]

Raft Protocol Improvements:

  • Overhauled leader election with a dedicated election loop, randomized election timeouts, persistent term and votedFor tracking, and enhanced logics for handling vote requests and append entries. This improves split-brain handling and cluster stability. (pkg/raft/node.go) [1] [2] [3] [4]
  • Improved logging for Raft protocol events (elections, heartbeats, append entries, and votes) to aid in debugging and observability. (pkg/raft/node.go, pkg/raft/http_transport.go) [1] [2] [3]

Observability and Metrics:

  • Introduced a new Prometheus metric (simple_cache_peers_total) to track the number of peers in the Raft cluster. (pkg/metrics/metrics.go) [1] [2] [3]
  • Made the Prometheus metrics server bind address configurable via metrics_addr in configuration files and environment variable, and updated configuration defaults and examples. (pkg/config/config.go, config.example.yaml, configs/node1.yaml, configs/node2.yaml, configs/node3.yaml) [1] [2] [3] [4] [5] [6]

Configuration and Logging:

  • Added support for loading configuration from a path specified by the CONFIG_PATH environment variable, improving deployment flexibility. (pkg/cmd/main.go)
  • Changed default log level to Info for less verbose output in production. (pkg/cmd/main.go)

Code Quality and Robustness:

  • Improved shutdown handling for the HTTP gateway server by using the correct context, ensuring graceful termination. (pkg/cmd/main.go)
  • Added missing imports and minor code hygiene improvements. (bench/benchmark_test.go, pkg/raft/http_transport.go, pkg/raft/storage.go, pkg/cmd/main.go, pkg/raft/node.go) [1] [2] [3] [4] [5]

@lushenle lushenle merged commit 7ec8ba3 into master Dec 3, 2025
2 checks passed
@lushenle lushenle deleted the feat/distribution branch December 3, 2025 04:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant