Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PRD] Real-Time Decentralized Network Monitoring Stack with Consul, Prometheus, and Grafana #292

Open
theMultitude opened this issue May 24, 2024 · 2 comments
Labels
Feature New feature

Comments

@theMultitude
Copy link
Contributor

theMultitude commented May 24, 2024

Overview

Summary

This Product Requirement Document outlines a proposal for the setup and integration of Consul, Prometheus, and Grafana on AWS for real-time monitoring of the Masa Protocol using Docker for deployment and Terraform for managing AWS infrastructure.

Goal

The rapid creation of a resilient, extensible, real-time monitoring system.

Audience

Masa Protocol Team

CPG-Flow-2024-05-24-2300

Background and Context

Problem Statement

At Masa we’re looking to build an event driven data architecture as a means to gather data from our nodes. This approach provides resilience, flexibility, and scalability. However, it comes with some challenges in the short term:  

  1. events are granular and need to be processed post ingestion into coherent datasets.
  2. datasets need to be visualized and made available to relevant parties.
  3. this process needs to be low enough latency to enable quick responses to novel issues.

In essence, the proposed stack allows Masa to get access to critical protocol information while our more general event system is still maturing.

In-Scope

Features and Functionality

  • Oracle #SDK with POC metric functionality
    • Integration of Prometheus for metrics collection and alerting.
    • Integration of Consul for node discoverability
  • Real-time monitoring dashboard using Grafana.
  • Docker containers for deploying Consul, Prometheus, and Grafana.
  • Infrastructure setup using Terraform.
  • Secure communication between Prometheus and Protocol nodes using TLS certificates.

Deliverables

  • A fully functional monitoring stack deployed on AWS.
  • Terraform scripts for automating infrastructure setup.
  • Docker configurations for Consul, Prometheus, and Grafana.
  • Documentation for setup, configuration, and usage of the monitoring stack.

Out-of-Scope

Excluded Features

  • Any advanced analytics on the collected metrics.
  • Integration with third-party monitoring tools not mentioned in this document.
  • Support for any Masa services outside of the Masa Protocol.

Testing and Validation

Testing Strategy

  • Perform unit tests for individual components.
  • Conduct integration tests to ensure proper communication between Consul, Prometheus, and Grafana.
  • Execute end-to-end tests to validate the entire monitoring stack.

Validation Criteria

  • All tests pass without errors.
  • Metrics are accurately collected and displayed in Grafana dashboards.
  • Secure communication is verified with mTLS (Optional if y'all don't think it's necessary)

User Stories

Protocol Monitoring

Title: Utilize Prometheus for Node Monitoring
As an: Oracle Developer
I want: to integrate Prometheus to collect and store metrics from all services
So that: I can monitor system performance, identify issues in real-time, and ensure system reliability
Acceptance Criteria:

  • Prometheus is deployed and configured to scrape metrics from all services.
  • Metrics are accessible via a centralized dashboard.
  • Alerts are configured for key performance indicators (TBD)

Node Discovery

Title: Implement Consul for Node Discovery
As an: Oracle Developer
I want: to use Consul for dynamic node discovery and health checks
So that: services can automatically be discovered and relayed to Prometheus
Acceptance Criteria:

  • Consul is deployed and configured in the production environment
  • Services register with Consul upon startup through Oracle Analytics SDK
  • Health checks are configured and working, with failing services automatically deregistered.

Separation of Concerns

Title: Consolidated/Abstracted Node Analytics
As a: Data Lead
I want: to have analytics separated from general oracle function
So that: modification of oracle functionality does not break information services
Acceptance Criteria:

  • Consul and Prometheus are initialized and maintained within the Analytics SDK
  • Oracles reference the SDK for Analytics data delivery

Further Notes

  • Future improvements include:
    • Integration and refactoring of Prometheus data for more complex analysis using Thanos and S3 OR
    • An external time series data store to persist Prometheus data.
    • Extended health monitoring of individual nodes
    • Increased flexibility of node configuration using Ansible (or an alternative)
@theMultitude theMultitude added the Feature New feature label May 24, 2024
@theMultitude
Copy link
Contributor Author

theMultitude commented May 24, 2024

@Luka-Loncar @j2d3 @restevens402 @teslashibe @jdutchak @nolanjacobson A first pass at a PRD. Still to be done with the team is final confirmation of design (including scope) and ticketing breakdown.

Comments welcome.

@theMultitude
Copy link
Contributor Author

A Loom for some perspective on how this architecture would work in it's end state.

And an example of Consul's key, value store structure:
image

Nodes are discrete machines that can have multiple services. Both nodes and services can have health checks if we desire.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature New feature
Projects
None yet
Development

No branches or pull requests

1 participant