[PRD] Real-Time Decentralized Network Monitoring Stack with Consul, Prometheus, and Grafana #292

theMultitude · 2024-05-24T20:43:19Z

Overview

Summary

This Product Requirement Document outlines a proposal for the setup and integration of Consul, Prometheus, and Grafana on AWS for real-time monitoring of the Masa Protocol using Docker for deployment and Terraform for managing AWS infrastructure.

Goal

The rapid creation of a resilient, extensible, real-time monitoring system.

Audience

Masa Protocol Team

Background and Context

Problem Statement

At Masa we’re looking to build an event driven data architecture as a means to gather data from our nodes. This approach provides resilience, flexibility, and scalability. However, it comes with some challenges in the short term:

events are granular and need to be processed post ingestion into coherent datasets.
datasets need to be visualized and made available to relevant parties.
this process needs to be low enough latency to enable quick responses to novel issues.

In essence, the proposed stack allows Masa to get access to critical protocol information while our more general event system is still maturing.

In-Scope

Features and Functionality

Oracle #SDK with POC metric functionality
- Integration of Prometheus for metrics collection and alerting.
- Integration of Consul for node discoverability
Real-time monitoring dashboard using Grafana.
Docker containers for deploying Consul, Prometheus, and Grafana.
Infrastructure setup using Terraform.
Secure communication between Prometheus and Protocol nodes using TLS certificates.

Deliverables

A fully functional monitoring stack deployed on AWS.
Terraform scripts for automating infrastructure setup.
Docker configurations for Consul, Prometheus, and Grafana.
Documentation for setup, configuration, and usage of the monitoring stack.

Out-of-Scope

Excluded Features

Any advanced analytics on the collected metrics.
Integration with third-party monitoring tools not mentioned in this document.
Support for any Masa services outside of the Masa Protocol.

Testing and Validation

Testing Strategy

Perform unit tests for individual components.
Conduct integration tests to ensure proper communication between Consul, Prometheus, and Grafana.
Execute end-to-end tests to validate the entire monitoring stack.

Validation Criteria

All tests pass without errors.
Metrics are accurately collected and displayed in Grafana dashboards.
Secure communication is verified with mTLS (Optional if y'all don't think it's necessary)

User Stories

Protocol Monitoring

Title: Utilize Prometheus for Node Monitoring
As an: Oracle Developer
I want: to integrate Prometheus to collect and store metrics from all services
So that: I can monitor system performance, identify issues in real-time, and ensure system reliability
Acceptance Criteria:

Prometheus is deployed and configured to scrape metrics from all services.
Metrics are accessible via a centralized dashboard.
Alerts are configured for key performance indicators (TBD)

Node Discovery

Title: Implement Consul for Node Discovery
As an: Oracle Developer
I want: to use Consul for dynamic node discovery and health checks
So that: services can automatically be discovered and relayed to Prometheus
Acceptance Criteria:

Consul is deployed and configured in the production environment
Services register with Consul upon startup through Oracle Analytics SDK
Health checks are configured and working, with failing services automatically deregistered.

Separation of Concerns

Title: Consolidated/Abstracted Node Analytics
As a: Data Lead
I want: to have analytics separated from general oracle function
So that: modification of oracle functionality does not break information services
Acceptance Criteria:

Consul and Prometheus are initialized and maintained within the Analytics SDK
Oracles reference the SDK for Analytics data delivery

Further Notes

Future improvements include:
- Integration and refactoring of Prometheus data for more complex analysis using Thanos and S3 OR
- An external time series data store to persist Prometheus data.
- Extended health monitoring of individual nodes
- Increased flexibility of node configuration using Ansible (or an alternative)

theMultitude · 2024-05-24T20:46:01Z

@Luka-Loncar @j2d3 @restevens402 @teslashibe @jdutchak @nolanjacobson A first pass at a PRD. Still to be done with the team is final confirmation of design (including scope) and ticketing breakdown.

Comments welcome.

theMultitude · 2024-05-28T14:08:29Z

A Loom for some perspective on how this architecture would work in it's end state.

And an example of Consul's key, value store structure:

Nodes are discrete machines that can have multiple services. Both nodes and services can have health checks if we desire.

theMultitude added the Feature New feature label May 24, 2024

theMultitude mentioned this issue May 30, 2024

Epic: Masa Protocol - Real Time Analytics Infrastructure #304

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PRD] Real-Time Decentralized Network Monitoring Stack with Consul, Prometheus, and Grafana #292

[PRD] Real-Time Decentralized Network Monitoring Stack with Consul, Prometheus, and Grafana #292

theMultitude commented May 24, 2024 •

edited

Loading

theMultitude commented May 24, 2024 •

edited

Loading

theMultitude commented May 28, 2024

[PRD] Real-Time Decentralized Network Monitoring Stack with Consul, Prometheus, and Grafana #292

[PRD] Real-Time Decentralized Network Monitoring Stack with Consul, Prometheus, and Grafana #292

Comments

theMultitude commented May 24, 2024 • edited Loading

Overview

Summary

Goal

Audience

Background and Context

Problem Statement

In-Scope

Features and Functionality

Deliverables

Out-of-Scope

Excluded Features

Testing and Validation

Testing Strategy

Validation Criteria

User Stories

Further Notes

theMultitude commented May 24, 2024 • edited Loading

theMultitude commented May 28, 2024

theMultitude commented May 24, 2024 •

edited

Loading

theMultitude commented May 24, 2024 •

edited

Loading