Skip to content

marcuspat/rustops

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

16 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

RustOps

Intelligent AIOps Platform for Automated Monitoring, Anomaly Detection, and Incident Remediation

Rust License Tests Coverage Build


Overview

RustOps is a comprehensive AIOps (Artificial Intelligence for IT Operations) platform built in Rust. It combines real-time monitoring, intelligent anomaly detection, automated incident management, and safe remediation workflows to empower DevOps teams with intelligent automation.

Key Capabilities

  • πŸ” Real-time Anomaly Detection - Statistical and ML-based detection with sub-millisecond latency
  • 🚨 Incident Management - Alert correlation, deduplication, and root cause analysis
  • πŸ”§ Automated Remediation - Safe, risk-based approval workflows with automatic rollback
  • πŸ—ΊοΈ Service Topology - Graph-based dependency mapping with impact analysis
  • πŸ“š Knowledge Management - Vector embeddings for semantic search and pattern storage
  • πŸ”„ Event Sourcing - Complete audit trail with CQRS for scalable read models

Table of Contents


Quick Start

Prerequisites

  • Rust: 1.85 or later
  • Docker: 20.10+ (optional, for containerized deployment)
  • Kubernetes: 1.21+ (optional, for production deployment)
  • Neo4j: 5.0+ (optional, for topology features)

Local Development

# Clone the repository
git clone https://github.com/rustops/rustops.git
cd rustops

# Install dependencies and build
cargo build --workspace

# Run tests
cargo test --workspace

# Start the API server
RUST_LOG=info cargo run --bin rustops-api
# Server available at http://localhost:8080

# Start the agent service
RUST_LOG=info cargo run --bin rustops-agent
# Service available at http://localhost:8081

Docker Deployment

# Build all images
make docker-build-all

# Start infrastructure services (Kafka, Neo4j, Prometheus, etc.)
docker-compose up -d

# Run RustOps services
docker-compose up -d rustops-api rustops-agent rustops-pipeline

Verify Installation

# Health check
curl http://localhost:8080/health

# Metrics endpoint
curl http://localhost:8080/metrics

# API version
curl http://localhost:8080/api/v1/version

Architecture

RustOps follows Domain-Driven Design (DDD) principles with Event Sourcing and CQRS patterns for scalable, maintainable architecture.

System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         RustOps Platform                       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  API     β”‚  β”‚  Pipeline  β”‚  β”‚  Agent     β”‚  β”‚ Knowledge β”‚  β”‚
β”‚  β”‚  Server  β”‚  β”‚  Service  β”‚  β”‚  Service   β”‚  β”‚   Graph   β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β”‚
β”‚       β”‚              β”‚                β”‚              β”‚         β”‚
β”‚  β”Œβ”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”  β”‚
β”‚  β”‚              Integration Layer (Adapters)                      β”‚  β”‚
β”‚  β”‚  Prometheus β”‚ Kubernetes β”‚ ServiceNow β”‚ Slack β”‚ PagerDuty    β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                                                                   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚              Bounded Contexts (Domain Layer)                   β”‚  β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚ β”‚
β”‚  β”‚  β”‚Telemetryβ”‚ β”‚ Anomaly  β”‚ β”‚Incident β”‚ β”‚Topology β”‚ β”‚Remediationβ”‚  β”‚ β”‚
β”‚  β”‚  β”‚        β”‚ β”‚Detectionβ”‚ β”‚Managementβ”‚ β”‚         β”‚ β”‚         β”‚  β”‚ β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                                                                   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚              Infrastructure & Data Layer                        β”‚  β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚  β”‚
β”‚  β”‚  β”‚Kafka β”‚ β”‚Neo4j β”‚ β”‚PostgreSQLβ”‚Redis β”‚Prometheusβ”‚    β”‚  β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                                                                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Design Patterns

Pattern Implementation Benefit
Domain-Driven Design Bounded contexts for each domain Clear separation of concerns
Event Sourcing Complete audit trail via domain events Temporal queries, replay capability
CQRS Separate read/write models Optimized for each use case
Adapter Pattern Unified interface for integrations Swappable implementations
Circuit Breaker Fault tolerance for external calls Prevents cascading failures
Retry with Backoff Exponential backoff for retries Handles transient failures
Repository Pattern Aggregate root persistence Encapsulates data access

Project Structure

RustOps is organized as a Cargo workspace with specialized crates:

rustops/
β”œβ”€β”€ Cargo.toml                 # Workspace configuration
β”œβ”€β”€ crates/
β”‚   β”œβ”€β”€ common/                 # Shared types and utilities
β”‚   β”œβ”€β”€ telemetry/              # Metrics, logs, traces collection
β”‚   β”œβ”€β”€ anomaly/                # Anomaly detection algorithms
β”‚   β”œβ”€β”€ incident/                # Incident management (CQRS/ES)
β”‚   β”œβ”€β”€ integration/             # External system adapters
β”‚   β”œβ”€β”€ topology/                # Service topology graph
β”‚   β”œβ”€β”€ knowledge/               # Knowledge graph with embeddings
β”‚   └── remediation/             # Automated remediation workflows
β”œβ”€β”€ src/                        # Binary entry points
β”œβ”€β”€ tests/                      # Integration tests
β”œβ”€β”€ docs/                       # Documentation
β”œβ”€β”€ docker-compose.yml           # Local development stack
└── Dockerfile                  # Multi-stage builds

Crates Overview

Crate Purpose Key Exports
common Foundation types IDs, Events, Errors, Telemetry primitives
telemetry Collection pipeline CollectorRegistry, Metric, LogEntry, TraceSpan
anomaly Detection engine ZScoreDetector, IQRDetector, AnomalyRouter
incident Incident management IncidentRepository, AlertCorrelator, TopologyGrouping
integration External adapters PrometheusAdapter, KubernetesAdapter, ServiceNowAdapter
topology Service graph ServiceGraph, DependencyAnalyzer, ImpactScorer
knowledge Knowledge management VectorSearch, PatternStorage, Runbook
remediation Remediation engine WorkflowEngine, SafetyCheck, ApprovalGate

Features

πŸ” Anomaly Detection

Multi-algorithm approach for optimal accuracy and performance:

  • Statistical Detection (<1ms latency)

    • Z-score for spike/drop detection
    • IQR (Interquartile Range) for outlier detection
    • CUSUM for cumulative change detection
  • ML-Based Detection (~50ms latency)

    • ONNX Runtime integration for model inference
    • Support for TensorFlow, PyTorch, scikit-learn models
    • Model versioning and hot-reloading
  • Pattern Matching

    • Signature-based detection for known anomalies
    • Seasonal decomposition
    • Trend analysis
// Example: Using Z-score detector
let detector = ZScoreDetector::new(2.0);
let result = detector.detect(&metrics).await?;

for anomaly in result.anomalies {
    println!("Anomaly detected: {:?} with confidence: {}",
        anomaly.anomaly_type, anomaly.confidence);
}

🚨 Incident Management

Complete incident lifecycle management:

  • Alert Ingestion - Real-time alert processing from monitoring systems
  • Correlation - Intelligent grouping of related alerts
  • Deduplication - Eliminates duplicate alerts using similarity matching
  • Topology Grouping - Groups alerts by affected service topology
  • Root Cause Ranking - ML-based ranking of probable root causes
  • Event Sourcing - Complete audit trail for compliance
// Create new incident
let incident = Incident::new(
    "High CPU usage detected",
    IncidentSeverity::P2,
    IncidentStatus::New,
);

// Repository persists with event sourcing
repository.save(incident).await?;

πŸ”§ Automated Remediation

Safe, risk-based remediation workflows:

  • Workflow Orchestration - Temporal workflow engine for complex processes
  • Risk Assessment - Automatic risk scoring based on change impact
  • Approval Gates - Configurable approval workflows for high-risk changes
  • Blast Radius Limits - Namespace and resource constraints
  • Safety Interlocks - Pre-flight checks and validation
  • Automatic Rollback - Revert changes on failure detection
// Remediation workflow
let workflow = RestartServiceWorkflow::new(executor, config);
let context = WorkflowContext::new(incident);

let result = workflow.execute(&mut context).await?;
if result.success {
    println!("Service restarted successfully");
}

πŸ—ΊοΈ Service Topology

Graph-based service dependency management:

  • Discovery - Automatic service discovery from Kubernetes
  • Graph Database - Neo4j for efficient topology queries
  • Impact Analysis - Predict downstream impact of changes
  • Communication Patterns - Detect HTTP, gRPC, Kafka connections
  • Real-time Updates - Streaming topology changes
// Query service dependencies
let dependencies = graph.get_dependencies("payment-api").await?;
for dep in dependencies {
    println!("payment-api depends on: {}", dep.service_name);
}

// Impact analysis
let impact = graph.estimate_impact("database", "delete").await?;
println!("Would affect {} services", impact.service_count);

πŸ“š Knowledge Management

Intelligent knowledge storage and retrieval:

  • Vector Embeddings - Semantic search using HNSW (150x-12,500x faster)
  • Pattern Storage - Store successful remediation patterns
  • Runbook Automation - Link knowledge to executable actions
  • Learning Loop - Continuously improve from successful resolutions
// Semantic search
let results = knowledge.search("service restart timeout").await?;
for pattern in results {
    println!("Found pattern: {} (confidence: {})",
        pattern.description, pattern.similarity);
}

Installation

From Source

# Clone repository
git clone https://github.com/rustops/rustops.git
cd rustops

# Build workspace
cargo build --release

# Install binaries
cargo install --path .

Using Cargo

# Install API server
cargo install rustops-api --path crates/api

# Install agent service
cargo install rustops-agent --path crates/agent

Docker

# Pull images
docker pull rustops/api:latest
docker pull rustops/agent:latest

# Run with docker-compose
docker-compose up -d

System Requirements

Component Minimum Recommended
CPU 2 cores 4+ cores
Memory 4 GB 8+ GB
Disk 20 GB 50+ GB
Rust 1.85 1.92+
Docker 20.10 24.0+

Configuration

Environment Variables

Variable Default Description
RUST_LOG info Log level (trace, debug, info, warn, error)
RUSTOPS_API_PORT 8080 API server port
RUSTOPS_AGENT_PORT 8081 Agent service port
RUSTOPS_PIPELINE_PORT 9090 Pipeline service port
KAFKA_BROKERS localhost:9092 Kafka broker addresses
NEO4J_URI bolt://localhost:7687 Neo4j connection URI
PROMETHEUS_PORT 9090 Prometheus scrape port

Configuration File

Create config.yaml:

# Agent configuration
agent:
  collection_interval_seconds: 15
  batch_size: 100

# Pipeline configuration
pipeline:
  kafka_brokers:
    - "localhost:9092"
  consumer_group: "rustops-pipeline"
  auto_offset_reset: "latest"

# Telemetry configuration
telemetry:
  prometheus:
    url: "http://localhost:9090"
    scrape_interval: "15s"

  logging:
    level: "info"
    format: "json"

# Anomaly detection
anomaly:
  z_score_threshold: 2.0
  iqr_multiplier: 1.5
  ml_model_path: "/models/anomaly.onnx"

  detection_window_seconds: 300
  minimum_data_points: 100

# Incident management
incident:
  correlation_window_minutes: 15
  deduplication_similarity_threshold: 0.85

  auto_escalation:
    p1_escalation_minutes: 5
    p2_escalation_minutes: 15

# Remediation
remediation:
  default_approval_strategy: "auto"  # auto, manual, hybrid

  blast_radius:
    enabled: true
    max_aged_services: 5
    max_namespace_depth: 3

  safety_checks:
    - "business_hours_check"
    - "maintenance_window_check"
    - "canary_deployment_check"

# Topology
topology:
  discovery_interval_seconds: 60
  neo4j_uri: "bolt://localhost:7687"

  communication_patterns:
    - "http"
    - "grpc"
    - "kafka"

# Knowledge graph
knowledge:
  hnsw_index_dimension: 384
  similarity_threshold: 0.75
  storage_path: "/data/knowledge"

Service Configuration

Each service can be configured independently:

# API Server
rustops-api \
  --port 8080 \
  --config /etc/rustops/config.yaml \
  --log-level info

# Agent Service
rustops-agent \
  --port 8081 \
  --telemetry-prometheus http://prometheus:9090 \
  --pipeline-kafka-brokers localhost:9092

Development

Development Setup

# Install development dependencies
cargo install cargo-watch cargo-tarpaulin cargo-criterion

# Run with hot reload
cargo watch -x 'run --bin rustops-api'

# Run tests with output
cargo test --workspace -- --nocapture

# Run benchmarks
cargo bench

Code Organization

  • Bounded Contexts - Each crate represents a domain boundary
  • Aggregate Roots - Incident, ServiceGraph, Workflow are key aggregates
  • Domain Events - All state changes emit events for sourcing
  • Repositories - Data access abstracted through repository pattern
  • Factories - Complex object creation via factory methods

Testing

# Unit tests
cargo test --lib

# Integration tests
cargo test --test integration

# Property-based tests
cargo test --test property_tests

# With coverage
cargo tarpaulin --out Html

# Benchmarks
cargo bench -- --test

Build Configuration

# Development profile
[profile.dev]
opt-level = 0
debug = true

# Release profile
[profile.release]
opt-level = 3
lto = true
codegen-units = 256
strip = true

# Benchmark profile
[profile.bench]
debug = true

Linting

# Check code
cargo clippy --all-targets -- -D warnings

# Format code
cargo fmt

# Check formatting
cargo fmt -- --check

Testing

Test Suite

The project includes 139 tests covering all bounded contexts:

Crate Tests Coverage Target
common 60 >80%
telemetry 14 >80%
anomaly 8 >80%
incident 16 >80%
integration 16 >80%
topology 9 >80%
knowledge 6 >80%
remediation 16 >80%

Running Tests

# All tests
cargo test --workspace

# Specific crate
cargo test -p rustops-common

# Specific test
cargo test test_z_score_detector

# With output
cargo test -- --nocapture

# Run tests in parallel
cargo test --workspace --jobs 4

Property-Based Testing

# Run property tests
cargo test --test property_tests

# Generate test cases
cargo test --test property_tests -- --generate

Benchmarking

# Run all benchmarks
cargo bench

# Specific benchmark
cargo bench --bench id_benchmark

# Generate flamegraph
cargo bench --bench id_benchmark -- --profile-time=10

Deployment

Docker Deployment

# Build image
docker build -t rustops-api:latest .

# Run container
docker run -d \
  --name rustops-api \
  -p 8080:8080 \
  -v /etc/rustops:/etc/rustops:ro \
  -e RUST_LOG=info \
  rustops-api:latest

Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: rustops-api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: rustops-api
  template:
    metadata:
      labels:
        app: rustops-api
    spec:
      containers:
      - name: rustops-api
        image: rustops/api:latest
        ports:
        - containerPort: 8080
        env:
        - name: RUST_LOG
          value: "info"
        - name: KAFKA_BROKERS
          value: "kafka:9092"
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: rustops-api
spec:
  selector:
    app: rustops-api
  ports:
  - port: 8080
    targetPort: 8080
  type: LoadBalancer

Infrastructure Requirements

Minimum Production Setup

Service vCPU Memory Storage
rustops-api 2 4GB 10GB
rustops-agent 1 2GB 5GB
rustops-pipeline 2 4GB 10GB
Kafka 2 4GB 50GB
Neo4j 2 4GB 50GB
PostgreSQL 2 4GB 50GB
Redis 1 2GB 10GB
Prometheus 1 2GB 20GB
Grafana 1 2GB 10GB
Jaeger 1 2GB 10GB
Temporal 2 4GB 20GB
Fluentd 1 2GB 10GB
ClickHouse 2 4GB 50GB

Total: ~17 vCPU, 36GB RAM, 245GB storage


API Reference

REST API v1

Health & Status

GET /health
GET /metrics
GET /api/v1/version

Telemetry

POST /api/v1/telemetry/metrics
POST /api/v1/telemetry/logs
POST /api/v1/telemetry/traces

Incidents

GET /api/v1/incidents
GET /api/v1/incidents/:id
POST /api/v1/incidents
PUT /api/v1/incidents/:id
DELETE /api/v1/incidents/:id
GET /api/v1/incidents/:id/timeline

Anomalies

GET /api/v1/anomalies
GET /api/v1/anomalies/:id
POST /api/v1/anomalies/detect
PUT /api/v1/anomalies/:id/acknowledge

Topology

GET /api/v1/topology/services
GET /api/v1/topology/services/:id/dependencies
POST /api/v1/topology/analyze-impact

Knowledge

GET /api/v1/knowledge/search
POST /api/v1/knowledge/patterns
GET /api/v1/knowledge/patterns/:id

Remediation

GET /api/v1/remediation/workflows
POST /api/v1/remediation/workflows
PUT /api/v1/remediation/workflows/:id
POST /api/v1/remediation/workflows/:id/approve
POST /api/v1/remediation/workflows/:id/execute
GET /api/v1/remediation/safety-checks

WebSocket Streams

WS /api/v1/stream/metrics
WS /api/v1/stream/alerts
WS /api/v1/stream/topology
WS /api/v1/stream/workflows

Example Requests

# Create incident
curl -X POST http://localhost:8080/api/v1/incidents \
  -H "Content-Type: application/json" \
  -d '{
    "title": "High CPU usage on payment-service",
    "severity": "P2",
    "description": "CPU usage at 95% for 5 minutes",
    "labels": {"service": "payment-service"}
  }'

# Search knowledge
curl -X POST http://localhost:8080/api/v1/knowledge/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "database connection pool exhaustion",
    "limit": 10
  }'

# Trigger remediation
curl -X POST http://localhost:8080/api/v1/remediation/workflows \
  -H "Content-Type: application/json" \
  -d '{
    "incident_id": "incident-123",
    "workflow_type": "restart_service",
    "parameters": {
      "service_name": "payment-service",
      "namespace": "production"
    }
  }'

Contributing

We welcome contributions! Please see below for guidelines.

Development Workflow

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Write tests (cargo test --workspace)
  5. Format code (cargo fmt)
  6. Run lints (cargo clippy)
  7. Commit your changes (git commit -m "Add amazing feature")
  8. Push to the branch (git push origin feature/amazing-feature)
  9. Open a Pull Request

Code Style

  • Follow Rust naming conventions
  • Use cargo fmt for formatting
  • Use cargo clippy for linting
  • Document all public APIs
  • Keep functions focused and small
  • Write descriptive commit messages

Testing Requirements

  • All tests must pass (cargo test --workspace)
  • New features require tests
  • Maintain >80% code coverage
  • Add integration tests for external APIs
  • Include property tests for algorithms

Pull Request Guidelines

  • Describe the change in the PR title
  • Provide context in the PR description
  • Link related issues
  • Ensure CI checks pass
  • Request review from relevant maintainers
  • Keep PRs small and focused

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Copyright 2025 RustOps Contributors

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Support

Documentation

Community

Acknowledgments

Built with:

  • Rust - Systems programming language
  • Tokio - Async runtime
  • Serde - Serialization framework
  • Kube - Kubernetes client
  • Prometheus - Metrics monitoring
  • Temporal - Workflow orchestration
  • Neo4j - Graph database

RustOps - Empowering DevOps teams with intelligent automation and observability.

For more information, visit https://github.com/rustops/rustops

About

AIOps Agent for IT Operations. Intelligent monitoring, anomaly detection, incident management, automated remediation with Rust. Event sourcing, CQRS, knowledge graphs.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors