Intelligent AIOps Platform for Automated Monitoring, Anomaly Detection, and Incident Remediation
RustOps is a comprehensive AIOps (Artificial Intelligence for IT Operations) platform built in Rust. It combines real-time monitoring, intelligent anomaly detection, automated incident management, and safe remediation workflows to empower DevOps teams with intelligent automation.
- π Real-time Anomaly Detection - Statistical and ML-based detection with sub-millisecond latency
- π¨ Incident Management - Alert correlation, deduplication, and root cause analysis
- π§ Automated Remediation - Safe, risk-based approval workflows with automatic rollback
- πΊοΈ Service Topology - Graph-based dependency mapping with impact analysis
- π Knowledge Management - Vector embeddings for semantic search and pattern storage
- π Event Sourcing - Complete audit trail with CQRS for scalable read models
- Quick Start
- Architecture
- Project Structure
- Features
- Installation
- Configuration
- Development
- Testing
- Deployment
- API Reference
- Contributing
- License
- Rust: 1.85 or later
- Docker: 20.10+ (optional, for containerized deployment)
- Kubernetes: 1.21+ (optional, for production deployment)
- Neo4j: 5.0+ (optional, for topology features)
# Clone the repository
git clone https://github.com/rustops/rustops.git
cd rustops
# Install dependencies and build
cargo build --workspace
# Run tests
cargo test --workspace
# Start the API server
RUST_LOG=info cargo run --bin rustops-api
# Server available at http://localhost:8080
# Start the agent service
RUST_LOG=info cargo run --bin rustops-agent
# Service available at http://localhost:8081# Build all images
make docker-build-all
# Start infrastructure services (Kafka, Neo4j, Prometheus, etc.)
docker-compose up -d
# Run RustOps services
docker-compose up -d rustops-api rustops-agent rustops-pipeline# Health check
curl http://localhost:8080/health
# Metrics endpoint
curl http://localhost:8080/metrics
# API version
curl http://localhost:8080/api/v1/versionRustOps follows Domain-Driven Design (DDD) principles with Event Sourcing and CQRS patterns for scalable, maintainable architecture.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β RustOps Platform β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββ ββββββββββββββ βββββββββββββββ βββββββββββββ β
β β API β β Pipeline β β Agent β β Knowledge β β
β β Server β β Service β β Service β β Graph β β
β ββββββ¬ββββββ ββββββββ¬ββββββ ββββββββ¬βββββββ βββββββ¬ββββββ β
β β β β β β
β βββββΌβββββββββββββββΌβββββββββββββββββΌβββββββββΌββββββββββββΌββββ β
β β Integration Layer (Adapters) β β
β β Prometheus β Kubernetes β ServiceNow β Slack β PagerDuty β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Bounded Contexts (Domain Layer) β β
β β ββββββββββ ββββββββββββ βββββββββββ ββββββββββββ βββββββββββ β β
β β βTelemetryβ β Anomaly β βIncident β βTopology β βRemediationβ β β
β β β β βDetectionβ βManagementβ β β β β β β
β β ββββββββββ ββββββββββββ βββββββββββ ββββββββββββ βββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Infrastructure & Data Layer β β
β β ββββββββ ββββββββ ββββββββ βββββββββββ βββββββββββββ β β
β β βKafka β βNeo4j β βPostgreSQLβRedis βPrometheusβ β β
β β ββββββββ ββββββββ ββββββββ βββββββββββ βββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Pattern | Implementation | Benefit |
|---|---|---|
| Domain-Driven Design | Bounded contexts for each domain | Clear separation of concerns |
| Event Sourcing | Complete audit trail via domain events | Temporal queries, replay capability |
| CQRS | Separate read/write models | Optimized for each use case |
| Adapter Pattern | Unified interface for integrations | Swappable implementations |
| Circuit Breaker | Fault tolerance for external calls | Prevents cascading failures |
| Retry with Backoff | Exponential backoff for retries | Handles transient failures |
| Repository Pattern | Aggregate root persistence | Encapsulates data access |
RustOps is organized as a Cargo workspace with specialized crates:
rustops/
βββ Cargo.toml # Workspace configuration
βββ crates/
β βββ common/ # Shared types and utilities
β βββ telemetry/ # Metrics, logs, traces collection
β βββ anomaly/ # Anomaly detection algorithms
β βββ incident/ # Incident management (CQRS/ES)
β βββ integration/ # External system adapters
β βββ topology/ # Service topology graph
β βββ knowledge/ # Knowledge graph with embeddings
β βββ remediation/ # Automated remediation workflows
βββ src/ # Binary entry points
βββ tests/ # Integration tests
βββ docs/ # Documentation
βββ docker-compose.yml # Local development stack
βββ Dockerfile # Multi-stage builds
| Crate | Purpose | Key Exports |
|---|---|---|
| common | Foundation types | IDs, Events, Errors, Telemetry primitives |
| telemetry | Collection pipeline | CollectorRegistry, Metric, LogEntry, TraceSpan |
| anomaly | Detection engine | ZScoreDetector, IQRDetector, AnomalyRouter |
| incident | Incident management | IncidentRepository, AlertCorrelator, TopologyGrouping |
| integration | External adapters | PrometheusAdapter, KubernetesAdapter, ServiceNowAdapter |
| topology | Service graph | ServiceGraph, DependencyAnalyzer, ImpactScorer |
| knowledge | Knowledge management | VectorSearch, PatternStorage, Runbook |
| remediation | Remediation engine | WorkflowEngine, SafetyCheck, ApprovalGate |
Multi-algorithm approach for optimal accuracy and performance:
-
Statistical Detection (<1ms latency)
- Z-score for spike/drop detection
- IQR (Interquartile Range) for outlier detection
- CUSUM for cumulative change detection
-
ML-Based Detection (~50ms latency)
- ONNX Runtime integration for model inference
- Support for TensorFlow, PyTorch, scikit-learn models
- Model versioning and hot-reloading
-
Pattern Matching
- Signature-based detection for known anomalies
- Seasonal decomposition
- Trend analysis
// Example: Using Z-score detector
let detector = ZScoreDetector::new(2.0);
let result = detector.detect(&metrics).await?;
for anomaly in result.anomalies {
println!("Anomaly detected: {:?} with confidence: {}",
anomaly.anomaly_type, anomaly.confidence);
}Complete incident lifecycle management:
- Alert Ingestion - Real-time alert processing from monitoring systems
- Correlation - Intelligent grouping of related alerts
- Deduplication - Eliminates duplicate alerts using similarity matching
- Topology Grouping - Groups alerts by affected service topology
- Root Cause Ranking - ML-based ranking of probable root causes
- Event Sourcing - Complete audit trail for compliance
// Create new incident
let incident = Incident::new(
"High CPU usage detected",
IncidentSeverity::P2,
IncidentStatus::New,
);
// Repository persists with event sourcing
repository.save(incident).await?;Safe, risk-based remediation workflows:
- Workflow Orchestration - Temporal workflow engine for complex processes
- Risk Assessment - Automatic risk scoring based on change impact
- Approval Gates - Configurable approval workflows for high-risk changes
- Blast Radius Limits - Namespace and resource constraints
- Safety Interlocks - Pre-flight checks and validation
- Automatic Rollback - Revert changes on failure detection
// Remediation workflow
let workflow = RestartServiceWorkflow::new(executor, config);
let context = WorkflowContext::new(incident);
let result = workflow.execute(&mut context).await?;
if result.success {
println!("Service restarted successfully");
}Graph-based service dependency management:
- Discovery - Automatic service discovery from Kubernetes
- Graph Database - Neo4j for efficient topology queries
- Impact Analysis - Predict downstream impact of changes
- Communication Patterns - Detect HTTP, gRPC, Kafka connections
- Real-time Updates - Streaming topology changes
// Query service dependencies
let dependencies = graph.get_dependencies("payment-api").await?;
for dep in dependencies {
println!("payment-api depends on: {}", dep.service_name);
}
// Impact analysis
let impact = graph.estimate_impact("database", "delete").await?;
println!("Would affect {} services", impact.service_count);Intelligent knowledge storage and retrieval:
- Vector Embeddings - Semantic search using HNSW (150x-12,500x faster)
- Pattern Storage - Store successful remediation patterns
- Runbook Automation - Link knowledge to executable actions
- Learning Loop - Continuously improve from successful resolutions
// Semantic search
let results = knowledge.search("service restart timeout").await?;
for pattern in results {
println!("Found pattern: {} (confidence: {})",
pattern.description, pattern.similarity);
}# Clone repository
git clone https://github.com/rustops/rustops.git
cd rustops
# Build workspace
cargo build --release
# Install binaries
cargo install --path .# Install API server
cargo install rustops-api --path crates/api
# Install agent service
cargo install rustops-agent --path crates/agent# Pull images
docker pull rustops/api:latest
docker pull rustops/agent:latest
# Run with docker-compose
docker-compose up -d| Component | Minimum | Recommended |
|---|---|---|
| CPU | 2 cores | 4+ cores |
| Memory | 4 GB | 8+ GB |
| Disk | 20 GB | 50+ GB |
| Rust | 1.85 | 1.92+ |
| Docker | 20.10 | 24.0+ |
| Variable | Default | Description |
|---|---|---|
RUST_LOG |
info |
Log level (trace, debug, info, warn, error) |
RUSTOPS_API_PORT |
8080 |
API server port |
RUSTOPS_AGENT_PORT |
8081 |
Agent service port |
RUSTOPS_PIPELINE_PORT |
9090 |
Pipeline service port |
KAFKA_BROKERS |
localhost:9092 |
Kafka broker addresses |
NEO4J_URI |
bolt://localhost:7687 |
Neo4j connection URI |
PROMETHEUS_PORT |
9090 |
Prometheus scrape port |
Create config.yaml:
# Agent configuration
agent:
collection_interval_seconds: 15
batch_size: 100
# Pipeline configuration
pipeline:
kafka_brokers:
- "localhost:9092"
consumer_group: "rustops-pipeline"
auto_offset_reset: "latest"
# Telemetry configuration
telemetry:
prometheus:
url: "http://localhost:9090"
scrape_interval: "15s"
logging:
level: "info"
format: "json"
# Anomaly detection
anomaly:
z_score_threshold: 2.0
iqr_multiplier: 1.5
ml_model_path: "/models/anomaly.onnx"
detection_window_seconds: 300
minimum_data_points: 100
# Incident management
incident:
correlation_window_minutes: 15
deduplication_similarity_threshold: 0.85
auto_escalation:
p1_escalation_minutes: 5
p2_escalation_minutes: 15
# Remediation
remediation:
default_approval_strategy: "auto" # auto, manual, hybrid
blast_radius:
enabled: true
max_aged_services: 5
max_namespace_depth: 3
safety_checks:
- "business_hours_check"
- "maintenance_window_check"
- "canary_deployment_check"
# Topology
topology:
discovery_interval_seconds: 60
neo4j_uri: "bolt://localhost:7687"
communication_patterns:
- "http"
- "grpc"
- "kafka"
# Knowledge graph
knowledge:
hnsw_index_dimension: 384
similarity_threshold: 0.75
storage_path: "/data/knowledge"Each service can be configured independently:
# API Server
rustops-api \
--port 8080 \
--config /etc/rustops/config.yaml \
--log-level info
# Agent Service
rustops-agent \
--port 8081 \
--telemetry-prometheus http://prometheus:9090 \
--pipeline-kafka-brokers localhost:9092# Install development dependencies
cargo install cargo-watch cargo-tarpaulin cargo-criterion
# Run with hot reload
cargo watch -x 'run --bin rustops-api'
# Run tests with output
cargo test --workspace -- --nocapture
# Run benchmarks
cargo bench- Bounded Contexts - Each crate represents a domain boundary
- Aggregate Roots -
Incident,ServiceGraph,Workfloware key aggregates - Domain Events - All state changes emit events for sourcing
- Repositories - Data access abstracted through repository pattern
- Factories - Complex object creation via factory methods
# Unit tests
cargo test --lib
# Integration tests
cargo test --test integration
# Property-based tests
cargo test --test property_tests
# With coverage
cargo tarpaulin --out Html
# Benchmarks
cargo bench -- --test# Development profile
[profile.dev]
opt-level = 0
debug = true
# Release profile
[profile.release]
opt-level = 3
lto = true
codegen-units = 256
strip = true
# Benchmark profile
[profile.bench]
debug = true# Check code
cargo clippy --all-targets -- -D warnings
# Format code
cargo fmt
# Check formatting
cargo fmt -- --checkThe project includes 139 tests covering all bounded contexts:
| Crate | Tests | Coverage Target |
|---|---|---|
| common | 60 | >80% |
| telemetry | 14 | >80% |
| anomaly | 8 | >80% |
| incident | 16 | >80% |
| integration | 16 | >80% |
| topology | 9 | >80% |
| knowledge | 6 | >80% |
| remediation | 16 | >80% |
# All tests
cargo test --workspace
# Specific crate
cargo test -p rustops-common
# Specific test
cargo test test_z_score_detector
# With output
cargo test -- --nocapture
# Run tests in parallel
cargo test --workspace --jobs 4# Run property tests
cargo test --test property_tests
# Generate test cases
cargo test --test property_tests -- --generate# Run all benchmarks
cargo bench
# Specific benchmark
cargo bench --bench id_benchmark
# Generate flamegraph
cargo bench --bench id_benchmark -- --profile-time=10# Build image
docker build -t rustops-api:latest .
# Run container
docker run -d \
--name rustops-api \
-p 8080:8080 \
-v /etc/rustops:/etc/rustops:ro \
-e RUST_LOG=info \
rustops-api:latestapiVersion: apps/v1
kind: Deployment
metadata:
name: rustops-api
spec:
replicas: 3
selector:
matchLabels:
app: rustops-api
template:
metadata:
labels:
app: rustops-api
spec:
containers:
- name: rustops-api
image: rustops/api:latest
ports:
- containerPort: 8080
env:
- name: RUST_LOG
value: "info"
- name: KAFKA_BROKERS
value: "kafka:9092"
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
---
apiVersion: v1
kind: Service
metadata:
name: rustops-api
spec:
selector:
app: rustops-api
ports:
- port: 8080
targetPort: 8080
type: LoadBalancer| Service | vCPU | Memory | Storage |
|---|---|---|---|
| rustops-api | 2 | 4GB | 10GB |
| rustops-agent | 1 | 2GB | 5GB |
| rustops-pipeline | 2 | 4GB | 10GB |
| Kafka | 2 | 4GB | 50GB |
| Neo4j | 2 | 4GB | 50GB |
| PostgreSQL | 2 | 4GB | 50GB |
| Redis | 1 | 2GB | 10GB |
| Prometheus | 1 | 2GB | 20GB |
| Grafana | 1 | 2GB | 10GB |
| Jaeger | 1 | 2GB | 10GB |
| Temporal | 2 | 4GB | 20GB |
| Fluentd | 1 | 2GB | 10GB |
| ClickHouse | 2 | 4GB | 50GB |
Total: ~17 vCPU, 36GB RAM, 245GB storage
GET /health
GET /metrics
GET /api/v1/version
POST /api/v1/telemetry/metrics
POST /api/v1/telemetry/logs
POST /api/v1/telemetry/traces
GET /api/v1/incidents
GET /api/v1/incidents/:id
POST /api/v1/incidents
PUT /api/v1/incidents/:id
DELETE /api/v1/incidents/:id
GET /api/v1/incidents/:id/timeline
GET /api/v1/anomalies
GET /api/v1/anomalies/:id
POST /api/v1/anomalies/detect
PUT /api/v1/anomalies/:id/acknowledge
GET /api/v1/topology/services
GET /api/v1/topology/services/:id/dependencies
POST /api/v1/topology/analyze-impact
GET /api/v1/knowledge/search
POST /api/v1/knowledge/patterns
GET /api/v1/knowledge/patterns/:id
GET /api/v1/remediation/workflows
POST /api/v1/remediation/workflows
PUT /api/v1/remediation/workflows/:id
POST /api/v1/remediation/workflows/:id/approve
POST /api/v1/remediation/workflows/:id/execute
GET /api/v1/remediation/safety-checks
WS /api/v1/stream/metrics
WS /api/v1/stream/alerts
WS /api/v1/stream/topology
WS /api/v1/stream/workflows
# Create incident
curl -X POST http://localhost:8080/api/v1/incidents \
-H "Content-Type: application/json" \
-d '{
"title": "High CPU usage on payment-service",
"severity": "P2",
"description": "CPU usage at 95% for 5 minutes",
"labels": {"service": "payment-service"}
}'
# Search knowledge
curl -X POST http://localhost:8080/api/v1/knowledge/search \
-H "Content-Type: application/json" \
-d '{
"query": "database connection pool exhaustion",
"limit": 10
}'
# Trigger remediation
curl -X POST http://localhost:8080/api/v1/remediation/workflows \
-H "Content-Type: application/json" \
-d '{
"incident_id": "incident-123",
"workflow_type": "restart_service",
"parameters": {
"service_name": "payment-service",
"namespace": "production"
}
}'We welcome contributions! Please see below for guidelines.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Write tests (
cargo test --workspace) - Format code (
cargo fmt) - Run lints (
cargo clippy) - Commit your changes (
git commit -m "Add amazing feature") - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Follow Rust naming conventions
- Use
cargo fmtfor formatting - Use
cargo clippyfor linting - Document all public APIs
- Keep functions focused and small
- Write descriptive commit messages
- All tests must pass (
cargo test --workspace) - New features require tests
- Maintain >80% code coverage
- Add integration tests for external APIs
- Include property tests for algorithms
- Describe the change in the PR title
- Provide context in the PR description
- Link related issues
- Ensure CI checks pass
- Request review from relevant maintainers
- Keep PRs small and focused
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Copyright 2025 RustOps Contributors
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
- π Full Documentation
- π API Reference
- ποΈ Architecture Guide
- π§ Development Guide
- π¬ Discussions
- π Issues
- π§ Email: support@rustops.com
Built with:
- Rust - Systems programming language
- Tokio - Async runtime
- Serde - Serialization framework
- Kube - Kubernetes client
- Prometheus - Metrics monitoring
- Temporal - Workflow orchestration
- Neo4j - Graph database
RustOps - Empowering DevOps teams with intelligent automation and observability.
For more information, visit https://github.com/rustops/rustops