Skip to content

Conversation

@gocanto
Copy link
Collaborator

@gocanto gocanto commented Nov 6, 2025

Go Live

  • GRAFANA_ADMIN_PASSWORD must be set in .env file

Summary

This commit adds Prometheus as a Docker service to monitor Caddy in production:

  • Added Prometheus service to docker-compose.yml with prod profile
  • Created prometheus.yml configuration to scrape Caddy metrics
  • Configured Caddy to expose admin API on port 2019 for metrics
  • Added prometheus_data volume for persistent storage
  • Bound Prometheus UI to localhost:9090 for security
  • Set a 30-day retention period for metrics data

Prometheus will now collect metrics from Caddy's admin API endpoint, enabling monitoring of proxy performance, request rates, and other metrics.

Summary by CodeRabbit

Release Notes

  • New Features
    • Added comprehensive monitoring stack with Prometheus and Grafana for real-time observability.
    • Integrated metrics collection from API, database, and web proxy services.
    • Introduced pre-configured dashboards displaying request rates, response times, database connections, and system health.
    • Added monitoring management commands for easy stack orchestration.
    • Included VPS deployment guide for production monitoring setup.

This commit adds Prometheus as a Docker service to monitor Caddy in production:
- Added Prometheus service to docker-compose.yml with prod profile
- Created prometheus.yml configuration to scrape Caddy metrics
- Configured Caddy to expose admin API on port 2019 for metrics
- Added prometheus_data volume for persistent storage
- Bound Prometheus UI to localhost:9090 for security
- Set 30-day retention period for metrics data

Prometheus will now collect metrics from Caddy's admin API endpoint,
enabling monitoring of proxy performance, request rates, and other metrics.
@coderabbitai
Copy link

coderabbitai bot commented Nov 6, 2025

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

This PR introduces a comprehensive monitoring and observability stack integrating Prometheus, Grafana, and postgres_exporter. It reorganizes build infrastructure from metal/ to infra/, adds Prometheus metrics collection via a new HTTP handler, deploys monitoring services via Docker Compose, and provides operational tooling and documentation for stack management.

Changes

Cohort / File(s) Summary
Go module dependencies
go.mod
Added Prometheus Go client library (github.com/prometheus/client_golang v1.20.5) and related indirect dependencies for metrics instrumentation.
Go application code
handler/metrics.go, metal/kernel/app.go, metal/router/router.go
Introduced MetricsHandler for exposing Prometheus metrics via /metrics endpoint; integrated metrics initialization in app boot sequence (modem.Metrics()) and registered metrics route in router.
Makefile infrastructure reorganization
Makefile, infra/makefile/app.mk, infra/makefile/caddy.mk, infra/makefile/monitor.mk
Switched include paths from metal/makefile/ to infra/makefile/; updated path references for Caddy and Docker artifacts; added comprehensive monitoring orchestration targets (monitor-up, monitor-down, monitor-status, monitor-grafana, etc.) in new monitor.mk.
Docker Compose services
docker-compose.yml
Added Prometheus, Grafana, and postgres_exporter services (local and production variants) with volumes, healthchecks, networks, dependencies, and port mappings; updated Caddy and API build contexts to infra/ paths; added dedicated metrics volumes.
Caddy configuration
infra/caddy/Dockerfile, infra/caddy/Caddyfile.local, infra/caddy/Caddyfile.prod
Updated Dockerfile path documentation; added internal metrics endpoint on port 9180 that proxies /metrics from admin API; enabled metrics collection in global blocks; protected /metrics and /api/metrics paths with 403 responses.
Prometheus configuration
infra/metrics/prometheus/provisioning/prometheus.yml, infra/metrics/prometheus/provisioning/prometheus.local.yml, infra/metrics/prometheus/scripts/postgres-exporter-entrypoint.sh
Configured global scraping (15s intervals) and static scrape targets for Caddy, PostgreSQL, API, and Prometheus self-monitoring; created entrypoint script for postgres_exporter credential handling via Docker secrets.
Grafana dashboards and provisioning
infra/metrics/grafana/dashboards/oullin-*.json, infra/metrics/grafana/provisioning/dashboards/default.yml, infra/metrics/grafana/provisioning/datasources/prometheus.yml, infra/metrics/grafana/scripts/export-dashboards.sh
Provisioned three pre-configured dashboards (Caddy Proxy Metrics, PostgreSQL Database Metrics, Oullin Overview) with Prometheus queries; configured datasource and dashboard provisioning; added dashboard export utility script.
Monitoring documentation and guides
infra/metrics/README.md, infra/metrics/VPS_DEPLOYMENT.md
Added comprehensive monitoring stack documentation covering architecture, quick start, security, dashboard management, troubleshooting, and maintenance; provided end-to-end VPS deployment guide with prerequisites, setup, configuration, verification, and production checklist.
Configuration and path updates
.env.example, .gitignore, storage/monitoring/backups/.gitkeep
Added ENV_HTTP_PORT and ENV_SPA_IMAGES_DIR environment variables; updated .gitignore Caddy mtls paths from caddy/ to infra/caddy/; created monitoring backups directory placeholder.

Sequence Diagram(s)

sequenceDiagram
    participant Boot as App Boot
    participant Metrics as modem.Metrics()
    participant Router as Router
    participant Handler as MetricsHandler
    participant Prometheus as Prometheus Scraper
    
    Boot->>Metrics: Initialize metrics during startup
    Metrics->>Router: Register /metrics route
    Note over Router: GET /metrics endpoint registered<br/>(internal container access)
    
    Prometheus->>Handler: GET localhost:8080/metrics
    Handler->>Handler: Serve Prometheus metrics<br/>(via promhttp.Handler)
    Handler-->>Prometheus: Return metric data
    
    Note over Prometheus: Scrapes via internal<br/>port 9180 (Caddy proxy)<br/>bypasses public listener
Loading
sequenceDiagram
    participant Client as External Client
    participant Caddy as Caddy (Public)
    participant CaddyMetrics as Caddy Metrics<br/>Port 9180
    participant Admin as Caddy Admin<br/>API 2019
    participant Prometheus as Prometheus
    
    Client->>Caddy: GET /metrics
    Caddy->>Caddy: Block /metrics (403)
    Caddy-->>Client: 403 Forbidden
    
    Prometheus->>CaddyMetrics: GET :9180/metrics
    CaddyMetrics->>Admin: Proxy to :2019/metrics
    Admin-->>CaddyMetrics: Metrics data
    CaddyMetrics-->>Prometheus: Return metrics
    
    Note over Prometheus,Admin: Internal Docker network<br/>access only
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Docker Compose service definitions: Verify networking, health checks, volume mounts, environment variable handling, resource limits, and dependencies for Prometheus, Grafana, and postgres_exporter services
  • Caddy metrics routing: Confirm internal port 9180 endpoint isolation, access control rules (403 for protected paths), and proper proxying to admin API on port 2019
  • Prometheus scrape configurations: Validate target addresses, labels, and endpoints for local vs. production profiles; verify postgres_exporter DSN construction and credential handling via Docker secrets
  • Grafana dashboard queries: Review Prometheus PromQL syntax across all dashboard JSON panels (rate functions, histogram_quantiles, aggregations)
  • Boot sequence integration: Ensure modem.Metrics() call placement in App.Boot and router registration of /metrics endpoint do not introduce errors or blocking calls
  • Monitoring Makefile orchestration: Audit target logic for stack management, traffic generation, backup/restore, and Docker Compose profile usage

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title check ❓ Inconclusive The title 'chore: Metrics' is vague and does not convey meaningful information about the comprehensive changes in this PR, which includes Prometheus, Grafana, monitoring stack setup, infrastructure reorganization, and multiple configuration changes. Consider using a more specific title that reflects the main changes, such as 'chore: Add comprehensive monitoring stack with Prometheus and Grafana' or 'chore: Integrate Prometheus and Grafana for infrastructure monitoring'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

claude added 10 commits November 6, 2025 08:56
This commit extends the monitoring setup to include the API and database:

API Monitoring:
- Added Prometheus client library (prometheus/client_golang v1.20.5)
- Created metrics handler to expose Prometheus metrics at /metrics endpoint
- Integrated metrics route into API boot sequence
- Configured Prometheus to scrape API metrics from api:8080/metrics

Database Monitoring:
- Added postgres_exporter service to expose PostgreSQL metrics
- Configured with database credentials via Docker secrets
- Exposed on port 9187 for Prometheus scraping
- Added to both oullin_net and caddy_net networks

Prometheus Configuration:
- Updated scrape configs to monitor all services:
  * Caddy (proxy metrics)
  * PostgreSQL (database metrics via exporter)
  * API (application metrics)
  * Prometheus (self-monitoring)
- All targets labeled with service and environment tags

The monitoring stack now provides comprehensive observability across
the entire application infrastructure in production.
Bind Caddy admin API port 2019 to 127.0.0.1:2019 for secure
local access. This allows server administrators to access the
admin API and metrics endpoint directly while keeping it
inaccessible from external networks.
This commit adds Grafana as a visualization layer for the monitoring stack:

Grafana Service:
- Added Grafana 11.4.0 service to docker-compose.yml
- Bound to localhost:3000 for secure access
- Configured with admin credentials via environment variable
- Disabled public sign-up and anonymous access
- Added grafana_data volume for persistent storage

Data Source Configuration:
- Auto-provisioned Prometheus data source
- Configured to connect to prometheus:9090
- Set as default data source with 15s scrape interval

Pre-configured Dashboards:
1. Overview Dashboard - High-level metrics from all services
   - Caddy request rates
   - PostgreSQL connections
   - API memory and goroutines

2. PostgreSQL Dashboard - Detailed database monitoring
   - Connection tracking
   - Database size and growth
   - Transaction rates and operations
   - Cache hit ratio
   - Lock statistics

3. Caddy Dashboard - Proxy performance metrics
   - Request rates by status code
   - Response time percentiles (p50, p95, p99)
   - Traffic rates and connection states

Documentation:
- Added comprehensive README with access instructions
- Included example queries for custom analysis
- Provided troubleshooting guidance

The dashboards are automatically loaded on startup and provide
real-time visibility into application performance and health.
This commit adds complete local development support for testing
the monitoring stack before deploying to production.

Local Monitoring Services:
- Added prometheus_local service on port 9090
- Added grafana_local service on port 3000
- Added postgres_exporter_local service
- All services use 'local' profile for easy activation

Configuration Changes:
- Created prometheus.local.yml for local scraping targets
- Updated Caddyfile.local to expose admin API on port 2019
- Added metrics exposure to caddy_local service
- Reduced retention time to 7 days for local (vs 30 days prod)

Port Bindings (Local):
- Prometheus: 9090 (accessible from host)
- Grafana: 3000 (accessible from host)
- Caddy Admin: 2019 (accessible from host)
- Caddy Proxy: 8080/8443 (existing)

Documentation:
- Added comprehensive MONITORING.md guide covering:
  * Local testing instructions and verification steps
  * Production deployment guidelines
  * Troubleshooting common issues
  * Useful Prometheus queries for API, DB, and Caddy
  * Maintenance and backup procedures

Testing Workflow:
1. Start local stack: docker compose --profile local up -d
2. Access Grafana: http://localhost:3000
3. Verify targets: http://localhost:9090/targets
4. Generate traffic and observe metrics in dashboards

This enables developers to test the monitoring setup locally
before deploying changes to production, ensuring reliability
and making it easier to develop custom metrics and dashboards.
This commit adds comprehensive Make targets for managing the monitoring
stack, following the project's existing Makefile conventions.

New File:
- metal/makefile/monitor.mk - Complete monitoring management targets

Makefile Updates:
- Include monitor.mk in main Makefile
- Added Monitoring Commands section to help output

Available Targets:

Start/Stop:
- monitor:up                 - Start monitoring stack (local)
- monitor:up:prod            - Start monitoring stack (production)
- monitor:down               - Stop monitoring stack (local)
- monitor:down:prod          - Stop monitoring stack (production)
- monitor:restart            - Restart monitoring stack

Status & Logs:
- monitor:status             - Show status of all monitoring services
- monitor:logs               - Follow logs from all services
- monitor:logs:prometheus    - Show Prometheus logs
- monitor:logs:grafana       - Show Grafana logs
- monitor:logs:db            - Show PostgreSQL exporter logs

Testing & Verification:
- monitor:test               - Run full test suite
- monitor:targets            - Show Prometheus scrape targets status
- monitor:config             - Display Prometheus configuration

Access & Metrics:
- monitor:grafana            - Open Grafana in browser
- monitor:prometheus         - Open Prometheus in browser
- monitor:metrics            - List all metrics endpoints
- monitor:caddy-metrics      - Show Caddy metrics sample
- monitor:api-metrics        - Show API metrics sample
- monitor:db-metrics         - Show PostgreSQL metrics sample

Traffic Generation:
- monitor:traffic            - Generate 100 test requests
- monitor:traffic:heavy      - Generate 500 concurrent requests

Utilities:
- monitor:stats              - Show resource usage
- monitor:backup             - Backup Prometheus data
- monitor:clean              - Remove all monitoring data
- monitor:help               - Show detailed command help

Quick Start:
  make monitor:up      # Start the stack
  make monitor:test    # Verify setup
  make monitor:traffic # Generate traffic
  make monitor:grafana # View dashboards

All targets follow the project's color-coded output style and include
helpful status messages for better developer experience.
This commit extends monitor.mk with comprehensive Docker commands
for direct container and compose management.

New Docker Compose Targets:
- monitor:up:full           - Start full stack (API + DB + monitoring)
- monitor:up:full:prod      - Start full prod stack
- monitor:up:logs           - Start with logs in foreground
- monitor:down:remove       - Stop and remove containers
- monitor:pull              - Pull latest monitoring images

New Docker Container Targets:
- monitor:docker:ps         - Show running containers
- monitor:docker:config     - Display docker compose config
- monitor:docker:inspect    - Inspect container details
- monitor:docker:exec:prometheus - Shell into Prometheus
- monitor:docker:exec:grafana    - Shell into Grafana
- monitor:docker:logs:prometheus - Prometheus docker logs
- monitor:docker:logs:grafana    - Grafana docker logs
- monitor:docker:logs:db         - DB exporter docker logs

Updated Help Section:
- Added "Docker Commands" section with all new targets
- Added "Docker Compose Examples" with common commands
- Better organization and alignment in help output

These targets provide direct access to:
1. Docker Compose orchestration commands
2. Container inspection and debugging
3. Shell access to running containers
4. Alternative log viewing methods
5. Image management

Example Usage:
  make monitor:docker:ps               # List containers
  make monitor:docker:exec:prometheus  # Shell access
  make monitor:up:full                 # Start everything
  make monitor:pull                    # Update images

All targets follow the project's conventions and include
colored output for better readability.
This commit standardizes all monitoring Make targets to use hyphens
instead of colons for better consistency with the project's naming
conventions.

Changes:
- monitor:up          → monitor-up
- monitor:up:prod     → monitor-up-prod
- monitor:up:full     → monitor-up-full
- monitor:down        → monitor-down
- monitor:docker:ps   → monitor-docker-ps
- monitor:logs:grafana → monitor-logs-grafana
- And all other monitoring targets...

Updated Files:
- metal/makefile/monitor.mk - All target definitions
- Makefile - Main help section

Benefits:
- Consistent naming with other project targets (db:local, caddy-gen-cert)
- Easier to type and autocomplete
- Clearer hierarchy with hyphens

Usage Examples:
  make monitor-up                    # Start monitoring
  make monitor-docker-ps             # List containers
  make monitor-logs-prometheus       # View logs
  make monitor-help                  # Show all commands

All functionality remains unchanged, only naming convention updated.
This commit adds tooling and documentation for creating and managing
Grafana dashboards.

New Export Tool (scripts/export-grafana-dashboards.sh):
- Interactive script to export dashboards from Grafana UI
- Lists all available dashboards
- Supports exporting single or all dashboards
- Automatically formats JSON for provisioning
- Removes metadata and sets proper IDs

New Make Target:
- monitor-export-dashboards - Run the export tool

Comprehensive Dashboard Guide (grafana/DASHBOARD_GUIDE.md):
- Method 1: Create in UI and export (recommended)
- Method 2: Import community dashboards from grafana.com
- Method 3: Generate with Grafonnet (advanced)
- Method 4: Edit JSON directly

Covers:
- Step-by-step dashboard creation in Grafana UI
- Exporting dashboards manually and automatically
- Popular community dashboards for PostgreSQL, Go, Caddy
- Grafonnet example for programmatic generation
- Dashboard JSON structure and editing tips
- Best practices for dashboard design
- Example PromQL queries for all services
- Troubleshooting common issues

How Current Dashboards Were Created:
The existing dashboards (overview.json, postgresql.json, caddy.json)
were manually crafted following Grafana's schema. They serve as
templates and starting points for custom dashboards.

Workflow:
1. Create dashboard in Grafana UI
2. Export with: make monitor-export-dashboards
3. Edit if needed
4. Commit to git
5. Dashboards auto-load on Grafana restart

This makes dashboard management version-controlled, reproducible,
and easy to customize for specific monitoring needs.
This commit reorganizes the dashboard export script to be located
with its related files in the grafana directory.

Changes:
- Moved: scripts/export-grafana-dashboards.sh → grafana/export-dashboards.sh
- Updated: metal/makefile/monitor.mk to reference new location
- Removed: empty scripts/ directory

Benefits:
- Better organization: dashboard tools live with dashboard files
- Clearer structure: grafana/ contains all Grafana-related assets
  * dashboards/     - Dashboard JSON files
  * provisioning/   - Provisioning configs
  * export-dashboards.sh - Export tool
  * DASHBOARD_GUIDE.md   - Documentation

The script still uses relative paths (./grafana/dashboards) so it
works correctly when run from project root via make targets.

Usage remains unchanged:
  make monitor-export-dashboards
- Move MONITORING.md to prometheus/ directory
- Move export-dashboards.sh to grafana/scripts/ directory
- Update monitor.mk to reference new script path

This reorganizes documentation and scripts closer to their respective
modules for better maintainability and discoverability.
@gocanto gocanto marked this pull request as ready for review November 7, 2025 00:58
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🧹 Nitpick comments (5)
prometheus/MONITORING.md (1)

81-86: Consider adding language specifiers to code blocks for better syntax highlighting.

The code blocks at lines 81, 98, and 111 that show example metric output would benefit from language identifiers (e.g., prometheus or text) for improved readability in rendered markdown.

Example:

-```
+```prometheus
 caddy_http_requests_total
 caddy_http_request_duration_seconds
 caddy_http_connections_open

</blockquote></details>
<details>
<summary>docker-compose.yml (1)</summary><blockquote>

`91-112`: **Consider adding resource limits for monitoring services.**

None of the monitoring services (Prometheus, Grafana, postgres_exporter) have resource limits defined. In production, unbounded resource usage can impact the main application services. Prometheus in particular can consume significant CPU and memory depending on the scrape frequency and retention period.



Add resource limits to each monitoring service:

```yaml
prometheus:
    # ... existing config ...
    deploy:
        resources:
            limits:
                cpus: '0.5'
                memory: 512M
            reservations:
                cpus: '0.25'
                memory: 256M

Adjust the values based on your expected metrics volume and retention period. For reference:

  • Prometheus: 256M-1G memory, 0.25-1 CPU depending on scrape load
  • Grafana: 128M-512M memory, 0.1-0.5 CPU
  • postgres_exporter: 64M-128M memory, 0.1 CPU

Also applies to: 114-135, 137-155, 157-175, 177-198, 200-221

metal/makefile/monitor.mk (3)

227-235: Traffic generation assumes local profile only.

The monitor-traffic and monitor-traffic-heavy targets hard-code localhost:8080, which is specific to the local profile. The production profile uses different ports (80/443 via Caddy), so these targets won't work for testing production monitoring.

Add separate targets or make the endpoint configurable:

## Generate test traffic to populate metrics (local)
monitor-traffic:
	@printf "$(BOLD)$(CYAN)Generating test traffic (local)...$(NC)\n"
	@printf "Making 100 requests to /ping endpoint...\n"
	@for i in $$(seq 1 100); do \
		curl -s http://localhost:8080/ping > /dev/null && printf "." || printf "$(RED)$(NC)"; \
		sleep 0.1; \
	done
	@printf "\n$(BOLD)$(GREEN)✓ Test traffic generated$(NC)\n\n"

## Generate test traffic to populate metrics (production)
monitor-traffic-prod:
	@printf "$(BOLD)$(CYAN)Generating test traffic (production)...$(NC)\n"
	@printf "Making 100 requests to /ping endpoint...\n"
	@for i in $$(seq 1 100); do \
		curl -s http://localhost/ping > /dev/null && printf "." || printf "$(RED)$(NC)"; \
		sleep 0.1; \
	done
	@printf "\n$(BOLD)$(GREEN)✓ Test traffic generated$(NC)\n\n"

Also applies to: 238-246


152-164: Test suite hard-codes localhost URLs.

The monitor-test target checks endpoints at localhost:9090, localhost:2019, localhost:8080, and localhost:3000. These URLs are specific to the local profile. In production, some of these ports are bound to 127.0.0.1 on the server and won't be accessible from a developer's machine.

Consider:

  1. Adding a separate monitor-test-prod target that adapts the checks for production
  2. Documenting that monitor-test is for local development only
  3. Or making the target detect which profile is running and adjust accordingly

Example documentation addition:

## Run full monitoring stack test suite (local profile only)
monitor-test:
	@printf "$(BOLD)$(CYAN)Running monitoring stack tests (local profile)...$(NC)\n\n"
	@printf "$(YELLOW)Note: This target is for local development. Production monitoring should be verified on the server.$(NC)\n\n"
	# ... rest of the target

272-277: Backup target creates unversioned tar files.

The backup filename includes a timestamp, but there's no cleanup or rotation strategy for old backups. Over time, the ./backups directory could accumulate many files and consume significant disk space.

Add backup rotation or document a cleanup strategy:

## Backup Prometheus data
monitor-backup:
	@printf "$(BOLD)$(CYAN)Backing up Prometheus data...$(NC)\n"
	@mkdir -p ./backups
	@docker run --rm -v prometheus_data:/data -v $(PWD)/backups:/backup alpine \
		tar czf /backup/prometheus-backup-$$(date +%Y%m%d-%H%M%S).tar.gz /data
	@printf "$(BOLD)$(GREEN)✓ Backup created in ./backups/$(NC)\n"
	@printf "$(YELLOW)Cleaning up old backups (keeping last 5)...$(NC)\n"
	@ls -t ./backups/prometheus-backup-*.tar.gz | tail -n +6 | xargs -r rm
	@printf "$(BOLD)$(GREEN)✓ Backup rotation complete$(NC)\n\n"
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3143430 and 252d145.

📒 Files selected for processing (20)
  • Makefile (2 hunks)
  • caddy/Caddyfile.local (1 hunks)
  • caddy/Caddyfile.prod (1 hunks)
  • docker-compose.yml (4 hunks)
  • go.mod (1 hunks)
  • grafana/DASHBOARD_GUIDE.md (1 hunks)
  • grafana/README.md (1 hunks)
  • grafana/dashboards/caddy.json (1 hunks)
  • grafana/dashboards/overview.json (1 hunks)
  • grafana/dashboards/postgresql.json (1 hunks)
  • grafana/provisioning/dashboards/default.yml (1 hunks)
  • grafana/provisioning/datasources/prometheus.yml (1 hunks)
  • grafana/scripts/export-dashboards.sh (1 hunks)
  • handler/metrics.go (1 hunks)
  • metal/kernel/app.go (1 hunks)
  • metal/makefile/monitor.mk (1 hunks)
  • metal/router/router.go (1 hunks)
  • prometheus/MONITORING.md (1 hunks)
  • prometheus/prometheus.local.yml (1 hunks)
  • prometheus/prometheus.yml (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
metal/router/router.go (1)
handler/metrics.go (1)
  • NewMetricsHandler (11-13)
🪛 GitHub Actions: Tests
handler/metrics.go

[error] 6-6: missing go.sum entry for module providing package github.com/prometheus/client_golang/prometheus/promhttp (imported by github.com/oullin/handler); to add: go mod tidy or go mod download

🪛 GitHub Check: test (1.25.3)
handler/metrics.go

[failure] 6-6:
missing go.sum entry for module providing package github.com/prometheus/client_golang/prometheus/promhttp (imported by github.com/oullin/handler); to add:

🪛 LanguageTool
prometheus/MONITORING.md

[uncategorized] ~3-~3: Possible missing comma found.
Context: ... for running and testing the monitoring stack both locally and in production. ## Sta...

(AI_HYDRA_LEO_MISSING_COMMA)


[typographical] ~163-~163: If specifying a range, consider using an en dash instead of a hyphen.
Context: ...ta: http://localhost:9090/graph 3. Wait 1-2 minutes for initial scraping **Problem...

(HYPHEN_TO_EN)

grafana/DASHBOARD_GUIDE.md

[uncategorized] ~20-~20: Possible missing preposition found.
Context: ...gh-level metrics from all services 2. postgresql.json - Detailed database monitoring 3...

(AI_HYDRA_LEO_MISSING_OF)


[misspelling] ~108-~108: This word is normally spelled as one.
Context: ...or-restart ``` Your dashboard will now auto-load on startup! --- ## Method 2: Use Comm...

(EN_COMPOUNDS_AUTO_LOAD)

🪛 markdownlint-cli2 (0.18.1)
grafana/README.md

114-114: Bare URL used

(MD034, no-bare-urls)

prometheus/MONITORING.md

43-43: Bare URL used

(MD034, no-bare-urls)


44-44: Bare URL used

(MD034, no-bare-urls)


45-45: Bare URL used

(MD034, no-bare-urls)


46-46: Bare URL used

(MD034, no-bare-urls)


66-66: Bare URL used

(MD034, no-bare-urls)


71-71: Bare URL used

(MD034, no-bare-urls)


72-72: Bare URL used

(MD034, no-bare-urls)


81-81: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


98-98: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


111-111: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


119-119: Bare URL used

(MD034, no-bare-urls)


162-162: Bare URL used

(MD034, no-bare-urls)


201-201: Bare URL used

(MD034, no-bare-urls)


202-202: Bare URL used

(MD034, no-bare-urls)


203-203: Bare URL used

(MD034, no-bare-urls)

🔇 Additional comments (16)
caddy/Caddyfile.local (1)

5-6: LGTM: Admin API exposure configured correctly for container environment.

The 0.0.0.0:2019 binding is appropriate within the containerized context, allowing Prometheus to scrape metrics from the caddy_local container over the Docker network. External access control is handled by the port mapping in docker-compose.yml.

prometheus/prometheus.yml (1)

1-40: LGTM: Prometheus configuration is well-structured.

The configuration correctly defines four scrape targets with appropriate labels and service discovery via Docker DNS. The 15-second scrape interval and 30-day retention (from docker-compose) provide a good balance between granularity and storage.

prometheus/MONITORING.md (1)

1-372: Excellent monitoring documentation!

This guide is comprehensive, well-structured, and provides clear instructions for both local testing and production deployment. The troubleshooting sections, PromQL examples, and maintenance procedures will be very helpful for operators.

metal/router/router.go (1)

95-98: LGTM: Metrics endpoint correctly registered without authentication middleware.

The implementation properly bypasses the authentication pipeline for the /metrics endpoint, which is standard practice for Prometheus scraping within a trusted network. The use of Handle() instead of HandleFunc() is correct since MetricsHandler implements http.Handler.

metal/kernel/app.go (1)

90-90: LGTM: Metrics registration properly integrated into boot sequence.

The modem.Metrics() call is appropriately placed after database initialization and ensures the metrics endpoint is available early in the application lifecycle.

grafana/provisioning/dashboards/default.yml (1)

1-13: LGTM: Grafana dashboard provisioning properly configured.

The configuration follows Grafana best practices with appropriate update intervals and UI editability enabled. The use of foldersFromFilesStructure will maintain organized dashboard layouts.

grafana/provisioning/datasources/prometheus.yml (1)

1-12: LGTM: Prometheus datasource correctly configured.

The datasource configuration properly uses Docker service names for discovery and sets appropriate intervals that align with the Prometheus scrape configuration (15s). The editable: false setting prevents accidental modifications in production.

go.mod (1)

16-16: Update prometheus/client_golang to v1.23.2.

The latest stable version is v1.23.2 (released 2025-09-05). The current dependency (v1.20.5) is several versions behind and should be updated to benefit from the latest fixes and improvements.

Makefile (2)

45-45: LGTM!

The inclusion of the monitoring module is well-structured and follows the existing pattern of modular Makefile organization.


108-116: LGTM!

The monitoring commands documentation is clear and provides appropriate entry points for users. The extra blank line maintains consistent spacing with other sections.

grafana/README.md (1)

1-135: LGTM! Comprehensive documentation.

The documentation provides clear access instructions, security notes, dashboard descriptions, and troubleshooting guidance. The structure and content are well-organized for operational use.

grafana/dashboards/postgresql.json (1)

490-496: Verify cache hit ratio handles zero denominator.

The cache hit ratio calculation could result in division by zero if both blks_hit and blks_read are zero (e.g., on a fresh database or during no activity periods). While PromQL typically handles this gracefully by returning NaN, it's worth verifying this doesn't cause panel rendering issues.

Test the dashboard with a fresh database or during idle periods to confirm the panel handles zero/NaN values appropriately. If issues arise, consider adding a check like:

(rate(pg_stat_database_blks_hit[5m]) > 0) and 
(rate(pg_stat_database_blks_hit[5m]) / (rate(pg_stat_database_blks_hit[5m]) + rate(pg_stat_database_blks_read[5m])))

Or use Grafana's "No value" handling in the panel settings.

grafana/dashboards/overview.json (1)

1-395: LGTM! Well-structured overview dashboard.

The dashboard provides a good high-level view with appropriate metrics from all services. The queries use proper rate functions and job labels for filtering.

grafana/DASHBOARD_GUIDE.md (1)

1-453: LGTM! Excellent dashboard creation guide.

This comprehensive guide provides multiple methods for dashboard creation, best practices, example queries, and troubleshooting steps. The content is well-organized and will be valuable for maintaining the monitoring stack.

metal/makefile/monitor.mk (1)

10-18: LGTM! Clear and informative monitoring stack management.

The monitoring stack start targets are well-structured with clear output, appropriate wait times for services to initialize, and helpful access point information. The separation between local and production profiles is clean and consistent.

Also applies to: 21-29

docker-compose.yml (1)

99-99: Verify Prometheus retention period aligns with monitoring strategy.

Production Prometheus is configured with 30-day retention, while local uses 7 days. Confirm this choice supports your use case:

  • Storage impact: 30 days of metrics consumes significant disk space depending on cardinality and scrape frequency
  • Query performance: Longer retention can slow queries over large time ranges
  • Backup strategy: Extended retention increases importance of backup/disaster recovery planning

If long-term metrics storage is needed, consider:

  1. Using Prometheus remote write to a long-term storage solution (Thanos, Cortex, VictoriaMetrics)
  2. Reducing retention and implementing a separate long-term storage strategy
  3. Ensuring adequate disk space monitoring for the prometheus_data volume

To estimate actual storage consumption after deployment:

docker exec oullin_prometheus du -sh /prometheus
docker exec oullin_prometheus df -h /prometheus

Also applies to: 122-122

@gocanto gocanto marked this pull request as draft November 7, 2025 01:05
claude and others added 11 commits November 7, 2025 01:07
Replace invalid environment variable shell syntax with command wrapper
that properly reads Docker secret files.

Problem: Docker Compose does not execute shell commands in environment
variables, so $(cat /run/secrets/...) was being passed as a literal
string instead of reading the secret files.

Solution: Use command override with /bin/sh to read secrets at runtime,
construct the DATA_SOURCE_NAME, and exec the postgres_exporter binary.

This fix applies to both postgres_exporter (prod) and
postgres_exporter_local services.
SECURITY FIX: Remove Caddy admin API exposure to host in production

The Caddy admin API exposes dangerous unauthenticated endpoints:
- /load: Load new configuration
- /config: Modify runtime configuration
- /stop: Shutdown Caddy

Changes:
1. Production (caddy_prod):
   - REMOVED port mapping for 2019 (was 127.0.0.1:2019:2019)
   - Admin API only accessible within Docker network
   - Prometheus can still scrape via caddy_prod:2019 using Docker DNS

2. Local (caddy_local):
   - Changed from 2019:2019 to 127.0.0.1:2019:2019
   - Restricts access to localhost only for debugging

3. Documentation (MONITORING.md):
   - Added comprehensive Security Model section
   - Documented admin API security implications
   - Updated production access instructions
   - Added security best practices

Rationale:
- Prometheus scrapes metrics via Docker's internal network, not host
- No legitimate need for host access to admin API in production
- Defense in depth: minimize attack surface
- Follows same security principle as Prometheus/Grafana localhost binding

References:
- Caddy admin API docs: https://caddyserver.com/docs/api
- Docker networking: Services can communicate via service names
…orter

Use a dedicated shell script instead of inline command substitution for
better reliability and maintainability.

Changes:
1. Created prometheus/postgres-exporter-entrypoint.sh:
   - Reads Docker secrets at runtime using shell substitution
   - Constructs DATA_SOURCE_NAME environment variable
   - Executes postgres_exporter with proper argument forwarding

2. Updated both postgres_exporter services (prod and local):
   - Replaced multi-line command override with clean entrypoint
   - Mount entrypoint script as read-only volume
   - More maintainable and follows best practices

Benefits:
- Cleaner docker-compose.yml configuration
- Easier to debug and modify connection string
- Follows standard pattern used by other services (e.g., api-db)
- Entrypoint script can be tested independently
- Supports passing additional arguments to postgres_exporter

The script properly uses "set -e" for error handling and "exec" to
replace the shell process with postgres_exporter (making it PID 1
for proper signal handling).
SECURITY FIX: Require explicit GRAFANA_ADMIN_PASSWORD configuration

The default "admin" password is a well-known credential that poses a
security risk, even with localhost-only binding. An attacker with SSH
access or through other vulnerabilities could exploit this default.

Changes:
1. docker-compose.yml (both grafana and grafana_local services):
   - Changed from: ${GRAFANA_ADMIN_PASSWORD:-admin}
   - Changed to: ${GRAFANA_ADMIN_PASSWORD:?GRAFANA_ADMIN_PASSWORD must be set in .env file}
   - Docker Compose will now fail with clear error if password not set

2. prometheus/MONITORING.md:
   - Added mandatory password requirement to Prerequisites
   - Added Setup section with clear instructions
   - Included password generation example (openssl rand -base64 32)
   - Updated Security Model section to document this requirement
   - Added Production Prerequisites section
   - Updated credentials table for clarity

Behavior:
- If GRAFANA_ADMIN_PASSWORD is not set, docker-compose will exit with:
  "GRAFANA_ADMIN_PASSWORD must be set in .env file"
- Users must explicitly set a secure password
- No fallback to insecure defaults

Benefits:
- Eliminates well-known default credential vulnerability
- Forces security-conscious configuration
- Clear error message guides users to fix
- Defense in depth: complements localhost-only binding
CRITICAL FIX: Dashboard was using incorrect metric names that don't exist
in Caddy's default metrics endpoint, causing all Caddy panels to show no data.

Changes to grafana/dashboards/caddy.json:
1. Fixed caddy_http_requests_total → caddy_http_request_count_total
   - Line 56: Total Request Rate panel
   - Line 194: Requests by Status Code panel

2. Removed non-existent metrics panels:
   - Removed "Active Connections" panel (id: 2)
     Used caddy_http_connections_open which doesn't exist in default Caddy
   - Removed "Connection States" panel (id: 7)
     Used caddy_http_connections_open and caddy_http_connections_idle
     Both metrics require third-party exporters, not provided by Caddy

3. Fixed "Traffic Rate" panel (id: 6):
   - Kept caddy_http_response_size_bytes_sum (exists)
   - Removed caddy_http_request_size_bytes_sum (doesn't exist)
   - Renamed to "Response Traffic Rate"

4. Added "Request Errors" panel (id: 7):
   - Uses caddy_http_request_errors_total (exists)
   - Shows error rate over time

Changes to grafana/dashboards/overview.json:
- Fixed caddy_http_requests_total → caddy_http_request_count_total (2 occurrences)

Changes to documentation (MONITORING.md, README.md, DASHBOARD_GUIDE.md):
- Updated all example metric names to match official Caddy metrics
- Replaced caddy_http_requests_total → caddy_http_request_count_total
- Removed caddy_http_connections_open references
- Added caddy_http_request_errors_total examples

Correct Caddy metrics (verified from official docs):
- caddy_http_request_count_total (counter)
- caddy_http_request_duration_seconds (histogram: _bucket, _sum, _count)
- caddy_http_response_size_bytes (histogram: _bucket, _sum, _count)
- caddy_http_request_errors_total (counter)

Note: Connection metrics (open, idle) are not provided by default Caddy.
These would require additional exporters or plugins if needed.
SECURITY/RELIABILITY FIX: Prevent filename collisions when exporting
dashboards with titles that differ only by special characters.

Problem:
The export script sanitized dashboard titles to create filenames by
removing special characters. This could cause collisions:
- "My Dashboard!" → "my-dashboard.json"
- "My Dashboard?" → "my-dashboard.json"
The second export would silently overwrite the first.

Solution:
Prefix filenames with the dashboard UID, which is guaranteed unique
by Grafana:
- "My Dashboard!" → "abc123-my-dashboard.json"
- "My Dashboard?" → "xyz789-my-dashboard.json"

Changes:
1. grafana/scripts/export-dashboards.sh:
   - Line 50 (export all): FILENAME="${UID}-$(sanitized title).json"
   - Line 65 (export single): FILENAME="${UID}-$(sanitized title).json"
   - UID is already extracted, just need to prepend it

2. Renamed existing dashboards to match new convention:
   - caddy.json → oullin-caddy-caddy-proxy-metrics.json
   - overview.json → oullin-overview-oullin-overview.json
   - postgresql.json → oullin-postgresql-postgresql-database-metrics.json

Benefits:
- Guaranteed unique filenames (UIDs are unique in Grafana)
- No silent overwrites
- Easier to identify dashboard files by UID
- Consistent with dashboard provisioning best practices

The UID prefix makes it easy to correlate exported files with their
source dashboards in Grafana's UI.
RELIABILITY FIX: Prevent corrupt or incomplete JSON files when dashboard
exports fail due to network errors, authentication issues, or jq failures.

Problem:
Without error handling, failed curl or jq operations would:
- Create empty or corrupt JSON files
- Continue silently without user awareness
- Break dashboard provisioning on next Grafana restart
- No way to know which exports succeeded or failed

Solution:
Added comprehensive error handling for both export paths:

1. Export all dashboards:
   - Wrap curl+jq pipeline in if statement
   - Temporarily disable errexit (set +e) for the operation
   - Verify exported file is valid JSON and not empty
   - Delete corrupt files immediately (rm -f)
   - Track success/failure counts
   - Show clear status: "✓ Success" or "✗ Failed (reason)"
   - Display summary at end: "X succeeded, Y failed"
   - Exit with code 1 if any exports failed

2. Export single dashboard:
   - Same validation as export-all
   - Exit with code 1 on failure
   - Clear error messages for user
   - Validate selection input

3. Additional improvements:
   - Redirect stderr to /dev/null (2>/dev/null) to suppress jq warnings
   - Use -n with echo for inline status updates
   - Validate selection is not empty

Error scenarios handled:
- Network failures (curl timeout, connection refused)
- Authentication failures (401, 403)
- Dashboard not found (404)
- jq parse errors (invalid JSON response)
- Empty responses
- Filesystem errors (permissions, disk full)

Example output:
```
Exporting: Caddy - Proxy Metrics -> oullin-caddy-... ✓ Success
Exporting: Broken Dashboard -> oullin-broken-... ✗ Failed (invalid JSON)

Export summary: 2 succeeded, 1 failed
```

This ensures dashboard provisioning only uses valid, complete files.
SECURITY/UX FIX: Validate user input before processing dashboard selection
to prevent sed failures and provide clear error messages.

Problem:
Without validation, various invalid inputs could cause issues:
- Non-numeric input (e.g., "abc", "!@#") → sed fails silently
- Zero or negative numbers → sed returns empty or unexpected results
- Out-of-range numbers → sed returns empty string (caught later, but unclear)
- Empty input → sed fails
- Special characters → potential injection risks

Solution:
Added comprehensive input validation before processing selection:

1. Type validation:
   - Check if input matches regex ^[0-9]+$ (positive integers only)
   - Allows "all" as special case
   - Rejects: negative numbers, decimals, text, special chars, empty

2. Range validation:
   - Count total dashboards: DASHBOARD_COUNT=$(echo "$DASHBOARDS" | wc -l)
   - Verify selection is between 1 and DASHBOARD_COUNT (inclusive)
   - Clear error message shows valid range

3. Error messages:
   - "Error: Please enter a valid number or 'all'" - for non-numeric
   - "Error: Selection out of range (1-N)" - for out-of-range numbers

Examples of invalid input now caught:
- Input: "abc" → Error: Please enter a valid number or 'all'
- Input: "0" → Error: Selection out of range (1-3)
- Input: "999" → Error: Selection out of range (1-3)
- Input: "-1" → Error: Please enter a valid number or 'all'
- Input: "1.5" → Error: Please enter a valid number or 'all'
- Input: "1; rm -rf /" → Error: Please enter a valid number or 'all'

Valid input examples:
- Input: "1" → Exports first dashboard ✓
- Input: "all" → Exports all dashboards ✓

Defense in depth:
- Primary validation at input (this commit)
- Secondary validation later (SELECTED_LINE empty check - kept as safeguard)

This improves security, reliability, and user experience with clear,
actionable error messages.
PORTABILITY FIX: Replace bash-specific {1..N} syntax with POSIX-compliant
seq command to ensure targets work with /bin/sh on all systems.

Problem:
Brace expansion {1..100} is a bash-specific feature that doesn't work
in POSIX sh. On systems where make uses /bin/sh instead of bash, these
targets would fail with:

  /bin/sh: 1: Syntax error: Bad for loop variable

This affects:
- monitor-traffic: Makes 100 requests for testing
- monitor-traffic-heavy: Makes 500 requests (100 iterations × 5 concurrent)

Solution:
Replace all brace expansions with seq command, which is POSIX-compliant
and available on all Unix-like systems:

  {1..100} → $$(seq 1 100)
  {1..5}   → $$(seq 1 5)

Note: Double $$ is required in Makefiles to pass a literal $ to the shell.

Changes:
- Line 230: for i in {1..100} → for i in $$(seq 1 100)
- Line 241: for i in {1..100} → for i in $$(seq 1 100)
- Line 242: for j in {1..5}   → for j in $$(seq 1 5)

Tested behavior is identical:
- seq 1 100 produces: 1 2 3 ... 100 (same as {1..100})
- seq 1 5 produces: 1 2 3 4 5 (same as {1..5})

This ensures the monitoring traffic generation targets work reliably
across all Unix-like systems regardless of the default shell.
CONSISTENCY FIX: Standardize environment label to 'local' across all
metrics to enable consistent filtering in Grafana.

Problem:
Inconsistent environment labels between global and job-specific configs:
- Global external_labels: environment: 'development' (line 7)
- Job-specific labels: environment: 'local' (lines 16, 24, 33, 41)

This inconsistency causes:
- Confusion when filtering metrics in Grafana dashboards
- Metrics with mixed environment labels (some 'development', some 'local')
- Difficult to query all local metrics consistently

Solution:
Standardize on 'local' throughout the file because:
- Filename is prometheus.local.yml (indicates local environment)
- All 4 job-specific labels already use 'local'
- Matches the naming convention (oullin-local, caddy_local, etc.)
- Clearer distinction from production environment

Changed:
  external_labels:
    monitor: 'oullin-local'
-   environment: 'development'
+   environment: 'local'

Now all metrics will have consistent environment='local' label, making
Grafana queries simpler:
- Before: {environment="development"} OR {environment="local"}
- After: {environment="local"}

This aligns with the production config which uses environment='production'
consistently.
@gocanto gocanto added the testing label Nov 7, 2025
@gocanto gocanto changed the title Add Prometheus monitoring for Caddy in production chore: Add Prometheus monitoring for Caddy in production Nov 7, 2025
@gocanto gocanto marked this pull request as ready for review November 7, 2025 02:51
@gocanto gocanto marked this pull request as ready for review November 11, 2025 09:13
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
docker-compose.yml (1)

238-276: Grafana admin password requirement properly enforced; consider resource limits for local profile.

Both grafana and grafana_local services correctly enforce GRAFANA_ADMIN_PASSWORD via ?GRAFANA_ADMIN_PASSWORD must be set in .env file (lines 248, 287), preventing insecure defaults.

However, grafana_local omits the deploy resource limits section that grafana includes (lines 268-275). While intentional for development flexibility, this inconsistency may lead to unexpected resource consumption during local testing. Consider whether local should also define conservative limits.

Also applies to: 277-306

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2d55a68 and 711673b.

📒 Files selected for processing (3)
  • README.md (1 hunks)
  • docker-compose.yml (6 hunks)
  • infra/metrics/README.md (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • README.md
🔇 Additional comments (9)
infra/metrics/README.md (3)

1-70: Documentation is clear and well-organized.

The monitoring stack documentation provides comprehensive coverage of architecture, deployment, configuration, and troubleshooting. Markdown formatting is clean with properly wrapped URLs. The configuration consistency section accurately reflects environment-specific differences.


151-194: Verify Quick Start instructions match docker-compose service configuration.

The quick start section references make monitor-up, make monitor-targets, make monitor-traffic, and container access on localhost:2019 for Caddy admin API. These should be validated against the actual Makefile targets and docker-compose configuration provided in the PR.


422-541: Verify Hostinger VPS deployment instructions align with current infrastructure.

The deployment guide provides detailed steps for Ubuntu VPS setup. Ensure that:

  • The make monitor-up-prod, make monitor-down-prod, make monitor-logs-* targets exist and work as documented
  • Docker secrets setup (lines 387-399) aligns with the actual docker-compose secret definitions
  • Caddy configuration paths referenced in the SSL section (line 514) match the updated ./infra/caddy/Caddyfile.prod path in docker-compose.yml (line 69)
docker-compose.yml (6)

15-22: New volumes properly configured for monitoring data persistence.

The addition of prometheus_data_prod, prometheus_data_local, grafana_data_prod, and grafana_data_local volumes with driver: local correctly enables persistent storage for both environments. This decouples data from container lifecycle.


52-72: Caddy metrics exposure properly configured.

Both caddy_prod and caddy_local correctly expose port 9180 for Prometheus metrics scraping. Documentation in comments (lines 52-61) clearly explains the separation of admin API (2019, internal for prod) from metrics endpoint (9180, internal network).

Also applies to: 94-102


106-143: Prometheus services well-configured with appropriate retention and resource limits.

Both Prometheus services include:

  • Correct healthcheck using http://localhost:9090/-/healthy
  • Proper retention: 30 days for production (line 114), 7 days for local (line 153)
  • Resource constraints: prod has explicit limits (lines 136-143), local omits for flexibility

Network configuration correctly places Prometheus on both caddy_net and oullin_net to reach all metrics sources.

Also applies to: 145-174


252-252: Verify Grafana datasource connectivity to Prometheus via Docker DNS.

Both grafana and grafana_local services reference Prometheus via environment variable:

  • Prod: GF_DATASOURCE_PROMETHEUS_URL=http://oullin_prometheus:9090 (line 252)
  • Local: GF_DATASOURCE_PROMETHEUS_URL=http://oullin_prometheus_local:9090 (line 291)

Grafana services are configured on caddy_net only, while Prometheus services are on both caddy_net and oullin_net. DNS resolution via container name should work because Prometheus is reachable on the shared caddy_net, but verify connectivity works in practice.

Also applies to: 291-291


316-316: Dockerfile paths correctly updated to infra structure.

Both api-runner and api services now reference ./infra/docker/dockerfile-api, aligning with the centralized infrastructure directory structure. This is consistent with the updated Caddy build context.

Also applies to: 358-358


176-209: Entrypoint script verification confirmed.

The postgres-exporter entrypoint script exists at the expected path with proper executable permissions (755), and both postgres_exporter and postgres_exporter_local services correctly reference and mount it. The configuration properly allows runtime construction of DATA_SOURCE_NAME from Docker secrets. No issues found.

@gocanto gocanto marked this pull request as draft November 12, 2025 03:01
@gocanto gocanto marked this pull request as ready for review November 12, 2025 03:52
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (4)
metal/router/router.go (1)

95-104: Security concern from previous review has been addressed.

The previous review flagged that /api/metrics would be publicly accessible. This has been resolved:

  • Production: @protected_public matcher in Caddyfile.prod (line 46) blocks /api/metrics → returns 403
  • Local: Direct requests to /metrics are blocked by @protected matcher (Caddyfile.local line 49); /api/metrics requests would return 404 since the route doesn't exist
  • Prometheus: Scrapes directly from api:8080/metrics via internal Docker network, bypassing Caddy

The implementation correctly ensures metrics are not publicly exposed.

docker-compose.yml (2)

184-217: Previous shell substitution issue has been resolved.

Both postgres_exporter services now correctly use an entrypoint script (/postgres-exporter-entrypoint.sh) to build the DATA_SOURCE_NAME from Docker secrets at runtime, addressing the previous review concern about shell command substitution not working in Docker Compose environment variables.

Also applies to: 219-252


254-291: Previous password security issue has been resolved.

Both Grafana services now require GRAFANA_ADMIN_PASSWORD to be explicitly set using the :? syntax, preventing the use of insecure default credentials. This addresses the previous security concern.

Also applies to: 293-330

infra/caddy/Caddyfile.prod (1)

8-11: Admin API should bind to 127.0.0.1 for production security.

The previous review correctly identified that binding the admin API to 0.0.0.0:2019 exposes administrative control endpoints (/load, /config, /stop) to all containers on the Docker network without authentication.

Since the :9180 metrics endpoint proxies to localhost:2019 (line 139), the admin API doesn't need to listen on all interfaces. Binding to 127.0.0.1:2019 would restrict access to within the container only.

Apply this diff:

-	admin 0.0.0.0:2019
+	admin 127.0.0.1:2019

This ensures only the Caddy container itself can access the admin API, while Prometheus can still scrape metrics via the :9180 endpoint.

🧹 Nitpick comments (2)
infra/caddy/Caddyfile.local (2)

11-14: Consider binding admin API to 127.0.0.1 for defense in depth.

The admin API is bound to 0.0.0.0:2019, making it accessible to all containers on the Docker network. Since the :9180 metrics endpoint proxies to localhost:2019 (line 65), the admin API doesn't need to listen on all interfaces.

Apply this diff for additional security:

-	admin 0.0.0.0:2019
+	admin 127.0.0.1:2019

Note: The port is already published to 127.0.0.1:2019 in docker-compose.yml (line 92), so this change would only affect inter-container access.


48-52: Add /api/metrics to protected paths for consistency.

The @protected matcher blocks /metrics but not /api/metrics. While /api/metrics would return 404 (route doesn't exist), explicitly blocking it would maintain consistency with production configuration (Caddyfile.prod line 46) and prevent confusion.

Apply this diff:

-	@protected path /metrics /generate-signature*
+	@protected path /metrics /api/metrics /generate-signature*
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 711673b and e0766fd.

📒 Files selected for processing (5)
  • README.md (1 hunks)
  • docker-compose.yml (6 hunks)
  • infra/caddy/Caddyfile.local (2 hunks)
  • infra/caddy/Caddyfile.prod (3 hunks)
  • metal/router/router.go (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
metal/router/router.go (1)
handler/metrics.go (1)
  • NewMetricsHandler (12-14)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: test (1.25.3)
🔇 Additional comments (3)
README.md (1)

19-51: Excellent documentation additions!

The new sections on Grafana dashboard management and metrics endpoint security provide clear, actionable guidance. The security model explanation is particularly valuable for preventing regressions.

docker-compose.yml (2)

15-22: LGTM!

The new persistent volumes for Prometheus and Grafana data are properly configured with separate volumes for prod and local environments.


106-143: Well-configured Prometheus services.

The Prometheus configurations are solid:

  • Appropriate retention periods (30d prod, 7d local)
  • Localhost-only binding for prod (127.0.0.1:9090)
  • Resource limits prevent runaway resource consumption
  • Healthchecks ensure service availability
  • Proper dependencies on exporters

Also applies to: 145-182

@gocanto gocanto marked this pull request as draft November 12, 2025 03:58
@gocanto gocanto force-pushed the claude/add-prometheus-docker-monitoring-011CUrM7Pj8H5qPebaAWNvXv branch from 13dab92 to 4de2096 Compare November 12, 2025 06:23
@gocanto gocanto marked this pull request as ready for review November 12, 2025 07:07
@gocanto
Copy link
Collaborator Author

gocanto commented Nov 12, 2025

@codex review

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (2)
docker-compose.yml (2)

184-253: LGTM: postgres_exporter properly configured with entrypoint script.

The previous issue with shell command substitution in DATA_SOURCE_NAME has been correctly resolved:

  • ✅ Uses entrypoint script to construct DSN at runtime
  • ✅ Reads credentials from Docker secrets
  • ✅ Proper health checks and resource limits
  • ✅ Appropriate network configuration

254-331: LGTM: Grafana security improved with required password.

The previous security issue has been correctly addressed:

  • GRAFANA_ADMIN_PASSWORD now required via :? syntax (no default)
  • ✅ Proper provisioning and dashboard mounts
  • ✅ Appropriate startup time (30s start_period for Grafana)
  • ✅ Reasonable resource limits

This prevents deployment with default credentials.

🧹 Nitpick comments (1)
infra/metrics/VPS_DEPLOYMENT.md (1)

248-263: Clarify placeholder path in cron job.

Line 262 uses a hardcoded path /home/deployer/your-repo which is a placeholder. Users may miss this and the cron job will fail.

Add a note above the cron example:

 Add:
 
+# NOTE: Update /home/deployer/your-repo to your actual repository path
 ```cron
 # Run daily at 2 AM
 0 2 * * * cd /home/deployer/your-repo && make monitor-backup-prod >> /var/log/prometheus-backup.log 2>&1

</blockquote></details>

</blockquote></details>

<details>
<summary>📜 Review details</summary>

**Configuration used**: CodeRabbit UI

**Review profile**: CHILL

**Plan**: Pro

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between e0766fdad3d70af72865de5c5f30a3dd9c788962 and 6b92a9ac25aced0df624c4d85565579f3bedcdd4.

</details>

<details>
<summary>📒 Files selected for processing (6)</summary>

* `docker-compose.yml` (6 hunks)
* `infra/caddy/Caddyfile.local` (2 hunks)
* `infra/caddy/Caddyfile.prod` (3 hunks)
* `infra/makefile/monitor.mk` (1 hunks)
* `infra/metrics/README.md` (1 hunks)
* `infra/metrics/VPS_DEPLOYMENT.md` (1 hunks)

</details>

<details>
<summary>✅ Files skipped from review due to trivial changes (1)</summary>

* infra/metrics/README.md

</details>

<details>
<summary>🚧 Files skipped from review as they are similar to previous changes (1)</summary>

* infra/caddy/Caddyfile.prod

</details>

<details>
<summary>⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)</summary>

* GitHub Check: test (1.25.3)

</details>

<details>
<summary>🔇 Additional comments (12)</summary><blockquote>

<details>
<summary>infra/caddy/Caddyfile.local (3)</summary><blockquote>

`5-9`: **LGTM: Metrics collection enabled correctly.**

The `servers.metrics` block properly enables HTTP handler metrics collection at the global level, which is required for Prometheus scraping.

---

`47-51`: **LGTM: Protected paths properly blocked.**

This configuration correctly prevents external access to sensitive endpoints. Returning 403 for `/metrics` and `/generate-signature*` ensures these are only accessible via the internal `:9180` listener.

---

`59-70`: **Excellent security design for internal metrics.**

This configuration implements defense in depth:
- Dedicated port (9180) for Prometheus scraping
- Only `/metrics` path exposed (404 for others)
- Admin API remains on localhost:2019
- Not published to host (Docker network only)

This follows the principle of least privilege and prevents accidental exposure of the admin API.

</blockquote></details>
<details>
<summary>docker-compose.yml (4)</summary><blockquote>

`15-22`: **LGTM: Volume separation for local and production.**

The separate volumes for local and production environments prevent data conflicts and follow best practices for environment isolation.

---

`52-65`: **LGTM: Admin API comments clarified and metrics endpoint exposed.**

The updated comments accurately reflect that:
- Admin API listens on all interfaces within the container but is NOT published to the host
- Prometheus scrapes from the dedicated `:9180` endpoint via Docker internal DNS

The `expose` directive correctly makes port 9180 available only on the Docker network, not on the host.

---

`106-183`: **LGTM: Prometheus services properly configured.**

Both Prometheus services have:
- ✅ Appropriate retention periods (30d prod, 7d local)
- ✅ Correct port bindings (prod restricted to localhost)
- ✅ Robust health checks with proper intervals
- ✅ Reasonable resource limits (1 CPU, 1GB RAM)
- ✅ Proper startup dependencies via `depends_on`

The configuration follows container best practices.

---

`332-405`: **LGTM: Build context paths updated for infra reorganization.**

The Dockerfile paths have been correctly updated from `./docker/dockerfile-api` to `./infra/docker/dockerfile-api`, aligning with the infrastructure reorganization.

</blockquote></details>
<details>
<summary>infra/makefile/monitor.mk (5)</summary><blockquote>

`1-41`: **LGTM: Well-organized configuration variables.**

The configuration section is clean and maintainable:
- ✅ Parameterized paths and URLs for flexibility
- ✅ Conditional authentication flag (lines 29-31) only adds credentials when both are present
- ✅ Clear separation of local vs production endpoints

---

`337-386`: **LGTM: Traffic generation with proper credential validation.**

The traffic generation targets are well-implemented:
- ✅ Validates credentials before proceeding (lines 339-342)
- ✅ Proper error handling and visual feedback
- ✅ Reasonable throttling (0.1s between requests)
- ✅ Separate local and production variants

---

`429-452`: **LGTM: Backup rotation uses portable implementation.**

The backup rotation (lines 437, 449) correctly uses a portable for-loop pattern instead of `xargs -r`:

```bash
for f in $$(ls -t ... | tail -n +6); do rm -f "$$f"; done || true

This works on macOS, BSD, and Linux, and properly:

  • Keeps the 5 most recent backups
  • Deletes older backups
  • Handles empty results gracefully

249-265: LGTM: Comprehensive monitoring test suite.

The test target provides excellent coverage:

  • ✅ Service status verification
  • ✅ Prometheus target health checks
  • ✅ Endpoint accessibility tests (Caddy, API, Grafana)
  • ✅ Clear success/failure feedback
  • ✅ Appropriate scope limiting (local profile only)

458-530: Excellent documentation with comprehensive help target.

The monitor-help target provides outstanding usability:

  • ✅ Well-organized categories for easy navigation
  • ✅ Quick start guide for new users (lines 521-525)
  • ✅ Docker Compose examples for reference (lines 526-530)
  • ✅ Complete command reference with descriptions

This significantly improves the developer experience.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@gocanto gocanto marked this pull request as draft November 12, 2025 07:14
@gocanto gocanto marked this pull request as ready for review November 12, 2025 07:35
@gocanto gocanto merged commit 0f174da into main Nov 12, 2025
3 checks passed
@gocanto gocanto deleted the claude/add-prometheus-docker-monitoring-011CUrM7Pj8H5qPebaAWNvXv branch November 12, 2025 07:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants