4 changes: 3 additions & 1 deletion .sync.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ appveyor.yml:
unmanaged: true
.github/workflows/spec.yml:
checks: 'syntax lint metadata_lint check:symlinks check:git_ignore check:dot_underscore check:test_file rubocop'
unmanaged: false
unmanaged: true
.github/workflows/release.yml:
unmanaged: true
.travis.yml:
Expand All @@ -34,6 +34,8 @@ spec/spec_helper.rb:
coverage_report: true
Rakefile:
changelog_user: "puppetlabs"
extra_disabled_lint_checks:
- 'lookup_in_parameter'
spec/default_facts.yml:
extra_facts:
pe_build: '2021.5.0'
Expand Down
27 changes: 27 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,33 @@

All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/) and this project adheres to [Semantic Versioning](http://semver.org).

## [v1.0.0](https://github.com/puppetlabs/puppet_operational_dashboards/tree/v1.0.0) (2022-05-04)

[Full Changelog](https://github.com/puppetlabs/puppet_operational_dashboards/compare/v0.2.0...v1.0.0)

### Changed

- \(SUP-3061\) Install class for ent infrastructure agents [\#36](https://github.com/puppetlabs/puppet_operational_dashboards/pull/36) ([MartyEwings](https://github.com/MartyEwings))

### Added

- Display all http client and function metrics [\#43](https://github.com/puppetlabs/puppet_operational_dashboards/pull/43) ([m0dular](https://github.com/m0dular))
- Add panels for PDB read and write pools [\#42](https://github.com/puppetlabs/puppet_operational_dashboards/pull/42) ([m0dular](https://github.com/m0dular))
- \(SUP-3243\) Add index stats for pe-puppetdb tables [\#40](https://github.com/puppetlabs/puppet_operational_dashboards/pull/40) ([m0dular](https://github.com/m0dular))
- \(SUP-3241\) Add in Dashboard documentation [\#39](https://github.com/puppetlabs/puppet_operational_dashboards/pull/39) ([MartyEwings](https://github.com/MartyEwings))
- Check for existance of keys in dict [\#26](https://github.com/puppetlabs/puppet_operational_dashboards/pull/26) ([m0dular](https://github.com/m0dular))

### Fixed

- \(SUP-3235\) Use latest telegraf package on Ubuntu [\#38](https://github.com/puppetlabs/puppet_operational_dashboards/pull/38) ([m0dular](https://github.com/m0dular))
- make resource ordering specific to install class [\#37](https://github.com/puppetlabs/puppet_operational_dashboards/pull/37) ([MartyEwings](https://github.com/MartyEwings))
- \(SUP-3228\) Fix Ubuntu compatibility issue [\#35](https://github.com/puppetlabs/puppet_operational_dashboards/pull/35) ([MartyEwings](https://github.com/MartyEwings))
- \(SUP-3201\) Check port availability with systemd [\#33](https://github.com/puppetlabs/puppet_operational_dashboards/pull/33) ([m0dular](https://github.com/m0dular))
- \(SUP-3201\) Accept any Sensitive value in template [\#32](https://github.com/puppetlabs/puppet_operational_dashboards/pull/32) ([m0dular](https://github.com/m0dular))
- \(SUP-3209\) Grant pg\_monitor role to telegraf [\#31](https://github.com/puppetlabs/puppet_operational_dashboards/pull/31) ([m0dular](https://github.com/m0dular))
- \(SUP-3201\) Make Grafana datasource idempotent [\#30](https://github.com/puppetlabs/puppet_operational_dashboards/pull/30) ([m0dular](https://github.com/m0dular))
- Fix handling of 'error' entry in dict [\#28](https://github.com/puppetlabs/puppet_operational_dashboards/pull/28) ([m0dular](https://github.com/m0dular))

## [v0.2.0](https://github.com/puppetlabs/puppet_operational_dashboards/tree/v0.2.0) (2022-03-11)

[Full Changelog](https://github.com/puppetlabs/puppet_operational_dashboards/compare/v0.1.2...v0.2.0)
Expand Down
219 changes: 209 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,26 @@

## Table of Contents

1. [Description](#description)
1. [Setup - The basics of getting started with puppet_operational_dashboards](#setup)
* [Beginning with puppet_operational_dashboards](#beginning-with-puppet_operational_dashboards)
1. [Usage - Configuration options and additional functionality](#usage)
* [Determining where Telegraf runs](#determining-where-telegraf-runs)
- [puppet_operational_dashboards](#puppet_operational_dashboards)
- [Table of Contents](#table-of-contents)
- [Description](#description)
- [Setup](#setup)
- [Prerequisites](#prerequisites)
- [Beginning with puppet_operational_dashboards](#beginning-with-puppet_operational_dashboards)
- [Installing on Puppet Enterprise](#installing-on-puppet-enterprise)
- [Installing on Puppet Open Source](#installing-on-puppet-open-source)
- [What puppet_operational_dashboards affects](#what-puppet_operational_dashboards-affects)
- [Usage](#usage)
- [Evaluation order](#evaluation-order)
- [Determining where Telegraf runs](#determining-where-telegraf-runs)
- [Importing archive metrics](#importing-archive-metrics)
- [Default Dashboards Available](#default-dashboards-available)
- [Puppetserver Performance](#puppetserver-performance)
- [Puppetserver Workload](#puppetserver-workload)
- [File Sync Metrics](#file-sync-metrics)
- [PuppetDB Performance](#puppetdb-performance)
- [PuppetDB Workload](#puppetdb-workload)
- [Postgres Metrics](#postgres-metrics)

## Description

Expand All @@ -17,19 +32,45 @@ This module is a replacement for the [puppet_metrics_dashboard module](https://f

### Prerequisites

The toml-rb gem needs to be installed in the Puppetserver gem space, which can be done with the [influxdb::profile::toml](https://github.com/puppetlabs/influxdb/blob/main/manifests/profile/toml.pp) class in the InfluxDB module.
### Beginning with puppet_operational_dashboards

To collect PostgreSQL metrics, classify your PostgreSQL nodes with the [puppet_operational_dashboards::profile::postgres_access](https://github.com/puppetlabs/puppet_operational_dashboards/blob/main/manifests/profile/postgres_access.pp) class. FOSS users will need to manually configure the PostgreSQL authentication settings.
#### Installing on Puppet Enterprise

### Beginning with puppet_operational_dashboards
To Install on Puppet Enterprise:

1. Classify `puppet_operational_dashboards::enterprise_infrastructure` to a node group that encompasses all Puppet Infrastructure agents. The default node group `PE Infrastructure Agent` is appropriate.

```
include puppet_operational_dashboards::enterprise_infrastructure
```

This will install the toml-rb gem on compiling nodes, and grant the appropriate access to the databases, for the dashboard node on all database nodes.

2. Classify `puppet_operational_dashboards` to the Puppet agent node to be designated as the Operational Dashboard node.

```
include puppet_operational_dashboards
```
This will install and configure Telegraf, InfluxDB, and Grafana.

The easiest way to get started using this module is by including the `puppet_operational_dashboards` class to install and configure Telegraf, InfluxDB, and Grafana. Note that you also need to install the toml-rb gem according to the [prerequisites](#setup-prerequisites).

Please note database access will not be granted until the Puppet agent run on the postgres nodes AFTER the application of `puppet_operational_dashboards` on the designated dashboard node.


#### Installing on Puppet Open Source

The toml-rb gem needs to be installed in the Puppetserver gem space, which can be done with the [influxdb::profile::toml](https://github.com/puppetlabs/influxdb/blob/main/manifests/profile/toml.pp) class in the InfluxDB module.

To collect PostgreSQL metrics, FOSS users will need to manually configure the PostgreSQL authentication settings.

The easiest way to get started using this module is by including the `puppet_operational_dashboards` class to install and configure Telegraf, InfluxDB, and Grafana. Note that you also need to install the toml-rb gem according to the.

```
include puppet_operational_dashboards
```

Doing so will:
#### What puppet_operational_dashboards affects
Installing the module will:

* Install and configure InfluxDB using the [puppetlabs/influxdb module](https://forge.puppet.com/modules/puppetlabs/influxdb#what-influxdb-affects)
* Install and configure Telegraf to collect metrics from your PE infrastructure. FOSS users can specify a list of infrastructure nodes via the `puppet_operational_dashboards::telegraf::agent` parameters.
Expand All @@ -51,6 +92,14 @@ These parameters take precedence over the file on disk if both are specified.

## Usage

### Evaluation order

When using the default configuration options and the deferred function to retreive the Telegraf token, note that it will not be available during the initial Puppet agent run that creates all of the resources. A second run is required to retrieve the token and update the resources that use it. If you are seeing authentication errors from Telegraf and Grafana, make sure the Puppet agent has been run twice and that the token has made its way to the Telegraf service config file:

```
/etc/systemd/system/telegraf.service.d/override.conf
```

### Determining where Telegraf runs

Which hosts a node collects metrics from is determined by the `puppet_operational_dashboards::telegraf::agent::collection_method` parameter. By default, the `puppet_operational_dashboards` class will collect metrics from all nodes in a PE infrastructure. If you want to change this behavior, set `collection_method` to `local` or `none`. Telegraf can be run on other nodes by applying the `puppet_operational_dashboards::telegraf::agent` class to them, for example:
Expand Down Expand Up @@ -93,3 +142,153 @@ Or one service at a time, e.g. for Puppet server
```
telegraf --once --debug --config ~/telegraf.conf --config ~/telegraf.conf.d//puppetserver.conf
```

## Default Dashboards Available
#### Puppetserver Performance
This dashboard is to inspect Puppet server performance and troubleshoot the `pe-puppetserver` service. Available panels:
- Puppetserver Performance
This is a composite panel consisting of the following JRuby related metrics:
- Average free JRubies
- Average requested JRubies
- Average JRuby borrow time
- Average JRuby wait time
- Heap Memory and Uptime
This panel displays the following JVM metrics:
- Heap Committed
- Heap Used
- Uptime
- Average Requested JRubies
- Average Borrow/Compile Time
- Avergae Free JRubies
- Average Wait Time
- HTTP Client Metrics
This panel displays the various network related metrics performed by Puppet server. Examples include:
- puppetdb.query.full_response
- facts.find.full_response
- Borrow Timers Mean
Average duration api requests require borrowing a JRuby from the pool
- Borrow Timers Rate
Rate at which Puppet server performs the above api requests
- Function Timers
Average duration of functions run as part of catalog compilations
- Function Timers Count
Rate at which Puppet server performs the above api requests

**Use Case**
- Puppetserver service performance degraded
- 503 responses to agent requests
- Agent unable to get catalog
- Inspect performance for a particular type of request
- Inspect which type of request could be a performance bottleneck
#### File Sync Metrics
This dashboard is to inspect File-sync related performance. Available Graphs:
- Number of Fetch / Commits vs Lock wait / held
- Average Lock Held Time
- Avergee Lock Wait Time
- Number of Commits
- Number of Fetches
- File-Sync timing - Client Services
- Average Clone Time
- Average Fetch Time
- Average Sync Time
- Average Sync Clean Time
- File-Sync timing - Storage Services
- Average Commit add / rm Time
- Average Commit time
- Average Clean Check time
- Average Pre-commit Hook Time

**Use Case**
- Code Manager takes a significant time or fails to deploy code
- Puppetserver frequently locked due to file sync
- Compilers do not have the latest code available
#### PuppetDB Performance
This dashboard is to inspect PuppetDB performance and troubleshoot the `pe-puppetdb` service. Available panels:
- Heap
- Commands Per Second
- Command Processing Time
- Queue Depth
- Replace Catalog Time
- Replace Facts Time
- Store Report Time
- Average Read Duration
- Read Pool Pending Connections
- Average Write Duration
- Write Pool Pending Connections

**Use Case**
- Any PuppetDB performance issues
- Troubleshooting Read/Write Pool Errors

#### Postgres Performance
This dashboard is to inspect PostgreSQL database performance. Available panels:
- Temp Files
Changes in temp file sizes per database over the given time interval
- Sizes by Database (total)
Total size of each database, including tables, indexes, and toast
- Sizes by Table
Size of each table, not including indexes or toast
- Sizes by Index
- Sizes by Toast
- Autovacuum Activity
- Vacuum Activity - (not auto, not full)
- I/O - heap toast and index - hits / reads
- Disk Block Reads (Heap)
Changes in the number of disks blocks reads by postgres heap files per table. This indicates the value needed to be retrieved from disk instead of the cache.
- Cache Reads (Heap)
Changes in the number of cache reads by postgres heap files per table. This indicates the value was retrieved from the cache.
- Disk Block Reads (Index)
Same as above panel, but for indexes
- Cache Reads (Index)
Same as above panel, but for indexes
- Disk Block Reads (Toast)
Same as above panel, but for toast data
- Cache Reads (Toast)
Same as above panel, but for toast data
- Live / Dead Tuples
- Deadlocks

**Use Cases**
- Monitor table sizes
- Monitor Deadlocks and Slow Queries
- Any PostgreSQL performance issues
### Limitations

## Ubuntu Telegraf Package
Currently, only the latest Telegraf package is provided by the Ubuntu repository. Therefore, the only allowed value for `puppet_operational_dashboards::telegraf::agent::version` is `latest`. Setting this parameter to a different value on Ubuntu will produce a warning.

## Upgrading from puppet_metrics_dashboard
This module uses InfluxDB 2.x, while `puppet_metrics_dashboard` uses 1.x. This module does not currently provide an option to upgrade between these versions, so it is recommended to either install this module on a new node or manually upgrade. See the [InfluxDB docs](https://docs.influxdata.com/influxdb/v2.2/upgrade/v1-to-v2/) for more information about upgrading.

### Troubleshooting
If data is not displaying in Grafana or you see errors in Telegraf collections, try checking the following items.

## Grafana datasource and time interval
A common reason for not seeing data in the dashboards is choosing the wrong datasource or time interval. Double check that you have selected a datasource and window of time for which metrics have been collected. Also, check that the `server` filter at the top of the dashboard contains valid entries.

Also, note that Telegraf performs its first collection after the first collection interval has passed. You may need to wait for this to pass, or manually test using the method below.

Datasources can be tested via the "Data Sources" configuration page in Grafana. Select the datasource, e.g. `influxdb_puppet`, and click the "Test" button. Note that because this is a "provisioned datasource," it cannot be edited in the UI.

## Telegraf errors
A good way to test Telegraf collection is to use the `--test` option. After logging into the node running `telegraf`, first export your token:
```
export INFLUX_TOKEN=<token>
```

The token can either be the admin token written to `/root/.influxdb_token` by default, or the `puppet telegraf token` used specifically for Telegraf. See `REFERENCE.md` for more information.

Prepending a space before the `export` command will prevent the token from being written to you shell's history.

Then, test the collection:
```
telegraf --test --debug --config /etc/telegraf/telegraf.conf --config-directory /etc/telegraf/telegraf.d/
```

Services can also be tested individually, for example:

```
telegraf --test --debug --config /etc/telegraf/telegraf.conf --config /etc/telegraf/telegraf.d/puppetserver_metrics.conf
```

will only collect Puppet server metrics.
Loading