Skip to content

Commit

Permalink
Add plan for Puppet agent state summary
Browse files Browse the repository at this point in the history
  • Loading branch information
bastelfreak committed May 3, 2024
1 parent 793f801 commit a762009
Show file tree
Hide file tree
Showing 3 changed files with 156 additions and 0 deletions.
72 changes: 72 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,78 @@ environment. You can plott it in a more human-readable way with the
[puppet/format](https://github.com/voxpupuli/puppet-format?tab=readme-ov-file#puppet-format)
modules.


The plan `pe_status_check::agent_state_summary` provides you a hash with all nodes, grouped by failure state:

```json
{
"noop": [ ],
"corrective_changes": [ ],
"used_cached_catalog": [ ],
"failed": [ ],
"changed": [ "student2.local" ],
"unresponsive": [ "student3.local", "student4.local", "student1.local", "login.local" ],
"responsive": [ "pe.bastelfreak.local"],
"unhealthy": [ "student2.local", "student3.local", "student4.local", "student1.local", "login.local" ],
"unhealthy_counter": 5,
"healthy": [ "pe.bastelfreak.local" ],
"healthy_counter": 1,
"total_counter": 6
}
```

* `noop`: last catalog was applied in noop mode
* `failed`: The last catalog couldn't be compiled or catalog application raised an error
* `changed`: A node reported a change
* `unresponsive`: Last report is older than 30 minutes (can be configured via the `runinterval` parameter)
* `corrective_changes`: A node reported corrective changes
* `used_cached_catalog`: The node didn't apply a new catalog but used a cached version
* `unhealthy`: List of nodes that are in any of the above categories
* `responsive`: Last report isn't older than 30 minutes (can be configured via the `runinterval` parameter). Doesn't matter if the report is healthy.
* `healthy`: All nodes - unhealthy
* `unhealthy_counter`: Amount of unhealthy nodes
* `healthy_counter`: Amount of healthy nodes
* `total_counter`: Amount of all nodes in PuppetDB

The goal of this plan is to run it before doing major upgrades, to ensure that your agents are in a healthy state.

To turn this into a table:

```
$result = run_plan('pe_status_check::agent_state_summary', '_catch_errors' => true)
$table = format::table(
{
title => 'Puppet Agent states',
head => ['status check', 'Nodes'],
rows => $result.map |$key, $data| { [$key, [$data].flatten.join(', ')]},
}
)
out::message($table)
```

example output:

```
+------------------------------------------------+
| Puppet Agent states |
+---------------------+--------------------------+
| status check | Nodes |
+---------------------+--------------------------+
| noop | |
| corrective_changes | |
| used_cached_catalog | |
| failed | |
| changed | |
| unresponsive | |
| responsive | puppet.bastelfreak.local |
| unhealthy | |
| unhealthy_counter | 0 |
| healthy | puppet.bastelfreak.local |
| healthy_counter | 1 |
| total_counter | 1 |
+---------------------+--------------------------+
``
### Using a Puppet Query to report status.
As the pe_status_check module uses Puppet's existing fact behavior to gather the status data from each of the agents, it is possible to use PQL (puppet query language) to gather this information.
Expand Down
19 changes: 19 additions & 0 deletions REFERENCE.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@

### Plans

* [`pe_status_check::agent_state_summary`](#pe_status_check--agent_state_summary): provides an overview of all Puppet agents and their error states
* [`pe_status_check::agent_summary`](#pe_status_check--agent_summary): Summary report of the state of agent_status_check on each node
Uses the facts task to get the current status from each node
and produces a summary report in JSON
Expand Down Expand Up @@ -84,6 +85,24 @@ Default value: `true`

## Plans

### <a name="pe_status_check--agent_state_summary"></a>`pe_status_check::agent_state_summary`

provides an overview of all Puppet agents and their error states

#### Parameters

The following parameters are available in the `pe_status_check::agent_state_summary` plan:

* [`runinterval`](#-pe_status_check--agent_state_summary--runinterval)

##### <a name="-pe_status_check--agent_state_summary--runinterval"></a>`runinterval`

Data type: `Integer[0]`

the runinterval for the Puppet Agent in minutes. We consider latest reports that are older than runinterval as unresponsive

Default value: `30`

### <a name="pe_status_check--agent_summary"></a>`pe_status_check::agent_summary`

Summary report of the state of agent_status_check on each node
Expand Down
65 changes: 65 additions & 0 deletions plans/agent_state_summary.pp
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
#
# @summary provides an overview of all Puppet agents and their error states
#
# @param runinterval the runinterval for the Puppet Agent in minutes. We consider latest reports that are older than runinterval as unresponsive
#
# @author Tim Meusel <tim@bastelfreak.de>
#
plan pe_status_check::agent_state_summary (
Integer[0] $runinterval = 30,
){
# a list of all nodes and their latest catalog state
$nodes = puppetdb_query('nodes[certname,latest_report_noop,latest_report_corrective_change,cached_catalog_status,latest_report_status,report_timestamp]{}')
$fqdns = $nodes.map |$node| { $node['certname'] }

# check if the last catalog is older than X minutes
$current_timestamp = Integer(Timestamp().strftime('%s'))
$runinterval_seconds = $runinterval * 60
$unresponsive = $nodes.map |$node| {
$old_timestamp = Integer(Timestamp($node['report_timestamp']).strftime('%s'))
if ($current_timestamp - $old_timestamp) >= $runinterval_seconds {
$node['certname']
}
}.delete_undef_values

# all nodes that delivered a report in time
$responsive = $fqdns - $unresponsive

# all nodes that used noop for the last catalog
$noop = $nodes.map |$node| { if ($node['latest_report_noop'] == true){ $node['certname'] } }.delete_undef_values

# all nodes that reported corrective changes
$corrective_changes = $nodes.map |$node| { if ($node['latest_report_corrective_change'] == true){ $node['certname'] } }.delete_undef_values

# all nodes that used a cached catalog on the last run
$used_cached_catalog = $nodes.map |$node| { if ($node['cached_catalog_status'] != 'not_used'){ $node['certname'] } }.delete_undef_values

# all nodes with failed resources in the last report
$failed = $nodes.map |$node| { if ($node['latest_report_status'] == 'failed'){ $node['certname'] } }.delete_undef_values

# all nodes with changes in the last report
$changed = $nodes.map |$node| { if ($node['latest_report_status'] == 'changed'){ $node['certname'] } }.delete_undef_values

# all nodes that aren't healthy in any form
$unhealthy = [$noop, $corrective_changes, $used_cached_catalog, $failed, $changed, $unresponsive].flatten.unique

# all healthy nodes
$healthy = $fqdns - $unhealthy

$data = {
'noop' => $noop,
'corrective_changes' => $corrective_changes,
'used_cached_catalog' => $used_cached_catalog,
'failed' => $failed,
'changed' => $changed,
'unresponsive' => $unresponsive,
'responsive' => $responsive,
'unhealthy' => $unhealthy,
'unhealthy_counter' => $unhealthy.count,
'healthy' => $healthy,
'healthy_counter' => $healthy.count,
'total_counter' => $fqdns.count,
}

return $data
}

0 comments on commit a762009

Please sign in to comment.