Checklist

This is the repository for the AIX's Checklist tool

Checklist ?!

The checklist started as a shell script to facilitate health-check assessments and performance evaluation of servers.
As it evolved, some of it's contents were split into smaller scripts that could be integrated into 3rd party monitoring tools, like Splunk.
However in order to facilitate out of box analysis of complex environment and better correlation of server behavior, a major rewrite in Python have been initiated.
At this moment moment about 10% of the original checklist have been ported to Python and it supports:

AIX servers
PowerVM Virtual IO Servers
Oracle Databases ( using oracle_cx python module )

Who should use it ? / Target Audience

This is intended to be used for data exploration and server troubleshoot, mainly where combination of factors is crucial.
While it's possible to do performance monitoring and even capacity planning using this tool, for that you should look for njmon

Python Script

The Python collector can run locally, within the server or use Ansible to fetch data from remote servers.
All stats collected are parsed and sent to influxdb in order to allow for performance measurement over the time and health-check operations.
All scripts are under pq_checklist directory, which works as a Python module.
The entry point is pq_checklist.main, as described into the main.py file.

How to run it

You can run the checklist as a standard process through the "main.py" script or interactively from python shell

OS Shell ( Bash/KSH/etc )

In order to facilitate execution, a shell wrapper is provided

$ ./main.py -d

Available parameters

Parameter	Description
-h	Help Message
-l	A influxdb dumpfile to load into the local database ( cannot be used along with -d )
-d	Run the checklist as a foreground daemon
-c	Specify a config file to be used, skip search for a config file or create a new one

Python shell

Within python3 shell it can be executed as a python module as shown bellow

import pq_checklist
config,logger = pq_checklist.get_config(log_start=True,config_path='/etc/pq_checklist.conf')
collector_instance = pq_checklist.general_collector.collector(config=config,logger=logger)
collected_data = collector_instance.collect_all()

API Documentation

Most of the classes are properly commented out and python's using sphinx formatting
This is a general overview of how the checklist components are tied together:

Instalation

The Instalation and dependency tracking is being done using Python's setuptools.

python3 setup.py install

Dependencies

At this moment the checklist ( regardless it's execution mode ), needs the following modules installed on the server:

ansible-runner
configparser
asyncio
influxdb_client
xdg

If oracle will be monitored too, the cx_oracle python module is required

Normally all dependencies should be handled by setuptools during the install
for further information please check setup.cfg file

Packaging and redistribution

Optionally you can package it into a bdist_whell or even a rpm package in order to facilitate distribution.
Please check setuptools documentation for further details.

Configuration

At first execution the checklist will look for it's configuration file following XDG variables.
If not found it will also look into /etc and /opt/freeware/etc, if not found there either it will create one at ${XDG_CONFIG_HOME}, using the pq_checklist/pq_checklist.conf.template as template
Key components of the configuration are organized as sessions, and those are:

[INFLUXDB]

This session define the database connection parameters and also allow the checklist.
If no database will be used, all query information will be dumped into dump_file destination
It's advisable to not remove the parameter url.
If no database will be used, just leave with default value "<influxdb_url>"
Available tags:

Tag	Default	Description
url	<influxdb_url>	Url used to connect into influxDB
token	<influxdb_token>	InfluxDB authentication token
org	<influxdb_org>	Organization that will be used within InfluxDB
bucket	<influxdb_bucket>	Bucket into InfluxDB
timeout	50000	amount of seconds before consider that the query failed
dump_file	/tmp/influx_queries.json	File to store failed queries or any query if a invalid db url is provided

[ANSIBLE]

When running on Ansible mode, this session define targets, which ansible playbook will be played, along with artifacts and modules location.
It's advisable to not change the playbook or modules, unless you're extending the checklist capabilities.
Available tags:

Tag	Default	Description
host_target	all	Host group which will be the target of the checklist
playbook	getperf.yml	Playbook that will be executed, right now the checklist look's for specifically defined into getperf.yml
private_data_dir	ansible	Base directory that holds ansible related assets

[MODE]

Under this session its defined how the checklist will behave, which collectors are to be loaded and if the health check routines that exists within the collectors are to be called
Available tags:

Tag	Default	Description
mode	local	If the checklist will get statistics from the local server or if it will use ansible to fetch data
healthcheck	True	If the healthcheck routine provided by the collector will be called upon its execution
collectors	[ 'net', 'cpu', 'vio', 'dio' ]	Collectors to load with the checklist, please see collectors session for further details

[LOOP]

This session define the interval, which the checklist will collect data Available tags:

Tag	Default	Description
interval	300	Seconds between collection cycles ( only valid when using general_collector's loop )
hc_loop_count	2	Interval in collection cycles to call Health routines, in this example hc will be performed every 600 seconds

[CPU] && [ORACLE]

Those are sessions tied to collectors, please check the collector part of this document for details about these config sessions

Collectors

In order to simplify development, the checklist data gathering capabilities have been split into modules, which are called collectors.
Each collector provide essentially two (02) things :

Measurements :
- Data to be inserted into InfluxDB for further analysis
Health Check :
- Automated analysis of the server health, based on the collected data

Important:
Whatever message comes out from the health check functions just means something that should be checked from MY perspective.
Don't engage into tuning crusades or HW capacity endeavors without engage the proper support channels ( like your solution provider )

Health check

Each collector has it's own HealthCheck ( HC ) set of validations, and at each running cycle the HC messages consolidated through the collectors are pushed to syslog
The messages follow the directives defined at the checklist configuration file

The Message report behavior follow a few patterns to minimize duplicity:

Validate that the counter/metric has changed since previous reading whatever is possible
Validate if it isn't something that normally changes
If it changes normally, if it changing above normal rate ( right now just evaluate the average for the past 24 hours )

Message Location

All Healcheck messages go to syslog and use the script name as a main tag to messages:

HC MGS FROM Healthcheck from a specific server ( message details bellow )

net collector

This collector is responsible for parse network related commands
On AIX and VirtualIO Servers, at this moment it will fetch the following commands:

netstat -s
netstat -aon
entstat -d ( for each ent interface found on the server )

Health Check routines

At this moment this collector report warnings for several counters from entstat command when the increment in abnormal ways, like:

a counter has not changed for the majority of it's life, but started to increase
it changes frequently, but began to increase at a faster rate

In case the adapter is etherchannel like adapter and is set to use LACP, it will also send messages in case LACP gets out of sync.

Counters currently monitored per adapter:

Session from entstat	Counter
transmit_stats	( 'transmit_errors', 'receive_errors', 'transmit_packets_dropped', 'receive_packets_dropped', 'bad_packets', 's_w_transmit_queue_overflow', 'no_carrier_sense', 'crc_errors', 'dma_underrun', 'dma_overrun', 'lost_cts_errors', 'alignment_errors', 'max_collision_errors', 'no_resource_errors', 'late_collision_errors', 'receive_collision_errors', 'packet_too_short_errors', 'packet_too_long_errors', 'timeout_errors', 'packets_discarded_by_adapter', 'single_collision_count', 'multiple_collision_count'
'general_stats'	( 'no_mbuf_errors' )
'dev_stats'	( 'number_of_xoff_packets_transmitted', 'number_of_xon_packets_transmitted', 'number_of_xoff_packets_received', 'number_of_xon_packets_received', 'transmit_q_no_buffers', 'transmit_q_dropped_packets', 'transmit_swq_dropped_packets', 'receive_q_no_buffers', 'receive_q_errors', 'receive_q_dropped_packets' )
'addon_stats'	( 'rx_error_bytes', 'rx_crc_errors', 'rx_align_errors', 'rx_discards', 'rx_mf_tag_discard', 'rx_brb_discard', 'rx_pause_frames', 'rx_phy_ip_err_discards', 'rx_csum_offload_errors', 'tx_error_bytes', 'tx_mac_errors', 'tx_carrier_errors', 'tx_single_collisions', 'tx_deferred', 'tx_excess_collisions', 'tx_late_collisions', 'tx_total_collisions', 'tx_pause_frames', 'unrecoverable_errors'
'veth_stats'	('send_errors', 'invalid_vlan_id_packets', 'receiver_failures', 'platform_large_send_packets_dropped')

Actions to perform due HC messages

For messages at transmit_stats session, this usually indicate issues on lower levels of the adapter, like :

physical switches errors
physical adapter queue saturation ( usually something at dev_stats or addon_stats will come up too )
virtual adapter saturation ( usually something at veth_stats will come up )

For messages under dev_stats and addon_stats this is usually tied to physical adapter saturation, like rx,tx queues, or buffer segmentation issues ( TSO/LSO )
Normally counters at this session can be remediate through tuning on the number of queue sizes ( or even amount of queues ) and matching interrupt coalescing intervals with CPU resource. Also, keep in mind that xoff/xon counters incrementing usually indicate CPU/bus starvation of either Server or switch side.

At this moment the checklist will report hints into troubleshoot issues for the following counters:

Session	Counter	Message	Impact, normal action	Priority
veth_stats	send_errors	Error sending packages to VIOS, If buffers are maxedout please check VIOS resources	As long the servers are not running out of cores ( check cpu_collector ), normally adjust tiny/small/medium/large/huge buffers tend to address these issues	🟡
veth_stats	receiver_failures	Error possible starvation errors at the server, If buffers are maxedout please check CPU capabilities	As long the servers are not running out of cores ( check cpu_collector ), normally adjust tiny/small/medium/large/huge buffers tend to address these issues	🟡
veth_stats	platform_large_send_packets_dropped	Error sending PLSO packages to VIOS, If buffers are maxedout and no backend error at physical adapters, please check VIOS resources	As long the servers are not running out of cores ( check cpu_collector ), normally adjust the amount of dog_threads on AIX or sea_threads on VIOS help	🟡
addon_stats	rx_pause_frames	Possible saturation at switch side	Nothing can be done at server side	🟡
addon_stats	tx_pause_frames	Possible saturation at server side, Queues or CPU saturation is likely	As long the servers are not running out of cores ( check cpu_collector ), normally adjust intr_priority, intr_time might help under this issues	🟡
dev_stats	number_of_xoff_packets_transmitted	Possible saturation at server side, Queues or CPU saturation is likely	As long the servers are not running out of cores ( check cpu_collector ), normally adjust intr_priority, intr_time might help under this issues	🟡
dev_stats	number_of_xoff_packets_received	Possible saturation at server side, Queues or CPU saturation is likely	Nothing can be done at server side	🟡
dev_stats	transmit_q_no_buffers	Buffer Saturation, possible more TX queues are advisable	As long the servers are not running out of cores ( check cpu_collector ), normally adjust queue_size, tx_max_pkts, tx_limit might help	🟡
dev_stats	transmit_swq_dropped_packets	Buffer Saturation, possible bigger queues are advisable	Can lead to slowdowns	🟡
dev_stats	receive_q_no_buffers	Buffer Saturation, possible more RX queues are advisable	Possible the combination of packages in all queues exceeded the amount of packages allowed in buffer, more queues might help in better management ( queues_rx increases ), otherwise increase the total amount of packages in buffer might help	🔴
general_stats	no_mbuf_errors	Network stack lack of memory buffers, possible check of thewall is advisable	Possible innability of offload the packages from the adapter into AIX network stack, increase the network buffers ( at "no" command ), specially thewall might help	🟡
lacp_port_stats	partner_state	LACP Error ( possible switch port mismatch )	Can lead to loss of connectivity, check port-channel on both switches and server side	🔴
lacp_port_stats	actor_state	LACP Error ( possible switch port mismatch )	Can lead to loss of connectivity, check port-channel on both switches and server side	🔴

For futher reading on the topic, please check:

Metrics inserted into InfluxDB

At this moment the net_collector provide the following measurements:

measurement	tag	Description
entstat	host	Server that originated the observation
entstat	stats_type	Session within entstat command that generated the entry, can be : transmit_stats, general_stats, dev_stats, addon_stats, veth_stats
entstat	interface	Interface that generated the oservation
netstat_general	host	Server that originated the observation
netstat_general	protocol	Protocol that generated the observation
netstat_general	session_group	Session within the protocol that generated information
netstat_general	session	Session within Session group that generated information

cpu collector

This collector is responsible for CPU related commands
On AIX it runs:

mpstat
lparstat

Collector configuration

Follow the supported tags at the [CPU] session on the config file:

Tag	Default	Description
samples	2	Readings from the commands used to calculate the usage
interval	1	Interval between the readings
rq_relative_100pct	10	Run Queue lenght to consider that the CPU is at 100%
max_usage_warn	90	Percentage of Utilization where the checklist trigger a high CPU usage warning
min_usage_warn	30	Percentage of Utilization where the checklist trigger a low CPU usage warning
min_core_pool_warn	2	Minimal amount of cores free on the shared processor pool before trigger a warning
other_warnings	True	If warnings related to ilcs and vlcs will be issues
involuntarycontextswitch_ratio	30	context switch ratio to consider that the server needs more CPUs to handle the workload
involuntarycorecontextswitch_ratio	10	core context switch ratio to consider that the server needs more CORES

Health Check routines

The CPU collector will use data from mpstat and lparstat to evaluate if the server is running out of CPU resources if the CPU utilization goes beyond the threshold value defined into the config.

When the CPU is high, running on shared CPUs and the ilcs vs vlcs ratio is high too, it will trigger an alert suggesting to increase the Entitled Capacity of the LPAR
When the CPU is high, running on shared CPUs and the cs vs ics ratio is high too, it will trigger an alert suggesting to increase the amount of VCPUs assigned to the LPAR
When the Run Queue is high for the amount of CPUs on the server, if will trigger an alert suggesting to add more VCPUs or Cores to the LPAR ( threshold defined on config )
If the LPAR is running on a specific shared processor pool, it reaches its limits, it will trigger an alert
If the server is idle, it will trigger an alert suggesting to remove resources

Actions to perform due HC messages

Message	Description	Priority
High CPU utilization detected along with possible core starvation of the lpar, due high ilcs vs vlcs ratio, values...	This message appear when the application is demanding more cores than it's promptly available to the server, core allocation above entitlement suffers from latency spikes and priority calculations	🟡
High CPU utilization detected along with possible cpu starvation of the lpar, due high cs vs ics ratio, values...	This message appear when there aren't enought VCPUs on the server for the amount of running applications, this will lead to spikes in run queue, which might lead to server crashes	🟠
High CPU utilization detected, values...	High CPU detected and apprently the server has enough resources, this might be a problem and investigation is required	🔴
LOW CPU utilization detected, values...	The server has more resources than it actually needs, CPU/Core removal could benefit the whole system	🟢
High run queue detected on the server, value...	Normaly this happends when the process start to acomulate on the server, when this reach normally 10 x amount of VCPUs the server crash	🔴
Shared processor pool near its capacity limit, value...	This indicate that the server is running inside a shared processor pool that is at it's limit, which will starve the lpars, more cores on the pool are needed	🟠

The actions on this case are kind of self explanatory

Metrics inserted into InfluxDB

measurement	tag	Description
mpstat	host	Server that generated the entry
lparstat	host	Server that generated the entry

vio collector

This collector works only on PowerVM VirtualIO Servers and wraps around parsing the following commands:

ioscli
- seastat
vnicstat

Health Check routines

At this moment all HC related messages are related to vnicstat, and it will trigger alerts when the following counters increase per adapter or per adapter's CRQ:

'low_memory'
'error_indications'
'device_driver_problem'
'adapter_problem'
'command_response_errors'
'reboots'
'client_completions_error'
'vf_completions_error'

Actions to perform due HC messages

VNIC implements the SRIO-V specification to allow near direct access to the physical adapter.
This means that data transfer to the adapter queues can be done directly by the Client LPAR, so the behavior would be nearly the same of a physical adapter.
The glue that hold these queues together between at the Client LPAR is the Logical Port ( Slot ), which also define the behavior of the SRIO-V's Virtual Function. These defitions can be identified at the VIOS as the vnicserver* adapters and the ent* at the AIX LPARs.
Errors ( like crc/send/recv/duplicate packages ) originated at physical level are simply passed to the client lpar using VF.
When dealing with VNIC errrors, queue descriptors errors usually mean that all physical device queues got full for a moment,while VF errors could be tied to CPU/Memory starvation at Client or VIOS level.
Assuming that no physical error have been observed at the adapter or switch itself... and CPU/Memory resources are available, a queue tunning could help.
To evaluate queue sizes, it's good to consider that VNIC began on P8 servers, I think the default were something like this:

A maximum of 02 queues per VNIC
About 512 packages per Queue

This was supposed to handle at least the same amount of packages handled by SEA.
On P9 I've seen 04 and 06 queues per adapter, but as far I known, the limitations are on the NIC and bus itself, so this should increase fairly simply in the future. But keep in mind that increased capabilities won't mean that the defaults will increase too.
Also more and bigger queues don't mean higher package throughput, as CPU still needed to lift the data from the adapter into the server memory, therefore device specific tuning might be required.

Also, keep in mind when the adapter is being shared across multiple VNICs the queues are shared too, therefore other clients can fill up the queues.

With that said, VNIC troubleshoot isn't very straightforward, so once the queues have been tweaked, if the issues continue... it's advisable to open a PMR with IBM to investigate further.

Metrics inserted into InfluxDB

measurement	tag	Description
vnicstat	host	Server that generated the entry
vnicstat	backing_device_name	Device at the VIOS
vnicstat	client_partition_id	LPAR ID of the AIX/Linux/i Client
vnicstat	client_partition_name	Hostname of the client lpar ( sometimes it comes empty when the client is linux )
vnicstat	client_operating_system	Client Operating System
vnicstat	client_device_name	Device name at the client ( sometimes linux comes up with weird names )
vnicstat	client_device_location_code	Slot at the Client partition
vnicstat	adapter	Adapter at VIOS
vnicstat	crq	CRQ number within the adapter
vnicstat	direction	rx/tx within the CRQ, within the adapter
seastat_vlan	host	Server that generated the entry
seastat_vlan	adapter	Adapter at VIOS
seastat_vlan	vlan	Vlan which the traffic is using
seastat_mac	host	Server that generated the entry
seastat_mac	adapter	Adapter at VIOS
seastat_mac	mac	Mac ( virtual HW ) address generating traffic

Regarding SEA:
Sea statistics usually comes from entstat command, therefore SEA related statistics are under entstat metrics

dio collector

This collector handle disk and disk adapter related metrics.

On AIX it handles the following commands:

iostat
fcstat ( for all fcs adapters on the lpar )

Health Check routines

Actions to perform due HC messages

Metrics inserted into InfluxDB

measurement	tag	Description
iostat_disks	host	Server that generated the entry
iostat_disks	disk	Disk name

oracle collector

This collector connect into remote Oracle database instances to gather performance measurements and report basic slowdown scenarios

Collector configuration

All configuration of this collector reside under the [ORACLE] tag within the config file

Tag	Default	Description
conn_type	local	How the connection to the database will be stablished local will use sqlplus to fetch data and remote will use cx_oracle, right now only remote works
ora_user	[ 'oracle', 'oracle' ]	must be a list ( even if only one ) of users that will be used to connect into the database
ora_home	[ '/oracle/database/dbhome_1', '/oracle/grid' ]	must be a list ( even if only one ) of orahome, not used when conn_type = remote
ora_sid	[ 'tasy21', '+ASM1' ]	must be a list ( even if only one ) of SIDs, not used when conn_type = remote
ora_logon	[ '/ as sysdba', '/ as sysasm' ]	must be a list ( even if only one ) of users used to connect, not used when conn_type = remote
ora_pass	[ pass, pass ]	must be a list ( even if only one ) of passwords used to connect into the databases
ora_dsn	[ host/service, host/service, ]	must be a list ( even if only one ) of oracle DSN used to connect into remote databases
ora_role = [ 0, 2 ]	User role used to connect into the remote database, 0 = DEFAULT_AUTH, 2 = SYSDBA, 32768 = SYSASM
ora_users_to_ignore	[ 'PUBLIC', 'APPQOSSYS', 'CTXSYS', 'ORDPLUGINS', 'GSMADMIN_INTERNAL', 'XDB', 'ORDDATA', 'DVSYS', 'OUTLN', 'SYSTEM', 'ORACLE_OCM', 'WMSYS', 'OLAPSYS', 'LBACSYS', 'SYS', 'MDSYS', 'DBSNMP', 'SI_INFORMTN_SCHEMA', 'DVF', 'DBSFWUSER', 'AUDSYS', 'REMOTE_SCHEDULER_AGENT', 'OJVMSYS', 'ORDSYS' ]	List of users to ignore when tracking objects
check_statistics_days	2	How many days before consider statistics of a modified object old
log_switches_hour_alert	3	Amount of log switches to be tolerated before issue a warning into syslog
script_dumpdir	/tmp/oracle_sql	When check for fragmentation and old statics, the system can also create defrag and gather stats scripts, to facilitate maintanance, those scripts will be stored on this directory
dump_longops	True	If when detect a longops query, dump its execution plan in order to look for possible causes for the specific longop
dump_running_ids	True	If dump of running queries, when detected is desireable
table_reclaimable_treshold	50	Amount of fragmentation tolerated before issue a warning so the admin might take action
stats_max_parallel	10	Parallel degree used to gather statistics
stats_estimate_percent	60	Estimate percentage used to gather statistics

Health Check routines

At this moment the oracle collector doesn't retrieve information from influxdb in order to evaluate the alerts before issue them, therefore duplicate alerts might happen frequently.
Follow the alert messages being reported:

Message	Description	Priority
The instance %s of database %s switched logs %d times at : %s	There are too many changes happening into the database and the redologs are not big enough to fullfill it, therefore a log switch is issued, freezing changes until the switch is completed	🟠
The database %s has a total of %d longops happening, please check dumped queries	There are some slow queries running into the database, which indicate slowdowns	🟢
The Query %s from database %s has a execution plan too long, possible problems	A specific query is taking a long time to complete, possible logical problem in the way that the query is being executed is happening	🟡
The Query %s from database %s has a full table scan, please check	A specific query is taking a long time to complete, and doing a full table scan along with it, there is a high chance of a column not be indexed properly	🟠
Long queries detected using full table scan, please check %d	Amount of queries performing full table scan detected into the system	🟢

Statistics and scripts

This collector will scan tables within the database in order to find tables that might need have its statistics updated.
The key factor to determine if the statistics is old the check_statistics_days tag within the config file; If the statistics are newer than whats defined on the tag, the collector will not check it.

If the statistics is older than what's defined on check_statistics_days, then will follow the following criteria:

If the table had changes since last statistics

Metrics inserted into InfluxDB

measurement	tag	Description
oracle_logswitches	database	Name of database that generated the metric
oracle_logswitches	<instance_name>	amount of logswitches this specific instance generated at the designated timeframe
oracle_stalestats	database	Name of database that generated the metric
oracle_stalestats	user	owner of the stale objects
oracle_stalestats	total	amount of objects for the specific user
oracle_tablespaces	database	Name of database that generated the metric
oracle_tablespaces	tablespace	tablespace name
oracle_tablespaces	total	Total amount of bytes
oracle_tablespaces	total_physical_cap	Total amount of physical bytes
oracle_tablespaces	free	Free space
oracle_tablespaces	free_pct	Percentage of free space into the tablespace
oracle_longops	database	Name of database that generated the metric
oracle_longops	server	Server where the longop was identified
oracle_longops	instance	Database instance that originated the longops
oracle_longops	user	User that were running the query ( longop )
oracle_longops	hash_value	Hash value of the query
oracle_longops	sql_id	sql_id of the query
oracle_wait_events	database	Name of database that generated the metric
oracle_wait_events	server	Server where the longop was identified
oracle_wait_events	instance	Database instance that originated the longops
oracle_wait_events	wait_class	Class of the wait event
oracle_wait_events	total_waits	Total wait events for this class
oracle_wait_events	time_waited	Total time waited on this class
oracle_wait_events	total_waits_fg	Total amount of foreground wait events for this class
oracle_wait_events	time_waited_fg	Amount of time foreground events spent on this wait class
oracle_running_queries	database	Name of database that generated the metric
oracle_running_queries	total	Amount of queries running concurrently on this database
oracle_sql_monitor	database	Name of database that generated the metric
oracle_sql_monitor	status	Query status from sql monitor
oracle_sql_monitor	username	User running the query
oracle_sql_monitor	module	sql module being used on the query
oracle_sql_monitor	service_name	Service name
oracle_sql_monitor	sql_id	Query sql_id
oracle_sql_monitor	tot_time	Amount of time spent on this query
oracle_temp_tablespaces	database	Name of database that generated the metric
oracle_temp_tablespaces	tablespace	tablespace name
oracle_temp_tablespaces	usage_in_mb	Amount of space used in Megabytes
oracle_sessions	host	servername running the database
oracle_sessions	instance	Instance name
oracle_sessions	total_sessions	Amount of sessions
oracle_objects	database	Name of database that generated the metric
oracle_objects	user	Owner of the object
oracle_objects	valid	Amount of valid objects
oracle_objects	invalid	Amount of invalid objects

Important
The data model of this collector might and will change in the near future, in order to provide more useful information

bos collector

This is a inner collector that gather information about the target server and feed into the collectors.
If the checklist's python interface is being used, device tree, serial numbers, SMT modes can be found here
The documentation of this collector comes only through python help() interface and sphinx

Legacy / Standalone shell scripts

The scripts at sh directory are intended to be used in conjunction with Splunk and are not really used by the python collector or ansible anymore Follow the list of scripts and it's purpose:

Script	Description
checklist-aix.sh	do a data capture of the server
fcstat.sh	Collect Fibre interface statistics
netstat.sh	Collect Ethernet interface statistics
cpu.sh	Collect CPU/CORE statistics
powerha_check.sh	Do a PowerHA Automate health check
vmstat_i.sh	Virtual Memory interrupt related statistics
vmstat_s.sh	Virtual Memory system wide statistics
lspath.sh	Disk multipath health ( Rely on AIX MPIO )
errpt_count.sh	Count amount of elements out of errpt
seastat.sh	Get Network Statistics from VIOS SEA Adapters
mount.sh	Check filesystem mount parameters for unsafe settings

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
ansible		ansible
img		img
pq_checklist		pq_checklist
sh		sh
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ansible.cfg		ansible.cfg
checklist.toml		checklist.toml
main.py		main.py
pq_checklist.conf		pq_checklist.conf
pytest.ini		pytest.ini
setup.cfg		setup.cfg
setup.py		setup.py
test.sh		test.sh

License

pslq/checklist

Folders and files

Latest commit

History

Repository files navigation

Checklist

Checklist ?!

Who should use it ? / Target Audience

Python Script

How to run it

OS Shell ( Bash/KSH/etc )

Available parameters

Python shell

API Documentation

Instalation

Dependencies

Packaging and redistribution

Configuration

[INFLUXDB]

[ANSIBLE]

[MODE]

[LOOP]

[CPU] && [ORACLE]

Collectors

Health check

Message Location

net collector

Health Check routines

Counters currently monitored per adapter:

Actions to perform due HC messages

Metrics inserted into InfluxDB

cpu collector

Collector configuration

Health Check routines

Actions to perform due HC messages

Metrics inserted into InfluxDB

vio collector

Health Check routines

Actions to perform due HC messages

Metrics inserted into InfluxDB

dio collector

Health Check routines

Actions to perform due HC messages

Metrics inserted into InfluxDB

oracle collector

Collector configuration

Health Check routines

Statistics and scripts

Metrics inserted into InfluxDB

bos collector

Legacy / Standalone shell scripts

TODO / IDEAS

About

Resources

License

Stars

Watchers

Forks

Languages