Skip to content

pslq/checklist

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Checklist

This is the repository for the AIX's Checklist tool

Checklist ?!

The checklist started as a shell script to facilitate health-check assessments and performance evaluation of servers.
As it evolved, some of it's contents were split into smaller scripts that could be integrated into 3rd party monitoring tools, like Splunk.
However in order to facilitate out of box analysis of complex environment and better correlation of server behavior, a major rewrite in Python have been initiated.
At this moment moment about 10% of the original checklist have been ported to Python and it supports:

  • AIX servers
  • PowerVM Virtual IO Servers
  • Oracle Databases ( using oracle_cx python module )

Who should use it ? / Target Audience

This is intended to be used for data exploration and server troubleshoot, mainly where combination of factors is crucial.
While it's possible to do performance monitoring and even capacity planning using this tool, for that you should look for njmon

Python Script

The Python collector can run locally, within the server or use Ansible to fetch data from remote servers.
All stats collected are parsed and sent to influxdb in order to allow for performance measurement over the time and health-check operations.
All scripts are under pq_checklist directory, which works as a Python module.
The entry point is pq_checklist.main, as described into the main.py file.

How to run it

You can run the checklist as a standard process through the "main.py" script or interactively from python shell

OS Shell ( Bash/KSH/etc )

In order to facilitate execution, a shell wrapper is provided

$ ./main.py -d
Available parameters
Parameter Description
-h Help Message
-l A influxdb dumpfile to load into the local database ( cannot be used along with -d )
-d Run the checklist as a foreground daemon
-c Specify a config file to be used, skip search for a config file or create a new one

Python shell

Within python3 shell it can be executed as a python module as shown bellow

import pq_checklist
config,logger = pq_checklist.get_config(log_start=True,config_path='/etc/pq_checklist.conf')
collector_instance = pq_checklist.general_collector.collector(config=config,logger=logger)
collected_data = collector_instance.collect_all()
API Documentation

Most of the classes are properly commented out and python's using sphinx formatting
This is a general overview of how the checklist components are tied together: General Overview


Instalation

The Instalation and dependency tracking is being done using Python's setuptools.

python3 setup.py install

Dependencies

At this moment the checklist ( regardless it's execution mode ), needs the following modules installed on the server:

  • ansible-runner
  • configparser
  • asyncio
  • influxdb_client
  • xdg

If oracle will be monitored too, the cx_oracle python module is required

Normally all dependencies should be handled by setuptools during the install
for further information please check setup.cfg file

Packaging and redistribution

Optionally you can package it into a bdist_whell or even a rpm package in order to facilitate distribution.
Please check setuptools documentation for further details.


Configuration

At first execution the checklist will look for it's configuration file following XDG variables.
If not found it will also look into /etc and /opt/freeware/etc, if not found there either it will create one at ${XDG_CONFIG_HOME}, using the pq_checklist/pq_checklist.conf.template as template
Key components of the configuration are organized as sessions, and those are:

[INFLUXDB]

This session define the database connection parameters and also allow the checklist.
If no database will be used, all query information will be dumped into dump_file destination
It's advisable to not remove the parameter url.
If no database will be used, just leave with default value "<influxdb_url>"
Available tags:

Tag Default Description
url <influxdb_url> Url used to connect into influxDB
token <influxdb_token> InfluxDB authentication token
org <influxdb_org> Organization that will be used within InfluxDB
bucket <influxdb_bucket> Bucket into InfluxDB
timeout 50000 amount of seconds before consider that the query failed
dump_file /tmp/influx_queries.json File to store failed queries or any query if a invalid db url is provided

[ANSIBLE]

When running on Ansible mode, this session define targets, which ansible playbook will be played, along with artifacts and modules location.
It's advisable to not change the playbook or modules, unless you're extending the checklist capabilities.
Available tags:

Tag Default Description
host_target all Host group which will be the target of the checklist
playbook getperf.yml Playbook that will be executed, right now the checklist look's for specifically defined into getperf.yml
private_data_dir ansible Base directory that holds ansible related assets

[MODE]

Under this session its defined how the checklist will behave, which collectors are to be loaded and if the health check routines that exists within the collectors are to be called
Available tags:

Tag Default Description
mode local If the checklist will get statistics from the local server or if it will use ansible to fetch data
healthcheck True If the healthcheck routine provided by the collector will be called upon its execution
collectors [ 'net', 'cpu', 'vio', 'dio' ] Collectors to load with the checklist, please see collectors session for further details

[LOOP]

This session define the interval, which the checklist will collect data Available tags:

Tag Default Description
interval 300 Seconds between collection cycles ( only valid when using general_collector's loop )
hc_loop_count 2 Interval in collection cycles to call Health routines, in this example hc will be performed every 600 seconds

[CPU] && [ORACLE]

Those are sessions tied to collectors, please check the collector part of this document for details about these config sessions


Collectors

In order to simplify development, the checklist data gathering capabilities have been split into modules, which are called collectors.
Each collector provide essentially two (02) things :

  • Measurements :
    • Data to be inserted into InfluxDB for further analysis
  • Health Check :
    • Automated analysis of the server health, based on the collected data

Important:
Whatever message comes out from the health check functions just means something that should be checked from MY perspective.
Don't engage into tuning crusades or HW capacity endeavors without engage the proper support channels ( like your solution provider )

Health check

Each collector has it's own HealthCheck ( HC ) set of validations, and at each running cycle the HC messages consolidated through the collectors are pushed to syslog
The messages follow the directives defined at the checklist configuration file

The Message report behavior follow a few patterns to minimize duplicity:

  • Validate that the counter/metric has changed since previous reading whatever is possible
  • Validate if it isn't something that normally changes
  • If it changes normally, if it changing above normal rate ( right now just evaluate the average for the past 24 hours )
Message Location

All Healcheck messages go to syslog and use the script name as a main tag to messages:

  • HC MGS FROM Healthcheck from a specific server ( message details bellow )

net collector

This collector is responsible for parse network related commands
On AIX and VirtualIO Servers, at this moment it will fetch the following commands:

  • netstat -s
  • netstat -aon
  • entstat -d ( for each ent interface found on the server )
Health Check routines

At this moment this collector report warnings for several counters from entstat command when the increment in abnormal ways, like:

  • a counter has not changed for the majority of it's life, but started to increase
  • it changes frequently, but began to increase at a faster rate

In case the adapter is etherchannel like adapter and is set to use LACP, it will also send messages in case LACP gets out of sync.

Counters currently monitored per adapter:
Session from entstat Counter
transmit_stats ( 'transmit_errors', 'receive_errors', 'transmit_packets_dropped', 'receive_packets_dropped', 'bad_packets', 's_w_transmit_queue_overflow', 'no_carrier_sense', 'crc_errors', 'dma_underrun', 'dma_overrun', 'lost_cts_errors', 'alignment_errors', 'max_collision_errors', 'no_resource_errors', 'late_collision_errors', 'receive_collision_errors', 'packet_too_short_errors', 'packet_too_long_errors', 'timeout_errors', 'packets_discarded_by_adapter', 'single_collision_count', 'multiple_collision_count'
'general_stats' ( 'no_mbuf_errors' )
'dev_stats' ( 'number_of_xoff_packets_transmitted', 'number_of_xon_packets_transmitted', 'number_of_xoff_packets_received', 'number_of_xon_packets_received', 'transmit_q_no_buffers', 'transmit_q_dropped_packets', 'transmit_swq_dropped_packets', 'receive_q_no_buffers', 'receive_q_errors', 'receive_q_dropped_packets' )
'addon_stats' ( 'rx_error_bytes', 'rx_crc_errors', 'rx_align_errors', 'rx_discards', 'rx_mf_tag_discard', 'rx_brb_discard', 'rx_pause_frames', 'rx_phy_ip_err_discards', 'rx_csum_offload_errors', 'tx_error_bytes', 'tx_mac_errors', 'tx_carrier_errors', 'tx_single_collisions', 'tx_deferred', 'tx_excess_collisions', 'tx_late_collisions', 'tx_total_collisions', 'tx_pause_frames', 'unrecoverable_errors'
'veth_stats' ('send_errors', 'invalid_vlan_id_packets', 'receiver_failures', 'platform_large_send_packets_dropped')
Actions to perform due HC messages

For messages at transmit_stats session, this usually indicate issues on lower levels of the adapter, like :

  • physical switches errors
  • physical adapter queue saturation ( usually something at dev_stats or addon_stats will come up too )
  • virtual adapter saturation ( usually something at veth_stats will come up )

For messages under dev_stats and addon_stats this is usually tied to physical adapter saturation, like rx,tx queues, or buffer segmentation issues ( TSO/LSO )
Normally counters at this session can be remediate through tuning on the number of queue sizes ( or even amount of queues ) and matching interrupt coalescing intervals with CPU resource. Also, keep in mind that xoff/xon counters incrementing usually indicate CPU/bus starvation of either Server or switch side.

At this moment the checklist will report hints into troubleshoot issues for the following counters:

Session Counter Message Impact, normal action Priority
veth_stats send_errors Error sending packages to VIOS, If buffers are maxedout please check VIOS resources As long the servers are not running out of cores ( check cpu_collector ), normally adjust tiny/small/medium/large/huge buffers tend to address these issues 🟑
veth_stats receiver_failures Error possible starvation errors at the server, If buffers are maxedout please check CPU capabilities As long the servers are not running out of cores ( check cpu_collector ), normally adjust tiny/small/medium/large/huge buffers tend to address these issues 🟑
veth_stats platform_large_send_packets_dropped Error sending PLSO packages to VIOS, If buffers are maxedout and no backend error at physical adapters, please check VIOS resources As long the servers are not running out of cores ( check cpu_collector ), normally adjust the amount of dog_threads on AIX or sea_threads on VIOS help 🟑
addon_stats rx_pause_frames Possible saturation at switch side Nothing can be done at server side 🟑
addon_stats tx_pause_frames Possible saturation at server side, Queues or CPU saturation is likely As long the servers are not running out of cores ( check cpu_collector ), normally adjust intr_priority, intr_time might help under this issues 🟑
dev_stats number_of_xoff_packets_transmitted Possible saturation at server side, Queues or CPU saturation is likely As long the servers are not running out of cores ( check cpu_collector ), normally adjust intr_priority, intr_time might help under this issues 🟑
dev_stats number_of_xoff_packets_received Possible saturation at server side, Queues or CPU saturation is likely Nothing can be done at server side 🟑
dev_stats transmit_q_no_buffers Buffer Saturation, possible more TX queues are advisable As long the servers are not running out of cores ( check cpu_collector ), normally adjust queue_size, tx_max_pkts, tx_limit might help 🟑
dev_stats transmit_swq_dropped_packets Buffer Saturation, possible bigger queues are advisable Can lead to slowdowns 🟑
dev_stats receive_q_no_buffers Buffer Saturation, possible more RX queues are advisable Possible the combination of packages in all queues exceeded the amount of packages allowed in buffer, more queues might help in better management ( queues_rx increases ), otherwise increase the total amount of packages in buffer might help πŸ”΄
general_stats no_mbuf_errors Network stack lack of memory buffers, possible check of thewall is advisable Possible innability of offload the packages from the adapter into AIX network stack, increase the network buffers ( at "no" command ), specially thewall might help 🟑
lacp_port_stats partner_state LACP Error ( possible switch port mismatch ) Can lead to loss of connectivity, check port-channel on both switches and server side πŸ”΄
lacp_port_stats actor_state LACP Error ( possible switch port mismatch ) Can lead to loss of connectivity, check port-channel on both switches and server side πŸ”΄

For futher reading on the topic, please check:

Metrics inserted into InfluxDB

At this moment the net_collector provide the following measurements:

measurement tag Description
entstat host Server that originated the observation
entstat stats_type Session within entstat command that generated the entry, can be : transmit_stats, general_stats, dev_stats, addon_stats, veth_stats
entstat interface Interface that generated the oservation
netstat_general host Server that originated the observation
netstat_general protocol Protocol that generated the observation
netstat_general session_group Session within the protocol that generated information
netstat_general session Session within Session group that generated information

cpu collector

This collector is responsible for CPU related commands
On AIX it runs:

  • mpstat
  • lparstat
Collector configuration

Follow the supported tags at the [CPU] session on the config file:

Tag Default Description
samples 2 Readings from the commands used to calculate the usage
interval 1 Interval between the readings
rq_relative_100pct 10 Run Queue lenght to consider that the CPU is at 100%
max_usage_warn 90 Percentage of Utilization where the checklist trigger a high CPU usage warning
min_usage_warn 30 Percentage of Utilization where the checklist trigger a low CPU usage warning
min_core_pool_warn 2 Minimal amount of cores free on the shared processor pool before trigger a warning
other_warnings True If warnings related to ilcs and vlcs will be issues
involuntarycontextswitch_ratio 30 context switch ratio to consider that the server needs more CPUs to handle the workload
involuntarycorecontextswitch_ratio 10 core context switch ratio to consider that the server needs more CORES
Health Check routines

The CPU collector will use data from mpstat and lparstat to evaluate if the server is running out of CPU resources if the CPU utilization goes beyond the threshold value defined into the config.

  • When the CPU is high, running on shared CPUs and the ilcs vs vlcs ratio is high too, it will trigger an alert suggesting to increase the Entitled Capacity of the LPAR
  • When the CPU is high, running on shared CPUs and the cs vs ics ratio is high too, it will trigger an alert suggesting to increase the amount of VCPUs assigned to the LPAR
  • When the Run Queue is high for the amount of CPUs on the server, if will trigger an alert suggesting to add more VCPUs or Cores to the LPAR ( threshold defined on config )
  • If the LPAR is running on a specific shared processor pool, it reaches its limits, it will trigger an alert
  • If the server is idle, it will trigger an alert suggesting to remove resources
Actions to perform due HC messages
Message Description Priority
High CPU utilization detected along with possible core starvation of the lpar, due high ilcs vs vlcs ratio, values... This message appear when the application is demanding more cores than it's promptly available to the server, core allocation above entitlement suffers from latency spikes and priority calculations 🟑
High CPU utilization detected along with possible cpu starvation of the lpar, due high cs vs ics ratio, values... This message appear when there aren't enought VCPUs on the server for the amount of running applications, this will lead to spikes in run queue, which might lead to server crashes 🟠
High CPU utilization detected, values... High CPU detected and apprently the server has enough resources, this might be a problem and investigation is required πŸ”΄
LOW CPU utilization detected, values... The server has more resources than it actually needs, CPU/Core removal could benefit the whole system 🟒
High run queue detected on the server, value... Normaly this happends when the process start to acomulate on the server, when this reach normally 10 x amount of VCPUs the server crash πŸ”΄
Shared processor pool near its capacity limit, value... This indicate that the server is running inside a shared processor pool that is at it's limit, which will starve the lpars, more cores on the pool are needed 🟠

The actions on this case are kind of self explanatory

Metrics inserted into InfluxDB
measurement tag Description
mpstat host Server that generated the entry
lparstat host Server that generated the entry

vio collector

This collector works only on PowerVM VirtualIO Servers and wraps around parsing the following commands:

  • ioscli
    • seastat
  • vnicstat
Health Check routines

At this moment all HC related messages are related to vnicstat, and it will trigger alerts when the following counters increase per adapter or per adapter's CRQ:

  • 'low_memory'
  • 'error_indications'
  • 'device_driver_problem'
  • 'adapter_problem'
  • 'command_response_errors'
  • 'reboots'
  • 'client_completions_error'
  • 'vf_completions_error'
Actions to perform due HC messages

VNIC implements the SRIO-V specification to allow near direct access to the physical adapter.
This means that data transfer to the adapter queues can be done directly by the Client LPAR, so the behavior would be nearly the same of a physical adapter.
The glue that hold these queues together between at the Client LPAR is the Logical Port ( Slot ), which also define the behavior of the SRIO-V's Virtual Function. These defitions can be identified at the VIOS as the vnicserver* adapters and the ent* at the AIX LPARs.
Errors ( like crc/send/recv/duplicate packages ) originated at physical level are simply passed to the client lpar using VF.
When dealing with VNIC errrors, queue descriptors errors usually mean that all physical device queues got full for a moment,while VF errors could be tied to CPU/Memory starvation at Client or VIOS level.
Assuming that no physical error have been observed at the adapter or switch itself... and CPU/Memory resources are available, a queue tunning could help.
To evaluate queue sizes, it's good to consider that VNIC began on P8 servers, I think the default were something like this:

  • A maximum of 02 queues per VNIC
  • About 512 packages per Queue

This was supposed to handle at least the same amount of packages handled by SEA.
On P9 I've seen 04 and 06 queues per adapter, but as far I known, the limitations are on the NIC and bus itself, so this should increase fairly simply in the future. But keep in mind that increased capabilities won't mean that the defaults will increase too.
Also more and bigger queues don't mean higher package throughput, as CPU still needed to lift the data from the adapter into the server memory, therefore device specific tuning might be required.

Also, keep in mind when the adapter is being shared across multiple VNICs the queues are shared too, therefore other clients can fill up the queues.

With that said, VNIC troubleshoot isn't very straightforward, so once the queues have been tweaked, if the issues continue... it's advisable to open a PMR with IBM to investigate further.

Metrics inserted into InfluxDB
measurement tag Description
vnicstat host Server that generated the entry
vnicstat backing_device_name Device at the VIOS
vnicstat client_partition_id LPAR ID of the AIX/Linux/i Client
vnicstat client_partition_name Hostname of the client lpar ( sometimes it comes empty when the client is linux )
vnicstat client_operating_system Client Operating System
vnicstat client_device_name Device name at the client ( sometimes linux comes up with weird names )
vnicstat client_device_location_code Slot at the Client partition
vnicstat adapter Adapter at VIOS
vnicstat crq CRQ number within the adapter
vnicstat direction rx/tx within the CRQ, within the adapter
seastat_vlan host Server that generated the entry
seastat_vlan adapter Adapter at VIOS
seastat_vlan vlan Vlan which the traffic is using
seastat_mac host Server that generated the entry
seastat_mac adapter Adapter at VIOS
seastat_mac mac Mac ( virtual HW ) address generating traffic

Regarding SEA:
Sea statistics usually comes from entstat command, therefore SEA related statistics are under entstat metrics

dio collector

This collector handle disk and disk adapter related metrics.

On AIX it handles the following commands:

  • iostat
  • fcstat ( for all fcs adapters on the lpar )
Health Check routines
Actions to perform due HC messages
Metrics inserted into InfluxDB
measurement tag Description
iostat_disks host Server that generated the entry
iostat_disks disk Disk name

oracle collector

This collector connect into remote Oracle database instances to gather performance measurements and report basic slowdown scenarios

Collector configuration

All configuration of this collector reside under the [ORACLE] tag within the config file

Tag Default Description
conn_type local How the connection to the database will be stablished local will use sqlplus to fetch data and remote will use cx_oracle, right now only remote works
ora_user [ 'oracle', 'oracle' ] must be a list ( even if only one ) of users that will be used to connect into the database
ora_home [ '/oracle/database/dbhome_1', '/oracle/grid' ] must be a list ( even if only one ) of orahome, not used when conn_type = remote
ora_sid [ 'tasy21', '+ASM1' ] must be a list ( even if only one ) of SIDs, not used when conn_type = remote
ora_logon [ '/ as sysdba', '/ as sysasm' ] must be a list ( even if only one ) of users used to connect, not used when conn_type = remote
ora_pass [ pass, pass ] must be a list ( even if only one ) of passwords used to connect into the databases
ora_dsn [ host/service, host/service, ] must be a list ( even if only one ) of oracle DSN used to connect into remote databases
ora_role = [ 0, 2 ] User role used to connect into the remote database, 0 = DEFAULT_AUTH, 2 = SYSDBA, 32768 = SYSASM
ora_users_to_ignore [ 'PUBLIC', 'APPQOSSYS', 'CTXSYS', 'ORDPLUGINS', 'GSMADMIN_INTERNAL', 'XDB', 'ORDDATA', 'DVSYS', 'OUTLN', 'SYSTEM', 'ORACLE_OCM', 'WMSYS', 'OLAPSYS', 'LBACSYS', 'SYS', 'MDSYS', 'DBSNMP', 'SI_INFORMTN_SCHEMA', 'DVF', 'DBSFWUSER', 'AUDSYS', 'REMOTE_SCHEDULER_AGENT', 'OJVMSYS', 'ORDSYS' ] List of users to ignore when tracking objects
check_statistics_days 2 How many days before consider statistics of a modified object old
log_switches_hour_alert 3 Amount of log switches to be tolerated before issue a warning into syslog
script_dumpdir /tmp/oracle_sql When check for fragmentation and old statics, the system can also create defrag and gather stats scripts, to facilitate maintanance, those scripts will be stored on this directory
dump_longops True If when detect a longops query, dump its execution plan in order to look for possible causes for the specific longop
dump_running_ids True If dump of running queries, when detected is desireable
table_reclaimable_treshold 50 Amount of fragmentation tolerated before issue a warning so the admin might take action
stats_max_parallel 10 Parallel degree used to gather statistics
stats_estimate_percent 60 Estimate percentage used to gather statistics
Health Check routines

At this moment the oracle collector doesn't retrieve information from influxdb in order to evaluate the alerts before issue them, therefore duplicate alerts might happen frequently.
Follow the alert messages being reported:

Message Description Priority
The instance %s of database %s switched logs %d times at : %s There are too many changes happening into the database and the redologs are not big enough to fullfill it, therefore a log switch is issued, freezing changes until the switch is completed 🟠
The database %s has a total of %d longops happening, please check dumped queries There are some slow queries running into the database, which indicate slowdowns 🟒
The Query %s from database %s has a execution plan too long, possible problems A specific query is taking a long time to complete, possible logical problem in the way that the query is being executed is happening 🟑
The Query %s from database %s has a full table scan, please check A specific query is taking a long time to complete, and doing a full table scan along with it, there is a high chance of a column not be indexed properly 🟠
Long queries detected using full table scan, please check %d Amount of queries performing full table scan detected into the system 🟒
Statistics and scripts

This collector will scan tables within the database in order to find tables that might need have its statistics updated.
The key factor to determine if the statistics is old the check_statistics_days tag within the config file; If the statistics are newer than whats defined on the tag, the collector will not check it.

If the statistics is older than what's defined on check_statistics_days, then will follow the following criteria:

  • If the table had changes since last statistics
Metrics inserted into InfluxDB
measurement tag Description
oracle_logswitches database Name of database that generated the metric
oracle_logswitches <instance_name> amount of logswitches this specific instance generated at the designated timeframe
oracle_stalestats database Name of database that generated the metric
oracle_stalestats user owner of the stale objects
oracle_stalestats total amount of objects for the specific user
oracle_tablespaces database Name of database that generated the metric
oracle_tablespaces tablespace tablespace name
oracle_tablespaces total Total amount of bytes
oracle_tablespaces total_physical_cap Total amount of physical bytes
oracle_tablespaces free Free space
oracle_tablespaces free_pct Percentage of free space into the tablespace
oracle_longops database Name of database that generated the metric
oracle_longops server Server where the longop was identified
oracle_longops instance Database instance that originated the longops
oracle_longops user User that were running the query ( longop )
oracle_longops hash_value Hash value of the query
oracle_longops sql_id sql_id of the query
oracle_wait_events database Name of database that generated the metric
oracle_wait_events server Server where the longop was identified
oracle_wait_events instance Database instance that originated the longops
oracle_wait_events wait_class Class of the wait event
oracle_wait_events total_waits Total wait events for this class
oracle_wait_events time_waited Total time waited on this class
oracle_wait_events total_waits_fg Total amount of foreground wait events for this class
oracle_wait_events time_waited_fg Amount of time foreground events spent on this wait class
oracle_running_queries database Name of database that generated the metric
oracle_running_queries total Amount of queries running concurrently on this database
oracle_sql_monitor database Name of database that generated the metric
oracle_sql_monitor status Query status from sql monitor
oracle_sql_monitor username User running the query
oracle_sql_monitor module sql module being used on the query
oracle_sql_monitor service_name Service name
oracle_sql_monitor sql_id Query sql_id
oracle_sql_monitor tot_time Amount of time spent on this query
oracle_temp_tablespaces database Name of database that generated the metric
oracle_temp_tablespaces tablespace tablespace name
oracle_temp_tablespaces usage_in_mb Amount of space used in Megabytes
oracle_sessions host servername running the database
oracle_sessions instance Instance name
oracle_sessions total_sessions Amount of sessions
oracle_objects database Name of database that generated the metric
oracle_objects user Owner of the object
oracle_objects valid Amount of valid objects
oracle_objects invalid Amount of invalid objects

Important
The data model of this collector might and will change in the near future, in order to provide more useful information

bos collector

This is a inner collector that gather information about the target server and feed into the collectors.
If the checklist's python interface is being used, device tree, serial numbers, SMT modes can be found here
The documentation of this collector comes only through python help() interface and sphinx


Legacy / Standalone shell scripts

The scripts at sh directory are intended to be used in conjunction with Splunk and are not really used by the python collector or ansible anymore Follow the list of scripts and it's purpose:

Script Description
checklist-aix.sh do a data capture of the server
fcstat.sh Collect Fibre interface statistics
netstat.sh Collect Ethernet interface statistics
cpu.sh Collect CPU/CORE statistics
powerha_check.sh Do a PowerHA Automate health check
vmstat_i.sh Virtual Memory interrupt related statistics
vmstat_s.sh Virtual Memory system wide statistics
lspath.sh Disk multipath health ( Rely on AIX MPIO )
errpt_count.sh Count amount of elements out of errpt
seastat.sh Get Network Statistics from VIOS SEA Adapters
mount.sh Check filesystem mount parameters for unsafe settings

TODO / IDEAS

  • Better documentation ( Better adoption of sphinx into the APIs )
  • Send messages to a webhook instead of syslog ( like M$ Teams or Slack )
  • Collect data from Linux Servers
  • Gather statistics from netstat -aon ( AIX )
  • Handle other ioscli commands
  • Handle Memory related commands
  • Handle process related commands
  • Gather data from SAP jobs
  • Enable HC using data inside DB, without fetching data from server ( Python mode only, probably is the next one )
  • Provide HC messages through rest APIs ( Using Flask or Tornado )
  • Review fcstat data model and HC messages related to it
  • When providing data through REST, convert the lists in np.arrays in order to use ML to calculate trends and isolate behaviors using ML

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages