This is the repository for the AIX's Checklist tool
The checklist started as a shell script to facilitate health-check assessments and performance evaluation of servers.
As it evolved, some of it's contents were split into smaller scripts that could be integrated into 3rd party monitoring tools, like Splunk.
However in order to facilitate out of box analysis of complex environment and better correlation of server behavior, a major rewrite in Python have been initiated.
At this moment moment about 10% of the original checklist have been ported to Python and it supports:
- AIX servers
- PowerVM Virtual IO Servers
- Oracle Databases ( using oracle_cx python module )
This is intended to be used for data exploration and server troubleshoot, mainly where combination of factors is crucial.
While it's possible to do performance monitoring and even capacity planning using this tool, for that you should look for njmon
The Python collector can run locally, within the server or use Ansible to fetch data from remote servers.
All stats collected are parsed and sent to influxdb in order to allow for performance measurement over the time and health-check operations.
All scripts are under pq_checklist directory, which works as a Python module.
The entry point is pq_checklist.main, as described into the main.py file.
You can run the checklist as a standard process through the "main.py" script or interactively from python shell
In order to facilitate execution, a shell wrapper is provided
$ ./main.py -d
Parameter | Description |
---|---|
-h | Help Message |
-l | A influxdb dumpfile to load into the local database ( cannot be used along with -d ) |
-d | Run the checklist as a foreground daemon |
-c | Specify a config file to be used, skip search for a config file or create a new one |
Within python3 shell it can be executed as a python module as shown bellow
import pq_checklist
config,logger = pq_checklist.get_config(log_start=True,config_path='/etc/pq_checklist.conf')
collector_instance = pq_checklist.general_collector.collector(config=config,logger=logger)
collected_data = collector_instance.collect_all()
Most of the classes are properly commented out and python's using sphinx formatting
This is a general overview of how the checklist components are tied together:
The Instalation and dependency tracking is being done using Python's setuptools.
python3 setup.py install
At this moment the checklist ( regardless it's execution mode ), needs the following modules installed on the server:
- ansible-runner
- configparser
- asyncio
- influxdb_client
- xdg
If oracle will be monitored too, the cx_oracle python module is required
Normally all dependencies should be handled by setuptools during the install
for further information please check setup.cfg file
Optionally you can package it into a bdist_whell or even a rpm package in order to facilitate distribution.
Please check setuptools documentation for further details.
At first execution the checklist will look for it's configuration file following XDG variables.
If not found it will also look into /etc and /opt/freeware/etc, if not found there either it will create one at ${XDG_CONFIG_HOME}, using the pq_checklist/pq_checklist.conf.template as template
Key components of the configuration are organized as sessions, and those are:
This session define the database connection parameters and also allow the checklist.
If no database will be used, all query information will be dumped into dump_file destination
It's advisable to not remove the parameter url.
If no database will be used, just leave with default value "<influxdb_url>"
Available tags:
Tag | Default | Description |
---|---|---|
url | <influxdb_url> | Url used to connect into influxDB |
token | <influxdb_token> | InfluxDB authentication token |
org | <influxdb_org> | Organization that will be used within InfluxDB |
bucket | <influxdb_bucket> | Bucket into InfluxDB |
timeout | 50000 | amount of seconds before consider that the query failed |
dump_file | /tmp/influx_queries.json | File to store failed queries or any query if a invalid db url is provided |
When running on Ansible mode, this session define targets, which ansible playbook will be played, along with artifacts and modules location.
It's advisable to not change the playbook or modules, unless you're extending the checklist capabilities.
Available tags:
Tag | Default | Description |
---|---|---|
host_target | all | Host group which will be the target of the checklist |
playbook | getperf.yml | Playbook that will be executed, right now the checklist look's for specifically defined into getperf.yml |
private_data_dir | ansible | Base directory that holds ansible related assets |
Under this session its defined how the checklist will behave, which collectors are to be loaded and if the health check routines that exists within the collectors are to be called
Available tags:
Tag | Default | Description |
---|---|---|
mode | local | If the checklist will get statistics from the local server or if it will use ansible to fetch data |
healthcheck | True | If the healthcheck routine provided by the collector will be called upon its execution |
collectors | [ 'net', 'cpu', 'vio', 'dio' ] | Collectors to load with the checklist, please see collectors session for further details |
This session define the interval, which the checklist will collect data
Available tags:
Tag | Default | Description |
---|---|---|
interval | 300 | Seconds between collection cycles ( only valid when using general_collector's loop ) |
hc_loop_count | 2 | Interval in collection cycles to call Health routines, in this example hc will be performed every 600 seconds |
Those are sessions tied to collectors, please check the collector part of this document for details about these config sessions
In order to simplify development, the checklist data gathering capabilities have been split into modules, which are called collectors.
Each collector provide essentially two (02) things :
- Measurements :
- Data to be inserted into InfluxDB for further analysis
- Health Check :
- Automated analysis of the server health, based on the collected data
Important:
Whatever message comes out from the health check functions just means something that should be checked from MY perspective.
Don't engage into tuning crusades or HW capacity endeavors without engage the proper support channels ( like your solution provider )
Each collector has it's own HealthCheck ( HC ) set of validations, and at each running cycle the HC messages consolidated through the collectors are pushed to syslog
The messages follow the directives defined at the checklist configuration file
The Message report behavior follow a few patterns to minimize duplicity:
- Validate that the counter/metric has changed since previous reading whatever is possible
- Validate if it isn't something that normally changes
- If it changes normally, if it changing above normal rate ( right now just evaluate the average for the past 24 hours )
All Healcheck messages go to syslog and use the script name as a main tag to messages:
- HC MGS FROM Healthcheck from a specific server ( message details bellow )
This collector is responsible for parse network related commands
On AIX and VirtualIO Servers, at this moment it will fetch the following commands:
- netstat -s
- netstat -aon
- entstat -d ( for each ent interface found on the server )
At this moment this collector report warnings for several counters from entstat command when the increment in abnormal ways, like:
- a counter has not changed for the majority of it's life, but started to increase
- it changes frequently, but began to increase at a faster rate
In case the adapter is etherchannel like adapter and is set to use LACP, it will also send messages in case LACP gets out of sync.
Session from entstat | Counter |
---|---|
transmit_stats | ( 'transmit_errors', 'receive_errors', 'transmit_packets_dropped', 'receive_packets_dropped', 'bad_packets', 's_w_transmit_queue_overflow', 'no_carrier_sense', 'crc_errors', 'dma_underrun', 'dma_overrun', 'lost_cts_errors', 'alignment_errors', 'max_collision_errors', 'no_resource_errors', 'late_collision_errors', 'receive_collision_errors', 'packet_too_short_errors', 'packet_too_long_errors', 'timeout_errors', 'packets_discarded_by_adapter', 'single_collision_count', 'multiple_collision_count' |
'general_stats' | ( 'no_mbuf_errors' ) |
'dev_stats' | ( 'number_of_xoff_packets_transmitted', 'number_of_xon_packets_transmitted', 'number_of_xoff_packets_received', 'number_of_xon_packets_received', 'transmit_q_no_buffers', 'transmit_q_dropped_packets', 'transmit_swq_dropped_packets', 'receive_q_no_buffers', 'receive_q_errors', 'receive_q_dropped_packets' ) |
'addon_stats' | ( 'rx_error_bytes', 'rx_crc_errors', 'rx_align_errors', 'rx_discards', 'rx_mf_tag_discard', 'rx_brb_discard', 'rx_pause_frames', 'rx_phy_ip_err_discards', 'rx_csum_offload_errors', 'tx_error_bytes', 'tx_mac_errors', 'tx_carrier_errors', 'tx_single_collisions', 'tx_deferred', 'tx_excess_collisions', 'tx_late_collisions', 'tx_total_collisions', 'tx_pause_frames', 'unrecoverable_errors' |
'veth_stats' | ('send_errors', 'invalid_vlan_id_packets', 'receiver_failures', 'platform_large_send_packets_dropped') |
For messages at transmit_stats session, this usually indicate issues on lower levels of the adapter, like :
- physical switches errors
- physical adapter queue saturation ( usually something at dev_stats or addon_stats will come up too )
- virtual adapter saturation ( usually something at veth_stats will come up )
For messages under dev_stats and addon_stats this is usually tied to physical adapter saturation, like rx,tx queues, or buffer segmentation issues ( TSO/LSO )
Normally counters at this session can be remediate through tuning on the number of queue sizes ( or even amount of queues ) and matching interrupt coalescing intervals with CPU resource.
Also, keep in mind that xoff/xon counters incrementing usually indicate CPU/bus starvation of either Server or switch side.
At this moment the checklist will report hints into troubleshoot issues for the following counters:
Session | Counter | Message | Impact, normal action | Priority |
---|---|---|---|---|
veth_stats | send_errors | Error sending packages to VIOS, If buffers are maxedout please check VIOS resources | As long the servers are not running out of cores ( check cpu_collector ), normally adjust tiny/small/medium/large/huge buffers tend to address these issues | π‘ |
veth_stats | receiver_failures | Error possible starvation errors at the server, If buffers are maxedout please check CPU capabilities | As long the servers are not running out of cores ( check cpu_collector ), normally adjust tiny/small/medium/large/huge buffers tend to address these issues | π‘ |
veth_stats | platform_large_send_packets_dropped | Error sending PLSO packages to VIOS, If buffers are maxedout and no backend error at physical adapters, please check VIOS resources | As long the servers are not running out of cores ( check cpu_collector ), normally adjust the amount of dog_threads on AIX or sea_threads on VIOS help | π‘ |
addon_stats | rx_pause_frames | Possible saturation at switch side | Nothing can be done at server side | π‘ |
addon_stats | tx_pause_frames | Possible saturation at server side, Queues or CPU saturation is likely | As long the servers are not running out of cores ( check cpu_collector ), normally adjust intr_priority, intr_time might help under this issues | π‘ |
dev_stats | number_of_xoff_packets_transmitted | Possible saturation at server side, Queues or CPU saturation is likely | As long the servers are not running out of cores ( check cpu_collector ), normally adjust intr_priority, intr_time might help under this issues | π‘ |
dev_stats | number_of_xoff_packets_received | Possible saturation at server side, Queues or CPU saturation is likely | Nothing can be done at server side | π‘ |
dev_stats | transmit_q_no_buffers | Buffer Saturation, possible more TX queues are advisable | As long the servers are not running out of cores ( check cpu_collector ), normally adjust queue_size, tx_max_pkts, tx_limit might help | π‘ |
dev_stats | transmit_swq_dropped_packets | Buffer Saturation, possible bigger queues are advisable | Can lead to slowdowns | π‘ |
dev_stats | receive_q_no_buffers | Buffer Saturation, possible more RX queues are advisable | Possible the combination of packages in all queues exceeded the amount of packages allowed in buffer, more queues might help in better management ( queues_rx increases ), otherwise increase the total amount of packages in buffer might help | π΄ |
general_stats | no_mbuf_errors | Network stack lack of memory buffers, possible check of thewall is advisable | Possible innability of offload the packages from the adapter into AIX network stack, increase the network buffers ( at "no" command ), specially thewall might help | π‘ |
lacp_port_stats | partner_state | LACP Error ( possible switch port mismatch ) | Can lead to loss of connectivity, check port-channel on both switches and server side | π΄ |
lacp_port_stats | actor_state | LACP Error ( possible switch port mismatch ) | Can lead to loss of connectivity, check port-channel on both switches and server side | π΄ |
For futher reading on the topic, please check:
- https://community.ibm.com/community/user/power/blogs/jim-cunningham1/2020/06/22/aix-network-tuning-for-10ge-and-virtual-network
- https://www.ibm.com/docs/en/aix/7.2?topic=parameters-network-option-tunable
- https://www.ibm.com/support/pages/10-gbit-ethernet-bad-assumptions-and-best-practice
- https://www.ibm.com/docs/en/ssw_aix_72/performance/performance_pdf.pdf
At this moment the net_collector provide the following measurements:
measurement | tag | Description |
---|---|---|
entstat | host | Server that originated the observation |
entstat | stats_type | Session within entstat command that generated the entry, can be : transmit_stats, general_stats, dev_stats, addon_stats, veth_stats |
entstat | interface | Interface that generated the oservation |
netstat_general | host | Server that originated the observation |
netstat_general | protocol | Protocol that generated the observation |
netstat_general | session_group | Session within the protocol that generated information |
netstat_general | session | Session within Session group that generated information |
This collector is responsible for CPU related commands
On AIX it runs:
- mpstat
- lparstat
Follow the supported tags at the [CPU] session on the config file:
Tag | Default | Description |
---|---|---|
samples | 2 | Readings from the commands used to calculate the usage |
interval | 1 | Interval between the readings |
rq_relative_100pct | 10 | Run Queue lenght to consider that the CPU is at 100% |
max_usage_warn | 90 | Percentage of Utilization where the checklist trigger a high CPU usage warning |
min_usage_warn | 30 | Percentage of Utilization where the checklist trigger a low CPU usage warning |
min_core_pool_warn | 2 | Minimal amount of cores free on the shared processor pool before trigger a warning |
other_warnings | True | If warnings related to ilcs and vlcs will be issues |
involuntarycontextswitch_ratio | 30 | context switch ratio to consider that the server needs more CPUs to handle the workload |
involuntarycorecontextswitch_ratio | 10 | core context switch ratio to consider that the server needs more CORES |
The CPU collector will use data from mpstat and lparstat to evaluate if the server is running out of CPU resources if the CPU utilization goes beyond the threshold value defined into the config.
- When the CPU is high, running on shared CPUs and the ilcs vs vlcs ratio is high too, it will trigger an alert suggesting to increase the Entitled Capacity of the LPAR
- When the CPU is high, running on shared CPUs and the cs vs ics ratio is high too, it will trigger an alert suggesting to increase the amount of VCPUs assigned to the LPAR
- When the Run Queue is high for the amount of CPUs on the server, if will trigger an alert suggesting to add more VCPUs or Cores to the LPAR ( threshold defined on config )
- If the LPAR is running on a specific shared processor pool, it reaches its limits, it will trigger an alert
- If the server is idle, it will trigger an alert suggesting to remove resources
Message | Description | Priority |
---|---|---|
High CPU utilization detected along with possible core starvation of the lpar, due high ilcs vs vlcs ratio, values... | This message appear when the application is demanding more cores than it's promptly available to the server, core allocation above entitlement suffers from latency spikes and priority calculations | π‘ |
High CPU utilization detected along with possible cpu starvation of the lpar, due high cs vs ics ratio, values... | This message appear when there aren't enought VCPUs on the server for the amount of running applications, this will lead to spikes in run queue, which might lead to server crashes | π |
High CPU utilization detected, values... | High CPU detected and apprently the server has enough resources, this might be a problem and investigation is required | π΄ |
LOW CPU utilization detected, values... | The server has more resources than it actually needs, CPU/Core removal could benefit the whole system | π’ |
High run queue detected on the server, value... | Normaly this happends when the process start to acomulate on the server, when this reach normally 10 x amount of VCPUs the server crash | π΄ |
Shared processor pool near its capacity limit, value... | This indicate that the server is running inside a shared processor pool that is at it's limit, which will starve the lpars, more cores on the pool are needed | π |
The actions on this case are kind of self explanatory
measurement | tag | Description |
---|---|---|
mpstat | host | Server that generated the entry |
lparstat | host | Server that generated the entry |
This collector works only on PowerVM VirtualIO Servers and wraps around parsing the following commands:
- ioscli
- seastat
- vnicstat
At this moment all HC related messages are related to vnicstat, and it will trigger alerts when the following counters increase per adapter or per adapter's CRQ:
- 'low_memory'
- 'error_indications'
- 'device_driver_problem'
- 'adapter_problem'
- 'command_response_errors'
- 'reboots'
- 'client_completions_error'
- 'vf_completions_error'
VNIC implements the SRIO-V specification to allow near direct access to the physical adapter.
This means that data transfer to the adapter queues can be done directly by the Client LPAR, so the behavior would be nearly the same of a physical adapter.
The glue that hold these queues together between at the Client LPAR is the Logical Port ( Slot ), which also define the behavior of the SRIO-V's Virtual Function. These defitions can be identified at the VIOS as the vnicserver* adapters and the ent* at the AIX LPARs.
Errors ( like crc/send/recv/duplicate packages ) originated at physical level are simply passed to the client lpar using VF.
When dealing with VNIC errrors, queue descriptors errors usually mean that all physical device queues got full for a moment,while VF errors could be tied to CPU/Memory starvation at Client or VIOS level.
Assuming that no physical error have been observed at the adapter or switch itself... and CPU/Memory resources are available, a queue tunning could help.
To evaluate queue sizes, it's good to consider that VNIC began on P8 servers, I think the default were something like this:
- A maximum of 02 queues per VNIC
- About 512 packages per Queue
This was supposed to handle at least the same amount of packages handled by SEA.
On P9 I've seen 04 and 06 queues per adapter, but as far I known, the limitations are on the NIC and bus itself, so this should increase fairly simply in the future. But keep in mind that increased capabilities won't mean that the defaults will increase too.
Also more and bigger queues don't mean higher package throughput, as CPU still needed to lift the data from the adapter into the server memory, therefore device specific tuning might be required.
Also, keep in mind when the adapter is being shared across multiple VNICs the queues are shared too, therefore other clients can fill up the queues.
With that said, VNIC troubleshoot isn't very straightforward, so once the queues have been tweaked, if the issues continue... it's advisable to open a PMR with IBM to investigate further.
measurement | tag | Description |
---|---|---|
vnicstat | host | Server that generated the entry |
vnicstat | backing_device_name | Device at the VIOS |
vnicstat | client_partition_id | LPAR ID of the AIX/Linux/i Client |
vnicstat | client_partition_name | Hostname of the client lpar ( sometimes it comes empty when the client is linux ) |
vnicstat | client_operating_system | Client Operating System |
vnicstat | client_device_name | Device name at the client ( sometimes linux comes up with weird names ) |
vnicstat | client_device_location_code | Slot at the Client partition |
vnicstat | adapter | Adapter at VIOS |
vnicstat | crq | CRQ number within the adapter |
vnicstat | direction | rx/tx within the CRQ, within the adapter |
seastat_vlan | host | Server that generated the entry |
seastat_vlan | adapter | Adapter at VIOS |
seastat_vlan | vlan | Vlan which the traffic is using |
seastat_mac | host | Server that generated the entry |
seastat_mac | adapter | Adapter at VIOS |
seastat_mac | mac | Mac ( virtual HW ) address generating traffic |
Regarding SEA:
Sea statistics usually comes from entstat command, therefore SEA related statistics are under entstat metrics
This collector handle disk and disk adapter related metrics.
On AIX it handles the following commands:
- iostat
- fcstat ( for all fcs adapters on the lpar )
measurement | tag | Description |
---|---|---|
iostat_disks | host | Server that generated the entry |
iostat_disks | disk | Disk name |
This collector connect into remote Oracle database instances to gather performance measurements and report basic slowdown scenarios
All configuration of this collector reside under the [ORACLE] tag within the config file
Tag | Default | Description |
---|---|---|
conn_type | local | How the connection to the database will be stablished local will use sqlplus to fetch data and remote will use cx_oracle, right now only remote works |
ora_user | [ 'oracle', 'oracle' ] | must be a list ( even if only one ) of users that will be used to connect into the database |
ora_home | [ '/oracle/database/dbhome_1', '/oracle/grid' ] | must be a list ( even if only one ) of orahome, not used when conn_type = remote |
ora_sid | [ 'tasy21', '+ASM1' ] | must be a list ( even if only one ) of SIDs, not used when conn_type = remote |
ora_logon | [ '/ as sysdba', '/ as sysasm' ] | must be a list ( even if only one ) of users used to connect, not used when conn_type = remote |
ora_pass | [ pass, pass ] | must be a list ( even if only one ) of passwords used to connect into the databases |
ora_dsn | [ host/service, host/service, ] | must be a list ( even if only one ) of oracle DSN used to connect into remote databases |
ora_role = [ 0, 2 ] | User role used to connect into the remote database, 0 = DEFAULT_AUTH, 2 = SYSDBA, 32768 = SYSASM | |
ora_users_to_ignore | [ 'PUBLIC', 'APPQOSSYS', 'CTXSYS', 'ORDPLUGINS', 'GSMADMIN_INTERNAL', 'XDB', 'ORDDATA', 'DVSYS', 'OUTLN', 'SYSTEM', 'ORACLE_OCM', 'WMSYS', 'OLAPSYS', 'LBACSYS', 'SYS', 'MDSYS', 'DBSNMP', 'SI_INFORMTN_SCHEMA', 'DVF', 'DBSFWUSER', 'AUDSYS', 'REMOTE_SCHEDULER_AGENT', 'OJVMSYS', 'ORDSYS' ] | List of users to ignore when tracking objects |
check_statistics_days | 2 | How many days before consider statistics of a modified object old |
log_switches_hour_alert | 3 | Amount of log switches to be tolerated before issue a warning into syslog |
script_dumpdir | /tmp/oracle_sql | When check for fragmentation and old statics, the system can also create defrag and gather stats scripts, to facilitate maintanance, those scripts will be stored on this directory |
dump_longops | True | If when detect a longops query, dump its execution plan in order to look for possible causes for the specific longop |
dump_running_ids | True | If dump of running queries, when detected is desireable |
table_reclaimable_treshold | 50 | Amount of fragmentation tolerated before issue a warning so the admin might take action |
stats_max_parallel | 10 | Parallel degree used to gather statistics |
stats_estimate_percent | 60 | Estimate percentage used to gather statistics |
At this moment the oracle collector doesn't retrieve information from influxdb in order to evaluate the alerts before issue them, therefore duplicate alerts might happen frequently.
Follow the alert messages being reported:
Message | Description | Priority |
---|---|---|
The instance %s of database %s switched logs %d times at : %s | There are too many changes happening into the database and the redologs are not big enough to fullfill it, therefore a log switch is issued, freezing changes until the switch is completed | π |
The database %s has a total of %d longops happening, please check dumped queries | There are some slow queries running into the database, which indicate slowdowns | π’ |
The Query %s from database %s has a execution plan too long, possible problems | A specific query is taking a long time to complete, possible logical problem in the way that the query is being executed is happening | π‘ |
The Query %s from database %s has a full table scan, please check | A specific query is taking a long time to complete, and doing a full table scan along with it, there is a high chance of a column not be indexed properly | π |
Long queries detected using full table scan, please check %d | Amount of queries performing full table scan detected into the system | π’ |
This collector will scan tables within the database in order to find tables that might need have its statistics updated.
The key factor to determine if the statistics is old the check_statistics_days tag within the config file; If the statistics are newer than whats defined on the tag, the collector will not check it.
If the statistics is older than what's defined on check_statistics_days, then will follow the following criteria:
- If the table had changes since last statistics
measurement | tag | Description |
---|---|---|
oracle_logswitches | database | Name of database that generated the metric |
oracle_logswitches | <instance_name> | amount of logswitches this specific instance generated at the designated timeframe |
oracle_stalestats | database | Name of database that generated the metric |
oracle_stalestats | user | owner of the stale objects |
oracle_stalestats | total | amount of objects for the specific user |
oracle_tablespaces | database | Name of database that generated the metric |
oracle_tablespaces | tablespace | tablespace name |
oracle_tablespaces | total | Total amount of bytes |
oracle_tablespaces | total_physical_cap | Total amount of physical bytes |
oracle_tablespaces | free | Free space |
oracle_tablespaces | free_pct | Percentage of free space into the tablespace |
oracle_longops | database | Name of database that generated the metric |
oracle_longops | server | Server where the longop was identified |
oracle_longops | instance | Database instance that originated the longops |
oracle_longops | user | User that were running the query ( longop ) |
oracle_longops | hash_value | Hash value of the query |
oracle_longops | sql_id | sql_id of the query |
oracle_wait_events | database | Name of database that generated the metric |
oracle_wait_events | server | Server where the longop was identified |
oracle_wait_events | instance | Database instance that originated the longops |
oracle_wait_events | wait_class | Class of the wait event |
oracle_wait_events | total_waits | Total wait events for this class |
oracle_wait_events | time_waited | Total time waited on this class |
oracle_wait_events | total_waits_fg | Total amount of foreground wait events for this class |
oracle_wait_events | time_waited_fg | Amount of time foreground events spent on this wait class |
oracle_running_queries | database | Name of database that generated the metric |
oracle_running_queries | total | Amount of queries running concurrently on this database |
oracle_sql_monitor | database | Name of database that generated the metric |
oracle_sql_monitor | status | Query status from sql monitor |
oracle_sql_monitor | username | User running the query |
oracle_sql_monitor | module | sql module being used on the query |
oracle_sql_monitor | service_name | Service name |
oracle_sql_monitor | sql_id | Query sql_id |
oracle_sql_monitor | tot_time | Amount of time spent on this query |
oracle_temp_tablespaces | database | Name of database that generated the metric |
oracle_temp_tablespaces | tablespace | tablespace name |
oracle_temp_tablespaces | usage_in_mb | Amount of space used in Megabytes |
oracle_sessions | host | servername running the database |
oracle_sessions | instance | Instance name |
oracle_sessions | total_sessions | Amount of sessions |
oracle_objects | database | Name of database that generated the metric |
oracle_objects | user | Owner of the object |
oracle_objects | valid | Amount of valid objects |
oracle_objects | invalid | Amount of invalid objects |
Important
The data model of this collector might and will change in the near future, in order to provide more useful information
This is a inner collector that gather information about the target server and feed into the collectors.
If the checklist's python interface is being used, device tree, serial numbers, SMT modes can be found here
The documentation of this collector comes only through python help() interface and sphinx
The scripts at sh directory are intended to be used in conjunction with Splunk and are not really used by the python collector or ansible anymore Follow the list of scripts and it's purpose:
Script | Description |
---|---|
checklist-aix.sh | do a data capture of the server |
fcstat.sh | Collect Fibre interface statistics |
netstat.sh | Collect Ethernet interface statistics |
cpu.sh | Collect CPU/CORE statistics |
powerha_check.sh | Do a PowerHA Automate health check |
vmstat_i.sh | Virtual Memory interrupt related statistics |
vmstat_s.sh | Virtual Memory system wide statistics |
lspath.sh | Disk multipath health ( Rely on AIX MPIO ) |
errpt_count.sh | Count amount of elements out of errpt |
seastat.sh | Get Network Statistics from VIOS SEA Adapters |
mount.sh | Check filesystem mount parameters for unsafe settings |
- Better documentation ( Better adoption of sphinx into the APIs )
- Send messages to a webhook instead of syslog ( like M$ Teams or Slack )
- Collect data from Linux Servers
- Gather statistics from netstat -aon ( AIX )
- Handle other ioscli commands
- Handle Memory related commands
- Handle process related commands
- Gather data from SAP jobs
- Enable HC using data inside DB, without fetching data from server ( Python mode only, probably is the next one )
- Provide HC messages through rest APIs ( Using Flask or Tornado )
- Review fcstat data model and HC messages related to it
- When providing data through REST, convert the lists in np.arrays in order to use ML to calculate trends and isolate behaviors using ML