Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARNING (or CRITICAL) on CPU_IOWAIT #2330

Closed
maravento opened this issue Apr 5, 2023 · 5 comments
Closed

WARNING (or CRITICAL) on CPU_IOWAIT #2330

maravento opened this issue Apr 5, 2023 · 5 comments

Comments

@maravento
Copy link

maravento commented Apr 5, 2023

Describe the bug
WARNING (or CRITICAL) on CPU_IOWAIT

To Reproduce
Steps to reproduce the behavior:

sudo apt install glances

sudo /etc/init.d/glances status
[sudo] contraseña para adminred: 
● glances.service - Glances
     Loaded: loaded (/lib/systemd/system/glances.service; enabled; vendor preset: enabled)
     Active: active (running) since Tue 2023-04-04 19:37:48 -05; 12h ago
       Docs: man:glances(1)
             https://github.com/nicolargo/glances
   Main PID: 1004035 (glances)
      Tasks: 1 (limit: 18893)
     Memory: 88.3M
        CPU: 20min 38.361s
     CGroup: /system.slice/glances.service
             └─1004035 /usr/bin/python3 /usr/bin/glances -w -B 127.0.0.1 -t 10
abr 04 19:37:48 adminred systemd[1]: Started Glances.

Screenshots
critical

Desktop (please complete the following information):

  • Ubuntu 22.04.2 x64

  • Glances v3.2.4.2 with PsUtil v5.9.0

  • Log file: /home/user/.local/share/glances/glances.log

  • Glances logs file:

2023-04-05 08:10:08,027 -- INFO -- Start Glances 3.2.4.2
2023-04-05 08:10:08,028 -- INFO -- CPython 3.10.6 (/usr/bin/python3) and psutil 5.9.0 detected
2023-04-05 08:11:44,694 -- INFO -- Start Glances 3.2.4.2
2023-04-05 08:11:44,694 -- INFO -- CPython 3.10.6 (/usr/bin/python3) and psutil 5.9.0 detected
2023-04-05 08:11:44,703 -- INFO -- Read configuration file '/etc/glances/glances.conf'
2023-04-05 08:11:44,729 -- INFO -- Start GlancesStandalone mode
2023-04-05 08:11:44,950 -- ERROR -- docker plugin - Can not connect to Docker (Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No existe el archivo o el directorio')))
2023-04-05 08:11:44,952 -- WARNING -- Missing Python Lib (No module named 'pymdstat'), Raid plugin is disabled
2023-04-05 08:11:45,775 -- WARNING -- Missing Python Lib (No module named 'py3nvml'), Nvidia GPU plugin is disabled
2023-04-05 08:11:45,778 -- WARNING -- Missing Python Lib (No module named 'pySMART'), HDD Smart plugin is disabled
2023-04-05 08:11:45,806 -- WARNING -- Sparklines module not found (No module named 'sparklines')
2023-04-05 08:11:45,812 -- WARNING -- Missing Python Lib (No module named 'wifi'), Wifi plugin is disabled
2023-04-05 08:11:45,812 -- WARNING -- Wifi lib is not compliant with Python 3, Wifi plugin is disabled
2023-04-05 08:11:46,138 -- INFO -- Issue mode is ON
2023-04-05 08:12:08,189 -- INFO -- Start Glances 3.2.4.2
2023-04-05 08:12:08,189 -- INFO -- CPython 3.10.6 (/usr/bin/python3) and psutil 5.9.0 detected
2023-04-05 08:12:08,197 -- INFO -- Read configuration file '/etc/glances/glances.conf'
2023-04-05 08:12:08,213 -- INFO -- Start GlancesStandalone mode
2023-04-05 08:12:08,366 -- ERROR -- docker plugin - Can not connect to Docker (Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No existe el archivo o el directorio')))
2023-04-05 08:12:08,368 -- WARNING -- Missing Python Lib (No module named 'pymdstat'), Raid plugin is disabled
2023-04-05 08:12:10,383 -- WARNING -- Missing Python Lib (No module named 'py3nvml'), Nvidia GPU plugin is disabled
2023-04-05 08:12:10,384 -- WARNING -- Missing Python Lib (No module named 'pySMART'), HDD Smart plugin is disabled
2023-04-05 08:12:10,400 -- WARNING -- Sparklines module not found (No module named 'sparklines')
2023-04-05 08:12:10,403 -- WARNING -- Missing Python Lib (No module named 'wifi'), Wifi plugin is disabled
2023-04-05 08:12:10,404 -- WARNING -- Wifi lib is not compliant with Python 3, Wifi plugin is disabled
2023-04-05 08:12:10,624 -- INFO -- Issue mode is ON
2023-04-05 09:50:35,881 -- INFO -- Start Glances 3.2.4.2
2023-04-05 09:50:35,882 -- INFO -- CPython 3.10.6 (/usr/bin/python3) and psutil 5.9.0 detected
2023-04-05 09:50:35,890 -- INFO -- Read configuration file '/etc/glances/glances.conf'
2023-04-05 09:50:35,908 -- INFO -- Start GlancesStandalone mode
2023-04-05 09:50:36,085 -- ERROR -- docker plugin - Can not connect to Docker (Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No existe el archivo o el directorio')))
2023-04-05 09:50:36,086 -- WARNING -- Missing Python Lib (No module named 'pymdstat'), Raid plugin is disabled
2023-04-05 09:50:37,806 -- WARNING -- Missing Python Lib (No module named 'py3nvml'), Nvidia GPU plugin is disabled
2023-04-05 09:50:37,807 -- WARNING -- Missing Python Lib (No module named 'pySMART'), HDD Smart plugin is disabled
2023-04-05 09:50:37,820 -- WARNING -- Sparklines module not found (No module named 'sparklines')
2023-04-05 09:50:37,822 -- WARNING -- Missing Python Lib (No module named 'wifi'), Wifi plugin is disabled
2023-04-05 09:50:37,822 -- WARNING -- Wifi lib is not compliant with Python 3, Wifi plugin is disabled
2023-04-05 09:50:38,108 -- INFO -- Issue mode is ON
  • Glances test [output of glances --issue] (only available with Glances 3.1.7 or higher)
glances --issue
================================================================================================================================================================================
Glances 3.2.4.2 (/usr/lib/python3/dist-packages/glances/__init__.py)
Python 3.10.6 (/usr/bin/python3)
PsUtil 5.9.0 (/usr/lib/python3/dist-packages/psutil/__init__.py)
================================================================================================================================================================================
alert         [OK]    0.00002s []
amps          [OK]    0.00024s key=name [{'key': 'name', 'name': 'Dropbox', 'result': None, 'refresh': 3.0, 'timer': 3.2943501472473145, 'count': 0, 'countmin': 1.0, 
cloud         [OK]    0.00005s {}
connections   [N/A]
core          [OK]    0.00081s {'phys': 6, 'log': 6}
cpu           [OK]    0.00059s {'total': 25.2, 'user': 5.1, 'nice': 0.0, 'system': 9.2, 'idle': 74.7, 'iowait': 0.1, 'irq': 0.0, 'softirq': 11.0, 'steal': 0.0, 'guest
diskio        [OK]    0.00151s key=disk_name [{'time_since_update': 3.007429361343384, 'disk_name': 'sda', 'read_count': 7, 'write_count': 1, 'read_bytes': 28672, 'wr
docker        [OK]    0.00006s []
folders       [OK]    0.00004s []
fs            [OK]    0.00198s key=mnt_point [{'device_name': '/dev/sda2', 'fs_type': 'ext4', 'mnt_point': '/', 'size': 234622398464, 'used': 99770957824, 'free': 122
gpu           [OK]    0.00004s []
help          [OK]    0.00001s None
ip            [N/A]
irq           [N/A]
load          [OK]    0.00006s {'min1': 1.43115234375, 'min5': 1.005859375, 'min15': 0.80322265625, 'cpucore': 6}
mem           [OK]    0.00026s {'total': 16634073088, 'available': 6792138752, 'percent': 59.2, 'used': 9841934336, 'free': 6792138752, 'active': 3696222208, 'inactiv
memswap       [OK]    0.00034s {'total': 2147487744, 'used': 311730176, 'free': 1835757568, 'percent': 14.5, 'sin': 2664747008, 'sout': 2991501312, 'time_since_update
network       [OK]    0.00067s key=interface_name [{'interface_name': 'lo', 'alias': None, 'time_since_update': 3.008650302886963, 'cumulative_rx': 5049377907, 'rx': 
now           [OK]    0.00003s 2023-04-05 09:50:41 -05
percpu        [OK]    0.00055s key=cpu_number [{'key': 'cpu_number', 'cpu_number': 0, 'total': 25.0, 'user': 5.3, 'system': 19.4, 'idle': 75.0, 'nice': 0.0, 'iowait':
ports         [OK]    0.00001s [{'host': None, 'port': 0, 'description': 'DefaultGateway', 'refresh': 30, 'timeout': 3, 'status': None, 'rtt_warning': None, 'indice':
processcount  [OK]    0.25830s {'total': 394, 'running': 4, 'sleeping': 315, 'thread': 1085, 'pid_max': 0}
processlist   [OK]    0.00043s key=pid [{'cpu_times': pcputimes(user=4487.54, system=19599.39, children_user=0.04, children_system=0.0, iowait=0.0), 'num_threads': 2,
psutilversion [OK]    0.00004s (5, 9, 0)
quicklook     [OK]    0.00052s {'cpu': 25.2, 'percpu': [{'key': 'cpu_number', 'cpu_number': 0, 'total': 25.0, 'user': 5.3, 'system': 19.4, 'idle': 75.0, 'nice': 0.0, 
raid          [N/A]
sensors       [OK]    0.00001s key=label [{'label': 'Package id 0', 'value': 43, 'warning': 80, 'critical': 90, 'unit': 'C', 'type': 'temperature_core', 'key': 'label
smart         [N/A]
system        [OK]    0.00001s {'os_name': 'Linux', 'hostname': 'adminred', 'platform': '64bit', 'linux_distro': 'Ubuntu 22.04', 'os_version': '5.15.0-69-generic', 'h
uptime        [OK]    0.00016s {'seconds': 334798}
wifi          [N/A]

Additional context

  • Related to Issue #1214
  • The image belongs to a HPE proliant ml110 gen9 with 2 network interfaces (datasheet)

Can someone explain clearly what these alerts mean and how do i fix it? Thk

@RazCrimson
Copy link
Collaborator

@maravento
IO Wait - Percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request.

IO Wait is a part of idle time (CPU didn't do anything) due to out standing IO. That is, while some IO transfer was progressing, the CPU was not able to schedule/execute any tasks and was idle. More detailed explanation is here

Having a high IO wait could mean that CPU is throttled due to IO transfers. But this is not a "bad" situation when it happens in servers or PCs with HDDs. In those cases, it is normal to have higher IO wait time.

Now coming to the alerts, they mean that at some earlier points in time, more than 20% of CPU time was spent idle with some IO operation happening in background. The values in the alerts specify the IO Wait time when the spike occurred.

To modify and change the thresholds for the alerts, you can specify the exact thresholds in the config file.Ref: https://glances.readthedocs.io/en/latest/config.html

The exact values depend upon what kind of workload is happening on your system and if the workload can cause heavy IO operations. Depending on you needs, decide on the threshold values. The values mentioned in the docs, are quite in line with heavy IO systems too. The default value calculations is a bit complicated and is mentioned here

Here is a quick snip that you can drop in:
~/.config/glances/glances.conf

[cpu]
iowait_careful=50
iowait_warning=70
iowait_critical=90

@RazCrimson
Copy link
Collaborator

RazCrimson commented Apr 5, 2023

On an another note, we could probably apply some limit to the logic for the default IO wait threshold computation.

For example, for IO critical default threshold, we could do max(30, 100/#cores) instead of just 100/#cores.
So we would use core-count based logic up to a limit but cap the threshold at a fixed value when it becomes very small.

What do you think @nicolargo ?

Also this probably needs better documentation, rather than being mentioned in the config file example.

@maravento
Copy link
Author

maravento commented Apr 6, 2023

@maravento IO Wait - Percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request.

IO Wait is a part of idle time (CPU didn't do anything) due to out standing IO. That is, while some IO transfer was progressing, the CPU was not able to schedule/execute any tasks and was idle. More detailed explanation is here

Having a high IO wait could mean that CPU is throttled due to IO transfers. But this is not a "bad" situation when it happens in servers or PCs with HDDs. In those cases, it is normal to have higher IO wait time.

Now coming to the alerts, they mean that at some earlier points in time, more than 20% of CPU time was spent idle with some IO operation happening in background. The values in the alerts specify the IO Wait time when the spike occurred.

To modify and change the thresholds for the alerts, you can specify the exact thresholds in the config file.Ref: https://glances.readthedocs.io/en/latest/config.html

The exact values depend upon what kind of workload is happening on your system and if the workload can cause heavy IO operations. Depending on you needs, decide on the threshold values. The values mentioned in the docs, are quite in line with heavy IO systems too. The default value calculations is a bit complicated and is mentioned here

Here is a quick snip that you can drop in: ~/.config/glances/glances.conf

[cpu]
iowait_careful=50
iowait_warning=70
iowait_critical=90

The drives are SSD. Not HDD. Also, there are alerts related to network interfaces, which you don't mention in your answer

To be honest, your explanation is not very clear. But you don't need to explain further, because I don't see that adding something to your explanation will fix the problem.
So, I think, if it's not a bad thing (as you claim) and if warnings can be ignored (as you claim), then it's better to turn them off. Please tell me where they are disabled. Thank you

@RazCrimson
Copy link
Collaborator

RazCrimson commented Apr 6, 2023

To be honest, your explanation is not very clear. But you don't need to explain further, because I don't see that adding something to your explanation will fix the problem.

TLDR: The amount of time that the CPU was idle when some IO transfer was taking place, is called the IO wait time.

I'm not sure how better I could explain it. If you want a more detailed explanation, check the severfault question linked above.

According to what I understood you, this is not a problem and can be ignored. But, what I don't understand is that if these warnings are nothing bad and can be ignored, then why are they there? (to make the life of the sysadmin more difficult?)

The problem is different workloads (tasks) can have different IO wait time and the threshold needs to be varied to match your case.

An example that could let you understand why its hard to set an exact value for IO wait time threshold:

Desktop do different things and Servers do different things. Even servers can do different things, like a File Hosting servers that just does IO tasks (reading and transferring files) or a compute intensive servers that does complex modelling/simulation or maybe even running some ML algos etc.

In a File Hosting server, it is normal to have heavy IO tasks, that is, a higher IO wait time is expected and you dont really might the IO wait times in this case.

Whereas in compute servers, the objective is to perform the task as fast as possible, we want a low IO wait time. In that case we would be interested in IO wait time as IO might be cause of slowing down/throttling. Consider scenario for a costly compute optimized VM in the cloud. You don't want the VM to be wasting all its precious CPU cycles waiting on some IO operation when it could be doing some other compute, rather than wasting $$$ (VMs can be shutdown or down-scaled depending on usage to save $$$). So the alerts are preferred in this case.

Concluding, defining a standard IO wait time threshold suitable for all cases is not possible, so we just have some preset defaults that work for most cases* (logic could be better as explained in previous comment).

Users (sysadmins in the current case) can change it according to their needs depending on their workloads or what would work for their case. Not all users have the same needs.

So, I think, if it's not a bad thing (as you claim) and if warnings can be ignored (as you claim), then it's better to turn them off.

It's very case specific, so I think its better to defer the choice of disabling it to the user.

Please tell me where they are disabled.

Its only possible to disable plugins not specific alerts. The CPU plugin, which gives the IO wait time alert, has other alerts which you would probably not want to miss. So disabling the plugin is not a good idea.

If you just want the IO wait time alerts to not pop up, you can set their thresholds (as explained above) to 100. This will practically disable them.

@maravento
Copy link
Author

Thanks

PD: we have published a post about this application https://www.maravento.com/2023/04/glances.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants