Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 25 additions & 25 deletions alerts/openshift-virtualization-operator/HighNodeCPUFrequency.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Meaning

This alert triggers when a CPU frequency on a node exceeds 80% of the maximum
This alert triggers when the CPU frequency on a node exceeds 80% of the maximum
frequency for more than 5 minutes.

## Impact
Expand All @@ -17,62 +17,62 @@ High CPU frequency can indicate:

1. Identify the affected node and CPU:
```bash
oc get nodes
$ oc get nodes
```

2. Check current CPU frequency on the node:
2. Check the current CPU frequency on the node:
```bash
oc debug node/<node-name> -it --image=registry.redhat.io/ubi8/ubi
$ oc debug node/<node-name> -it --image=registry.redhat.io/ubi8/ubi
```

Then run inside the debug pod:
```bash
cat /proc/cpuinfo | grep -i "cpu mhz"
$ cat /proc/cpuinfo | grep -i "cpu mhz"
```

3. Monitor CPU utilization and temperature:
```bash
oc top nodes
$ oc top nodes
```

```bash
oc top pods --all-namespaces --sort-by=cpu
$ oc top pods --all-namespaces --sort-by=cpu
```

Check system temperature (if available):
```bash
sensors
$ sensors
```

4. Review node resource allocation:
```bash
oc describe node <node-name>
$ oc describe node <node-name>
```

5. Check for CPU-intensive workloads:
```bash
ps aux --sort=-%cpu | head -20
$ ps aux --sort=-%cpu | head -20
```

## Mitigation

1. Immediate actions:
- Monitor the CPU temperature to ensure it is within safe limits.
- Check if the high frequency is due to legitimate high CPU demand.
- Verify CPU settings if needed.
Immediate actions:
- Monitor the CPU temperature to ensure it is within safe limits.
- Check if the high frequency is due to legitimate high CPU demand.
- Verify CPU settings if needed.

2. If caused by high CPU utilization:
- Identify and analyze CPU-intensive pods.
- Consider redistributing workloads across nodes.
- Evaluate pod resource limits and requests.
- Scale horizontally if needed.
If the alert is triggered by high CPU utilization:
- Identify and analyze CPU-intensive pods.
- Consider redistributing workloads across nodes.
- Evaluate pod resource limits and requests.
- Scale horizontally if needed.

3. If thermal throttling is suspected:
- Check system cooling and ventilation.
- Monitor ambient temperature.
- Consider reducing workload temporarily.
- Verify proper thermal management settings.
If you suspect thermal throttling:
- Check system cooling and ventilation.
- Monitor ambient temperature.
- Consider reducing workload temporarily.
- Verify proper thermal management settings.

If you cannot resolve the issue, log in to the
[Customer Portal](https://access.redhat.com) and open a support case,
[Red Hat Customer Portal](https://access.redhat.com) and open a support case,
attaching the artifacts gathered during the diagnosis procedure.