diff --git a/alerts/openshift-virtualization-operator/HighNodeCPUFrequency.md b/alerts/openshift-virtualization-operator/HighNodeCPUFrequency.md index e53a4568..745a89a6 100644 --- a/alerts/openshift-virtualization-operator/HighNodeCPUFrequency.md +++ b/alerts/openshift-virtualization-operator/HighNodeCPUFrequency.md @@ -2,7 +2,7 @@ ## Meaning -This alert triggers when a CPU frequency on a node exceeds 80% of the maximum +This alert triggers when the CPU frequency on a node exceeds 80% of the maximum frequency for more than 5 minutes. ## Impact @@ -17,62 +17,62 @@ High CPU frequency can indicate: 1. Identify the affected node and CPU: ```bash - oc get nodes + $ oc get nodes ``` -2. Check current CPU frequency on the node: +2. Check the current CPU frequency on the node: ```bash - oc debug node/ -it --image=registry.redhat.io/ubi8/ubi + $ oc debug node/ -it --image=registry.redhat.io/ubi8/ubi ``` Then run inside the debug pod: ```bash - cat /proc/cpuinfo | grep -i "cpu mhz" + $ cat /proc/cpuinfo | grep -i "cpu mhz" ``` 3. Monitor CPU utilization and temperature: ```bash - oc top nodes + $ oc top nodes ``` ```bash - oc top pods --all-namespaces --sort-by=cpu + $ oc top pods --all-namespaces --sort-by=cpu ``` Check system temperature (if available): ```bash - sensors + $ sensors ``` 4. Review node resource allocation: ```bash - oc describe node + $ oc describe node ``` 5. Check for CPU-intensive workloads: ```bash - ps aux --sort=-%cpu | head -20 + $ ps aux --sort=-%cpu | head -20 ``` ## Mitigation -1. Immediate actions: - - Monitor the CPU temperature to ensure it is within safe limits. - - Check if the high frequency is due to legitimate high CPU demand. - - Verify CPU settings if needed. +Immediate actions: +- Monitor the CPU temperature to ensure it is within safe limits. +- Check if the high frequency is due to legitimate high CPU demand. +- Verify CPU settings if needed. -2. If caused by high CPU utilization: - - Identify and analyze CPU-intensive pods. - - Consider redistributing workloads across nodes. - - Evaluate pod resource limits and requests. - - Scale horizontally if needed. +If the alert is triggered by high CPU utilization: +- Identify and analyze CPU-intensive pods. +- Consider redistributing workloads across nodes. +- Evaluate pod resource limits and requests. +- Scale horizontally if needed. -3. If thermal throttling is suspected: - - Check system cooling and ventilation. - - Monitor ambient temperature. - - Consider reducing workload temporarily. - - Verify proper thermal management settings. +If you suspect thermal throttling: +- Check system cooling and ventilation. +- Monitor ambient temperature. +- Consider reducing workload temporarily. +- Verify proper thermal management settings. If you cannot resolve the issue, log in to the -[Customer Portal](https://access.redhat.com) and open a support case, +[Red Hat Customer Portal](https://access.redhat.com) and open a support case, attaching the artifacts gathered during the diagnosis procedure. \ No newline at end of file