Skip to content

Troubleshooting High System Load Alert

Sarah Aligbe edited this page Jul 30, 2024 · 1 revision

Resolving High System Load Alert

Alert Description:

This alert triggers when the 1-minute load average on a system exceeds a certain percentage of available CPU cores.

Alert Rule:

scalar(node_load1{instance="localhost:9100",job="node_exporter"}) * 100 / count(count(node_cpu_seconds_total{instance="localhost:9100",job="node_exporter"}) by (cpu))

Step 1: Verify the Alert

  1. Log into the monitoring system and confirm the alert details.
  2. Check if the alert is still active or if it was a temporary spike.

Step 2: Assess the Situation

  1. SSH into the affected system
  2. Run uptime to view the current load averages.
  3. Use top or htop to get an overview of system resource usage.

Step 3: Identify High Resource Consumers

  1. In top/htop, sort processes by CPU usage ('%CPU' column).
  2. Identify any processes consuming an unusually high amount of CPU.
  3. Note the process IDs (PIDs) of high consumers.

Step 4: Investigate Problematic Processes

For each high-consuming process: a. Run ps aux | grep <PID> to get more details. b. Check if the process is expected to be running and consuming high resources. c. Investigate logs related to the process (usually in /var/log/ or application-specific locations).

Step 5: Address Issues

If a process is misbehaving: a. Try restarting the process: sudo systemctl restart <service-name> or kill -15 <PID> b. If restart doesn't help, consider stopping the process temporarily: sudo systemctl stop <service-name> or kill -9 <PID> c. If the high load is due to expected behavior (e.g., batch job), consider rescheduling or optimizing the task.

Step 6: Check System Resources

  1. Run free -h to check memory usage. If memory is low, it might cause high CPU usage due to swapping.
  2. Use df -h to check disk usage. Full disks can cause various issues.
  3. Check I/O wait using iostat -x 1. High wait times might indicate disk issues.

Step 7: Review Recent Changes

  1. Check recent system or application updates that might have caused the issue.
  2. Review any recent configuration changes.

Step 8: Implement Short-term Fix

Based on findings, implement a short-term fix to reduce system load. This might include stopping non-critical services, killing runaway processes, or adding resources.

Step 9: Monitor the Situation

  1. Continue monitoring the system load using top or htop.
  2. Verify that the alert resolves in the monitoring system.

Step 10: Plan Long-term Solution

If the issue is recurring, plan for a long-term solution. This might include:

  • Upgrading hardware resources
  • Optimizing application code
  • Load balancing or scaling out the service

Clone this wiki locally