Troubleshooting High System Load Alert

Resolving High System Load Alert

Alert Description:

This alert triggers when the 1-minute load average on a system exceeds a certain percentage of available CPU cores.

Alert Rule:

scalar(node_load1{instance="localhost:9100",job="node_exporter"}) * 100 / count(count(node_cpu_seconds_total{instance="localhost:9100",job="node_exporter"}) by (cpu))

Step 1: Verify the Alert

Log into the monitoring system and confirm the alert details.
Check if the alert is still active or if it was a temporary spike.

Step 2: Assess the Situation

SSH into the affected system
Run uptime to view the current load averages.
Use top or htop to get an overview of system resource usage.

Step 3: Identify High Resource Consumers

In top/htop, sort processes by CPU usage ('%CPU' column).
Identify any processes consuming an unusually high amount of CPU.
Note the process IDs (PIDs) of high consumers.

Step 4: Investigate Problematic Processes

For each high-consuming process: a. Run ps aux | grep <PID> to get more details. b. Check if the process is expected to be running and consuming high resources. c. Investigate logs related to the process (usually in /var/log/ or application-specific locations).

Step 5: Address Issues

If a process is misbehaving: a. Try restarting the process: sudo systemctl restart <service-name> or kill -15 <PID> b. If restart doesn't help, consider stopping the process temporarily: sudo systemctl stop <service-name> or kill -9 <PID> c. If the high load is due to expected behavior (e.g., batch job), consider rescheduling or optimizing the task.

Step 6: Check System Resources

Run free -h to check memory usage. If memory is low, it might cause high CPU usage due to swapping.
Use df -h to check disk usage. Full disks can cause various issues.
Check I/O wait using iostat -x 1. High wait times might indicate disk issues.

Step 7: Review Recent Changes

Check recent system or application updates that might have caused the issue.
Review any recent configuration changes.

Step 8: Implement Short-term Fix

Based on findings, implement a short-term fix to reduce system load. This might include stopping non-critical services, killing runaway processes, or adding resources.

Step 9: Monitor the Situation

Continue monitoring the system load using top or htop.
Verify that the alert resolves in the monitoring system.

Step 10: Plan Long-term Solution

If the issue is recurring, plan for a long-term solution. This might include:

Upgrading hardware resources
Optimizing application code
Load balancing or scaling out the service

Prepared By Devops Python Team

Nwanochie Emmanuel
Omolara Adeboye
Sarah Aligbe
Divine Onyekwuluje
Aisha Muhammad

Wiki Pages

Home
CI CD Pipeline Configuration for the Python Application
Deployment with Systemd
NGINX Reverse Proxy Setup and SSL Configuration
Setting up the remote server and installing prerequisites

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Troubleshooting High System Load Alert

Resolving High System Load Alert

Alert Description:

Alert Rule:

Prepared By Devops Python Team

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Wiki Pages

Home

CI CD Pipeline Configuration for the Python Application

Deployment with Systemd

NGINX Reverse Proxy Setup and SSL Configuration

Setting up the remote server and installing prerequisites

Clone this wiki locally