-
Notifications
You must be signed in to change notification settings - Fork 0
Server Troubleshooting and Resolution
Alert Rule
100 * (1 - avg(rate(node_cpu_seconds_total{mode="idle", instance="localhost:9100"}[5m])))This alert triggers when the average CPU usage over 5 minutes exceeds a certain threshold.
Investigation Steps
-
Verify Alert
- Check Prometheus/Grafana to confirm high CPU usage
- Ensure alert is not a false positive
-
Identify CPU-intensive processes Use
toporhtop -
Analyze specific processes
ps aux | grep <process_name_or_PID>- Check system load average
uptime- Monitor CPU usage over time
sudo sar -u 1 10- Examine CPU core usage
mpstat -P ALL 1 5- Investigate high I/O wait times
iostat -xz 1 10Resolution Steps 8. Terminate unnecessary processes
kill <PID>or force kill: `kill -9
-
Adjust process priority
renice +10 <PID> - Limit CPU usage for a process
sudo cpulimit -p <PID> -l 50
- Update or optimize software
sudo apt update && sudo apt upgrade
- Check for malware
sudo rkhunter --check
- Optimize system services
sudo systemctl disable <service_name>
- Document and Report
- Record actions taken and their effects
- Update alert status
- Notify relevant team members (devops team)
- Preventive Measures
- Implement regular system maintenance
- Set up resource usage monitoring
- Optimize application code if applicable
- Follow-up
- Conduct root cause analysis
- Implement long-term solutions
- Update runbook if necessary
Note: Always backup your system before making significant changes, and test in a non-production environment first.
Alert rule
(1 - (node_memory_MemAvailable_bytes{instance="localhost:9100", job="node_exporter"} / node_memory_MemTotal_bytes{instance="localhost:9100", job="node_exporter"})) * 100Troubleshooting tips
- Check Current Memory Usage
Use the free command to view memory statistics:
free -hor a more detailed view, use:
cat /proc/meminfo- Identify Memory-Intensive Processes: Use
toporhtopto see which processes are consuming the most memory
# Use top
top
# Use htop
htopSort processes by memory usage in top by pressing Shift+M.
- Analyze Specific Processes For detailed information about a process's memory usage:
ps aux | grep <process_name_or_PID>To see the memory map of a process:
pmap -x <PID>- Check for Memory Leaks Use Valgrind to check for memory leaks in a specific application:
valgrind --leak-check=full /path/to/your/program- Monitor Swap Usage. Check swap space usage:
swapon --show- Examine System Logs. Look for any memory-related errors in system logs:
sudo journalctl -p err..emergResolution steps
- Terminate unnecessary processes:
kill <PID>or force kill:
kill -9 <PID>- Clear Page Cache: To free up cached memory
sudo sync; echo 3 | sudo tee /proc/sys/vm/drop_caches- Increase Swap Space: Create a new swap file:
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfileAdd to /etc/fstab for persistence:
/swapfile none swap sw 0 0- Optimize Applications:
- Update software to latest versions
- Configure applications to use less memory
- Use lightweight alternatives for resource-heavy applications
- Implement Memory Limits:Use
cgroupsto set memory limits for services:
sudo systemctl set-property <service_name> MemoryLimit=1G- Clean Up Disk Space:Remove unnecessary files and uninstall unused applications:
sudo apt autoremove
sudo apt clean- Consider Hardware Upgrades: If issues persist, consider adding more RAM to your system.
Alert rule
100 - ((node_filesystem_avail_bytes{instance="localhost:9100",job="node_exporter",mountpoint="/",fstype!="rootfs"} * 100) / node_filesystem_size_bytes{instance="localhost:9100",job="node_exporter",mountpoint="/",fstype!="rootfs"})Low disk space on a Linux server can cause various issues, including application crashes and system instability. This guide provides steps and commands to troubleshoot and resolve low disk space issues.
- Check Disk Usage
Use the df command to check disk usage of all mounted filesystems.
df -h- Identify Large Files and Directories: Use the
ducommand to identify large files and directories
du -sh /path/to/directory/*Find Top 10 Largest Directories in Root
du -ahx / | sort -rh | head -10
- Clean Up Unnecessary Files
- Remove Unnecessary Packages
sudo apt-get autoremove
sudo apt-get clean- Clear Systemd Journal Logs
sudo journalctl --vacuum-size=100M
- Clear APT Cache (Debian/Ubuntu)
sudo apt-get clean- Delete Old Logs
sudo find /var/log -type f -name "*.log" -exec rm -f {} \;
- Investigate and Clear Docker Disk Usage (if docker is being used) If you are using Docker, it can consume a significant amount of disk space.
- Check Docker Disk Usage
sudo docker system df
- Remove unused Docker data
sudo docker system prune -a
# or force Remove
sudo docker system prune -af-
Implement log rotation using tools like
logrotateto prevent log files from consuming too much disk space. -
Consider adding more disk space or storage to the server if disk space issues persist.
Alert rule
irate(node_network_transmit_bytes_total{instance="localhost:9100",job="node_exporter"}[5m])*8Troubleshooting Steps
- Check network utilization:
iftop -i <interface> - Analyze network connections:
netstat -tuln - Monitor incoming/outgoing traffic:
tcpdump -i <interface> -n
Resolution
- Optimize application code for network efficiency
- Implement caching mechanisms
- Consider load balancing or CDN solutions
Alert rule
increase(node_network_transmit_errs_total[1h]) + increase(node_network_receive_errs_total[1h])Troubleshooting Steps
- Check DNS resolution:
nslookup <domain> - Test network connectivity:
ping <host> traceroute <host> - Verify SSL/TLS configuration:
openssl s_client -connect <host>:<port>
Resolution
- Update DNS settings
- Check firewall rules
- Renew or reconfigure SSL/TLS certificates
Symptoms
- High disk usage
- Slow read/write operations
- I/O wait time spikes
Troubleshooting Steps
- Monitor disk I/O:
iostat -x 1 - Check disk usage:
df -h du -sh /* - Identify processes causing high I/O:
iotop
Resolution
- Optimize database queries
- Implement proper indexing
- Consider upgrading to SSDs or faster storage
- Adjust file system parameters (e.g., noatime mount option)
Alert Rule:
node_time_seconds{instance="localhost:9100",job="node_exporter"} - node_boot_time_seconds{instance="localhost:9100",job="node_exporter"}This alert triggers when the system has recently rebooted. It calculates the difference between current time and boot time.
Initial Assessment:
- Verify alert legitimacy
- Check if reboot was planned maintenance
Troubleshooting Steps: a. Access the affected system b. Review system logs:
sudo journalctl -b -1 -nc. Check last reboot time: who -b
d. Examine uptime: uptime
Common Causes and Solutions: a. Power failure
- Check UPS status
- Verify power supply integrity b. Kernel panic
- Review kernel logs:
sudo dmesg | grep -i panic- Update kernel if necessary c. Hardware failure
- Run hardware diagnostics
- Check for overheating d. Software update
- Review package manager logs
- Rollback recent updates if problematic
Prevention Measures:
- Implement regular maintenance schedule
- Set up automatic security updates
- Monitor system resources
Alert Resolution:
- Document findings and actions taken
- Update alert status in monitoring system
- Notify relevant team members
Follow-up:
- Conduct root cause analysis
- Implement preventive measures
- Update runbook if necessary
- Always backup data before making significant changes
- Keep system and application logs for reference
- Regularly update and patch your systems
- Monitor server performance consistently to catch issues early
- Home
- CI CD Pipeline Configuration for the Python Application
- Deployment with Systemd
- NGINX Reverse Proxy Setup and SSL Configuration
- Setting up the remote server and installing prerequisites
(Content not available in the provided HTML)
(Content not available in the provided HTML)