Server Troubleshooting and Resolution

Server Troubleshooting And Resolution Guide

Troubleshooting and Resolving High CPU Usage in Linux

Alert Rule

100 * (1 - avg(rate(node_cpu_seconds_total{mode="idle", instance="localhost:9100"}[5m])))

This alert triggers when the average CPU usage over 5 minutes exceeds a certain threshold.

Investigation Steps

Verify Alert
- Check Prometheus/Grafana to confirm high CPU usage
- Ensure alert is not a false positive
Identify CPU-intensive processes Use top or htop
Analyze specific processes

ps aux | grep <process_name_or_PID>

Check system load average

uptime

Monitor CPU usage over time

sudo sar -u 1 10

Examine CPU core usage

mpstat -P ALL 1 5

Investigate high I/O wait times

iostat -xz 1 10

Resolution Steps 8. Terminate unnecessary processes

kill <PID>

or force kill: `kill -9

Adjust process priority renice +10 <PID>
Limit CPU usage for a process

sudo cpulimit -p <PID> -l 50

Update or optimize software

sudo apt update && sudo apt upgrade

Check for malware

sudo rkhunter --check

Optimize system services

sudo systemctl disable <service_name>

Post-Resolution Actions

Document and Report

Record actions taken and their effects
Update alert status
Notify relevant team members (devops team)

Preventive Measures

Implement regular system maintenance
Set up resource usage monitoring
Optimize application code if applicable

Follow-up

Conduct root cause analysis
Implement long-term solutions
Update runbook if necessary

Note: Always backup your system before making significant changes, and test in a non-production environment first.

Troubleshooting and Resolving Low Memory Space in Linux

Alert rule

(1 - (node_memory_MemAvailable_bytes{instance="localhost:9100", job="node_exporter"} / node_memory_MemTotal_bytes{instance="localhost:9100", job="node_exporter"})) * 100

Troubleshooting tips

Check Current Memory Usage

Use the free command to view memory statistics:

free -h

or a more detailed view, use:

cat /proc/meminfo

Identify Memory-Intensive Processes: Use top or htop to see which processes are consuming the most memory

# Use top
top

# Use htop
htop

Sort processes by memory usage in top by pressing Shift+M.

Analyze Specific Processes For detailed information about a process's memory usage:

ps aux | grep <process_name_or_PID>

To see the memory map of a process:

pmap -x <PID>

Check for Memory Leaks Use Valgrind to check for memory leaks in a specific application:

valgrind --leak-check=full /path/to/your/program

Monitor Swap Usage. Check swap space usage:

swapon --show

Examine System Logs. Look for any memory-related errors in system logs:

sudo journalctl -p err..emerg

Resolution steps

Terminate unnecessary processes:

kill <PID>

or force kill:

kill -9 <PID>

Clear Page Cache: To free up cached memory

sudo sync; echo 3 | sudo tee /proc/sys/vm/drop_caches

Increase Swap Space: Create a new swap file:

sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

Add to /etc/fstab for persistence:

/swapfile none swap sw 0 0

Optimize Applications:

Update software to latest versions
Configure applications to use less memory
Use lightweight alternatives for resource-heavy applications

Implement Memory Limits:Use cgroups to set memory limits for services:

sudo systemctl set-property <service_name> MemoryLimit=1G

Clean Up Disk Space:Remove unnecessary files and uninstall unused applications:

sudo apt autoremove
sudo apt clean

Consider Hardware Upgrades: If issues persist, consider adding more RAM to your system.

Troubleshooting and Resolving Low Disk Space

Alert rule

100 - ((node_filesystem_avail_bytes{instance="localhost:9100",job="node_exporter",mountpoint="/",fstype!="rootfs"} * 100) / node_filesystem_size_bytes{instance="localhost:9100",job="node_exporter",mountpoint="/",fstype!="rootfs"})

Low disk space on a Linux server can cause various issues, including application crashes and system instability. This guide provides steps and commands to troubleshoot and resolve low disk space issues.

Check Disk Usage

Use the df command to check disk usage of all mounted filesystems.

df -h

Identify Large Files and Directories: Use the du command to identify large files and directories

du -sh /path/to/directory/*

Find Top 10 Largest Directories in Root

du -ahx / | sort -rh | head -10

Clean Up Unnecessary Files

Remove Unnecessary Packages

sudo apt-get autoremove
sudo apt-get clean

Clear Systemd Journal Logs

sudo journalctl --vacuum-size=100M

Clear APT Cache (Debian/Ubuntu)

sudo apt-get clean

Delete Old Logs

sudo find /var/log -type f -name "*.log" -exec rm -f {} \;

Investigate and Clear Docker Disk Usage (if docker is being used) If you are using Docker, it can consume a significant amount of disk space.

Check Docker Disk Usage

sudo docker system df

Remove unused Docker data

sudo docker system prune -a

# or force Remove
sudo docker system prune -af

Implement log rotation using tools like logrotate to prevent log files from consuming too much disk space.
Consider adding more disk space or storage to the server if disk space issues persist.

Troubleshooting and resolving Network Traffic Issues

Alert rule

irate(node_network_transmit_bytes_total{instance="localhost:9100",job="node_exporter"}[5m])*8

Troubleshooting Steps

Check network utilization: iftop -i <interface>
Analyze network connections: netstat -tuln
Monitor incoming/outgoing traffic: tcpdump -i <interface> -n

Resolution

Optimize application code for network efficiency
Implement caching mechanisms
Consider load balancing or CDN solutions

Troubleshooting and Resolving Network Errors

Alert rule

increase(node_network_transmit_errs_total[1h]) + increase(node_network_receive_errs_total[1h])

Troubleshooting Steps

Check DNS resolution: nslookup <domain>
Test network connectivity: ping <host> traceroute <host>
Verify SSL/TLS configuration: openssl s_client -connect <host>:<port>

Resolution

Update DNS settings
Check firewall rules
Renew or reconfigure SSL/TLS certificates

Troubleshooting and Resolving Disk I/O Issues

Symptoms

High disk usage
Slow read/write operations
I/O wait time spikes

Troubleshooting Steps

Monitor disk I/O: iostat -x 1
Check disk usage: df -h du -sh /*
Identify processes causing high I/O: iotop

Resolution

Optimize database queries
Implement proper indexing
Consider upgrading to SSDs or faster storage
Adjust file system parameters (e.g., noatime mount option)

Troubleshooting and Resolving System Reboot Alert Resolution

Alert Rule:

node_time_seconds{instance="localhost:9100",job="node_exporter"} - node_boot_time_seconds{instance="localhost:9100",job="node_exporter"}

This alert triggers when the system has recently rebooted. It calculates the difference between current time and boot time.

Initial Assessment:

Verify alert legitimacy
Check if reboot was planned maintenance

Troubleshooting Steps: a. Access the affected system b. Review system logs:

sudo journalctl -b -1 -n

c. Check last reboot time: who -b d. Examine uptime: uptime

Common Causes and Solutions: a. Power failure

Check UPS status
Verify power supply integrity b. Kernel panic
Review kernel logs:

sudo dmesg | grep -i panic

Update kernel if necessary c. Hardware failure
Run hardware diagnostics
Check for overheating d. Software update
Review package manager logs
Rollback recent updates if problematic

Prevention Measures:

Implement regular maintenance schedule
Set up automatic security updates
Monitor system resources

Alert Resolution:

Document findings and actions taken
Update alert status in monitoring system
Notify relevant team members

Follow-up:

Conduct root cause analysis
Implement preventive measures
Update runbook if necessary

General Tips

Always backup data before making significant changes
Keep system and application logs for reference
Regularly update and patch your systems
Monitor server performance consistently to catch issues early

Prepared By Devops Python Team

Nwanochie Emmanuel
Omolara Adeboye
Sarah Aligbe
Divine Onyekwuluje
Aisha Muhammad

Wiki Pages

Home
CI CD Pipeline Configuration for the Python Application
Deployment with Systemd
NGINX Reverse Proxy Setup and SSL Configuration
Setting up the remote server and installing prerequisites

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server Troubleshooting and Resolution

Server Troubleshooting And Resolution Guide

Troubleshooting and Resolving High CPU Usage in Linux

Post-Resolution Actions

Troubleshooting and Resolving Low Memory Space in Linux

Troubleshooting and Resolving Low Disk Space

Troubleshooting and resolving Network Traffic Issues

Troubleshooting and Resolving Network Errors

Troubleshooting and Resolving Disk I/O Issues

Troubleshooting and Resolving System Reboot Alert Resolution

General Tips

Prepared By Devops Python Team

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Wiki Pages

Home

CI CD Pipeline Configuration for the Python Application

Deployment with Systemd

NGINX Reverse Proxy Setup and SSL Configuration

Setting up the remote server and installing prerequisites

Clone this wiki locally